(What was meant to be a quick) Overview of Statistics for Linear Regression

Jojo de León
15 min readAug 2, 2021

--

I made my first linear regression model recently and found myself wishing I’d made a a cheat sheet covering what I should keep in mind while creating it as I seemed to be constantly googling concepts that I thought I had a better grasp on. Well, as a good friend of mine says, you win some you lose some. Post-model, the following is what I devised — with the full knowledge that this is just the tip of the iceberg. Some of what is to follow is overly simple but was included because it will be referenced later for more complicated concepts.

So the “simple”…

Discrete data is information that can only take certain values i.e. the values are fixed. Oftentimes, discrete data is represented using a tally (frequency count) and visually represented in bar graphs or histograms.

Continuous data can take any value and may change over time. This data is best shown on a line graph as this this will show how the data changes over a given period of time.

A population is the collection of all the items of interest in a study. The numbers obtained when using a population are called parameters. A sample is a subset of a population. The numbers obtained when working with a sample are called statistics. In creating a model, you are trying to infer population parameters from a sample. Which leads us to Inferential Statistics which is the approach of making claims and communicating uncertainty about unknown values. Procuring the data for an entire population is unrealistic. Instead data scientists use a sample deemed representative of a population and then use the standard deviation of the sample as an estimate for the standard deviation of the population. In doing so, we need to ensure that the sample used is both random and representative of the whole population.

There is an enormous amount of distributions available to classify data, but a handful of distributions can represent the majority of situations a data scientist will encounter. Here, I’m referring to statistical distributions which are a representation of the frequencies of data or percentages of grouped data. A few examples of discrete distributions are Bernoulli, Poisson, and uniform. A few examples of continuous distributions are normal or Gaussian, exponential, logistic, and Weibull.

Normal and Standard Normal Distributions are very special…

Normal distributions (aka the “bell curve” or the “Gaussian curve”) are continuous distributions which are symmetric around their mean. Their mean, median, and mode are equal. Additionally, they are defined by two parameters, the mean and standard deviation (i.e. the spread of the distribution). Visually, this characteristic is seen by their graph having a denser center and less density in the tails. The area under the bell curve represents probability and the total area under the curve sums to 1.0. Around 68% of the area of a normal distribution is within one standard deviation of the mean. Approximately 95% of the area of a normal distribution is within two standard deviations of the mean.

Standard Normal Distributions, aka the Z-Distributions, are special cases of a normal distribution where the mean is 0 and the standard deviation is 1. You can transform any normal distribution to a standard normal distribution. The z-score calculates the probability of a certain score occurring within a given normal distribution and moreover reflects the number of standard deviations above or below the mean. Calculating a z-score along with graphing the distribution gives quick and easy access to understanding how extreme a certain result is. When speaking about a 𝑧-score, use “above” or “below” (the mean) in your phrasing.

The Central Limit Theorem allows data scientists to treat non-normal distributions as normal distributions thus providing a way to estimate parameters about a population. This theorem states that, under many conditions, independent random variables summed together will converge to a normal distribution as the number of variables increases. This becomes very useful for applying statistical logic to sample statistics in order to estimate population parameters.

And now, the more complicated..

Ways to represent probability distribution: Probability Mass Function (PMF), Probability Density Function (PDF), and Cumulative Distribution Function (CDF)

The PMF, aka the discrete density function or simply frequency function, is a function that associates probabilities with the mass or frequency of discrete (categorical) random variables. As this is a discrete distribution, there are a known number of possible outcomes enabling you to calculate an exact probability. You can visualize a PMF via simple bar graphs where the y-axis represents probabilities of each outcome (of the mass )of the categories represented on the x-axis.

The PDF is analogous to the PMF but for continuous variables — whose exact value cannot be identified but instead is approximated to a given accuracy — allowing for a way to model the probability of occurrences for continuous variables. It helps identify regions in the distribution where observations are more likely to occur i.e. where the observation occurrence is more dense. An area under the curve of a PDF is associated with a fixed probability. Do not lose sight of the fact that the idea of continuous variables and PDFs is that the probability of any given arbitrary number is always 0 because there is an infinite number of possibilities to check. Therefore, when using a PDF, the only way of finding a probability associated with an exact value is to use an interval of ranges which contains that value to express the probability. You can visualize a PDF by using histograms and density plots. These are used to get a sense of the data density as you cannot readily read the y-axis to get probabilities.

CDF, aka the percentile probability function, is a function of any value (x) in the given distribution which gives the probability that the variable x is less than or equal to a certain, possible value of x. Cumulative here refers to the sum of the probabilities. A cumulative probability is the sum of the probabilities of all values up until a given point. Visually, it looks like steps for discrete random variables whereas it is a smooth curve for continuous random variables. From these graphs, you can obtain the probability directly from the y-axis.

Hypothesis Testing

The following assumes you understand the Scientific Method and how to formulate your Null and Alternative Hypotheses to test.

Statistical parameters

Both confidence intervals and statistical tests are inferential techniques based on sampling and distributions. You choose some threshold of incorrectness (i.e. the confidence interval) that you are willing to tolerate and subsequently use that threshold in statistical tests. Then, once a a test statistic is computed and subsequently compared to the distribution of that statistic, the confidence interval is used to determine if a result is statistically significant. Statistical significance is based on a few concepts: samples and populations, hypothesis testing, the normal distribution, and p-values (more on these to come).

A Significance Level, oftentimes referred to as the 𝛼 (alpha), is the marginal threshold at which you’re okay with rejecting the null hypothesis. It can be thought of as what false positive rate will I tolerate. An alpha value can be any value set between 0 and 1. Some typical significance levels you might see are a 1%, 5%, or 10% significance levels (99%, 95%, or 90% confidence levels correspondingly). Choosing a lower alpha leads to a test that is more strict, so you will be less likely to be able to reject your null-hypothesis (which is generally what you want). Choosing a higher alpha leads to a higher probability of rejecting the null-hypothesis, a downside of which is that you run a higher risk of falsely concluding that there is a difference between your null-hypothesis and your observed results when there actually isn’t any. By setting α to 5%, you would expect that 1 in 20 times, if you were performing the exact same experiment, you would reject the null hypothesis when it is actually true. A statistically significant result means that the probability of obtaining that result (a false positive), if the null hypothesis were true, is below the specified significance level.

The p-value is the probability of observing a test statistic at least as large as the one observed, by random chance, assuming that the null hypothesis is true. It is a statistical summary of the compatibility between the observed data and what you would expect to see in a population assuming the statistical model is correct. If your p-value is low, you say that that the result is significant, in the sense that you conclude that the sample is significantly different from the population. This is where the significance level, or 𝛼, comes in to play as it is the threshold value that defines whether a p-value is low or high. A confidence interval for a p-value is a range of values surrounding an estimated parameter. The width of the range depends on the variance of the data (more variance results in a wider confidence interval, less variance in a narrower confidence interval).

The effect size is how different two samples are, a measure of observing real world differences between two groups. It is used to quantify the size of the difference between two groups under observation. To calculate, you take the difference of the means of the two groups divided by the grouped standard deviation. An unstandardized effect size simply tries to find the difference between the two groups by calculating the difference between distribution means. Since the magnitude of the difference depends on the units of measure it hard to compare across different studies that may be conducted with different units of measurement. In a data analytics domain, effect size calculation serves three primary goals. Firstly, it communicate the practical significance of results. Also, effect size calculation and interpretation allows you to draw Meta-Analytical conclusions allowing you to group together a number of existing studies, calculate the meta-analytic effect size and get the best estimate of the effect size of the population.And, to perform Power Analysis, which helps determine the number of participants (sample size) that a study requires to achieve a certain probability of finding a true effect — if there is one.

You can also look for the amount of overlap (or misclassification rate) between the two distributions. To define overlap, you choose a threshold between the two means. The simple threshold is the midpoint between the means. The “overlap” is the total AUCs (Area Under the Curves) of the intersection of the two distributions. You can use this to identify the samples that end up on the wrong side of the threshold. Another “non-parametric” way to quantify the difference between distributions is what’s called “probability of superiority”.

There is one other common way to express the difference between distributions (i.e. the difference in means), standardizing by dividing by the standard deviation.

When conducting hypothesis testing, there will almost always be the chance of accidentally rejecting a null hypothesis when it should not have been rejected. This scenario is a type I error, more commonly known as a False Positive.

Another type of error is beta (𝛽) or type II error, aka False Negatives, which is the probability that you fail to reject the null hypothesis when it is actually false. Beta is related to something called Power, which is the probability of rejecting the null hypothesis given that it actually is false. Mathematically, Power = 1 — 𝛽. When designing an experiment, scientists will frequently choose a power level they want for an experiment and from that obtain their type II error rate. . As with any probability, the power of a statistical test, therefore, ranges from 0 to 1, with 1 being a perfect test that guarantees rejecting the null hypothesis when it is indeed false. Power is related to alpha, sample size, and effect size. Typically a researcher will select an acceptable alpha value and then examine required sample sizes to achieve the desired power such as 0.8 (or higher).

The two error types are inversely related to one other; reducing type I errors will increase type II errors and vice versa.

Statistical Tests

In frequentist hypothesis testing, you construct a test statistic from the measured data and use the value of that statistic to decide whether to reject the null hypothesis. The test statistic is a lower-dimensional summary of the data but still maintains the discriminatory power necessary to determine whether a result is statistically significant for a given significance level.

A One Sample Z-Test is the most basic type of hypothesis test. It is performed when the population mean and standard deviation are known. When running a one-sample z-test, you test whether the average of the sample suggests that it comes from a certain population (with a known mean) or whether it may come from a different population.

The real value from a z-test comes from comparing it against a z-table. A 𝑧-table contains cumulative probabilities of a standard normal distribution up until a given 𝑧-score value therefore providing “less than” probabilities. By transforming normal distributions with various means and standard deviations, you can use this 𝑧-table for any value that follows a normal distribution. The 𝑧-table is short for the “Standard Normal 𝑧-table”. When using the idea of cumulative probability (the CDF) in the context of the standard normal distribution where the AUC=1 (i.e. 100%), we look at the cumulative distribution until a point 𝑧. Thus, you can use the 𝑧-table to calculate probabilities on both sides of the 𝑧-score under the standard normal distribution.

The T-Distribution allows us to work with samples where the population standard deviation is unknown (as well as with smaller samples), in order to form confidence intervals. They are similar to a normal distribution in shape but have heavier tails. T-distributions have a parameter known as degrees of freedom. The higher the degrees of freedom, the closer the distribution resembles that of the normal distribution. Use the t-distribution to find estimates for the population mean even without knowing any specific parameters concerning the population. Then, calculate your interval estimate using a t-distribution and your various parameters. The t-distribution requires 4 parameters: the sample mean, the sample standard deviation, the degrees of freedom (this is 1 less then the number of items in the sample, n), and the confidence level you wish to have in your estimate..

T-tests (also called Student’s T-test) are very practical hypothesis tests for t-distributions that can be employed to compare two averages (means) to assess if they are different from each other. You should run a t-test when you either don’t know the population standard deviation or have a small sample size. Like a 𝑧-test, the t-test also tells you how significant the differences are i.e. it lets you know if those differences could have happened by chance.

A One-Sample T-Test is a statistical procedure used to determine whether a sample of observations could have been generated by a process with a specific mean. That is to say, it compares the mean of your sample data to a known mean value.

A Two-Sample T-Test is used to determine if two population means are equal. The main types of two-sampled t-tests are paired and independent tests. Paired tests are useful for determining how different a sample is affected by a certain treatment. In other words, the individual items in the sample will remain the same and you are comparing how they change after treatment. Independent two-sample t-tests are for when we are comparing two different, unrelated samples to one another.

When performing various kinds of t-tests, you assume that the sample observations have numeric and continuous values. You also assume that the sample observations are independent from each other (that is, that you have a simple random sample) and that the samples have been drawn from normal distributions. You can visually inspect the distribution of your sample using a histogram. In the case of unpaired two-sample t-tests, you also assume that the population from which the samples have been drawn from have the same variance. For paired two-sample t-tests, you assume that the difference between the two sets of samples are normally distributed.

Assuming that the three requirements for a t-test mentioned above are fulfilled (i.e. normality, independence, and randomness), you are ready to calculate your t-statistic. A positive t-value indicates that the sample mean is greater than the population mean. You must assess whether the increase is high enough to reject the null hypothesis, which says that there is no significant increase. You answer this question by calculating a critical t-value. It’s possible to calculate a critical t-value with a t-table or by using Python scipy.stats module. The critical value approach involves determining “likely” or “unlikely”, by determining whether or not the observed test statistic is more extreme than would be expected if the null hypothesis were true. This involves comparing the observed test statistic to some cutoff value, called the critical value. If the test statistic is more extreme than the critical value, then the null hypothesis is rejected in favor of the alternative hypothesis. If the test statistic is not as extreme as the critical value, then the null hypothesis is not rejected. You need two values to find the critical t-value: the alpha level (𝛼) which is used to determine what p-value to look up in the t-table, and the degrees of freedom.

More often, you will used code to find the critical t-value rather than looking it up in a table. There is a function from scipy.stats that is the inverse of the CDF, called a “percent point function”, or PPF. Given a probability, we use the PPF to compute the corresponding value — in this case, the critical t-value. The critical t-value marks the boundary of the rejection region. is our observed t-statistic in the rejection region

Cohen’s d is a way to measure effect size. It represents the magnitude of differences between two (or more) groups on a given variable, with larger values representing a greater differentiation between the two groups on that variable. It’s basic formula is 𝑑 = effect size (difference of means) / pooled standard deviation. Cohen’s d can be cautiously interpreted as small effect = 0.2, medium effect = 0.5, large effect = 0.8. A few nice properties of Cohen’s d are since the mean and standard deviation have the same units, their ratio is dimensionless, so you can compare 𝑑 across different studies. Moreover, given 𝑑 (and the assumption that the distributions are normal), you can compute overlap, superiority, and related statistics.

A popular statistical test for satisfying the normality assumption is the Kolmogorov-Smirnov Test, or simply, the K-S test. The normality assumption can be checked through visualization techniques like qqplots or boxplots, but a more advanced statistical tests is the K-S test. In the K-S test, the distributions are compared in their cumulative form as Empirical Cumulative Distribution Functions. The test statistic used to compare distributions is simply the maximum vertical distance between the two functions. Essentially, you are testing the sample data against another sample, to compare their distributions for similarities.

ANOVA (Analysis of Variance) is a method for generalizing statistical tests to multiple groups. ANOVA analyses the overall variance of a dataset by partitioning the total sum of squared deviations (from the mean) into the sum of squares for each of these groups and sum of squares for error. By comparing the statistical test for multiple groups, it can serve as a useful alternative to the t-tests when you wish to test multiple factors simultaneously.

Skew is used to measure asymmetry in a distribution. It is the degree of distortion or deviation from the symmetrical normal distribution. Skew helps to identify extreme values in one of the tails. Symmetrical distributions have a skewness of 0. Positive skewness is when the tail on the right side of the distribution is longer — also called “fatter”. When there is positive skewness, the mean and median are bigger than the mode. When there is negative skewness, the tail on the left side of the distribution is longer or fatter than the tail on the right side. The mean and median are smaller than the mode when there is negative skewness. A skew between -0.5 and 0.5 means that the data are pretty symmetrical. Moderate skew has a value between -1 and -0.5 or between 0.5 and 1. Skewness smaller than -1 or larger than 1 means that the data are highly skewed.

Kurtosis defines whether a distribution is truly “normal” or whether it may have so-called “fatter” or “thinner” tails than you would observe when data are normally distributed. It aims to identify extreme values in both tails at the same time. It can be thought of as a measure of outliers present in the distribution. A high kurtosis indicates a lot of outliers in the data where a low kurtosis is an indication that data has light tails or lacks outliers. Mesokurtic distribution s look similar to a normal distribution and has a kurtosis that lie close to the ones of a normal distribution, with a score of about 3. A platykurtic distributionn has a shorter and broader peak with thinner tails than a normal distribution with a kutosis score less than 3. A Leptokurtic distribution, often describerd as ‘skinny’, has a higher and sharper peak with longer and fatter tails in comparison to a normal distribution with a kurtosis score greater than 3.

and I didn’t even get to covering residuals…

In linear regression, one goal is to have the residuals of your model follow a normal distribution.

But I’ll leave off with this little nugget:

In applied statistics, the question is not whether the data/residuals are perfectly normal, but normal enough for the assumptions to hold.

--

--