What does OLS assume re: the normal distribution?

Conditional vs. unconditional distributions of the outcome

Awareness that the linear regression model involves some kind of assumption regarding the normal distribution far exceeds awareness of what that assumption actually is.

First, sometimes people think that linear regression assumes that the outcome variable is normally distributed. Hooboy: this is wrong.

The assumption, such that it is, is that the error term is normally distributed.

We can write the linear regression model as:

\[ y_i = \mathbf{x}_i\mathbf{\beta} + \mathbf{\varepsilon}_i \]

(Notice: We are not talking about the predicted outcome \(\hat{y}\) here. We are talking about the actual, observed value of our outcome \(y\).)

For observations that share the same values for all explanatory variables \(\mathbf{x}\), \(\mathbf{x}\mathbf{\beta}\) is a constant. So, according to the model, differences in \(y_i\) for observations that share the same value of \(\mathbf{x}\) are due to having different values of the error term.

As a result, if the errors are normally distributed, then conditional on \(\mathbf{x}\), the outcome variable is normally distributed. But a variable being normally distributed conditional on \(\mathbf{x}\) does not necessarily imply the variable is unconditionally normally distributed.

Say your outcome is height and you have a sample of mothers and their toddlers. The mothers’ heights may be normally distributed, and the toddler’s heights may be normally distributed. Yet if you were to put the two together, the result would not be normally distributed: instead there would be a hump for mothers and a hump for toddlers and a big gap in between.

OLS as unbiased estimator of conditional mean

Even if the errors are not normally distributed, OLS remains an unbiased estimator of the conditional mean. In the sense of just what is needed to avoid having wrong or misleading coefficients, then, OLS does not involve an assumption about the shape of the distribution of the error term at all. For avoiding biased coefficients, the key assumptions about the error term in regression are that the population mean of the error term is 0 (both unconditionally and conditional on x) and that error term is uncorrelated with the explanatory variables.

At the same time, the assumption of normally distributed errors does undergird the calculation of standard errors in OLS. So violations of the assumption do mean that standard errors–and hence confidence intervals and p-values–may be incorrect. The problem is worse in small samples and with more severe deviations from normality.

However, due to the Central Limit Theorem, the consequence of non-normal errors per se decreases as sample size increases. This is related to how inferential statistics for OLS are officially calculated using the t-distribution rather than the normal distribution, but how, as sample size increases, the t-distribution becomes closer and closer to the normal distribution.

Homoskedasticity

A Similar points might be made about a different assumption about the error term, which is that the variance of the error term is constant over \(\mathbf{x}\), which is also called homoskedasticity. Again, violation of this assumption does not bias coefficients. OLS is a tough critter in this respect. But again it has implications for the standard errors.

Robust standard errors are also known as heteroskedasticity-consistent standard errors because of their use to correct for heteroskedasticity.

Even so: when the error term is normally distributed with constant variance, this has benefits beyond making the standard errors more straightforward to calculate. It offers a more thorough way of talking about the linear regression model as an actual model, by which I mean a model of the outcome-generating process for individual observations. The latter also provides a better interface between the linear regression model and some of the other models we will subsequently consider.

Key points

As long as the mean of the errors is 0, OLS estimates are unbiased regardless of the shape of the distribution of the error function.
Normally distributed and homoskedastic errors are assumed in deriving the conventional standard errors produced by default from the OLS routine in software packages.
Beyond the computation of standard errors, normally distributed errors make available a way of thinking about linear regression as a model of the data generating process that provides a better connection between it and some of the other models we will consider.