Least squares

OLS is the conventional method for estimating the parameters in the linear regression model. The LS in OLS stands for Least Squares, and the O just means that we are talking about the Ordinary form of least squares rather than a more elaborate variant.

Least squares is a criterion for comparing estimates of \(\beta\) and identifying the best one. Under least squares, the best estimates are the estimates that minimize the sum of the squared errors, or squared residuals. The error is the difference between the predicted value of the outcome (\(\hat{y}\)) given the estimates and the observed value of the outcome (\(y\)).

Sometimes this is talked about with the language of a cost function. The idea is to think of errors as being costly, and so the preferred estimates are those that minimize cost. OLS, then, involves defining the “cost” of an error as equal to the magnitude of the error squared.

Example

To make the least squares criterion concrete, we are going to find the best \(\hat{\beta}\) ourselves, by hand, without asking the computer to fit the model. This is worth doing because it shows exactly what OLS is minimizing — and because later in the course we will encounter models that do not have a tidy analytic solution, where understanding an alternative approach–iterative estimation–becomes essential.

Is the critical appraisal of a movie associated with how much money it makes? We can look at data on 609 movies released between 2008 and 2012. Our explanatory variable \(\texttt{rtscore}\) is the film’s score on Rotten Tomatoes (0-100 scale, based on critics’ reviews). Our outcome \(\texttt{usgross}\) is US gross box-office revenue in millions of dollars.

For the purposes of this exercise, I will center both of these variables, meaning that I will subtract the mean from each variable, so that the new mean of the variable is 0. (More information on how to do this in Stata or R is at the end of this page.)

Centering a variable does not change its standard deviation. Centering the explanatory variable and outcome does not affect the slope of the ensuing regression line. What centering does do in this case is change the constant of the regression model so that it is 0.

We do this so that the regression model has only one parameter: the slope, or the change in revenue associated with a one-unit change in Rotten Tomatoes score.

Aside: Why does centering the variables make the constant zero? In OLS, the regression line always passes through the centroid of the data. In our case, the regression line must pass through the point where the mean of \(\texttt{rtscore}\) intersects with the mean of \(\texttt{usgross}\). If we center both variables, then the regression line must pass through the point at which \(x\) and \(y\) are both 0. The intercept in the regression model is the predicted value of \(y\) when \(x\) is 0, so if both variables are centered the constant is 0.

Our model is:

\[ \widehat{\texttt{usgross}}_i = \hat{\beta}\texttt{rtscore}_i \]

where \(\widehat{\texttt{usgross}}_i\) is our \(\hat{y}\).

So what is \(\hat{\beta}\)? This is what we will now figure out, by ourselves, without having the computer fit the model for us.

What we want is the \(\hat{\beta}\) that minimizes the sum of squared errors. For any candidate value of \(\hat{\beta}\), we can calculate the sum of squared errors.

\[ \sum_{i=1}^{N} (\hat{y}_i - y_i)^2 \]

For example, say the candidate answer for \(\hat{\beta}\) that we want to try is .5. We could follow these steps:

  1. Calculate \(\hat{y}_i = .5 \times \texttt{rtscore}_i\) for every observation
Click for code: Load packages and data; center variables
library(tidyverse)
library(tulaverse)
library(haven)

df <- read_dta("../dta/movieprofits.dta") %>%
  mutate(
    usgross = scale(usgross, scale = FALSE)[,1],
    rtscore = scale(rtscore, scale = FALSE)[,1]
  )
df <- df %>%
  mutate(yhat = 0.5 * rtscore)
::: {.statacode}
. gen yhat = .5*rtscore
:::
  1. Calculate the error for each observation by taking the difference between \(\hat{y}_i\) and the observed outcome \(y_i\).
df <- df %>%
  mutate(error = yhat - usgross)
::: {.statacode}
. gen error = yhat - usgross
:::
  1. Square the errors for each observation.
df <- df %>%
  mutate(error_squared = error^2)
::: {.statacode}
. gen error_squared = error * error
:::
  1. Add up all the squared errors.
sum(df$error_squared)
[1] 3804829
::: {.statacode}
. quietly summarize error_squared
. display r(sum) // after summarize, provides total sum of variable
3804829.5
:::

The sum of the squared errors is 3804830. Let’s try \(\beta=1\) instead:

df <- df %>%
  mutate(
    yhat = 1 * rtscore,
    error = yhat - usgross,
    error_squared = error^2
  )

sum(df$error_squared)
[1] 3659126
::: {.statacode}
. drop yhat error error_squared

. gen yhat = 1*rtscore

. gen error = yhat - usgross

. gen error_squared = error * error

. quietly summarize error_squared

. display r(sum) // after summarize, provides total sum of variable
3659125.9
:::

The sum of the squared errors is now 3659126, which is smaller than the previous sum. So, by the least squares criterion, \(\beta\) = 1 is a better estimate than \(\beta\) = .5.

We can try a bunch of different guesses and plot the result. Here is what we get when we plot the results between 0 and 2:

Click for R code that draws plot
betas <- seq(0, 2, by = 0.005)
sse <- sapply(betas, function(b) sum((b * df$rtscore - df$usgross)^2))
sse_df <- data.frame(beta = betas, sse = sse)

ggplot(sse_df, aes(x = beta, y = sse)) +
  geom_hline(yintercept = min(sse), color = "gray70", linewidth = 0.7) +
  geom_line(color = "#8B1538", linewidth = 1) +
  scale_x_continuous(breaks = seq(0, 2, 0.2)) +
  scale_y_continuous(labels = function(x) paste0(round(x / 1e6, 1), "M")) +
  labs(x = "Candidate estimate of beta", y = "Sum of squared errors") +
  theme_tula()

The lowest sums of squares look like they are between 1 and 1.2. So we can plot guesses in this range and look again:

Click for R code that draws plot
betas2 <- seq(1, 1.2, by = 0.001)
sse2 <- sapply(betas2, function(b) sum((b * df$rtscore - df$usgross)^2))
sse_df2 <- data.frame(beta = betas2, sse = sse2)

ggplot(sse_df2, aes(x = beta, y = sse)) +
  geom_hline(yintercept = min(sse2), color = "gray70", linewidth = 0.7) +
  geom_line(color = "#8B1538", linewidth = 1) +
  scale_x_continuous(breaks = seq(1, 1.2, 0.05)) +
  scale_y_continuous(labels = function(x) paste0(format(round(x / 1e6, 3), nsmall = 3), "M")) +
  labs(x = "Candidate estimate of beta", y = "Sum of squared errors") +
  theme_tula()

The value of \(\beta\) corresponding to the lowest sum of squares appears to be just over 1.1. Now let’s fit the model and show that it corresponds to the results from OLS:

# Reload to get fresh centered data
df <- read_dta("../dta/movieprofits.dta") %>%
  mutate(
    usgross = scale(usgross, scale = FALSE)[,1],
    rtscore = scale(rtscore, scale = FALSE)[,1]
  )

model <- lm(usgross ~ rtscore, data = df)
tula(model)
AIC = 7032.303                                         Number of obs =    609
BIC = 7045.538                                         R-squared     =   0.12
                                                       Adj R-squared = 0.1186
                                                       Root MSE      =  77.59
─────────────────────────────────────────────────────────────────────────────
usgross     │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
─────────────────────────────────────────────────────────────────────────────
rtscore     │     1.111      .1221      9.099    <.0001      .8708       1.35
(Intercept) │  6.97e-15      3.144  2.217e-15    1.0000     -6.175      6.175
─────────────────────────────────────────────────────────────────────────────
::: {.statacode}
. regress usgross rtscore

      Source |       SS           df       MS      Number of obs   =       609
-------------+----------------------------------   F(1, 607)       =     82.79
       Model |   498405.65         1   498405.65   Prob > F        =    0.0000
    Residual |  3654187.53       607  6020.07831   R-squared       =    0.1200
-------------+----------------------------------   Adj R-squared   =    0.1186
       Total |  4152593.18       608    6829.923   Root MSE        =    77.589

------------------------------------------------------------------------------
     usgross |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     rtscore |   1.110545   .1220522     9.10   0.000     .8708488    1.350241
       _cons |   2.09e-07   3.144072     0.00   1.000    -6.174579    6.174579
------------------------------------------------------------------------------
:::

A one-point increase in a film’s Rotten Tomatoes score is associated with a $1.11 million increase in US gross box office revenue.

Analytic and iterative solutions

When you fit an OLS model in a stats package, the stats package is not following the process we used above to find \(\beta\). This is because in OLS there is an analytic solution for \(\beta\). Instead, it solves for the \(\mathbf{\beta}\) that minimize the sum of squared errors using the following matrix algebra operation:

\[ \hat{\mathbf{\beta}} = (\mathbf{x}'\mathbf{x})^{-1}(\mathbf{x}'\mathbf{y}) \]

If you take a course on OLS regression taught in matrix algebra, this formula is seared into your brain forever. I am sure there are folks out there with it tattooed.

But, as we showed with our simple example, you don’t need to use this analytic solution to figure out the least squares \(\beta\) to whatever level of precision you care about. Because least squares tells you how to compare possible \(\beta\) values and decide which is better, you can find the answer iteratively, meaning that you try out different estimates and compare them using your criterion and progressively zero in on the best estimates.

What’s great about this is that other models involve estimation strategies that do not have analytic solutions but can be solved iteratively. The way that statistics packages find solutions iteratively is not as clunky as what we did above; instead, there are methods for finding the optimum with far fewer (and much smarter) guesses.

Bonus: How to center a variable

In Stata, if you summarize a variable, then it will temporarily store the mean of that variable as \(\texttt{r(mean)}\). You can use that to center a variable, like this:

. summarize usgross rtscore

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     usgross |        609    75.81977    82.64335   2.778752   760.5076
     rtscore |        609    47.19212    25.78123          0         99

. quietly summarize usgross // suppresses output

. replace usgross = usgross - r(mean)
(609 real changes made)

. quietly summarize rtscore // suppresses output

. replace rtscore = rtscore - r(mean)
variable rtscore was byte now float
(609 real changes made)

. summarize usgross rtscore

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
     usgross |        609   -3.67e-08    82.64335  -73.04102   684.6879
     rtscore |        609   -2.22e-07    25.78123  -47.19212   51.80788

More generally, Stata has various commands that write results to \(\texttt{r()}\), and the results stored by one command in \(\texttt{r()}\) are kept there until another command writes anything to \(\texttt{r()}\), at which point whatever was put there by the last command is erased. At any point, you can see what is being stored in \(\texttt{r()}\) by typing \(\texttt{return list}\).

Here are a couple of ways of centering a variable in R. One uses the \(\texttt{scale()}\) function and the other uses the \(\texttt{mean()}\) function:

# using scale()
df$usgross <- scale(df$usgross, scale = FALSE)[,1]

# using mean()
df$rtscore <- df$rtscore - mean(df$rtscore)

With R’s \(\texttt{scale()}\) function, we need to specify \(\texttt{scale=FALSE}\) because \(\texttt{scale=TRUE}\) (the default) will standardize the variable, not just centering it but rescaling it so that the standard deviation is 1.