Least squares, the standard deviation, and the RMSE

The standard deviation

Tables of descriptive statistics usually provide the standard deviation alongside means. The standard deviation is the most commonly used measure of how much dispersion (or variation) there is in a variable–how spread out the values are.

If we look at the summary statistics for adult height in the NHANES data, we can see that the mean height is 65.88 inches and that the standard deviation is just under 4 inches:

. sum height

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
      height |     38,495    65.88727    3.996099   48.54321   80.51165

Expand for code that loads packages

library(tidyverse)
library(haven)
library(tulaverse)

df <- read_dta("../dta/nhanes_bodymeasures.dta") %>%
    filter(age >= 18) %>%
    drop_na(height)

# Summary statistics for height
df %>% summarise(
    n = n(),
    mean = mean(height),
    sd = sd(height),
    min = min(height),
    max = max(height)
)

# A tibble: 1 × 5
      n  mean    sd   min   max
  <int> <dbl> <dbl> <dbl> <dbl>
1 38495  65.9  4.00  48.5  80.5

Elsewhere we have discussed how the mean of a variable is connected to minimizing least squares. Specifically, the mean is the value of \(c\) that minimizes \(\sum_{i=1}^{N}\left(({y_i - c})^2\right)\).

For the standard deviation, we can return to this way of thinking about the mean as minimizing least squares.

We would get the same answer–the mean–if instead of defining it as minimizing the sum of the squared differences

\[\sum_{i=1}^{N}\left[(y_i - c)^2\right]\text{,}\]

we defined it as the value of \(c\) that minimized the mean of the squared differences \[\text{Variance}(y) = \frac{\sum_{i=1}^{N}\left[(y_i - \mu)^2\right]}{N}\text{.}\]

The mean of the squared differences is called the variance.

We would also get the same answer if we minimized the square root of the mean of the squared differences, that is, the square root of the variance. This quantity is the standard deviation.

\[ \begin{align} \text{Standard deviation}(y) &= \sqrt{\text{Variance}(y)} \\ \\ &= \sqrt{\frac{\sum_{i=1}^{N}\left[(y_i - \mu)^2\right]}{N}}\text{.} \end{align} \]

For a single variable, the mean minimizes the squared differences, and by doing so it also minimizes the standard deviation. That is, if you followed the standard deviation formula \(\sum_{i=1}^{N}({y_i - c})^2\) inserting any value for \(c\) other than the mean, you would get a larger value for the result than you do when you use the mean.

As a notational matter, you will often see:

\(\mu\) (mu) used to refer to the population mean
\(\sigma\) (sigma) used to refer to the population standard deviation
\(\sigma^2\) used to refer to the population variance.

Aside: When people have to describe what the standard deviation is for a non-professional audience, they will often fudge and describe the standard deviation as being the average distance between an observation and the mean, which is not correct and sometimes not even close to correct.

The average distance between an observation and the mean is called the mean absolute deviation. The mean absolute deviation is easy enough to calculate and present. The world of descriptive statistics in tables would probably be more straightforward and accessible if that was the usual practice.

Root mean squared error

Above we saw that the standard deviation of adult height in the NHANES data was just shy of 4 inches.

Now, we will fit a regression model with no explanatory variables. Remember, when we do this, the estimate of the constant is the mean of our outcome–in this case, 65.89 inches.

. regress height

      Source |       SS           df       MS      Number of obs   =    38,495
-------------+----------------------------------   F(0, 38494)     =      0.00
       Model |           0         0           .   Prob > F        =         .
    Residual |  614703.342    38,494  15.9688092   R-squared       =    0.0000
-------------+----------------------------------   Adj R-squared   =    0.0000
       Total |  614703.342    38,494  15.9688092   Root MSE        =   3.9961

------------------------------------------------------------------------------
      height |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   65.88727  .0203673  3234.95   0.000     65.84735    65.92719
------------------------------------------------------------------------------

# Model with no explanatory variables
model <- lm(height ~ 1, data = df)

# show results
tula(model)

AIC = 215902.764                                        Number of obs = 38495
BIC = 215919.881                                        R-squared     =     0
                                                        Adj R-squared =     0
                                                        Root MSE      = 3.996
─────────────────────────────────────────────────────────────────────────────
height      │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
─────────────────────────────────────────────────────────────────────────────
(Intercept) │     65.89     .02037       3235    <.0001      65.85      65.93
─────────────────────────────────────────────────────────────────────────────

# Extract RMSE and R-squared
# model_stats <- summary(model)
# model_stats$sigma  # RMSE
# model_stats$r.squared  # R-squared

If we look at the Stata regression output, you may notice that the value equal to the standard deviation–4 inches–also appears as the “Root MSE” (in blue above).

Root MSE stands for the root mean squared error, sometimes also called the standard error of the regression (as opposed to the standard errors of individual coefficients) or the \(\sigma\) (sigma) of the regression. The root MSE is also elsewhere regularly abbreviated as RMSE, and I will do that as well.

The RMSE is the standard deviation of the errors in a regression model: that is, the standard deviation of the distribution of (\(y_i - \hat{y}\)) over all the observations in our data. When there are no explanatory variables in our model, this is just the same as the standard deviation of the variable itself.

Let’s show this in Stata just to confirm. In the output below, I:

take the results from the regression model
generate predicted values of the outcome (as new variable \(\texttt{yhat}\))
use these to generate values for the error for each observation (as new variable \(\texttt{error}\))
compute the standard deviation of the errors.

:: uses results from regression model fit above ::

. predict yhat // creates variable 'yhat' with y-hat values
(option xb assumed; fitted values)

. gen error = yhat - height // creates variable with value of errors
(2,748 missing values generated)

. sum error // summary statistics for errors

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       error |     38,495   -2.70e-07    3.996099  -14.62438   17.34406

# Add predicted values and calculate errors
model <- lm(height ~ 1, data = df)
df_with_pred <- df %>%
    mutate(
        yhat = predict(model),
        error = yhat - height
    )

# Summary statistics of errors
df_with_pred %>% summarise(
    n = n(),
    mean_error = mean(error, na.rm = TRUE),
    sd_error = sd(error, na.rm = TRUE),
    min_error = min(error, na.rm = TRUE),
    max_error = max(error, na.rm = TRUE)
)

# A tibble: 1 × 5
      n mean_error sd_error min_error max_error
  <int>      <dbl>    <dbl>     <dbl>     <dbl>
1 38495   1.64e-15     4.00     -14.6      17.3

The standard deviation of the errors is 3.996, which is the same of the standard deviation of height itself. Again, this is only true when we have no explanatory variables in our model.

What happens when we do have explanatory variables in our model? Let’s add sex to our model:

. regress height male

      Source |       SS           df       MS      Number of obs   =    38,495
-------------+----------------------------------   F(1, 38493)     =  31426.28
       Model |  276287.741         1  276287.741   Prob > F        =    0.0000
    Residual |  338415.601    38,493  8.79161409   R-squared       =    0.4495
-------------+----------------------------------   Adj R-squared   =    0.4495
       Total |  614703.342    38,494  15.9688092   Root MSE        =    2.9651

------------------------------------------------------------------------------
      height |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |   5.360851   .0302404   177.27   0.000     5.301579    5.420123
       _cons |   63.29312   .0210362  3008.77   0.000     63.25188    63.33435
------------------------------------------------------------------------------

# Model with male as explanatory variable
model_male <- lm(height ~ male, data = df)

# Summary statistics
tula(model_male)

AIC = 192928.444                                       Number of obs =  38495
BIC = 192954.119                                       R-squared     = 0.4495
                                                       Adj R-squared = 0.4495
                                                       Root MSE      =  2.965
─────────────────────────────────────────────────────────────────────────────
height      │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
─────────────────────────────────────────────────────────────────────────────
male        │     5.361     .03024      177.3    <.0001      5.302       5.42
(Intercept) │     63.29     .02104       3009    <.0001      63.25      63.33
─────────────────────────────────────────────────────────────────────────────

# Extract RMSE and R-squared
# model_male_stats <- summary(model_male)
# model_male_stats$sigma  # RMSE
# model_male_stats$r.squared  # R-squared

Let’s confirm that the RMSE is still the standard deviation of our errors by running the same code as before:

. predict yhat // creates variable 'yhat' with y-hat values
(option xb assumed; fitted values)

. gen error = yhat - height // creates variable with value of errors
(2,748 missing values generated)

. sum error // summary statistics for errors

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       error |     38,495    6.87e-07    2.965027  -13.16343   17.31549

# Add predicted values and calculate errors for male model
df_male_pred <- df %>%
    mutate(
        yhat = predict(model_male),
        error = yhat - height
    )

# Summary statistics of errors
df_male_pred %>% summarise(
    n = n(),
    mean_error = mean(error, na.rm = TRUE),
    sd_error = sd(error, na.rm = TRUE),
    min_error = min(error, na.rm = TRUE),
    max_error = max(error, na.rm = TRUE)
)

# A tibble: 1 × 5
      n mean_error sd_error min_error max_error
  <int>      <dbl>    <dbl>     <dbl>     <dbl>
1 38495  -4.47e-15     2.97     -13.2      17.3

Yes, then, the RMSE remains the standard deviation of the errors in our regression model.

With \(\texttt{male}\) added to the model, the RMSE now is 2.9651, whereas it was 3.99 when there was no explanatory variables. The RMSE has gotten smaller.

Again, least squares estimates minimize the sum of the squared errors, and this means it also minimizes the square root of the mean squared errors. All of which means that our OLS estimates are those estimates that yield predicted values for which the RMSE is as small as possible.

Whenever we add explanatory variables, the RMSE will shrink, and how much it shrinks indicates how much of the variance in our outcome is accounted for by the explanatory variables.

In the worst possible scenario, we have the model with no explanatory variables whatsoever, and the RMSE is the standard deviation of the outcome.
In the best possible scenario, we would predict every value of the outcome perfectly, meaning that the error is 0 for all observations, and if all the errors are 0 the RMSE would be 0.

You’ve been taught about the \(R^2\) of a regression model before, which is often described as a proportion of the variance in the outcome explained by an OLS regression model.

The \(R^2\) is just a measure of how much smaller the Root MSE is than the standard deviation of the outcome.

\[ R^2 = 1 - \left(\frac{\text{RMSE}}{\text{Standard deviation of y}}\right)^2 \]

(The reason we square the ratio here is because \(R^2\) is a measure of the proportion of variance explained, and the variance is the square of the standard deviation.)

In our example:

\[ R^2 = 1 - \left(\frac{2.965}{\text{3.996}}\right)^2 = 1 - \left(\frac{2.965}{\text{3.996}}\right)^2 = 1 - (.741)^2 = = 1 -.550 = .450 \]

This result, .450, matches the \(R^2\) presented in our output above (in green).

Key points

The standard deviation and the variance are measures of how much dispersion there is in our outcome. The standard deviation is just the square root of the variance.
For a single variable, the mean may be defined as the value of \(c\) that minimizes the sum of the squared differences between \(c\) over all the values of the variable. The mean of those squared differences is the variance, and the square root of that mean is the standard deviation.
The root mean squared error (RMSE) is the standard deviation of the residuals in a regression model.
The root mean squared error shrinks as new explanatory variables are added to a regression model.
The \(R^2\) of the linear regression model indicates how much better the RMSE of a model is than the worst possible RMSE (which equals the standard deviation of the outcome).