Expand for code that loads packages
library(tidyverse)
library(haven)
library(tulaverse)Tables of descriptive statistics usually provide the standard deviation alongside means. The standard deviation is the most commonly used measure of how much dispersion (or variation) there is in a variable–how spread out the values are.
If we look at the summary statistics for adult height in the NHANES data, we can see that the mean height is 65.88 inches and that the standard deviation is just under 4 inches:
. sum height
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
height | 38,495 65.88727 3.996099 48.54321 80.51165
# A tibble: 1 × 5
n mean sd min max
<int> <dbl> <dbl> <dbl> <dbl>
1 38495 65.9 4.00 48.5 80.5
Elsewhere we have discussed how the mean of a variable is connected to minimizing least squares. Specifically, the mean is the value of \(c\) that minimizes \(\sum_{i=1}^{N}\left(({y_i - c})^2\right)\).
For the standard deviation, we can return to this way of thinking about the mean as minimizing least squares.
We would get the same answer–the mean–if instead of defining it as minimizing the sum of the squared differences
\[\sum_{i=1}^{N}\left[(y_i - c)^2\right]\text{,}\]
we defined it as the value of \(c\) that minimized the mean of the squared differences \[\text{Variance}(y) = \frac{\sum_{i=1}^{N}\left[(y_i - \mu)^2\right]}{N}\text{.}\]
The mean of the squared differences is called the variance.
We would also get the same answer if we minimized the square root of the mean of the squared differences, that is, the square root of the variance. This quantity is the standard deviation.
\[ \begin{align} \text{Standard deviation}(y) &= \sqrt{\text{Variance}(y)} \\ \\ &= \sqrt{\frac{\sum_{i=1}^{N}\left[(y_i - \mu)^2\right]}{N}}\text{.} \end{align} \]
For a single variable, the mean minimizes the squared differences, and by doing so it also minimizes the standard deviation. That is, if you followed the standard deviation formula \(\sum_{i=1}^{N}({y_i - c})^2\) inserting any value for \(c\) other than the mean, you would get a larger value for the result than you do when you use the mean.
As a notational matter, you will often see:
Aside: When people have to describe what the standard deviation is for a non-professional audience, they will often fudge and describe the standard deviation as being the average distance between an observation and the mean, which is not correct and sometimes not even close to correct.
The average distance between an observation and the mean is called the mean absolute deviation. The mean absolute deviation is easy enough to calculate and present. The world of descriptive statistics in tables would probably be more straightforward and accessible if that was the usual practice.
Above we saw that the standard deviation of adult height in the NHANES data was just shy of 4 inches.
Now, we will fit a regression model with no explanatory variables. Remember, when we do this, the estimate of the constant is the mean of our outcome–in this case, 65.89 inches.
. regress height
Source | SS df MS Number of obs = 38,495
-------------+---------------------------------- F(0, 38494) = 0.00
Model | 0 0 . Prob > F = .
Residual | 614703.342 38,494 15.9688092 R-squared = 0.0000
-------------+---------------------------------- Adj R-squared = 0.0000
Total | 614703.342 38,494 15.9688092 Root MSE = 3.9961
------------------------------------------------------------------------------
height | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_cons | 65.88727 .0203673 3234.95 0.000 65.84735 65.92719
------------------------------------------------------------------------------
AIC = 215902.764 Number of obs = 38495
BIC = 215919.881 R-squared = 0
Adj R-squared = 0
Root MSE = 3.996
─────────────────────────────────────────────────────────────────────────────
height │ Coef Std. Err. t P>|t| [95% Conf Interval]
─────────────────────────────────────────────────────────────────────────────
(Intercept) │ 65.89 .02037 3235 <.0001 65.85 65.93
─────────────────────────────────────────────────────────────────────────────
If we look at the Stata regression output, you may notice that the value equal to the standard deviation–4 inches–also appears as the “Root MSE” (in blue above).
Root MSE stands for the root mean squared error, sometimes also called the standard error of the regression (as opposed to the standard errors of individual coefficients) or the \(\sigma\) (sigma) of the regression. The root MSE is also elsewhere regularly abbreviated as RMSE, and I will do that as well.
The RMSE is the standard deviation of the errors in a regression model: that is, the standard deviation of the distribution of (\(y_i - \hat{y}\)) over all the observations in our data. When there are no explanatory variables in our model, this is just the same as the standard deviation of the variable itself.
Let’s show this in Stata just to confirm. In the output below, I:
:: uses results from regression model fit above ::
. predict yhat // creates variable 'yhat' with y-hat values
(option xb assumed; fitted values)
. gen error = yhat - height // creates variable with value of errors
(2,748 missing values generated)
. sum error // summary statistics for errors
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
error | 38,495 -2.70e-07 3.996099 -14.62438 17.34406
# Add predicted values and calculate errors
model <- lm(height ~ 1, data = df)
df_with_pred <- df %>%
mutate(
yhat = predict(model),
error = yhat - height
)
# Summary statistics of errors
df_with_pred %>% summarise(
n = n(),
mean_error = mean(error, na.rm = TRUE),
sd_error = sd(error, na.rm = TRUE),
min_error = min(error, na.rm = TRUE),
max_error = max(error, na.rm = TRUE)
)# A tibble: 1 × 5
n mean_error sd_error min_error max_error
<int> <dbl> <dbl> <dbl> <dbl>
1 38495 1.64e-15 4.00 -14.6 17.3
The standard deviation of the errors is 3.996, which is the same of the standard deviation of height itself. Again, this is only true when we have no explanatory variables in our model.
What happens when we do have explanatory variables in our model? Let’s add sex to our model:
. regress height male
Source | SS df MS Number of obs = 38,495
-------------+---------------------------------- F(1, 38493) = 31426.28
Model | 276287.741 1 276287.741 Prob > F = 0.0000
Residual | 338415.601 38,493 8.79161409 R-squared = 0.4495
-------------+---------------------------------- Adj R-squared = 0.4495
Total | 614703.342 38,494 15.9688092 Root MSE = 2.9651
------------------------------------------------------------------------------
height | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
male | 5.360851 .0302404 177.27 0.000 5.301579 5.420123
_cons | 63.29312 .0210362 3008.77 0.000 63.25188 63.33435
------------------------------------------------------------------------------
AIC = 192928.444 Number of obs = 38495
BIC = 192954.119 R-squared = 0.4495
Adj R-squared = 0.4495
Root MSE = 2.965
─────────────────────────────────────────────────────────────────────────────
height │ Coef Std. Err. t P>|t| [95% Conf Interval]
─────────────────────────────────────────────────────────────────────────────
male │ 5.361 .03024 177.3 <.0001 5.302 5.42
(Intercept) │ 63.29 .02104 3009 <.0001 63.25 63.33
─────────────────────────────────────────────────────────────────────────────
Let’s confirm that the RMSE is still the standard deviation of our errors by running the same code as before:
. predict yhat // creates variable 'yhat' with y-hat values
(option xb assumed; fitted values)
. gen error = yhat - height // creates variable with value of errors
(2,748 missing values generated)
. sum error // summary statistics for errors
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
error | 38,495 6.87e-07 2.965027 -13.16343 17.31549
# Add predicted values and calculate errors for male model
df_male_pred <- df %>%
mutate(
yhat = predict(model_male),
error = yhat - height
)
# Summary statistics of errors
df_male_pred %>% summarise(
n = n(),
mean_error = mean(error, na.rm = TRUE),
sd_error = sd(error, na.rm = TRUE),
min_error = min(error, na.rm = TRUE),
max_error = max(error, na.rm = TRUE)
)# A tibble: 1 × 5
n mean_error sd_error min_error max_error
<int> <dbl> <dbl> <dbl> <dbl>
1 38495 -4.47e-15 2.97 -13.2 17.3
Yes, then, the RMSE remains the standard deviation of the errors in our regression model.
With \(\texttt{male}\) added to the model, the RMSE now is 2.9651, whereas it was 3.99 when there was no explanatory variables. The RMSE has gotten smaller.
Again, least squares estimates minimize the sum of the squared errors, and this means it also minimizes the square root of the mean squared errors. All of which means that our OLS estimates are those estimates that yield predicted values for which the RMSE is as small as possible.
Whenever we add explanatory variables, the RMSE will shrink, and how much it shrinks indicates how much of the variance in our outcome is accounted for by the explanatory variables.
You’ve been taught about the \(R^2\) of a regression model before, which is often described as a proportion of the variance in the outcome explained by an OLS regression model.
The \(R^2\) is just a measure of how much smaller the Root MSE is than the standard deviation of the outcome.
\[ R^2 = 1 - \left(\frac{\text{RMSE}}{\text{Standard deviation of y}}\right)^2 \]
(The reason we square the ratio here is because \(R^2\) is a measure of the proportion of variance explained, and the variance is the square of the standard deviation.)
In our example:
\[ R^2 = 1 - \left(\frac{2.965}{\text{3.996}}\right)^2 = 1 - \left(\frac{2.965}{\text{3.996}}\right)^2 = 1 - (.741)^2 = = 1 -.550 = .450 \]
This result, .450, matches the \(R^2\) presented in our output above (in green).
Key points