Predicting the outcome in OLS

Here we will review how to compute the predicted value of the outcome in OLS given a set of estimates for our coefficients.

Using the NHANES data from a sample of US children and adolescents (ages 3-18), we fit a linear regression model in which our outcome is height (in inches) and our explanatory variables are sex (as binary), age (in years), and race/ethnicity (operationalized here as five mutually exclusive categories).

Click for code: Load packages and data
# dependencies
library(tidyverse)
library(haven)
library(tulaverse)

df <- read_dta("../dta/nhanes_bodymeasures.dta") %>% 
  filter(age >= 3 & age <= 18) %>% # only ages 3-18
  mutate(raceeth5cat = as_factor(raceeth5cat))
model <- lm(height ~ male + age + raceeth5cat, data=df)
tula(model, ref=TRUE) 
AIC = 124991.034                                        Number of obs =  23467
BIC = 125055.541                                        R-squared     = 0.8705
                                                        Adj R-squared = 0.8704
                                                        Root MSE      =  3.469
──────────────────────────────────────────────────────────────────────────────
height       │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
──────────────────────────────────────────────────────────────────────────────
male         │     1.675      .0453      36.97    <.0001      1.586      1.763
age          │      1.93    .004901      393.8    <.0001       1.92       1.94
raceeth5cat  │                                                                
  Oth-Hisp   │     .2111     .09516      2.219     .0265     .02462      .3976
  NonH-White │     .8946     .05993      14.93    <.0001      .7771      1.012
  NonH-Black │     1.231     .05897      20.87    <.0001      1.115      1.346
  NonH-Other │     .3371     .09741      3.461     .0005      .1462       .528
  Mex-Am     │         0                                                      
(Intercept)  │      34.4     .07164      480.2    <.0001      34.26      34.54
──────────────────────────────────────────────────────────────────────────────
. regress height i.male age i.raceeth5cat, allbaselevels

      Source |       SS           df       MS      Number of obs   =    23,467
-------------+----------------------------------   F(6, 23460)     =  26278.59
       Model |  1897894.47         6  316315.745   Prob > F        =    0.0000
    Residual |  282388.312    23,460  12.0370125   R-squared       =    0.8705
-------------+----------------------------------   Adj R-squared   =    0.8704
       Total |  2180282.78    23,466  92.9124171   Root MSE        =    3.4694

------------------------------------------------------------------------------
      height |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |
     female  |          0  (base)
       male  |    1.67467   .0453024    36.97   0.000     1.585874    1.763465
             |
         age |   1.930062   .0049009   393.82   0.000     1.920456    1.939668
             |
 raceeth5cat |
     Mex-Am  |          0  (base)
   Oth-Hisp  |   .2111317   .0951575     2.22   0.027     .0246168    .3976467
 NonH-White  |   .8945881   .0599273    14.93   0.000     .7771268    1.012049
 NonH-Black  |   1.230825   .0589714    20.87   0.000     1.115237    1.346413
 NonH-Other  |   .3371038   .0974073     3.46   0.001     .1461791    .5280285
             |
       _cons |   34.39853   .0716407   480.15   0.000     34.25811    34.53895
------------------------------------------------------------------------------

We could write the model we are fitting as a linear equation:

\[ \begin{split} \textrm{Height}_i = \enspace & \beta_0 + \beta_1(\texttt{male}_i) + \beta_2(\texttt{age}_i) + \beta_3(\texttt{othhisp}_i) \\ & + \beta_4(\texttt{nonhwhite}_i) + \beta_5(\texttt{nonhblack}_i) + \beta_6(\texttt{nonhother}_i) \\ & + \varepsilon_i \end{split} \]

where the subscript \(i\) means that our outcome is the height of observation \(i\).

Inserting our estimates:

\[ \begin{split} \textrm{Height}_i = \enspace & 34.39 + 1.67(\texttt{male}_i) + 1.93(\texttt{age}_i) + .21(\texttt{othhisp}_i) \\ & + .89(\texttt{nonhwhite}_i) + 1.23(\texttt{nonhblack}_i) + .34(\texttt{nonhother}_i) \\ & + \varepsilon_i \end{split} \]

We can use these estimates to compute a predicted height for any combination of values for the explanatory variables, whether observed in our data or not. A hat over a quantity in statistics always means “estimated” (or, in the context of an outcome, “predicted”):

\[ \begin{split} \widehat{\textrm{Height}}_i = \enspace & 34.39 + 1.67(\texttt{male}_i) + 1.93(\texttt{age}_i) + .21(\texttt{othhisp}_i) \\ & + .89(\texttt{nonhwhite}_i) + 1.23(\texttt{nonhblack}_i) + .34(\texttt{nonhother}_i) \end{split} \]

(Note that when we are predicting the outcome, the error term in our model goes away. OLS assumes that the expected value of the \(\varepsilon_i\) is 0 across all observations.)

Now, say that we want to predict the height for a 15-year-old non-Hispanic Black girl. Because the observation is female, the term for male drops out, and because the observation is non-Hispanic Black, the other terms for race/ethnicity drop out as well.

\[ \begin{split} \widehat{\textrm{Height}}_i = \enspace & 34.39 + 1.67(0) + 1.93(15) + .21(0) \\ & + .89(0) + 1.23(1) + .34(0) \end{split} \]

Leaving: \[ \begin{align} \widehat{\textrm{Height}}_i & = 34.39 + 1.93(15) + 1.23(1) \\ & = 34.39 + 28.95 + 1.23 \\ & = 64.57 \end{align} \]

Our predicted height is 64.57 inches (i.e., five-foot-four-and-a-half).

Aside: As it turns out, if we are actually trying to predict an outcome from linear regression estimates, in the sense of trying to make the best guess about an outcome that has not yet been observed (or just is not part of the data on which we fit our model), often the best prediction is not \(\hat{y}\).

The problem is that, in practice, regression models overfit data, meaning that they fit the data on which they are estimated better than new data. In that case, one may then predict better by effectively calculating \(\hat{y}\) as if all the regression coefficients were a bit closer to zero than the model estimates. If you read papers where authors talk about using something called “LASSO” or “Elastic net,” some version of this is what they are doing.