Reviewing the basic interpretation of OLS regression coefficients

Let’s review how one interprets coefficients from an OLS regression model using an example in which our outcome is the height of children and adolescents in the US (ages 3-18). In this example, we will have three explanatory variables:

Sex, measured here as male/female binary
Age, measured in years
Race/ethnicity, here divided into five categories: (1) Hispanic, Mexican-American; (2) Hispanic, Other; (3) Non-Hispanic, White; (4) Non-Hispanic, Black; (5) Non-Hispanic, Other

We fit the model using the \(\texttt{lm}\) function in R or the \(\texttt{regress}\) command in Stata:

Load packages and data

library(tidyverse)
library(haven)
library(tulaverse) # used to summarize models
nhanes <- read_dta("../dta/nhanes_bodymeasures.dta")  %>%
  filter(age >= 3 & age <= 18) %>%
  mutate(raceeth5cat = as_factor(raceeth5cat))

model <- lm(height ~ male + age + raceeth5cat, data = nhanes)
tula(model, ref=TRUE) # from tulaverse package, better output than summary()

AIC = 124991.034                                        Number of obs =  23467
BIC = 125055.541                                        R-squared     = 0.8705
                                                        Adj R-squared = 0.8704
                                                        Root MSE      =  3.469
──────────────────────────────────────────────────────────────────────────────
height       │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
──────────────────────────────────────────────────────────────────────────────
male         │     1.675      .0453      36.97    <.0001      1.586      1.763
age          │      1.93    .004901      393.8    <.0001       1.92       1.94
raceeth5cat  │                                                                
  Oth-Hisp   │     .2111     .09516      2.219     .0265     .02462      .3976
  NonH-White │     .8946     .05993      14.93    <.0001      .7771      1.012
  NonH-Black │     1.231     .05897      20.87    <.0001      1.115      1.346
  NonH-Other │     .3371     .09741      3.461     .0005      .1462       .528
  Mex-Am     │         0                                                      
(Intercept)  │      34.4     .07164      480.2    <.0001      34.26      34.54
──────────────────────────────────────────────────────────────────────────────

. use ../dta/nhanes_bodymeasures, clear
(Height and weight data from NHANES adults (cumulative to 2016))

. drop if age > 18 | age < 3 // restrict sample to ages 3-18
(47,355 observations deleted)
 
. regress height i.male age i.raceeth5cat, allbaselevels

      Source |       SS           df       MS      Number of obs   =    23,467
-------------+----------------------------------   F(6, 23460)     =  26278.59
       Model |  1897894.47         6  316315.745   Prob > F        =    0.0000
    Residual |  282388.312    23,460  12.0370125   R-squared       =    0.8705
-------------+----------------------------------   Adj R-squared   =    0.8704
       Total |  2180282.78    23,466  92.9124171   Root MSE        =    3.4694

------------------------------------------------------------------------------
      height |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        male |
     female  |          0  (base)
       male  |    1.67467   .0453024    36.97   0.000     1.585874    1.763465
             |
         age |   1.930062   .0049009   393.82   0.000     1.920456    1.939668
             |
 raceeth5cat |
     Mex-Am  |          0  (base)
   Oth-Hisp  |   .2111317   .0951575     2.22   0.027     .0246168    .3976467
 NonH-White  |   .8945881   .0599273    14.93   0.000     .7771268    1.012049
 NonH-Black  |   1.230825   .0589714    20.87   0.000     1.115237    1.346413
 NonH-Other  |   .3371038   .0974073     3.46   0.001     .1461791    .5280285
             |
       _cons |   34.39853   .0716407   480.15   0.000     34.25811    34.53895
------------------------------------------------------------------------------

Quantitative explanatory variable

The coefficient for age is 1.93. Age is included in the model as a quantitative variable, where “quantitative” in this context just means that the different values of the variable are intended to be interpreted as units of a quantity, such that one can meaningfully talk about an x-unit change in the variable. (Continuous variables are quantitative variables, but counts are also quantitative variables.)

In OLS, coefficients of quantitative explanatory variables reflect the change in the outcome associated with a one unit increase in the explanatory variable. Meaning in this case:

Net of sex and race/ethnicity, each additional year of age is associated with an increase of 1.93 inches in child height.

This interpretation is an oversimplification that we could address with additional models. For example, the relationship between age and height differs for different ages; it is not linear. Also, we could examine whether the relationship differs by sex or by race/ethnicity. Right now, we are just trying to understand interpretation in the simplest case.

Aside: Indicating covariates in interpretation. Above I note that the associated change in height is “Net of sex and race/ethnicity.” Other ways of saying this that one often sees are:

“Adjusting for sex and race/ethnicity”
“Holding sex and race/ethnicity constant”
“Controlling for sex and race/ethnicity”

There are people out there who do not like the “Holding constant” or “Controlling” interpretation. The main objection is that these phrasings may sound like the other variables were accounted for through study design, rather than through the purely statistical adjustment of including them in a regression model.

Your instructor sees the point of this critique but not enough to be particularly concerned. In any case, all these and other phrasings are used interchangeably in social science.

Note that as the number of variables gets larger, listing each one becomes more ungainly. One might look to other ways of grouping variables together: “Net of other demographic characteristics…” or “Net of other sociodemographic characteristics…” and at some point maybe just “Net of other variables in the model…” This is all a matter of making stylistic decisions about what balances precision and effectiveness in writing.

Categorical explanatory variable

Sex and race/ethnicity are measured as categorical variables. A categorical variable is a variable in which different values denote membership in one category versus another. Another name for a categorical variable is a factor variable.

When we treat a measure as categorical, there is no sense in talking about a unit increase in it. Instead, what we may talk about is a difference in the expected value of the outcome between one category and another, net of our other variables.

The way that coefficients for categorical variables are usually presented involves a reference category (also called an omitted category). The coefficient for the reference category is effectively constrained to equal 0. In R, omitted categories are not included in regression output. (In Stata, they can be listed with the \(\texttt{allbaselevels}\) option for estimation commands.)

Regarding sex, the coefficient for male is 1.67. This is interpreted as the difference for boys compared to girls:

Net of age and race/ethnicity, boys are 1.67 inches taller than girls on average.

For race/ethnicity, the reference category in the above example is Mexican-American. The coefficient for non-Hispanic Blacks is 1.23. We would interpret this as:

Net of age and sex, non-Hispanic Black children are 1.23 inches taller on average than Mexican-American children.

Or comparing to non-Hispanic Whites:

Net of age and sex, non-Hispanic White children are .89 inches taller on average than Mexican-American children.

If we wanted to compare non-Hispanic Blacks to non-Hispanic Whites, we could simply subtract the coefficients to get the result: \(1.23 - .89 = .34\) inches. We could also fit the model with either of these categories as the reference category instead of Mexican-Americans. Here we use non-Hispanic Black:

nhanes <- nhanes %>%
  mutate(raceeth5cat = relevel(raceeth5cat, ref="NonH-Black"))
 
model <- lm(height ~ male + age + raceeth5cat, data = nhanes)
tula(model, select="raceeth5cat", ref=TRUE)

──────────────────────────────────────────────────────────────────────────────
height       │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
──────────────────────────────────────────────────────────────────────────────
raceeth5cat  │                                                                
  Mex-Am     │    -1.231     .05897     -20.87    <.0001     -1.346     -1.115
  Oth-Hisp   │     -1.02      .0954     -10.69    <.0001     -1.207     -.8327
  NonH-White │    -.3362     .06031     -5.575    <.0001     -.4544      -.218
  NonH-Other │    -.8937     .09765     -9.153    <.0001     -1.085     -.7023
  NonH-Black │         0                                                      
──────────────────────────────────────────────────────────────────────────────

. regress height i.male age ib4.raceeth5cat, allbaselevels

:: output omitted ::

 raceeth5cat |
     Mex-Am  |  -1.230825   .0589714   -20.87   0.000    -1.346413   -1.115237
   Oth-Hisp  |  -1.019694   .0954002   -10.69   0.000    -1.206684    -.832703
 NonH-White  |  -.3362371   .0603081    -5.58   0.000    -.4544449   -.2180293
 NonH-Black  |          0  (base)
 NonH-Other  |  -.8937214   .0976476    -9.15   0.000    -1.085117   -.7023257

 :: output omitted ::

Changing the reference category in R and Stata

The above code changes the reference category. But to explain more specifically how this is done:

One changes the reference category in R using the relevel() function, with the ref argument used to indicate what the new reference category is. For example, relevel(as.factor(raceeth5cat), ref="NonH-Black") changes the reference category to category “NonH-Black”. This is a persistent change; that is, if we fit models with the same variable later in our R session, the reference category will remain “NonH-Black” until we set it to a different category.

To change the reference category in Stata, use ib#.varname where # is a value of the variable. This applies only to the model being fit.

You can also set this as a feature of a variable in Stata with \(\texttt{fvset base }\textit{# }\textit{varname}\). If you do this and then save the dataset, the fact that you want this to be treated as the default base category will be stored as part of the dataset.