Load packages and data
library(tidyverse)
library(haven)
library(tulaverse) # used to summarize models
nhanes <- read_dta("../dta/nhanes_bodymeasures.dta") %>%
filter(age >= 3 & age <= 18) %>%
mutate(raceeth5cat = as_factor(raceeth5cat))Let’s review how one interprets coefficients from an OLS regression model using an example in which our outcome is the height of children and adolescents in the US (ages 3-18). In this example, we will have three explanatory variables:
We fit the model using the \(\texttt{lm}\) function in R or the \(\texttt{regress}\) command in Stata:
AIC = 124991.034 Number of obs = 23467
BIC = 125055.541 R-squared = 0.8705
Adj R-squared = 0.8704
Root MSE = 3.469
──────────────────────────────────────────────────────────────────────────────
height │ Coef Std. Err. t P>|t| [95% Conf Interval]
──────────────────────────────────────────────────────────────────────────────
male │ 1.675 .0453 36.97 <.0001 1.586 1.763
age │ 1.93 .004901 393.8 <.0001 1.92 1.94
raceeth5cat │
Oth-Hisp │ .2111 .09516 2.219 .0265 .02462 .3976
NonH-White │ .8946 .05993 14.93 <.0001 .7771 1.012
NonH-Black │ 1.231 .05897 20.87 <.0001 1.115 1.346
NonH-Other │ .3371 .09741 3.461 .0005 .1462 .528
Mex-Am │ 0
(Intercept) │ 34.4 .07164 480.2 <.0001 34.26 34.54
──────────────────────────────────────────────────────────────────────────────
. use ../dta/nhanes_bodymeasures, clear (Height and weight data from NHANES adults (cumulative to 2016)) . drop if age > 18 | age < 3 // restrict sample to ages 3-18 (47,355 observations deleted) . regress height i.male age i.raceeth5cat, allbaselevels Source | SS df MS Number of obs = 23,467 -------------+---------------------------------- F(6, 23460) = 26278.59 Model | 1897894.47 6 316315.745 Prob > F = 0.0000 Residual | 282388.312 23,460 12.0370125 R-squared = 0.8705 -------------+---------------------------------- Adj R-squared = 0.8704 Total | 2180282.78 23,466 92.9124171 Root MSE = 3.4694 ------------------------------------------------------------------------------ height | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- male | female | 0 (base) male | 1.67467 .0453024 36.97 0.000 1.585874 1.763465 | age | 1.930062 .0049009 393.82 0.000 1.920456 1.939668 | raceeth5cat | Mex-Am | 0 (base) Oth-Hisp | .2111317 .0951575 2.22 0.027 .0246168 .3976467 NonH-White | .8945881 .0599273 14.93 0.000 .7771268 1.012049 NonH-Black | 1.230825 .0589714 20.87 0.000 1.115237 1.346413 NonH-Other | .3371038 .0974073 3.46 0.001 .1461791 .5280285 | _cons | 34.39853 .0716407 480.15 0.000 34.25811 34.53895 ------------------------------------------------------------------------------
The coefficient for age is 1.93. Age is included in the model as a quantitative variable, where “quantitative” in this context just means that the different values of the variable are intended to be interpreted as units of a quantity, such that one can meaningfully talk about an x-unit change in the variable. (Continuous variables are quantitative variables, but counts are also quantitative variables.)
In OLS, coefficients of quantitative explanatory variables reflect the change in the outcome associated with a one unit increase in the explanatory variable. Meaning in this case:
This interpretation is an oversimplification that we could address with additional models. For example, the relationship between age and height differs for different ages; it is not linear. Also, we could examine whether the relationship differs by sex or by race/ethnicity. Right now, we are just trying to understand interpretation in the simplest case.
Aside: Indicating covariates in interpretation. Above I note that the associated change in height is “Net of sex and race/ethnicity.” Other ways of saying this that one often sees are:
There are people out there who do not like the “Holding constant” or “Controlling” interpretation. The main objection is that these phrasings may sound like the other variables were accounted for through study design, rather than through the purely statistical adjustment of including them in a regression model.
Your instructor sees the point of this critique but not enough to be particularly concerned. In any case, all these and other phrasings are used interchangeably in social science.
Note that as the number of variables gets larger, listing each one becomes more ungainly. One might look to other ways of grouping variables together: “Net of other demographic characteristics…” or “Net of other sociodemographic characteristics…” and at some point maybe just “Net of other variables in the model…” This is all a matter of making stylistic decisions about what balances precision and effectiveness in writing.
Sex and race/ethnicity are measured as categorical variables. A categorical variable is a variable in which different values denote membership in one category versus another. Another name for a categorical variable is a factor variable.
When we treat a measure as categorical, there is no sense in talking about a unit increase in it. Instead, what we may talk about is a difference in the expected value of the outcome between one category and another, net of our other variables.
The way that coefficients for categorical variables are usually presented involves a reference category (also called an omitted category). The coefficient for the reference category is effectively constrained to equal 0. In R, omitted categories are not included in regression output. (In Stata, they can be listed with the \(\texttt{allbaselevels}\) option for estimation commands.)
Regarding sex, the coefficient for male is 1.67. This is interpreted as the difference for boys compared to girls:
For race/ethnicity, the reference category in the above example is Mexican-American. The coefficient for non-Hispanic Blacks is 1.23. We would interpret this as:
Or comparing to non-Hispanic Whites:
If we wanted to compare non-Hispanic Blacks to non-Hispanic Whites, we could simply subtract the coefficients to get the result: \(1.23 - .89 = .34\) inches. We could also fit the model with either of these categories as the reference category instead of Mexican-Americans. Here we use non-Hispanic Black:
──────────────────────────────────────────────────────────────────────────────
height │ Coef Std. Err. t P>|t| [95% Conf Interval]
──────────────────────────────────────────────────────────────────────────────
raceeth5cat │
Mex-Am │ -1.231 .05897 -20.87 <.0001 -1.346 -1.115
Oth-Hisp │ -1.02 .0954 -10.69 <.0001 -1.207 -.8327
NonH-White │ -.3362 .06031 -5.575 <.0001 -.4544 -.218
NonH-Other │ -.8937 .09765 -9.153 <.0001 -1.085 -.7023
NonH-Black │ 0
──────────────────────────────────────────────────────────────────────────────
. regress height i.male age ib4.raceeth5cat, allbaselevels :: output omitted :: raceeth5cat | Mex-Am | -1.230825 .0589714 -20.87 0.000 -1.346413 -1.115237 Oth-Hisp | -1.019694 .0954002 -10.69 0.000 -1.206684 -.832703 NonH-White | -.3362371 .0603081 -5.58 0.000 -.4544449 -.2180293 NonH-Black | 0 (base) NonH-Other | -.8937214 .0976476 -9.15 0.000 -1.085117 -.7023257 :: output omitted ::
The above code changes the reference category. But to explain more specifically how this is done:
One changes the reference category in R using the relevel() function, with the ref argument used to indicate what the new reference category is. For example, relevel(as.factor(raceeth5cat), ref="NonH-Black") changes the reference category to category “NonH-Black”. This is a persistent change; that is, if we fit models with the same variable later in our R session, the reference category will remain “NonH-Black” until we set it to a different category.
To change the reference category in Stata, use ib#.varname where # is a value of the variable. This applies only to the model being fit.
You can also set this as a feature of a variable in Stata with \(\texttt{fvset base }\textit{# }\textit{varname}\). If you do this and then save the dataset, the fact that you want this to be treated as the default base category will be stored as part of the dataset.