Often we fit regression models because we are interested in the association between a key explanatory variable and our outcome net of other variable(s). These other variable(s) that we include in our model in order to adjust our estimate for the key explanatory variable we are interested in are called covariates or control variables.
The point of this page is to reinforce what we are trying to do by including covariates in our model.
NHANES Example
We will use the NHANES data and restrict the sample to Hispanic girls in the US (age 3-18). NHANES asks a more detailed question about Hispanic ancestry, which allows us to compare Hispanic respondents who report Mexican ancestry to those who do not.
Aside: I use “Hispanic” throughout instead of “Latino” or “Latinx” because that is what NHANES uses, at least at the time of these data.
Our dependent variable is height, in inches.
I fit two regression models. In Model 1, the only explanatory variable is Mexican ancestry. In Model 2, we also include age as an explanatory variable.
model1 <-lm(height ~ mexican_am, data = df)model2 <-lm(height ~ mexican_am + age, data = df)modelsummary(list("(1)"= model1, "(2)"= model2),statistic ="({statistic})",gof_map =c("nobs"))
(1)
(2)
(Intercept)
54.561
37.302
(184.197)
(220.317)
mexican_am1
0.971
-0.198
(2.953)
(-1.466)
age
1.675
(146.092)
Num.Obs.
4301
4301
# compare mean height and age for mexican_am vs. nottulatab(mexican_am, data = df, mean=c(height, age))
│ Means
mexican_am │ height age N
─────────────┼──────────────────────────
0 │ 54.561 10.303 812
1 │ 55.532 11.001 3,489
─────────────┼──────────────────────────
Total │ 55.349 10.869 4,301
--------------------------------------------
(1) (2)
height height
--------------------------------------------
mexican_am 0.971** -0.198
(2.95) (-1.47)
age 1.675***
(146.09)
_cons 54.56*** 37.30***
(184.20) (220.32)
--------------------------------------------
N 4301 4301
--------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
What is important about these results is that the sign of the coefficient for Mexican ancestry changes between Model 1 and Model 2. The coefficients for these two models can be interpreted as follows:
Mexican-American girls are .97 inches taller on average than other Hispanic girls in the US.
Net of age, Mexican-American girls are .20 inches shorter on average than other Hispanic girls in the US.
It is important that we understand how both of these things can be true.
Let’s look at the differences in height and age between Mexican-American and other Hispanic respondents.
│ Means
ancestry │ height age N
─────────────┼──────────────────────────
Mexican-Am │ 55.532 11.001 3,489
Other Hisp │ 54.561 10.303 812
─────────────┼──────────────────────────
Total │ 55.349 10.869 4,301
In the first column of these results, we can see that the mean for Mexican-American girls is .97 inches higher than for other Hispanic girls. This is exactly the same as the coefficient in Model 1, which is another way of making clear what the coefficient is: the unconditional difference in means between the two groups.
Unconditional here means not adjusting for other variables.
In the second column of these results, we see that Mexican-American girls are also older than other Hispanic girls in the NHANES data–specifically, .70 years older on average.
Mexican-American girls are taller on average but also older on average. It could be, then, that the reason they are taller in these data is just because they are older.
This is what we are investigating when we add age as a covariate to the model. We are adjusting for the relationship between age and height, and seeing what the association between ancestry and height is conditional on age–that is, what the difference in means is for Mexican vs. other Hispanic girls who are the same age.
In Model 2 above, note that the coefficient for age is 1.68. Net of ancestry, each additional year of age is associated with a 1.68 inch increase in height. The difference in mean age between Mexican and other Hispanic girls was .7 years. Just as a result of Mexican-American girls being older, then, we would expect them to be \(1.68 \times .7 \approx 1.18\) inches taller on average.
But Mexican-American girls are not 1.18 inches taller on average. They are only .97 inches taller. In other words, Mexican-American girls are .2 inches shorter than what we would expect if the only reason that Mexican-American and other Hispanic girls differed in height was due to age. That -.2 inches corresponds to the coefficient for Mexican-American in Model 2 in the table above.
This leads to one way to think about what we are doing when we control for other variables in a regression model.
We are interested in the relationship between ancestry and height, but we want to control for the potentially confounding influence of age. We know that:
age is associated with ancestry; and
age is associated with height
These two facts alone will imply that there is some relationship between ancestry and height, even if ancestry differences per se played no role in height differences.
But we want to know what the relationship between ancestry and height is net of age. In effect, as in the example above, we figure out what we would expect the difference in height between groups to be if it was simply due to age. Then we adjust the observed difference in groups by this amount, and the result is our coefficient for the association net of age.
Key points so far:
In regression, we have an observed association between a key explanatory variable and an outcome, like here between ancestry and height.
When we include control variables, there is also effectively what we might call an implied association, meaning what the association between the key explanatory variable and outcome would be if the only reason the key explanatory variable and outcome were related was due to their common relationship with these other variables.
If our observed association equals this implied association, then the resulting coefficient when we control for other variables is zero.
If the observed association is greater than the implied association, the resulting coefficient is positive. If it is less than the implied association, the resulting coefficient is negative.
When does including a covariate make the biggest difference?
When we include a covariate (i.e., a control variable), we are adjusting the unconditional association between our key explanatory variable and outcome. We are adjusting for the implied association caused by the facts that both the key explanatory variable and the outcome are associated with the covariate. The bigger these associations, the bigger the change in the coefficient for the key explanatory variable when we include vs. exclude the covariate.
Specifically, the change in the coefficient for the key explanatory variable will reflect the product of:
The association between the covariate and the key explanatory variable.
The association between the covariate and the outcome.
In this diagram, the change in the coefficient for the key explanatory variable (the red dashed arrow) when the covariate is included in the model is approximately the product of the two black arrows:
The fact that the change reflects the product, rather than for example the sum, has a very important consequence. Because the product of anything times 0 equals 0, if either of the above associations is approximately zero, then the change in the coefficient for the key explanatory variable is approximately zero.
Consequently: if either of the above associations is (approximately) zero, then we will get (approximately) the same association whether we include the control variable or not.
Spoiler alert! That coefficients of a key explanatory variable do not change when one adds an uncorrelated covariate is true for the linear regression model. It is not true of a number of models we will cover in this class.
What does this mean for experiments?
In a typical experiment, the key explanatory variable is the treatment condition of the experiment. When we use random assignment to assign observations to the treatment or control condition of the experiment, this means that the expected value of the correlation between the treatment and any covariate is zero.
As a result, when we randomly assign which observations receive the treatment, then there will be no association between the treatment variable and the sex of respondents or the race or education or anything else, except whatever happens by pure chance. Imbalances caused by chance are just as likely to be in one direction as the other, and they get smaller in expectation as the size of the experiment gets larger.
As mentioned above, if the association between a covariate and the key explanatory variable is approximately zero, then we have little reason to worry about whether or not we include it in our model.
What is so great about experiments is this logic carries through to the potential covariates that we did not measure. The expected value of our coefficients does not vary depending on the inclusion of any pre-treatment covariate in a randomized experiment, and that fact holds for any covariate, regardless of whether we measured it.
In contrast, when we are using non-experimental data and have to worry about covariates, we often are limited by what covariates we have measured, and one of the big problems that inferences using non-experimental data have is all the concerns that can be raised about unmeasured covariates.
Elaboration: Here we are talking about the change in the coefficient for our key explanatory variable when the first covariate is added to a model. Say that we already have some covariates in our model and we add another one. What matters in this case are the associations of the new covariate with the outcome and with our key explanatory variable beyond what is already accounted for by the other covariates in our model.
For example, if we are studying the relationship between a person’s educational attainment and earnings and we have a bunch of family background variables in our model, then a new family background variable will only change the coefficient to the extent to which its relationship with education and earnings is not already accounted for by the other family background variables that we have already included.
When is it appropriate to adjust for a covariate?
This is easiest to think about when we treat our key explanatory variable as a potential cause of the outcome, and when we think about the arrows in our diagrams as reflecting the causal influence of one variable upon another.
What we want to estimate as the red arrow is the causal effect of the cause on the outcome, or how much we would expect the outcome to change if we changed the cause. This causal effect is the red dashed arrow in the diagram.
When another variable is associated with both the cause and the outcome, there are three basic scenarios for how the relationships among these three variables might work. Each of these scenarios means something different for the question of whether we would actually want to include this variable as a control variable in our model. The alarming thing here is that a regression model will give the same results regardless of which of these scenarios is true, even though they imply very different interpretations.
We are going to go through each using the example of trying to estimate the causal effect of educational attainment on earnings.
1. Confounding.
A confounding variable is one that both influences our potential causal variable and our outcome. When confounding variables exist, and our research design does nothing to account for them, they will distort our estimation of a cause-effect relationship if that is what we are trying to estimate. It could even be that our potential cause does not affect the outcome at all, and that the association is entirely due to the confounding variable.
In our example, this would mean that education didn’t really have any effect on earnings, and going to school longer didn’t actually cause anyone to make more money, but it just looked that way because people with more advantaged family backgrounds got more education and then subsequently made more money.
This is the basic “correlation does not imply causation” scenario. Research designs intended to estimate causal effects are flawed to whatever extent unaddressed confounding exists.
2. Mediation.
A mediating variable (or “mediator”) is a consequence of the causal variable that influences the outcome. In this way, the mediator captures one of the pathways by which our cause has the influence on the outcome that it does.
When including a mediating variable changes our estimate, it doesn’t mean that our previous estimate of the causal effect was incorrect. Instead, comparing the previous estimate to our new estimate provides an assessment of how important the mediator is for understanding why the causal variable affects the outcome.
In our example, say that we observed a strong relationship between education and earnings but that, when we added occupation to the model, this association largely disappeared. It would not mean that education did not influence earnings, but rather it would still be the case that going to school longer increased one’s prospects for higher income. Instead, our finding would be that the way education affected earnings was primarily by causing people to have higher-earning occupations.
3. Post-outcome consequences.
A post-outcome consequence is a consequence of the outcome. If a cause influences an outcome, then it also influences anything the outcome influences, if for no other reason than through the outcome. In other words, if education leads people to have more money, then it indirectly causes anything that having more money causes.
In our example, higher earnings lead people to have more expensive homes, because they use those earnings to buy more expensive real estate. In that event, higher education would also lead people to have more expensive homes, if for no other reason than through earnings.
Post-outcome consequences are fundamentally irrelevant for the question that leads us to estimate our model in the first place, as they occur after the cause-effect relationship and so after the outcome. If including post-outcome consequences changes our coefficient, then all we have done is introduce bias into our coefficient. We never want to include post-outcome consequences as covariates.
Key points
If we are trying to estimate a cause-effect relationship:
We have to address confounding, one way or another. Experiments do so in the data collection phase by eliminating the association between variables that would be confounders and the cause. In any event, if confounding exists and we do nothing about it, our estimates of the effect of the cause on the outcome will be wrong. With non-experimental data, we should control for confounders.
We do not have to do anything about post-outcome consequences, and in fact we would introduce bias if we added a post-outcome consequence to our model. We should never control for post-outcome consequences.
We do not include mediators in our model to improve our estimate of the cause-effect relationship. Failing to include mediators in the model does not bias our estimate. But, including mediators in a model is a way of interrogating a relationship between a key explanatory variable and outcome, and trying to understand the mechanisms that bring about the relationship.
A complication with real data is that we may have a set of variables and not be sure whether a given variable is a confounder, mediator, or post-outcome consequence: often, it might seem like a combination of these. For example, “health” could be a confounder, mediator, or post-outcome consequence of the relationship between education and earnings. If all we had in our data were a single measure of general health at the same time as our measurement of earnings, we would not really know how to distinguish among these possibilities (because health at one point in time is correlated with health at other points in time, the fact that our single measurement of health officially refers to the day of the survey interview doesn’t rule out that health was a confounder earlier that we simply failed to measure at the time). This problem is a limitation of measures that we can do very little about with statistics after the fact, except wish we had more complete measures.