Sometimes in regression model we want to explain how much of the relationship between an key explanatory variable and outcome is accounted for by another explanatory variable.
For example, when there is a mediator, we may be interested in how much of the relationship between the explanatory variable and outcome is explained by the mediator. If our outcome is an person’s earnings and the explanatory variable is that person’s parents’ education, we might be interested in how much of the association between parents’ education and earnings is explained by their own education.
In OLS, we would answer this question by fitting two models:
A baseline model that includes the key explanatory variable and other covariates but not the mediator.
An full model that includes all the regressors in the baseline model plus the mediator.
We will refer to the coefficient of our key explanatory variable in the first model as \(\beta_{baseline}\) and in the second model as \(\beta_{full}\).
The percent change attributed to the mediator is just the percent change in the coefficient:
\[
\textrm{proportion accounted for by mediator} = \left(\frac{\beta_{baseline} - \beta_{full}}{\beta_{baseline}}\right)
\]
If \(\beta_{baseline}\) was 2 and \(\beta_{full}\) was 1.5, then (2-1.5)/2 = .25, meaning that 25 percent of the relationship between the key explanatory variable and outcome was accounted for by the mediator.
Sometimes in causal analyses the language of total effect and indirect effect is used, where one would say that 25 percent of the total effect of the explanatory variable on the outcome is due to the indirect effect via the mediator.
In OLS, if the mediator is not correlated with the explanatory variable, then the coefficient of the explanatory variable will not change when the mediator is added to the model. In that case, \(\beta_{baseline} = \beta_{full}\), making the result for the above formula 0.
But in logit models, as we have already seen, coefficients do change when variables uncorrelated with the key explanatory variable are added to the model, as long as the added variables are associated with the outcome. Specifically, logit coefficients when uncorrelated regressors that predict the outcome become larger in maginitude.
When we talk about mediation, this rescaling may seem undesirable for accurately characterizing the extent to which the mediator accounts for the explanatory variable and the outcome. We might consider the true amount of mediation instead to reflect the change of the coefficient net of the consequence of the rescaling of the underlying latent variable.
Aside: This is not just about mediation. The basic points here do not apply just to mediation; one might be interested in how much of an association is in fact the result of a confounded. But for simplicity of language I am going to refer to the variable whose impact on the relationship between the key explanatory variable and outcome we are interested in assessing as a mediator.
Solutions
There are several ways that one could eliminate the scaling artifact.
The average marginal effects of a logit model are not affected by the addition of uncorrelated regressors. So instead of looking at the percentage change in the coefficient, one could look at the percentage change in the average marginal effects.
Like OLS, the coefficients of the linear probability model are not affected by the addition of uncorrelated regressors. So one could study mediation using a linear probability model rather than logit/probit.
The Karlsen-Holm-Breen method, often referred to as the KHB method, involves making a residualized version of the mediating variable. The residualized mediator is uncorrelated with the key explanatory variable but retains its independent explanatory power for predicting the outcome. The residualized mediator can be used to determine how much of the change in the logit/probit coefficient of the key explanatory variable is due to the scaling difference, and so then one can “correct” the estimated change so that it is net of the scaling difference.
Logit coefficients can be transformed using y-standardization to coefficients for which the standard deviation of the underlying latent variable is 1. Hence, comparing y-standardized coefficients across models does not have any scaling difference.
Below I am going to work through each of these methods. Here is the big picture of the overall lesson, though.
Big Picture points about these solutions
If the idea is to explain what percentage of the relationship between the explanatory variable and outcome is explained by a third variable (e.g., a mediator), comparing logit coefficients will understate the amount of change relative to these other methods. The problem motivated a little flurry of papers about this issue: comparing logit coefficients underestimates mediation.
Average marginal effects, \(y\)-standardization, and the KHB method, both in the example below and my more general experience, give quite similar results. The authors of the KHB model have done some simulations showing that their method performs better under certain conditions, but I am a bit skeptical that these differences are big enough or broad enough in practice to be worth it.
The linear probability model results are going to be more different than the other methods. If you think doing the logit model vs. the linear probability model is better in the first place, it’s not clear to me why you would change your mind because of this issue given that the average marginal effects are straightforward enough to calculate.
My conclusion:If I had an application, like mediation analysis, where I needed to be able to compare the magnitude of results across models, my go-to approach would be average marginal effects.
Example
Our outcome is subjective social class identification in the 2000-2018 General Social Survey. Our question is the extent to which the relationship between a person’s education and their subjective social class identification is mediated by their household income.
In the GSS, social class identification is measured using four categories: upper, middle, working, and lower. We make a dichotomous variable called \(\texttt{middleupper}\) that \(=1\) if upper or middle and \(=0\) if working or lower.
The variable in the GSS is household income in 1986 dollars. Because 2010 is about the midpoint between the range of years we are examining (2000-2018), I converted it to 2010 dollars instead. From online CPI calculators, I converted 1986 dollars to 2010 dollars by multiplying by 1.99; the new variable is called \(\texttt{inc2010}\).
Expand to show code for opening and recoding data
# dependencieslibrary(tidyverse)library(haven)library(marginaleffects)library(texreg)options(scipen =9)df <-read_dta("../dta/gss_soc383.dta") %>%filter(year >=2000& year <=2018) %>%mutate(middleupper =case_when( class %in%1:2~0, class %in%3:4~1,TRUE~NA_real_ )) %>%mutate(inc2010 = (realinc *1.99)/1000) %>%# convert to 2010 dollarsdrop_na(middleupper, educ, realinc)
In the GSS, social class identification is measured using four categories: upper, middle, working, and lower. We are going to make a dichotomous variable called \(\texttt{middleupper}\) that \(=1\) if upper or middle and \(=0\) if working or lower.
. use ../dta/gss_soc383.dta, clear
(GSS 1972-2018: selected variables)
. drop if year < 2000 | year > 2018
(38,116 observations deleted)
. recode class (1/2=0) (3/4=1) (*=.), gen(middleupper)
(26698 differences between class and middleupper)
The GSS measure of real household income (\(\mathtt{realinc}\)) is measured in 1986 dollars. We will convert this to 2010 dollars – according to online sources this requires inflating by a factor of 1.99 – and we will make the metric of our new variable \(\texttt{inc2010}\) in thousands of dollars instead of dollars.
. gen inc2010 = (realinc * 1.99)/1000
(3,025 missing values generated)
. label variable inc2010 "household income in 2010 dollars (/1000)"
We will also drop cases with missing data so that we do not have to worry about the sample size changing if we add a variable with missing data to a model.
. drop if middleupper >= .
(1,692 observations deleted)
. drop if educ >= .
(43 observations deleted)
. drop if realinc >= .
(2,696 observations deleted)
Difference in logit coefficients
We fit two logit models, one without the mediator and one with the mediator.
The logit coefficient decreases from .2336 to .1522, a difference of 34.4% \(\left(\frac{.2336 - .1532}{.2336}\right)\).
Again, if this was OLS we might then offer the interpretation that household income accounts for 34.4% of the association and class identification. But someone might object to this statement given that logit coefficients increase when uncorrelated variables are added to the model.
Methods for estimating % mediation
Average marginal change
While the logit coefficients change when uncorrelated regressors are added to the model, the average marginal change does not. Accordingly, if we measure the % difference between models in terms of the average marginal change, the estimate will not be affected by the inclusion of uncorrelated regressors.
Expand to show code for fitting models and displaying results
reduced_ame <-avg_comparisons( reduced_model,variables ="educ",comparison ="dydx", vcov =FALSE,wts =TRUE# use weights when computing average - default is FALSE!)reduced_ame
The difference is .0528 vs .0318, or a decrease of (.0528 - .0318) / .0528 = 39.7%. We could interpret this as:
Household income accounts for 39.7% of the relationship between educational attainment and whether someone identifies as upper or middle class (vs. working or lower class).
After fitting each logit model, we calculate average marginal effects for educ by using \(\texttt{margins}\) with the \(\texttt{dydx()}\) option.
. quietly logit middleupper educ [pweight=wtssall]
. margins, dydx(educ)
Average marginal effects Number of obs = 22,267
Model VCE: Robust
Expression: Pr(middleupper), predict()
dy/dx wrt: educ
------------------------------------------------------------------------------
| Delta-method
| dy/dx std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .0527973 .0012731 41.47 0.000 .0503021 .0552925
------------------------------------------------------------------------------
. quietly logit middleupper educ inc2010 [pweight=wtssall]
. margins, dydx(educ)
Average marginal effects Number of obs = 22,267
Model VCE: Robust
Expression: Pr(middleupper), predict()
dy/dx wrt: educ
------------------------------------------------------------------------------
| Delta-method
| dy/dx std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .0316367 .0013925 22.72 0.000 .0289074 .034366
------------------------------------------------------------------------------
The difference is .05280 vs .0316, or a decrease of 40.1%. We could interpret this as:
Household income accounts for 40.1% of the relationship between educational attainment and whether someone identifies as upper or middle class (vs. working or lower class).
Change in linear probability model coefficients
Linear probability model coefficients are often similar to the average marginal change. So, someone interested in comparing across models might decide to eschew fitting a logit model in favor of a linear probability model.
=========================================================
Without income With income
---------------------------------------------------------
Years of education 0.0507 *** 0.0322 ***
(0.0011) (0.0011)
Household income (2010 $) 0.0021 ***
(0.0000)
Intercept -0.2168 *** -0.1202 ***
(0.0149) (0.0144)
---------------------------------------------------------
R^2 0.0916 0.1676
Num. obs. 22267 22267
=========================================================
*** p < 0.01; ** p < 0.05; * p < 0.1
The difference in coefficients here is .0507 vs. .0322, or a difference of 36.4% We could offer the same interpretation as above, only using 36.4 instead of the earlier result.
. regress middleupper educ [pweight=wtssall]
(sum of wgt is 21,967.0887912027)
Linear regression Number of obs = 22,267
F(1, 22265) = 1792.78
Prob > F = 0.0000
R-squared = 0.0916
Root MSE = .47589
------------------------------------------------------------------------------
| Robust
middleupper | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .0507396 .0011984 42.34 0.000 .0483908 .0530885
_cons | -.2167826 .016873 -12.85 0.000 -.2498549 -.1837104
------------------------------------------------------------------------------
. regress middleupper educ inc2010 [pweight=wtssall]
(sum of wgt is 21,967.0887912027)
Linear regression Number of obs = 22,267
F(2, 22264) = 2293.40
Prob > F = 0.0000
R-squared = 0.1676
Root MSE = .45556
------------------------------------------------------------------------------
| Robust
middleupper | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | .0322284 .0012773 25.23 0.000 .0297248 .034732
inc2010 | .0020581 .0000494 41.70 0.000 .0019614 .0021548
_cons | -.120162 .0164134 -7.32 0.000 -.1523335 -.0879905
------------------------------------------------------------------------------
The difference in coefficients here is .05074 vs. .03223, or a difference of 36.5% We could offer the same interpretation as above, only using this instead of the 40.1% that was above.
KHB method
The KHB method (for Karlson, Holm, and Breen) involves first generating a measure of income (the mediator) that is residualized on education (the key explanatory variable). That is, we use OLS regression to predict income from education. Residualized income is then the difference between observed income and the predicted income from this regression.
# ols predicting income from educationols <-lm(inc2010 ~ educ, data = df, weights = wtssall)# create new variable that is residualized incomedf <- df %>%mutate(inc2010_resid =predict(ols) - inc2010)
While income and education are correlated, residualized income and education are not.
library(weights) # for obtaining weighted correlations# show that inc2010 and educ are correlatedwtd.cor(x = df$inc2010, y = df$educ, weight = df$wtssall)
correlation std.err t.value p.value
Y 0.3718302 0.006221246 59.76781 0
# show that inc2010 and educ are not correlatedwtd.cor(x = df$inc2010_resid, y = df$educ, weight = df$wtssall)
correlation std.err t.value p.value
Y 2.86596e-15 0.006701757 0.0000000000004276431 1
If we include residualized income in a logit model with education, the coefficient of education will be larger than the logit coefficient because we are adding uncorrelated information that predicts the outcome.
Expand to show code for fitting models and displaying results
=======================================================
Bivariate With resid. income
-------------------------------------------------------
Years of education 0.234 *** 0.254 ***
(0.005) (0.006)
Residualized income -0.011 ***
(0.000)
Intercept -3.300 *** -3.514 ***
(0.077) (0.079)
-------------------------------------------------------
Log Likelihood -13433.227 -12482.607
Num. obs. 22267 22267
=======================================================
*** p < 0.01; ** p < 0.05; * p < 0.1
We can think of the model with residualized income as indicating what the logit coefficient would have been if income was uncorrelated with education but still predicted the outcome. In effect, we will use the coefficient of the .254 from the second model in the table instead of the .234 in order to assess the decrease in the coefficient when (unresidualized) income is added to the logit model.
. regress inc2010 educ [pweight=wtssall]
(sum of wgt is 21,967.0887912027)
Linear regression Number of obs = 22,267
F(1, 22265) = 2283.74
Prob > F = 0.0000
R-squared = 0.1383
Root MSE = 66.884
------------------------------------------------------------------------------
| Robust
inc2010 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
educ | 8.99433 .1882111 47.79 0.000 8.625423 9.363237
_cons | -46.94649 2.415632 -19.43 0.000 -51.6813 -42.21168
------------------------------------------------------------------------------
. predict yhat
(option xb assumed; fitted values)
. gen inc2010_resid = yhat - inc2010
As the remaining step, we will compare logit estimates in which residualized vs. unresidualized income are included along with education in the model.
========================================================
Resid. income Unresid. income
--------------------------------------------------------
Years of education 0.254 *** 0.153 ***
(0.006) (0.006)
Residualized income -0.011 ***
(0.000)
(Unresidualized) income 0.011 ***
(0.000)
Intercept -3.514 *** -2.988 ***
(0.079) (0.078)
--------------------------------------------------------
Log Likelihood -12482.607 -12482.607
Num. obs. 22267 22267
========================================================
*** p < 0.01; ** p < 0.05; * p < 0.1
The decrease is (.254 - .153) / .254, or 39.8%
Then we compare the coefficient for education for a model that includes residualized income and the coefficient for eduacation for a model that includes the actual income measure.
The difference in coefficients is .2539 vs. .1532, or 39.7%.
Change in y-standardized coefficients
Note: Not yet implemented in R.
We can compute the y-standardized coefficients in Stata using the add-on command \(\mathtt{listcoef}\). We run into a slight problem, however, as listcoef only reports results to 3 decimal places. However, results from \(\mathtt{listcoef}\)’s calculations are stored after the command is run in a matrix called \(\mathtt{r(table)}\), and so we can use that to get the four significant digits we have been using in our earlier calculations.
. quietly logit middleupper educ [pweight=wtssall]
. listcoef, std
note: pweights not compatible with summarize; weights treated as aweights.
logit (N=22267): Unstandardized and standardized estimates
Observed SD: 0.4993
Latent SD: 1.9426
-------------------------------------------------------------------------------
| b z P>|z| bStdX bStdY bStdXY SDofX
-------------+-----------------------------------------------------------------
educ | 0.2336 34.515 0.000 0.696 0.120 0.358 2.979
constant | -3.3000 -34.574 0.000 . . . .
-------------------------------------------------------------------------------
. mat list r(table)
r(table)[2,7]
b z P>|z| bStdX bStdY bStdXY SDofX
educ .23357486 34.514975 4.78e-261 .69571051 .1202353 .35812484 2.9785334
constant -3.2999562 -34.573959 6.22e-262 . . . .
. quietly logit middleupper educ inc2010 [pweight=wtssall]
. listcoef, std
note: pweights not compatible with summarize; weights treated as aweights.
logit (N=22267): Unstandardized and standardized estimates
Observed SD: 0.4993
Latent SD: 2.1031
-------------------------------------------------------------------------------
| b z P>|z| bStdX bStdY bStdXY SDofX
-------------+-----------------------------------------------------------------
educ | 0.1532 21.539 0.000 0.456 0.073 0.217 2.979
inc2010 | 0.0112 26.868 0.000 0.807 0.005 0.384 72.049
constant | -2.9879 -31.422 0.000 . . . .
-------------------------------------------------------------------------------
. mat list r(table)
r(table)[3,7]
b z P>|z| bStdX bStdY bStdXY SDofX
educ .15317255 21.539081 6.70e-103 .45622954 .0728329 .21693522 2.9785334
inc2010 .01119957 26.868427 5.14e-159 .80691507 .00532535 .3836847 72.04877
constant -2.9878963 -31.422282 1.00e-216 . . . .
The \(\texttt{listcoef}\) output would lead us to calculate a change of .120 vs. .078, which is 35.0%. But, using the expanded precision from looking at more decimal places we get a difference of .12024 vs. .07283, or 39.4%.