Uncorrelated regressors and logit coefficients

Back when I introduced the linear probability model, I said that it used to be that many social scientists were simply taught that, if you had a binary outcome, using the linear probability model was wrong and you needed to use logit instead. Then I said that more recent years have seen a reappraisal with more misgivings about logit and more appreciation of the linear probability model.

We have now reached the point where I explain more about perhaps the biggest reason for misgivings about logit.

The big issue

When I talked about adjusting for covariates in the linear regression model, one topic was how adding a covariate to a model changes the coefficients for a key explanatory variable. I said that change in the coefficient depends on the product of two things:

The association between the covariate and the key explanatory variable.
The association between the covariate and the outcome.

If the added covariate is uncorrelated with either the key explanatory variable or the outcome, the coefficient for the key explanatory variable does not change when the covariate is added.

This is not true for the logit model.

Specifically, for logit: if a covariate is associated with the outcome, then adding it to the model will change the coefficient for other explanatory variables, regardless of whether those explanatory variables are associated with the covariate.

Simulation example

We are going to use a simulated example here to help make the issue plain.

Keep in mind that the reason to use a simulation is that it allows me to make the situation more obvious and ominous than if I were to use a real, non-cherry-picked, example.

In our simulated example, we are going to imagine an election-eve experiment in which voters randomly either do or do not receive a special voicemail message from a celebrity encouraging them to vote.

In our study, we will have the variables:

\(\texttt{voicemail}\) = whether the person randomly received the special celebrity voicemail message
\(\texttt{vote}\) = whether the person voted in the election

We also obtained information from a market research company about everyone in our study, which allows us to aggregate this information into two quantitative predictors of whether someone votes:

\(\texttt{vote\_last\_time}\) = voted in last election (yes/no)
\(\texttt{sociodem}\) = propensity to vote based on different sociodemographic characteristics (standardized)

\(\texttt{vote\_last\_time}\) is substantially correlated with \(\texttt{sociodem}\) (\(r=.5\)).

The reason we have the two variables is that we are going to show what happens to our coefficient of \(\texttt{voicemail}\) when you have covariates that themselves predict voting increasingly well.

Simulation details

The way our simulation will work is that there will be a continuous variable \(y^*\) that is a function of our explanatory variables and a random error term for each person:

\[ \begin{gather*} y^* = \texttt{voicemail} + \texttt{vote\_last\_time} + \texttt{sociodem} + \texttt{error} \\ \textrm{(Not including coefficients for these terms and a constant)} \end{gather*} \]

One can think of \(y^*\) as representing a person’s ultimate inclination to vote. But we don’t measure that; what we ultimately observe is whether somebody votes or not. In our model, if \(y^*\) is above a threshold \(\tau\), the person votes, and if it is below a threshold, they do not.

\[ \begin{align} \texttt{vote} & = 1 \textrm{ if } y^* > \tau \\ \texttt{vote} & = 0 \textrm{ if } y^* \leq \tau \end{align} \]

Expand to show code for simulation

library(tidyverse)
library(MASS)

set.seed(47)

# create two correlated normalized variables
n <- 1000000
sigma <- matrix(c(1, .5, .5, 1), nrow = 2)
xy <- mvrnorm(n, c(0,0), sigma)

data <- tibble(
  vote_last_time_cont = xy[,1],
  sociodem = xy[,2],
  error = rnorm(n, mean = 0, sd = 1), # random error
)

data <- data %>%
  mutate(vote_last_time = ifelse(vote_last_time_cont > .5, 1, 0)) %>%
  mutate(voicemail = ifelse(runif(n) > 0.5, 1, 0)) # random assigned experimental variable


data <- data %>%
  mutate(ystar = (.3*voicemail) + vote_last_time + sociodem + error) %>%
  mutate(vote = ifelse(ystar > 0, 1, 0))

Elaborating proportions in our simulated data

In our simulated data, we can look at the proportion of the entire sample that voted. We get:

data %>%
  group_by() %>%
  summarize(pr1 = mean(vote)) %>%
  mutate(across(where(is.numeric), ~round(., 4)))

# A tibble: 1 × 1
    pr1
  <dbl>
1 0.603

The proportion who voted is 60.3%. We can break this down by whether the respondent received the voicemail:

data %>%
  group_by(voicemail) %>%
  summarize(pr1 = mean(vote)) %>%
  mutate(across(where(is.numeric), ~round(., 4)))

# A tibble: 2 × 2
  voicemail   pr1
      <dbl> <dbl>
1         0 0.568
2         1 0.638

In our simulation, people who didn’t receive the voicemail had a 56.8% chance of voting, and those who did had a 63.8% chance of voting. So the voicemail increased the probability of voting in our sample by 7.0 percentage points.

If we fit the linear probability model with only voicemail as our explanatory variable, this is the coefficient we would obtain:

model1 <- lm(vote ~ voicemail, data = data)

# display results
screenreg(
  list(model1),
  custom.model.names = c("Model"),
  custom.coef.map = list(
    "voicemail" = "Celebrity voicemail",
    "(Intercept)" = "Intercept"
  ),
  stars = c(0.01, 0.05, 0.1),
  digits = 3, 
  include.rsquared = TRUE, 
  include.adjrs = FALSE, 
)


====================================
                     Model          
------------------------------------
Celebrity voicemail        0.071 ***
                          (0.001)   
Intercept                  0.568 ***
                          (0.001)   
------------------------------------
R^2                        0.005    
Num. obs.            1000000        
====================================
*** p < 0.01; ** p < 0.05; * p < 0.1

Note that the intercept here also matches the probability of voting for those who did not receive the voicemail.

We can now add to this linear probability model the measure of whether the participant voted last time:

model2 <- lm(vote ~ voicemail + vote_last_time, data = data)

# display results
screenreg(
  list(model2),
  custom.model.names = c("Model"),
  custom.coef.map = list(
    "voicemail" = "Celebrity voicemail",
    "vote_last_time" = "Voted last time",
    "(Intercept)" = "Intercept"
  ),
  stars = c(0.01, 0.05, 0.1),
  digits = 3, 
  include.rsquared = TRUE, 
  include.adjrs = FALSE, 
)


====================================
                     Model          
------------------------------------
Celebrity voicemail        0.071 ***
                          (0.001)   
Voted last time            0.428 ***
                          (0.001)   
Intercept                  0.436 ***
                          (0.001)   
------------------------------------
R^2                        0.169    
Num. obs.            1000000        
====================================
*** p < 0.01; ** p < 0.05; * p < 0.1

While having voted before increases the predicted probability of voting by a whopping 42.8 percentage points, it does not affect the average predicted effect of the voicemail.

Voted last time?	Received voicemail?	Predicted Pr(vote)
Yes	Yes	.935
Yes	No	.864
	Difference	.071
No	Yes	.507
No	No	.436
	Difference	.071

If we were to look at the data separately by whether the participants voted last time, the effect of receiving the celebrity voicemail is not actually the same across groups.

Expand to show code that generates values for table

data %>%
  group_by(vote_last_time, voicemail) %>%
  summarize(mean = mean(vote),
            n = n())

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by vote_last_time and voicemail.
ℹ Output is grouped by vote_last_time.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(vote_last_time, voicemail))` for per-operation
  grouping (`?dplyr::dplyr_by`) instead.

# A tibble: 4 × 4
# Groups:   vote_last_time [2]
  vote_last_time voicemail  mean      n
           <dbl>     <dbl> <dbl>  <int>
1              0         0 0.428 345887
2              0         1 0.514 345255
3              1         0 0.881 154689
4              1         1 0.917 154169

Voted last time?	Received voicemail?	Actual Pr(vote)
Yes	Yes	.917
Yes	No	.880
	Difference	.037
No	Yes	.513
	No	.429
	Difference	.086

However, when we average these differences over the relative size of groups (those who voted last time vs. didn’t), the average effect of receiving the voicemail remains the same [.071]. And, likewise, when we fit the linear probability model, the coefficient remains what it was before, as we showed earlier.

Indeed, if we also add the sociodemographic predictors of voting, we will also see that it does not affect our coefficient estimates for the celebrity voicemail. We can put the estimates from all three models side-by-side:

Expand to show code that fits models and shows results

model3 <- lm(vote ~ voicemail + vote_last_time + sociodem, data = data)

# Display results using texreg
screenreg(
  list(model1, model2, model3),
  custom.model.names = c("Model 1", "Model 2", "Model 3"),
  custom.coef.map = list(
    "voicemail" = "Celebrity voicemail",
    "vote_last_time" = "Voted last time",
    "sociodem" = "Sociodemographics",
    "(Intercept)" = "Intercept"
  ),
  stars = c(0.01, 0.05, 0.1),
  digits = 3, 
  include.rsquared = TRUE, 
  include.adjrs = FALSE, 
)


======================================================================
                     Model 1          Model 2          Model 3        
----------------------------------------------------------------------
Celebrity voicemail        0.071 ***        0.071 ***        0.071 ***
                          (0.001)          (0.001)          (0.001)   
Voted last time                             0.428 ***        0.228 ***
                                           (0.001)          (0.001)   
Sociodemographics                                            0.242 ***
                                                            (0.000)   
Intercept                  0.568 ***        0.436 ***        0.497 ***
                          (0.001)          (0.001)          (0.001)   
----------------------------------------------------------------------
R^2                        0.005            0.169            0.377    
Num. obs.            1000000          1000000          1000000        
======================================================================
*** p < 0.01; ** p < 0.05; * p < 0.1

The coefficent for celebrity voicemail does not change across the three models. As with the linear regression model for continuous outcomes, with the linear probability model for binary outcomes, our estimates do not change when we add uncorrelated regressors to the model.

The coefficient for \(\texttt{vote\_last\_time}\) does change between Model 2 and Model 3. But this is because \(\texttt{vote\_last\_time}\) is correlated with the sociodemographics variable.

If we compare the \(R^2\) across models, the additional regressors in the simulation greatly increase the extent to which we have accounted for variation in our outcome. Even though we can predict the outcome much better, this has no bearing on our estimate of the effect of celebrity voicemail.

That it would have no effect on the coefficient makes sense if we consider that our estimate – 7.1 percentage points – can also be interpreted as how we would predict the overall turnout rate would change if everyone had received the celebrity voicemail (vs. nobody receiving it). The 7.1 percentage point increase is an estimate of the change in the population proportion as well as the probability for individuals. That estimated change in the turnout rate does not and should not depend on whether other (uncorrelated) variables are included in the model.

Increasing coefficients for the logit model

If we use the logit model instead of the linear regression model to fit the counterparts to the above table, the coefficients for the celebrity voicemail do not stay the same across the three models.

Expand to show code that fits models and shows results

logit_model1 <- glm(vote ~ voicemail, data = data, family="binomial")
logit_model2 <- glm(vote ~ voicemail + vote_last_time, data = data, family="binomial")
logit_model3 <- glm(vote ~ voicemail + vote_last_time + sociodem, data = data, family="binomial")

# Display results using texreg
screenreg(
  list(logit_model1, logit_model2, logit_model3),
  custom.model.names = c("Model 1", "Model 2", "Model 3"),
  custom.coef.map = list(
    "voicemail" = "Celebrity voicemail",
    "vote_last_time" = "Voted last time",
    "sociodem" = "Sociodemographics",
    "(Intercept)" = "Intercept"
  ),
  stars = c(0.01, 0.05, 0.1),
  digits = 3, 
  include.aic = FALSE,
  include.bic = FALSE,
  include.deviance = FALSE
)


======================================================================
                     Model 1          Model 2          Model 3        
----------------------------------------------------------------------
Celebrity voicemail        0.296 ***        0.354 ***        0.511 ***
                          (0.004)          (0.004)          (0.005)   
Voted last time                             2.316 ***        1.758 ***
                                           (0.006)          (0.007)   
Sociodemographics                                            1.727 ***
                                                            (0.004)   
Intercept                  0.273 ***       -0.295 ***        0.006    
                          (0.003)          (0.003)          (0.004)   
----------------------------------------------------------------------
Log Likelihood       -669141.660      -575858.605      -428310.027    
Num. obs.            1000000          1000000          1000000        
======================================================================
*** p < 0.01; ** p < 0.05; * p < 0.1

As we can see from the table above, the logit coefficients increase from .292 when there were no other regressors in the model to .526 in the full model.

Why does this happen? Instead of probabilities, let’s think of the same quantities that we calculated above in terms of log-odds. Let’s look first at the log-odds of voting based on whether one received a celebrity voicemail. (I will also include the probabilities and odds, so that one can see that the log odds is equal to the \(\ln(\textrm{odds}))\) or \(\ln\left(\frac{\textrm{Pr}(y=1)}{\textrm{Pr}(y=0)}\right)\).

Expand to show code that computes results

data %>%
  group_by() %>%
  summarize(pr1 = mean(vote),
            pr0 = 1- pr1,
            odds = pr1/pr0,
            logodds = log(odds)) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  relocate(logodds)

# A tibble: 1 × 4
  logodds   pr1   pr0  odds
    <dbl> <dbl> <dbl> <dbl>
1   0.418 0.603 0.397  1.52

Now we divide the sample into those who did and did not receive a celebrity voicemail.

Expand to show code that computes results

data %>%
  group_by(voicemail) %>%
  summarize(pr1 = mean(vote),
            pr0 = 1- pr1,
            odds = pr1/pr0,
            logodds = log(odds)) %>%
  relocate(logodds)

# A tibble: 2 × 5
  logodds voicemail   pr1   pr0  odds
    <dbl>     <dbl> <dbl> <dbl> <dbl>
1   0.273         0 0.568 0.432  1.31
2   0.568         1 0.638 0.362  1.77

When we do, we can see that the log odds for those who did not receive the voicemail is .273, while the log-odds for those who do receive the voicemail is. 568. The difference of .296 corresponds to the coefficient in our logit model (Model 1 above).

As before, we can elaborate this further by looking at the log odds also by whether or not the participate voted last time:

Expand to show code that computes results

data %>%
  group_by(vote_last_time, voicemail) %>%
  summarize(pr1 = mean(vote),
            pr0 = 1- pr1,
            odds = pr1/pr0,
            logodds = log(odds)) %>%
  relocate(logodds)

Voted last time?	Voicemail?	Actual Log-odds(vote)	Actual Pr(vote)
Yes	Yes	2.40	.917
Yes	No	1.99	.880
	Difference	.41	.037
No	Yes	.055	.514
No	No	-.290	.428
	Difference	.345	.086

For those who voted last time, the difference in the log-odds between those who did and did not receive the voicemail .41. For those who did not vote last time, the log-odds difference is .345.

Both these differences are bigger than our original logit coefficient of .296. When we average them by the relative size of their respective groups, we do not get the model 1 coefficient of .296 , but instead .354, which is our coefficient for Model 2.

Does this mean that logit coefficients are biased?

No, logit coefficients are not biased.

The reason we can be confident that they are not biased is because even though the coefficients change, what those coefficients imply in terms of the average change in the predicted probability of voting does not change.

Expand to show code to compute average comparisons

library(marginaleffects)
model1_avgcomp <- avg_comparisons(
  logit_model1,
  variables = "voicemail",
  comparison = "difference", 
  newdata = data
)

model2_avgcomp <- avg_comparisons(
  logit_model2,
  variables = "voicemail",
  comparison = "difference", 
  newdata = data
)

model3_avgcomp <- avg_comparisons(
  logit_model3,
  variables = "voicemail",
  comparison = "difference", 
  newdata = data
)

# display average changes in predicted probability
model1_avgcomp


 Estimate Std. Error    z Pr(>|z|)   S  2.5 % 97.5 %
   0.0706   0.000976 72.3   <0.001 Inf 0.0687 0.0725

Term: voicemail
Type: response
Comparison: 1 - 0

model2_avgcomp


 Estimate Std. Error    z Pr(>|z|)   S 2.5 % 97.5 %
   0.0707   0.000892 79.3   <0.001 Inf 0.069 0.0725

Term: voicemail
Type: response
Comparison: 1 - 0

model3_avgcomp


 Estimate Std. Error    z Pr(>|z|)   S  2.5 % 97.5 %
   0.0713   0.000746 95.6   <0.001 Inf 0.0699 0.0728

Term: voicemail
Type: response
Comparison: 1 - 0

Even though the logit coefficients differ over the three models, the changes in the predicted probability are effectively the same.

As noted earlier, in the case of the experiment, the average change in the predicted probability corresponds to how we would expect the overall turnout to have changed if everyone had received the celebrity voicemail instead of just our treatment group. We would expect the turnout to increase by 7.1 percentage points. This expectation about how the turnout rate would change does not depend on whether we include uncorrelated regressors in our model, and, accordingly, our logit estimates of this overall change do not change.

Why do the logit coefficients change?

The better we are able to predict an outcome, the more our predictions will vary. When we only had \(\texttt{voicemail}\) in the model, every value either had one of two predicted values, depending on which experimental condition they were in. When we added \(\texttt{vote\_last\_time}\), we still only had four predicted values, but they had considerably more range because \(\texttt{vote\_last\_time}\) was such a strong predictor.

In the linear probability model, the predicted effect of the voicemail on the probability of voting does not vary depending on the baseline probability. If the linear probability model coefficient is .071, then we predict a .071 probability increase regardless of whether we are talking about a baseline probability of .1 or .5.

Across Model 1 to Model 3, even though the \(R^2\) increases and the predictions get more spread out, the predicted difference for a given observation remains the same: the coefficient in the linear probabililty model.

As for the logit model, when the baseline probability varies, the relationship between the change in log-odds and a change in probability also varies As our baseline probabilities approach 0 or 1, the “same” change in predicted probability corresponds to a bigger change in log-odds.

When we add uncorrelated regressors to our model that help predict the outcome, the predictions will vary more. In order for the effect of the voicemail on the average predicted probability to remain the same as the variation in the baseline probabilities increase, the logit coefficient will change. Specifically, the magnitude of the logit coefficient will increase, and that increase will be greater the more the uncorrelated regressors predict the outcome.

Because the odds ratio is obtained by exponentiating the logit coefficient, the same logic applies to odds ratios as well.