Probit and extending the latent variable approach to ordered outcomes

We wrote the probit model for binary outcomes as a latent variable model in which the outcome was unobserved variable \(y^\ast\).

\[ y^*_i = \mathbf{x_i}\mathbf{\beta} + \varepsilon_i \]

and the errors were distributed normally with a mean of 0 and standard deviation of 1.

The way that unobserved \(y^\ast\) mapped onto our observed binary variable was whether or not y* was greater than a threshold \(\tau\).

For ordered responses, we will follow the same principle. The key difference is that, instead of having one threshold \(\tau\), we will have \(k-1\) thresholds, where \(k\) is our number of categories.

As for how this maps onto the ordered variable we observe, imagine a survey item that has as its response categories: (1) strongly disagree (2) disagree (3) agree (4) strongly agree. We would have three thresholds: \(\tau_1\), \(\tau_2\), and \(\tau_3\). We would observe:

\(y =\) “strongly disagree” if \(y^\ast < \tau_1\)
\(y =\) “disagree” if \(y^\ast > \tau_1\) and \(y^\ast < \tau_2\)
\(y =\) “agree” if \(y^\ast > \tau_2\) and \(y^\ast < \tau_3\)
\(y =\) “strongly agree” if \(y^\ast > \tau_3\)

For purposes of estimation, all this needs to map onto the number line somehow.

In the binary probit model, we did this by constraining \(\tau\) to 0 because that made things simplest. But now that we have multiple thresholds, this is no longer the case.
Instead, what will is constrain the constant of our model (\(\beta_0\)) to be 0, and we will estimate values for the thresholds at the same time that we estimate our coefficients.

Here is a visualization of the ordered probit model for a three-category outcome.

In this visualization, notice that:

As x increases, the probability of the lowest (red) category decreases.
As x increases, the probability of the highest (blue) category increases.
As x increases, the probability of the middle (green) category first increases and then decreases.
For this middle category, whether the probability increases or decreases depends on whether more of the normal curve is shifting over \(\tau_1\) (thus entering from the lowest category) or shifting over \(\tau_2\) (thus exiting into the highest category)

Predicted probability of different outcomes

We will be able to fit our model using maximum likelihood so long as we have a means of computing the predicted probability of observing each of our outcome categories, for whatever value of \(\mathbf{x}\mathbf{\beta}\).

The math is easiest if we consider our \(k\) categories to be numbered as consecutive integers starting with 1.

We proceed again as in binary probit, by remembering that \(y^\ast = \mathbf{x}\mathbf{\beta} + \varepsilon\) and that \(\varepsilon\) is distributed as standard normal. Consequently, for a given value of \(\mathbf{x}\mathbf{\beta}\), the predicted probability will be based on what different random draws from \(\varepsilon\) result imply for \(\mathbf{x}\mathbf{\beta} + \varepsilon\) with respect to our thresholds.

For binary outcomes, the predicted probability that \(y = 1\) was \(\mathtt{cdf}_{\mathtt{normal}}(\mathbf{x}\mathbf{\beta} - \tau)\), where \(\mathtt{cdf}_{\mathtt{normal}}()\) is the cumulative distribution function of the standard normal distribution.

For ordered outcomes, we can follow the same reasoning to consider \(\Pr(y > m)\), where \(m\) is one of our categories. Then, \(\Pr(y > m) = \mathtt{cdf}_{\mathtt{normal}}(\mathbf{x}\mathbf{\beta}-\tau_m)\).

To make the math complete, we will add extra thresholds past the ends of our latent continuum:

\(\tau_0\) at \(-\infty\) so that \(\Pr(y > 0) = 1\)
\(\tau_k\) at \(\infty\) so that \(\Pr(y > k) = 0\)

Given that we can calculate \(\Pr(y > m)\) for any outcome, we can then calculate \(\Pr(y=m)\) for any outcome using subtraction:

\[ \begin{align} \Pr(y=m) & = \Pr(y^\ast > \tau_{m-1}) - \Pr(y^\ast > \tau_m) \\ \\ & = \mathtt{cdf}_{\mathtt{normal}}( \mathbf{x}\mathbf{\beta} - \tau_{m-1}) - \mathtt{cdf}_{\mathtt{normal}}( \mathbf{x}\mathbf{\beta} - \tau_m) \end{align} \]

Example

Let’s demonstrate how predicted probabilities are calculated with an example. The General Social Survey starts with a battery of items asking respondents for their opinion about national spending in several domains. One is spending on “improving and protecting the nation’s health.” The response categories are: (1) too little, (2) about right, and (3) too much. We will consider this categorical variable as the observed manifestation of the underlying latent variable “opposition to government health care spending.”

We fit a simple model in which our explanatory variables are sex (measured as male/female) and political party (measured as Dem/Ind/Rep. We fit the model using the probit model for ordered responses, which in Stata can be fit using the \(\texttt{oprobit}\) command.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(haven)
library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

library(modelsummary)
library(tulaverse)

# Read and prepare data
df <- read_dta("../dta/gss_natheal.dta") %>%
  filter(year >= 2010) %>%
  filter(as.integer(natheal) <= 3) %>%
  mutate(natheal = factor(as_factor(natheal),
                          levels = c("too little", "about right", "too much")),
         party = as_factor(party),
         male = as_factor(male)) %>%
  drop_na(party)

# Fit ordered probit model
model <- polr(natheal ~ party + male,
              data = df, Hess = TRUE, method = "probit")

# Display model results
tula(model)

Ordered regression / Link: probit
AIC            =  9347.476                               Number of obs =    5564
BIC            =  9380.597                               McFadden R-sq = 0.03856
Log likelihood = -4668.738                                                      
────────────────────────────────────────────────────────────────────────────────
natheal        │      Coef  Std. Err.          z     P>|z|  [95% Conf  Interval]
────────────────────────────────────────────────────────────────────────────────
party          │                                                                
  indep        │     .1999     .04591      4.354    <.0001      .1099      .2899
  republican   │     .6537     .03686      17.74    <.0001      .5814      .7259
male           │     .2003     .03326      6.021    <.0001      .1351      .2655
────────────────────────────────────────────────────────────────────────────────
too little|a~t │     .7109     .02888                           .6543      .7675
about right|~h │     1.645     .03406                           1.578      1.712
────────────────────────────────────────────────────────────────────────────────
Lowest level: too little, Highest: too much


. keep if year >= 2010
(14,927 observations deleted)

. oprobit natheal i.party i.male, nolog

Ordered probit regression                       Number of obs     =      5,564
                                                LR chi2(3)        =     374.46
                                                Prob > chi2       =     0.0000
Log likelihood = -4668.7381                     Pseudo R2         =     0.0386

------------------------------------------------------------------------------
     natheal |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       party |
      indep  |   .1999182   .0459128     4.35   0.000     .1099308    .2899055
 republican  |   .6536772   .0368562    17.74   0.000     .5814403     .725914
             |
        male |
       male  |   .2002578   .0332625     6.02   0.000     .1350644    .2654512
-------------+----------------------------------------------------------------
       /cut1 |   .7109146   .0288818                      .6543073    .7675219
       /cut2 |   1.644838    .034059                      1.578084    1.711593
------------------------------------------------------------------------------

The values of \(\mathtt{/cut}\) in our output are the thresholds: our estimates are .711 for \(\tau_1\) and 1.645 for \(\tau_2\).

If we consider the specific case of a Republican man, the for that case is \(.654 + .200 = .854\).

\[ \begin{align} \Pr(y > 0) & = 1 \\ \\ \Pr(y > 1) & = \Pr(y > \mathrm{too\ little}) = .854 - .710 = .144 = .557 \\ \\ \Pr(y > 2) & = \Pr(y > \mathrm{about\ right}) = .854 - 1.645 = .214 \\ \\ \Pr(y > 3) & = \Pr(y > \mathrm{too\ much}) = 0 \\ \end{align} \]

From these, we can compute the probabilities for each of our three categories:

\[ \begin{align} \Pr(y = \mathrm{too\ little}) & = \Pr(y > 0) - \Pr(y > 1) \\ & = 1 - .557 \\ & = .443 \\ \\ \Pr(y = \mathrm{about\ right}) & = \Pr(y > 1) - \Pr(y > 2) \\ & = \mathtt{cdf}_{\mathtt{normal}}(\mathbf{x}\mathbf{\beta} - \tau_1) - \mathtt{cdf}_{\mathtt{normal}}(\mathbf{x}\mathbf{\beta} - t2) \\ & = .557 - .214 \\ & = .343 \\ \\ \Pr(y = \mathrm{too\ much}) & = \Pr(y > 2) - \Pr(y > 3) \\ & = \mathtt{cdf}_{\mathtt{normal}}(\mathbf{x}\mathbf{\beta} - \tau_2) - 0 \\ & = .214 - 0 \\ & = .214 \\ \end{align} \]

Note that the sum of the predicted probabilities for each of the categories \(.443 + .343 + .214 = 1\).

We can use \(\mathtt{margins}\) in Stata to have it confirm our calculations:

# Create new data for prediction: Republican man
newdata <- data.frame(
  party = factor("republican", levels = levels(df$party)),
  male = factor("male", levels = levels(df$male))
)

# Get predicted probabilities for all three outcome categories
predictions <- predict(model, newdata = newdata, type = "probs")

# Display probabilities
data.frame(
  too_little = predictions["too little"],
  about_right = predictions["about right"],
  too_much = predictions["too much"],
  row.names = NULL
)

  too_little about_right  too_much
1  0.4431372   0.3423621 0.2145007


. margins, at(party = 3 male = 1) noatlegend ///
     predict(outcome(1)) predict(outcome(2)) predict(outcome(3))

Adjusted predictions                            Number of obs     =      5,564
Model VCE    : OIM

1._predict   : Pr(natheal==1), predict(outcome(1))
2._predict   : Pr(natheal==2), predict(outcome(2))
3._predict   : Pr(natheal==3), predict(outcome(3))

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    _predict |
          1  |    .443137    .012965    34.18   0.000     .4177261     .468548
          2  |   .3423628   .0080759    42.39   0.000     .3265342    .3581914
          3  |   .2145002   .0100464    21.35   0.000     .1948097    .2341907
------------------------------------------------------------------------------

The ordered logit model

With the binary probit model, we said that we could have derived the binary logit model following the same reasoning with a couple twists. Specifically, instead of assuming the errors are normally distributed, we assumed they followed the logistic distribution, and instead of assuming the standard deviation of the error term is 1, we assume that it is \(\frac{\pi}{\sqrt{3}}\). We assume both these things because then and only then is the resulting model equivalent to what we developed earlier by writing a model in which the outcome was log odds.

If we make the same changes to the above model, we get what’s known as the ordered logit model instead of the ordered probit model. There’s a different way to get to the ordered logit model that is more directly in the spirit of logit as a log odds model, and so when we talk about it more it will be in that context.