The multinomial logit model is the most commonly used model for unordered outcomes.
In the multinomial logit model, one category is chosen as the base category, which is arbitrary — the estimates obtained using one base category can be exactly recovered from estimates obtained using a different one.
Multiple contrasts, multiple sets of coefficients
The multinomial logit model has a different set of \(\mathbf{\beta}\) coefficients for each of the other categories. Each of those \(\mathbf{\beta}\) is like a binary logit of that category vs. the base category.
Consider our example in which the categories of the outcome are liberal, moderate, and conservative. We will use moderate as our base category. Our model has two equations and two sets of beta coefficients.
In R, we fit the model using multinom() from the nnet library. Because I find the output from multinom() hard to understand, I will use the tula() function from my tulaverse package to reformat it.
model <-multinom(libcon3cat ~ ed3cat + race4cat + age, data = data, trace =FALSE)tula(model)
AIC = 117605.442 Number of obs = 55052
BIC = 117730.266 McFadden R-sq = 0.02798
Log likelihood = -58788.721
────────────────────────────────────────────────────────────────────────────────
lib │ Coef Std. Err. z P>|z| [95% Conf Interval]
────────────────────────────────────────────────────────────────────────────────
ed3cat │
Some college │ .2881 .02661 10.83 <.0001 .236 .3403
College grad │ .9223 .0271 34.04 <.0001 .8691 .9754
race4cat │
Black │ .331 .03078 10.75 <.0001 .2707 .3913
Hispanic │ .1434 .04041 3.549 .0004 .06423 .2226
Other │ .06598 .06215 1.062 .2883 -.05582 .1878
age │ -.008382 .0006463 -12.97 <.0001 -.009649 -.007116
(Intercept) │ -.3212 .03505 -9.162 <.0001 -.3899 -.2524
────────────────────────────────────────────────────────────────────────────────
con │ Coef Std. Err. z P>|z| [95% Conf Interval]
────────────────────────────────────────────────────────────────────────────────
ed3cat │
Some college │ .2332 .02487 9.377 <.0001 .1845 .282
College grad │ .6161 .02608 23.62 <.0001 .565 .6673
race4cat │
Black │ -.3289 .03258 -10.09 <.0001 -.3928 -.265
Hispanic │ -.2192 .04082 -5.37 <.0001 -.2992 -.1392
Other │ -.3165 .06441 -4.914 <.0001 -.4427 -.1903
age │ .006492 .0005826 11.14 <.0001 .005351 .007634
(Intercept) │ -.5668 .03309 -17.13 <.0001 -.6317 -.502
────────────────────────────────────────────────────────────────────────────────
Base outcome: mod
I have also implemented an option parallel in tula() that presents coefficients for the different equations of a multinomial logit model side-by-side:
tula(model, parallel=TRUE)
AIC = 117605.442 Number of obs = 55052
BIC = 117730.266 McFadden R-sq = 0.02798
Log likelihood = -58788.721
────────────────────────────────────────────────────────────────
libcon3cat │ lib con
────────────────────────────────────────────────────────────────
ed3cat │
Some college │ .2881*** .2332***
│ (.02661) (.02487)
College grad │ .9223*** .6161***
│ (.0271) (.02608)
race4cat │
Black │ .331*** -.3289***
│ (.03078) (.03258)
Hispanic │ .1434*** -.2192***
│ (.04041) (.04082)
Other │ .06598 -.3165***
│ (.06215) (.06441)
age │ -.008382*** .006492***
│(.0006463) (.0005826)
(Intercept) │ -.3212*** -.5668***
│ (.03505) (.03309)
────────────────────────────────────────────────────────────────
Base outcome: mod
* p<0.05 ** p<0.01 *** p<0.001
In this output, we have two sets of coefficients for the same explanatory variables. One is for the contrast of liberal vs. moderate and one is for conservative vs. moderate.
In this output, we have two sets of coefficients for the same explanatory variables. One is labeled “liberal” and one is labeled “conservative”. The coefficients labeled liberal are our \(\mathbf{\beta}_{\textrm{LIBvMOD}}\), and the coefficients labeled conservative are our \(\mathbf{\beta}_{\textrm{CONvMOD}}\).
As we discussed earlier, we do not estimate the contrast that doesn’t involve the base category (lib vs. con), but we can obtain it as simple subtraction:
For example, the coefficient of some college for the liberal vs. conservative contrast would be the difference between the corresponding lib vs. mod (.332) and con vs. mod (.193) coefficients, or \(.332 - .193 = .139\).
To make the math ultimately simpler, we will also note that if we consider the contrast of the base category with itself, the betas are 0. That is, in our example, if we consider the case in which: \[
\ln\left[\frac{\Pr(y_i=\textrm{MOD})}{\Pr(y_i=\textrm{MOD})}\right] = \mathbf{x}_i \mathbf{\beta}_{\textrm{MODvMOD}}
\]
\(\frac{\Pr(y_i=\textrm{MOD})}{\Pr(y_i=\textrm{MOD})}\) is always 1. Therefore, \(\ln \left[\frac{\Pr(y_i=\textrm{MOD})}{\Pr(y_i=\textrm{MOD})}\right]\) is always 0, and all \(\beta_{\textrm{MODvMOD}} = 0\).
Predicted probabilities for the multinomial logit model
To estimate \(\mathbf{\beta}\) via maximum likelihood, we need to calculate the predicted probability of each outcome category given observed \(\mathbf{x}\) and candidate estimates of \(\mathbf{\beta}\).
The above equations provide everything we need. For example, say we want to calculate the probability of being liberal given \(\mathbf{x}_i\) and \(\mathbf{\beta}\).
\[
\begin{align}
\Pr(y_i = \textrm{LIB}) & = \frac{\Pr(y_i = \textrm{LIB})}{1} \\
\\
& = \frac{\Pr(y_i = \textrm{LIB})}{\Pr(y_i = \textrm{LIB})+\Pr(y_i = \textrm{CON})+\Pr(y_i = \textrm{MOD})}
\end{align}
\] We can make the latter substitution above because the sum of the probabilities of all our categories is 1. If we now divide all the terms by \(\Pr(y_i = \textrm{MOD})\), we get:
Going back to the equations for the multinomial logit model, we can obtain different terms above by exponentiating the \(\mathbf{x}_i\mathbf{\beta}\). For example: \[
\frac{\Pr(y_i=\textrm{LIB})}{\Pr(y_i=\textrm{MOD})} = \exp\left(\mathbf{x}_i\mathbf{\beta}_\textrm{LIBvMOD}\right)
\] If we make this substitution for all of the terms above, we get: \[
\Pr(y_i = \textrm{LIB}) = \frac{\exp\left(\mathbf{x}_i\mathbf{\beta}_\textrm{LIBvMOD}\right)}{\exp\left(\mathbf{x}_i\mathbf{\beta}_\textrm{LIBvMOD}\right)+\exp\left(\mathbf{x}_i\mathbf{\beta}_\textrm{CONvMOD}\right)+\exp\left(\mathbf{x}_i\mathbf{\beta}_\textrm{MODvMOD}\right)}
\]
To make clear how this works, let’s take a hypothetical case in which we have \(\mathbf{x}\) and \(\mathbf{\beta}\) such that \(\mathbf{x}_i\mathbf{\beta}_\textrm{LIBvMOD} = 1\) and \(\mathbf{x}_i\mathbf{\beta}_\textrm{CONvMOD} = -1\). (And, as above, \(\mathbf{x}_i\mathbf{\beta}_\textrm{MODvMOD}\) is always 0.) In that case:
\[
\begin{align}
\Pr(y_i = \textrm{LIB}) & = \frac{\exp(1)}{\exp(1)+\exp(-1)+\exp(0)} \\
\\
& = \frac{2.718}{2.718+.368+1} \\
\\
& = \frac{2.718}{4.086} \\
\\
& = .665
\end{align}
\] That is, given \(\mathbf{x}_i\) and candidate \(\mathbf{\beta}\), the probability of liberal is .665. We can calculate the predicted probabilities for conservative and moderate by just changing which \(\beta\) is included in the numerator: \[
\begin{align}
\Pr(y_i = \textrm{CON}) & = \frac{\exp(-1)}{\exp(1)+\exp(-1)+\exp(0)} \\
\\
& = \frac{.368}{2.718+.368+1} \\
\\
& = \frac{.368}{4.086} \\
\\
& = .090
\end{align}
\] and: \[
\begin{align}
\Pr(y_i = \textrm{MOD}) & = \frac{\exp(0)}{\exp(1)+\exp(-1)+\exp(0)} \\
\\
& = \frac{1}{2.718+.368+1} \\
\\
& = \frac{1}{4.086} \\
\\
& = .245
\end{align}
\] Note that from the above, our three predicted probabilities .665 + .090 + .245 sum to 1, as they must.
To generalize what we have shown here, we can talk about the probability that \(y_i\) equals the \(j\)th of \(k\) categories, where \(b\) is used to indicate the base category:
\[\Pr(y_i=j|\mathbf{x}_i) = \frac{\exp(\mathbf{x}_i\mathbf{\beta}_{j\textrm{ vs. }b})}{\sum_{m=1}^k\left[\exp(\mathbf{x}_i\mathbf{\beta}_{m\textrm{ vs. }b})\right]}\]
To move now to an actual example given our output above, let’s consider a Latino individual with a college diploma.
Because \(\mathbf{x}\mathbf{\beta}_\textrm{LIBvMOD}\) is greater than \(\mathbf{x}\mathbf{\beta}_\textrm{CONvMOD}\), the predicted probability of liberal will be greater than the predicted probability for conservative. Because both \(\mathbf{x}\mathbf{\beta}\) are greater than 0, the predicted probability of both liberal and conservative will be greater than that of moderate. More precisely:
Here we show how to compute predicted probabilities for a specific set of values of our explanatory variables. In this example, we will figure out the predicted probability of liberal, moderate, or conservative identification in our data for someone who is 40 years old, Hispanic, and has a college degree.
As with the binary logit model, we interpret the exponentiated coefficients as multiplicative changes in odds.
In the case of the binary logit model, our interpretation was in terms of the change in the odds of \(y=1\) vs. \(y=0\), e.g., the odds of donating to a Republican or Democrat or the odds of identifying as a religious none (vs. not). In the multinomial logit model, the interpretation will be in terms of the change in odds of one category vs. the base category.
We will obtain the exponentiated coefficients for our models and then interpret them.
The exponentiated coefficients can be interpreted as the odds of the specified category versus moderate.
Compared to those with only high school education or less, having a BA degree increases the odds of being a liberal versus moderate by 151%, net of race, sex, and age. It increases the odds of being a conservative versus moderate by 60%.
Compared to Whites, Blacks have 12% higher odds of being liberal versus moderate, and 39% lower odds of being conservative versus moderate, adjusting for education, sex, and age.
Each additional year of age increases the odds of being conservative versus moderate by .8%, and decreases the odds of being liberal versus moderate by .4%.
Sources vary in their endorsement of the use of “odds” above. One could object to calling it an odds ratio because odds are typically defined as the probability of an outcome occurring over the probability of it not occurring (i.e., \(\Pr(y=1)/\Pr(y=0)\)).
For this reason, Stata refers to estimates from exponentiating multinomial logit coefficients as “relative risk ratios,” and so someone following this logic could substitute “relative risk” instead of “odds” in the interpretations above. However, this is confusing because “relative risk” usually refers to probabilities relative to one category of an explanatory variable vs. another, not (as in this case) one category of the outcome versus another. Accordingly I will just stick with “odds,” but the important point is that we are talking about the odds of one outcome category versus the base category (i.e., liberal vs. moderate) and not one category vs. everything else (i.e., liberal vs. not-liberal).