Unordered outcomes

In the General Social Survey, there is a question which presents respondents with this list:

To obey
To help others
To be well-liked or popular
To think for oneself
To work hard

Respondents are then asked, “If you had to choose, which thing on this list would you pick as the most important for a child to learn to prepare him or her for life?”

A dataset might code this variable using the above order so that 1=To obey and 5=To work hard. But this order is completely arbitrary. If you were given this list of categories and asked to put them “in order,” you would likely be quite puzzled, and if you did decide on some order you would probably not have much confidence that somebody else would order them the same way.

This is an example of an unordered variable. An unordered variable is a categorical variable in which the different categories are not to be considered ordered. Unordered variables are also called nominal variables, reflecting the idea of a “nominal” level of measurement that consists simply of placing observations into unordered categories. Sometimes binary variables are included under the heading of unordered/nominal variables, but here we will keep things simpler by considering binary variables their own thing and only calling something an unordered variable if it has at least three categories.

It is probably best to think of categorical variables not as inherently ordered or unordered, but that outcomes might behave like they are usefully considered to be ordered in certain contexts. For example, on a political ideology scale, it might seem like moderate was inherently between liberal and conservative. But increasing educational attainment might be associated with people being more likely to identify both as liberal and as conservative (while less likely to identify as moderate).

Formal aside: To put this a little formally, we might consider an outcome ordered with respect to a set of explanatory variables \(\mathbf{x}\) if it can be ordered such that:

\[\ln\left(\frac{\Pr(y > j)}{\Pr(y \leq j)}\right)\]

has, for each explanatory variable, the same sign as a population parameter across all levels of the putative ordering.

While datasets will often assign numbers to the values of unordered outcomes, this is mostly the legacy of how data were traditionally stored. It might be clearer and better if numbers were not used for these variables at all (which is, incidentally, how R often handles them).

Unordered variables do not avail themselves of any of the properties that make numbers numbers. We would get to the same end with unordered outcomes using any random symbol system to distinguish categories, so long as we used the same symbol for things in the same category and different symbols for things in different categories.

Can we just use OLS?

With binary and ordered outcomes, there was the question of whether and what problems emerge if one fits the model using a simple linear regression model and OLS. With unordered outcomes, since even the order of the numbers assigned to categories is arbitrary, fitting the linear regression model is not something that can be done coherently. A sadistic exercise if you ever see somebody do this is to ask them innocently what one of their regression coefficients means. I predict they will not respond with a wrong answer but rather start to say something and then get tongue-tied as they realize there is not anything to say that makes any sense.