Binary outcomes

A binary outcome is a categorical outcome with exactly two categories.

Binary variables may or may not represent actual binaries in reality. There’s an old saying to the effect that one cannot be a little bit pregnant. The implication is that this is a binary in reality: a person either is or is not pregnant, and it doesn’t make sense to imagine anything in-between.

In contrast, we might have a measure in a survey about whether the respondent believes a woman should have the right to obtain a legal abortion for any reason. The question may be asked as a yes/no dichotomy, but we would recognize that the reality of some respondent’s opinion might be much fuzzier than that.

Coding binary variables

In R, categorical variables do not need to be coded in terms of numeric values. For example, in a simple experiment, one could have a variable named \(\texttt{condition}\) in which the categories are “treatment” and “control,” and these do not have to be associated with numeric value codes.

If your categorical variables are indicated by numeric values, R will likely assume that they are quantities. You may make them explicitly understood as factor variables using the as.factor() function.

Click for code: Load libraries
library(tidyverse)
library(haven)
library(tulaverse)
gss <- read_dta("../dta/gss_norelig_thru2018.dta") %>%
  mutate(norelig.cat = as.factor(norelig))

We can see the difference this makes in how R handles the variable by asking for summary statistics. As a quantitative variable:

summary(gss$norelig)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0000  0.0000  0.0000  0.1208  0.0000  1.0000     289 

These are the quartile statistics R provides for quantitative variables. For a factor variable:

summary(gss$norelig.cat)
    0     1  NA's 
56728  7797   289 

The relevel() function can be used to change the reference category when a variable is stored as a factor, but not when it is stored as a quantitative variable.

In base R, values cannot have labels the way they can in some other stats packages (e.g., Stata, SAS, SPSS). There are add-on functions that do extend R to provide labeling. haven allows one to import data with labels, and the labelled package provides functions for working with labels.

In haven, the function as_factor() works like as.factor(), but if values have labels it will convert the variable to use these labels as the factor names:

gss <- gss %>%
  mutate(norelig.catlabel = as_factor(norelig))
summary(gss$norelig.catlabel)
Some religion   No religion          NA's 
        56728          7797           289 

The optional parameter levels = "values" will instead use the numeric values, and thus yields the same outcome as as.factor:

gss <- gss %>%
  mutate(norelig.catnum = as_factor(norelig, levels="values"))
summary(gss$norelig.catnum)
    0     1  NA's 
56728  7797   289 

levels="both" will instead use both labels and values for the new variable.

gss <- gss %>%
  mutate(norelig.catboth = as_factor(norelig, levels="both"))
summary(gss$norelig.catboth)
[0] Some religion   [1] No religion              NA's 
            56728              7797               289 

In Stata and a number of other packages, string variables cannot directly be used in models, but instead need to be associated with numeric values that may have labels. Whenever I am analyzing binary variables in Stata, either as outcomes or as explanatory variables, I do two things, and I would strongly urge you to do the same:

  1. I give the variable a name consistent with the idea of it being a yes/no question. For the above example of a variable with a name like \(\texttt{condition}\), I would instead transform it to a variable named \(\texttt{treated}\).
  2. Given the yes/no name, I code the variable numerically so that 1 means “yes” and 0 means “no.”

This way I never look at a binary variable and wonder how it is coded or get mixed up about how it is coded.

There are other advantages of using 0/1 coding specifically. The mean of the variable is the proportion “yes.” Also, it fits nicely with the common Boolean convention of 1 meaning TRUE and 0 meaning FALSE.