Click for code: Load libraries
library(tidyverse)
library(haven)
library(tulaverse)A binary outcome is a categorical outcome with exactly two categories.
Binary variables may or may not represent actual binaries in reality. There’s an old saying to the effect that one cannot be a little bit pregnant. The implication is that this is a binary in reality: a person either is or is not pregnant, and it doesn’t make sense to imagine anything in-between.
In contrast, we might have a measure in a survey about whether the respondent believes a woman should have the right to obtain a legal abortion for any reason. The question may be asked as a yes/no dichotomy, but we would recognize that the reality of some respondent’s opinion might be much fuzzier than that.
In R, categorical variables do not need to be coded in terms of numeric values. For example, in a simple experiment, one could have a variable named \(\texttt{condition}\) in which the categories are “treatment” and “control,” and these do not have to be associated with numeric value codes.
If your categorical variables are indicated by numeric values, R will likely assume that they are quantities. You may make them explicitly understood as factor variables using the as.factor() function.
We can see the difference this makes in how R handles the variable by asking for summary statistics. As a quantitative variable:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0000 0.0000 0.0000 0.1208 0.0000 1.0000 289
These are the quartile statistics R provides for quantitative variables. For a factor variable:
The relevel() function can be used to change the reference category when a variable is stored as a factor, but not when it is stored as a quantitative variable.
In base R, values cannot have labels the way they can in some other stats packages (e.g., Stata, SAS, SPSS). There are add-on functions that do extend R to provide labeling. haven allows one to import data with labels, and the labelled package provides functions for working with labels.
In haven, the function as_factor() works like as.factor(), but if values have labels it will convert the variable to use these labels as the factor names:
Some religion No religion NA's
56728 7797 289
The optional parameter levels = "values" will instead use the numeric values, and thus yields the same outcome as as.factor:
0 1 NA's
56728 7797 289
levels="both" will instead use both labels and values for the new variable.
In Stata and a number of other packages, string variables cannot directly be used in models, but instead need to be associated with numeric values that may have labels. Whenever I am analyzing binary variables in Stata, either as outcomes or as explanatory variables, I do two things, and I would strongly urge you to do the same:
This way I never look at a binary variable and wonder how it is coded or get mixed up about how it is coded.
There are other advantages of using 0/1 coding specifically. The mean of the variable is the proportion “yes.” Also, it fits nicely with the common Boolean convention of 1 meaning TRUE and 0 meaning FALSE.