Introduction to event outcomes
The focus of survival analysis is modeling the duration until an event and its relationship to different explanatory variables. Survival analysis goes by a number of different names; perhaps the most common alternative name in social science is event history analysis.
Many different questions can be modeled using survival analysis. Here are some from actual social science papers:
- Do people who win major awards live longer than people who are nominated but lose?
- What predicts whether organizations were relatively early or late to adopt sexual harassment policies?
- What characteristics of countries predict when they abolished the death penalty?
- Are new businesses more or less likely to fail if the entrepreneur quits their “day job”?
- What predicts how long it takes executive nominees to be confirmed by the Senate?
- Does the effect of one spouse’s death on one’s own survival differ by race?
- Are marriages with daughters more likely to end in divorce than marriages with sons?
- Does having a stable job reduce recidivism?
As these examples indicate, “survival models” can be used in contexts where the key event is nothing like death and “survival” until that event is not necessarily desirable.
Nevertheless, regardless of whether the substance is positive or negative, didactic descriptions of survival models commonly use language that depicts the outcome as a bad thing. For example, the terminating event is often referred to as the failure event, even though things like abolishing the death penalty or having an executive nominee confirmed are not failures at all. For simplicity, I will sometimes follow this convention, although I will also use “outcome event” or “the event.”
Why don’t we just use linear regression?
Imagine a dataset recording the age at which an outcome event occurs — for example, the age at death. One might then wonder why we do not simply use linear regression.
As a first point, an additive model for survival times would rarely make substantive sense. An additive model would be one in which, for example, an explanatory variable might be associated with a 2-year increase in expected survival time, regardless of whether we were talking about an increase from 50 to 52 years or from 90 to 92 years. We might instead expect a multiplicative model is more appropriate, in which explanatory variables are associated with percentage changes either from some baseline duration or in the likelihood of the event happening by some particular point in time.
But even though conventional linear regression is an additive model, we already know how to transform it into a multiplicative model: by logging the outcome variable. Since survival time is always positive, logging it is always possible and straightforward. The real question, then, is why we do not just model survival using a linear regression model with a logged outcome variable?
There are three big issues undermining the use of linear regression here:
Issue #1: What do we do about individuals who are still alive?
This is known as the problem of censoring: for some observations, we do not know the value of our outcome variable, but we know that it is greater or less than some particular value. In this case, if we know somebody is still alive at age 70, we have information about their survival–namely, that the age at which they will die is at least 70 (this is called right-censoring)–even though we do not know precisely the age at which they will die.
If we drop people who are still alive, then we are biasing our results because we are selecting observations based on their value of the dependent variable. In this case, our predictions of survival time would be underestimates of real survival time, and our coefficients would likely underestimate true relationships as well.
We could restrict attention only to datasets in which every observation has experienced the outcome event. But this severely limits the analyses we can do. In a study of human mortality, for example, we would have to wait two or three decades past when every member of our sample had died before doing any studies of mortality.
Issue #2: What do we do about covariates that change over time?
For example, one of the questions mentioned above concerns how an individual’s mortality risk is affected by the death of their spouse. Mortality risk is increased and ultimate survival time is decreased by the death of a spouse. How would we model this?
Issue #3: What do we do if we have competing risks (that is, different types of end events)?
For example, we may be interested in how long it takes unemployed people to become re-employed. Say somebody is unemployed for two years and then, instead of getting a job, dies. How should we handle this case? We could simply drop this person, but then we would be throwing away the information that they had been unemployed for an unusually long time before their death. Plus, then cases with unusually long periods of unemployment would be more likely to be dropped – because death or other terminating events happened – and this would lead us to underestimate survival times and perhaps effects of some explanatory variables on survival.