AIC and BIC

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are commonly used methods of model comparison—that is, as ways of judging whether one model is to be preferred to another.

In practical terms, AIC and BIC address a key limitation of \(R^2\) or Pseudo-\(R^2\) type measures for comparing models: \(R^2\) always improves when you add more variables.

We may think there should be some premium for parsimony: from an analyst perspective, we may wish to have some way to resist making our model more complex for trivial gains, and we may also worry about overfitting. (Overfitting means we select the model in ways that capitalize on chance features of the data, which causes problems because the results will not predict new data or extrapolate as well as we might imagine they should.)

The way AIC and BIC provide some premium for parsimony is by including a penalty for each new parameter that is added to the model. (Another way of saying “each new parameter” is to refer to each additional degree of freedom used by the model.)

Take the case when you add a new explanatory variable to a model. Whenever you do this, the model will have a better (higher) likelihood than the old model. However, for the model to be preferred by the AIC or BIC criterion, the improvement in the likelihood has to be big enough to overcome the penalty for adding new terms.

When comparing models, smaller (or more negative) values of AIC or BIC are better.

In R, after glm(), tula() puts AIC and BIC in the model output.

Expand to show code for dependencies and recoding data

library(tidyverse)
library(haven)
library(texreg) # for screenreg() function used to display output
library(tulaverse)

gss <- read_dta("../dta/gss_norelig_thru2018.dta") %>%
    mutate(relig_none = norelig,
           raised_as_none = norelig16) %>%
  filter(!is.na(educ) & !is.na(degree)) # omit if missing on either educ or degree

model <- glm(relig_none ~ cohort + year + female + educ, data=gss, family="binomial")
tula(model)

Family: binomial / Link: logit
AIC            =  43592.084                          Number of obs   =  64101
BIC            =  43637.425                          McFadden R-sq   = 0.0785
Log likelihood = -21791.042                          Nagelkerke R-sq = 0.1078
─────────────────────────────────────────────────────────────────────────────
relig_none  │      Coef  Std. Err.          z     P>|z|  [95% Conf  Interval]
─────────────────────────────────────────────────────────────────────────────
cohort      │    .02857   .0008171      34.97    <.0001     .02697     .03017
year        │    .01137    .001193      9.528    <.0001    .009028      .0137
female      │    -.5656     .02511     -22.52    <.0001     -.6148     -.5163
educ        │    .04705    .004315       10.9    <.0001     .03859     .05551
(Intercept) │    -80.87      1.997     -40.49    <.0001     -84.78     -76.95
─────────────────────────────────────────────────────────────────────────────

In Base R, after glm() and some other model-fitting functions, you can also obtain AIC and BIC with the functions AIC() and BIC(). (Both AIC() and BIC() must be in all caps.)

AIC(model)

[1] 43592.08

BIC(model)

[1] 43637.42

In Stata, you can get AIC and BIC after you fit a model with the command \(\texttt{estat ic}\).

. quietly logit norelig cohort year i.female educ

. estat ic

Akaike's information criterion and Bayesian information criterion

-----------------------------------------------------------------------------
       Model |          N   ll(null)  ll(model)      df        AIC        BIC
-------------+---------------------------------------------------------------
           . |     64,101  -23647.23  -21791.04       5   43592.08   43637.42
-----------------------------------------------------------------------------
Note: BIC uses N = number of observations. See [R] BIC note.

Calculating AIC and BIC

The formulas for AIC and BIC are very similar:

\[ \begin{align} \text{AIC} & = −2 \ln L + 2k \\ \text{BIC} & = −2 \ln L + \ln(N)k \end{align} \]

where \(\ln L\) is the log-likelihood of the model, \(N\) is the sample size, and \(k\) is the number of parameters in the model.

Both formulas share the \(−2 \ln L\) term but differ in their penalty term. Let’s consider each in turn.

Deviance (-2 log-likelihood)

As the likelihood of a model improves, this quantity will get smaller, and, again, a smaller AIC/BIC value is better.

The quantity \(−2 \ln L\) comes up a lot with maximum likelihood models. It is also known as the deviance.

Why deviance? Earlier, we showed that one can compute a likelihood ratio test by taking the difference between the log-likelihoods of two models and multiplying this difference by -2 (and then comparing to a chi-square distribution). If we had a model that fit the data perfectly, its likelihood would be 1, and its log-likelihood would be zero. So \(−2 \ln L\) is comparable to the test statistic for a likelihood ratio test that compares the model that we have fit to a model that fits the data perfectly.

\(−2 \ln L\) is referred to as the deviance because it corresponds to the deviation between our model and one that fit the data perfectly.

Penalty term

This term adds an always-positive value to AIC/BIC, which is a penalty because, as I’ve said, a smaller AIC/BIC is better.

The difference between AIC and BIC is in the magnitude of this penalty term.

For AIC, the penalty for each new parameter is 2.
- But, in the first part of the equation, we also multiplied the log-likelihood by 2.
- Together, this means that a model with an added parameter must improve the log-likelihood by at least 1 to yield a better AIC.
For BIC, the penalty for each new parameter is \(\ln(N)\) .
- The penalty therefore increases with sample size.
- The amount the likelihood has to improve in order to yield a better BIC is thereby bigger in larger samples.

Aside: Relationship between AIC penalty term and the likelihood-ratio test. Earlier, when talking about the likelihood-ratio test, we noted that in order for a single variable to be statistically significant at the p < .05 level, it needed to improve the log-likelihood by 1.92. Above we see that in order for a model with an added variable to improve AIC it only needs to improve the log-likelihood by 1. Consequently, AIC is a less stringent criterion for deciding whether or not to add/omit a variable than doing so based on its statistical significance.

We can be even more precise than this. The reason for the value 1.92 above is that 1.92×2 = 3.84 and the p-value of a test statistic 3.84 evaluated using a chi-squared distribution with 1 degree of freedom is .0500. If we take 1×2 = 2 and evaluate the p-value of 2 using a chi-squared distribution with 1 df, we get:

1 - pchisq(2, df = 1)

[1] 0.1572992

The result is .157. Thus, in order for the addition of a new variable to a model to improve AIC, its p-value via a likelihood-ratio test needs to be .157 or less, which is a less stringent criterion than .05.

AIC vs BIC

Again, the entire difference between AIC and BIC is in the penalty for new parameters. For AIC the penalty for each new parameter is 2 and for BIC it is \(\ln(N)\). \(\ln(N)\) is bigger than 2 whenever N is bigger than 7. Hence, the penalty term for BIC will be larger than the penalty term for AIC whenever the sample size is bigger than 7, or effectively always in practice.

Two implications:

BIC rewards parsimony more than AIC does.
Whenever used to compare two models, either AIC and BIC will agree on which is the preferred model, or, if they disagree, BIC will prefer the model with fewer parameters.

In practice, my sense is that in science writ large, especially medicine, one sees AIC reported more. In social science, however, BIC is more common.

My opinion: Generally speaking, I use BIC. For the types of applications that I see in practice, if one wants to be using a penalized measure to select between models, AIC’s penalty seems too weak.

The case that might be different is if the only reason that I was interested in using a penalized measure was to reduce overfitting — that is, I had no reason to care about parsimony except for this — in which case I would probably use AIC.

Example: Comparing non-nested models

We are going to use data from the 1972 to 2018 General Social Survey. Our outcome will be \(\texttt{relig\_none}\) (whether the respondent does not identify with a religion).

We will fit two models. Both models will include \(\texttt{cohort}\), \(\texttt{year}\) and \(\texttt{female}\). Both will also include education, but one model will specify this as years of education (\(\texttt{educ}\)) and the other will specify education in terms of highest degree (\(\texttt{degree}\)). This is the difference between using 1 parameter for education and using 4 parameters (because our \(\texttt{degree}\) variable has 5 categories).

These are non-nested models, meaning that we cannot obtain one model by imposing constraints on the other. As a result, we cannot use a likelihood ratio test to compare them.

Important: AIC and BIC must be compared using the same sample. In our example, the variables (\(\texttt{educ}\)) and (\(\texttt{degree}\)) do not perfectly overlap in terms of missing cases—an observation can be missing on one but not the other.

To ensure this, when fitting both models, I dropped cases that are either missing on \(\texttt{educ}\) or on \(\texttt{degree}\).

We will fit the two models and present a table that compares results:

model1 <- glm(relig_none ~ 
                cohort + year + female + educ, 
              data=gss, family="binomial")

model2 <- glm(relig_none ~ 
                cohort + year + female + as_factor(degree), 
              data=gss, family="binomial")

Expand for code that creates table

screenreg(list(model1, model2),
       custom.model.names = c("Model 1", "Model 2"),
       custom.coef.names = c("Intercept", "Cohort", "Year", "Female", "Educ (yrs)", 
                             "High School", "Junior College", "Bachelor", "Graduate"),
       stars = c(0.001, 0.01, 0.05),
       digits = 4,
       include.loglik = TRUE,
       include.deviance = FALSE,
       custom.note = "*** p < 0.001; ** p < 0.01; * p < 0.05")


================================================
                Model 1          Model 2        
------------------------------------------------
Intercept          -80.8658 ***     -81.9757 ***
                    (1.9969)         (2.0014)   
Cohort               0.0286 ***       0.0301 ***
                    (0.0008)         (0.0008)   
Year                 0.0114 ***       0.0107 ***
                    (0.0012)         (0.0012)   
Female              -0.5656 ***      -0.5614 ***
                    (0.0251)         (0.0251)   
Educ (yrs)           0.0471 ***                 
                    (0.0043)                    
High School                          -0.1056 ** 
                                     (0.0363)   
Junior College                       -0.0319    
                                     (0.0585)   
Bachelor                              0.1373 ** 
                                     (0.0434)   
Graduate                              0.3827 ***
                                     (0.0508)   
------------------------------------------------
AIC              43592.0836       43576.7079    
BIC              43637.4246       43649.2537    
Log Likelihood  -21791.0418      -21780.3540    
Num. obs.        64101            64101         
================================================
*** p < 0.001; ** p < 0.01; * p < 0.05

. use ../dta/gss_norelig_thru2018.dta, clear
(Extract from GSS cumulative file through 2018)

. label variable year "survey year"

. label variable educ "years of ed"

. drop if educ >= . | degree >= .
(271 observations deleted)

. quietly logit norelig cohort year i.female educ 

. estimates store educ

. quietly logit norelig cohort year i.female i.degree

. estimates store degree

. etable, ///
>         est(educ degree) ///
>         column(estimates) ///
>         showstars showstarsnote

--------------------------------------------
                          educ      degree  
--------------------------------------------
year of birth            0.029 **   0.030 **
                       (0.001)    (0.001)   
survey year              0.011 **   0.011 **
                       (0.001)    (0.001)   
Female?                                     
  female                -0.566 **  -0.561 **
                       (0.025)    (0.025)   
years of ed              0.047 **           
                       (0.004)              
r's highest degree                          
  high school                      -0.106 **
                                  (0.036)   
  junior college                   -0.032   
                                  (0.059)   
  bachelor                          0.137 **
                                  (0.043)   
  graduate                          0.383 **
                                  (0.051)   
Intercept              -80.866 ** -81.976 **
                       (1.997)    (2.001)   
Number of observations   64101      64101   
--------------------------------------------

Showing AIC and BIC for these models:

. etable, ///
>         est(educ degree) ///
>         column(estimates) ///
>         keep(none) /// suppress coefficients
>         mstat(N) mstat(ll) mstat(aic) mstat(bic) 

------------------------------------------
                          educ     degree 
------------------------------------------
Number of observations     64101     64101
Log likelihood         -21791.04 -21780.35
AIC                     43592.08  43576.71
BIC                     43637.42  43649.25
------------------------------------------

Looking first at the coefficients, both models are consistent with the conclusion that more education is associated with being more likely to identify as a religious none. In Model 1, this is indicated by the positive sign of the coefficient for education in years. In Model 2, the reference category is not having a high school diploma, so we do see that actually those with high school degrees only are less likely to be religious nones than those without a diploma, but those with bachelor’s and graduate degrees are the most likely to be religious nones.

If we count the parameters above, we can see that Model 1 has 5 parameters and Model 2 has 8 parameters (the number of parameters here is one for each coefficient plus one for the constant).

We verify that we are using the same sample for both models by showing that the sample sizes are the same (in blue, above). This means that we can compare the models’ AIC and BIC values.

First, let’s just verify these calculations of AIC and BIC. The formulas are:

\[ \begin{align} \text{AIC} & = −2 \ln L + 2k \\ \text{BIC} & = −2 \ln L + \ln(N)k \end{align} \]

For Model 1, the log-likelihood is \(-21791.042\), so the deviance is \(-2 \times (-21791.042) = 43582.084\).

\[ \begin{align} \text{AIC} & = 43582.084 + 2k \\ \text{BIC} & = 43582.084 + \ln(N)k \end{align} \]

Model 1 had five degrees of freedom. For AIC this means the penalty term is \(2 \times 5 = 10\).

\[ \text{AIC}= 43582.084 + 10 = 43592.084 \]

For BIC, the penalty term is \(\ln(N)\) times \(k\). In our example, \(\ln(N)=\ln(64101)=11.068\). So the penalty term is \(11.068 \times 5 = 55.341\).

\[ \text{BIC} = 43582.084 + 55.341 = 43637.425 \]

We compare models by preferring the smaller value of AIC or BIC. Here, we can see that the two measures contradict: Model 2 is preferred by AIC, and Model 1 is preferred by BIC. Whenever AIC and BIC disagree, it will always be like this, with AIC preferring the model with more parameters and BIC preferring the model with fewer.

(In this example, however, given that Model 2’s results suggest a potential curvilinear relationship—where those with only a high school diploma were less likely to be religious nones than both those with less and more education, we might consider opting for the less parsimonious result regardless.)