Problem Set: Background

Concept Comprehension

For the model fit below, \(\texttt{birwtoz}\) is the birthweight of an infant (measured in ounces), \(\texttt{smoke}\) is whether or not the mother smokes (1=yes, 0=no), and \(\texttt{gestat}\) is the gestational period (the amount of time the mother was pregnant before giving birth, measured in weeks).

lm_fit <- lm(birwtoz ~ smoke + gestat, data = df)
tula(lm_fit)
AIC = 72880.913                                        Number of obs =   8604
BIC = 72909.153                                        R-squared     = 0.1925
                                                       Adj R-squared = 0.1923
                                                       Root MSE      =  16.71
─────────────────────────────────────────────────────────────────────────────
birwtoz     │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
─────────────────────────────────────────────────────────────────────────────
smoke       │    -11.19      .5194     -21.54    <.0001     -12.21     -10.17
gestat      │     3.297     .08389       39.3    <.0001      3.133      3.462
(Intercept) │    -5.641      3.305     -1.707     .0879     -12.12      .8373
─────────────────────────────────────────────────────────────────────────────
  1. The estimate for \(\texttt{gestat}\) is 3.3. In one sentence, describe the relationship between gestational period and birthweight using this coefficient. [1]

  2. The estimate for \(\texttt{smoke}\) is -11.2. In one sentence, describe the relationship between smoking and birthweight using this coefficient. [1]

  3. Given the above estimates, what is the predicted birthweight of a baby born at 37 weeks whose mother smoked during her pregnancy? Explain or show how you calculated this. [2]

From the 2018 General Social Survey, here are estimates from a regression of the number of hours of television watched per day by marital status. The reference category is married.

model <- lm(tvhours ~ marstatus, data = gss18)
tula(model)
AIC = 7629.774                                           Number of obs =    1554
BIC = 7661.866                                           R-squared     = 0.02106
                                                         Adj R-squared = 0.01853
                                                         Root MSE      =   2.811
────────────────────────────────────────────────────────────────────────────────
tvhours        │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
────────────────────────────────────────────────────────────────────────────────
marstatus      │                                                                
  widowed      │      1.45      .2642      5.488    <.0001      .9316      1.968
  divorced     │     .5506      .1987      2.771     .0057      .1608      .9404
  separated    │     .5486       .463      1.185     .2362     -.3595      1.457
  never marr~d │     .4347       .175      2.485     .0131     .09155      .7779
(Intercept)    │      2.58       .108      23.89    <.0001      2.368      2.791
────────────────────────────────────────────────────────────────────────────────
  1. The coefficient for \(\texttt{widowed}\) is 1.45. In one sentence, interpret this coefficient. [1]

  2. Ignoring the question of whether the difference is large enough to be statistically significant, who watches more TV per day according to these results: divorced people, or people who have never married? By how much? [1]

  3. Say that, instead of using married as the reference category, we had used never married instead. What would the coefficient for \(\texttt{widowed}\) have been? [1]

  4. When a researcher conducts an experiment in which participants have been randomly assigned to a treatment or a control group, there is no need to include any control variables (aka covariates) in a regression model of the outcome in order to estimate an unbiased treatment effect. Why not? [1]

    For questions 8–10, imagine a researcher was interested in estimating the effect of earnings on happiness using observational data (e.g., data from a survey).

  5. What is a potential confounding variable that a researcher might want to adjust for in estimating this effect? (Present your answer in a way that makes clear you understand what makes something a confounding variable.) [1]

  6. What is a potential mediating variable that, depending on one’s purpose, one might want to include as a covariate? (Present your answer in a way that makes clear you understand what a mediating variable is.) [1]

  7. Including a potential confounding variable as a covariate in a regression analysis serves a different purpose than fitting a model in which one includes a potential mediating variable. What are these different purposes? [2]

    For questions 11–12, here are estimates from a model using the NHANES height data (persons aged 3-18):

nhanes <- read_dta("../dta/nhanes_bodymeasures.dta") %>%
  filter(age > 2 & age <= 18) %>%
    mutate(raceeth = factor(raceeth5cat, levels = c(1, 2, 3, 4, 5), 
                         labels = c("mex-am", "oth-hisp", "nonh-white", "nonh-black", "nonh-other")))
  
model <- lm(height ~ male + age + raceeth, data = nhanes)
tula(model)
AIC = 124991.034                                        Number of obs =  23467
BIC = 125055.541                                        R-squared     = 0.8705
                                                        Adj R-squared = 0.8704
                                                        Root MSE      =  3.469
──────────────────────────────────────────────────────────────────────────────
height       │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
──────────────────────────────────────────────────────────────────────────────
male         │     1.675      .0453      36.97    <.0001      1.586      1.763
age          │      1.93    .004901      393.8    <.0001       1.92       1.94
raceeth      │                                                                
  oth-hisp   │     .2111     .09516      2.219     .0265     .02462      .3976
  nonh-white │     .8946     .05993      14.93    <.0001      .7771      1.012
  nonh-black │     1.231     .05897      20.87    <.0001      1.115      1.346
  nonh-other │     .3371     .09741      3.461     .0005      .1462       .528
(Intercept)  │      34.4     .07164      480.2    <.0001      34.26      34.54
──────────────────────────────────────────────────────────────────────────────
  1. From these data, what is the predicted height of a 14-year-old non-Hispanic White boy? Explain/show how you arrive at this prediction. [1]

  2. In OLS, the linear predictor is often referred to as \(\hat{y}\). Why might it be preferable to refer to the linear predictor as \(\mathbf{x}\mathbf{\beta}\) rather than \(\hat{y}\), especially in a course about models beyond OLS? [2]

The following estimates are from a model of the logged number of hours a day the respondent watches television, with marital status as the explanatory variable.

  • Unlike the earlier example, the reference category is never married.
  • The folks who report not watching TV are dropped. So this population of adults is those who report at least watching some TV in a typical day.
  • The coefficients have been exponentiated.
gss18_tvwatchers <- gss18 %>%
  filter(tvhours > 0) %>%
  mutate(marstatus = relevel(marstatus, ref = "never married"))

model <- glm(tvhours ~ marstatus, 
             family = gaussian(link = "log"), 
             data = gss18_tvwatchers)
tula(model, exp = TRUE)
Family: gaussian / Link: log
AIC            =  6883.797                          Number of obs   =    1409
BIC            =  6915.301                          McFadden R-sq   = 0.02692
Log likelihood = -3435.899                          Nagelkerke R-sq =  0.1916
─────────────────────────────────────────────────────────────────────────────
tvhours     │    exp(b)       DMSE          t     P>|t|  [95% Conf  Interval]
─────────────────────────────────────────────────────────────────────────────
marstatus   │                                                                
  married   │     .8017     .04647     -3.813     .0001      .7156      .8982
  widowed   │     1.208     .08657      2.637     .0085       1.05       1.39
  divorced  │     1.015     .06608      .2246     .8223      .8931      1.153
  separated │     1.101      .1491      .7102     .4777      .8442      1.436
(Intercept) │     3.463      .1458      29.51    <.0001      3.189      3.761
─────────────────────────────────────────────────────────────────────────────
  1. In one sentence, interpret the exponentiated coefficient of \(\texttt{married}\). The interpretation should be in terms of hours, not logged hours. [1]

  2. In one sentence, interpret the exponentiated coefficient of \(\texttt{widowed}\). The interpretation should be in terms of hours, not logged hours. [1]