Problem Set: Background

Concept Comprehension

For the model fit below, \(\texttt{birwtoz}\) is the birthweight of an infant (measured in ounces), \(\texttt{smoke}\) is whether or not the mother smokes (1=yes, 0=no), and \(\texttt{gestat}\) is the gestational period (the amount of time the mother was pregnant before giving birth, measured in weeks).

lm_fit <- lm(birwtoz ~ smoke + gestat, data = df)
tula(lm_fit)

AIC = 72880.913                                        Number of obs =   8604
BIC = 72909.153                                        R-squared     = 0.1925
                                                       Adj R-squared = 0.1923
                                                       Root MSE      =  16.71
─────────────────────────────────────────────────────────────────────────────
birwtoz     │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
─────────────────────────────────────────────────────────────────────────────
smoke       │    -11.19      .5194     -21.54    <.0001     -12.21     -10.17
gestat      │     3.297     .08389       39.3    <.0001      3.133      3.462
(Intercept) │    -5.641      3.305     -1.707     .0879     -12.12      .8373
─────────────────────────────────────────────────────────────────────────────

The estimate for \(\texttt{gestat}\) is 3.3. In one sentence, describe the relationship between gestational period and birthweight using this coefficient. [1]
The estimate for \(\texttt{smoke}\) is -11.2. In one sentence, describe the relationship between smoking and birthweight using this coefficient. [1]
Given the above estimates, what is the predicted birthweight of a baby born at 37 weeks whose mother smoked during her pregnancy? Explain or show how you calculated this. [2]

From the 2018 General Social Survey, here are estimates from a regression of the number of hours of television watched per day by marital status. The reference category is married.

model <- lm(tvhours ~ marstatus, data = gss18)
tula(model)

AIC = 7629.774                                           Number of obs =    1554
BIC = 7661.866                                           R-squared     = 0.02106
                                                         Adj R-squared = 0.01853
                                                         Root MSE      =   2.811
────────────────────────────────────────────────────────────────────────────────
tvhours        │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
────────────────────────────────────────────────────────────────────────────────
marstatus      │                                                                
  widowed      │      1.45      .2642      5.488    <.0001      .9316      1.968
  divorced     │     .5506      .1987      2.771     .0057      .1608      .9404
  separated    │     .5486       .463      1.185     .2362     -.3595      1.457
  never marr~d │     .4347       .175      2.485     .0131     .09155      .7779
(Intercept)    │      2.58       .108      23.89    <.0001      2.368      2.791
────────────────────────────────────────────────────────────────────────────────

The coefficient for \(\texttt{widowed}\) is 1.45. In one sentence, interpret this coefficient. [1]
Ignoring the question of whether the difference is large enough to be statistically significant, who watches more TV per day according to these results: divorced people, or people who have never married? By how much? [1]
Say that, instead of using married as the reference category, we had used never married instead. What would the coefficient for \(\texttt{widowed}\) have been? [1]
When a researcher conducts an experiment in which participants have been randomly assigned to a treatment or a control group, there is no need to include any control variables (aka covariates) in a regression model of the outcome in order to estimate an unbiased treatment effect. Why not? [1]

For questions 8–10, imagine a researcher was interested in estimating the effect of earnings on happiness using observational data (e.g., data from a survey).
What is a potential confounding variable that a researcher might want to adjust for in estimating this effect? (Present your answer in a way that makes clear you understand what makes something a confounding variable.) [1]
What is a potential mediating variable that, depending on one’s purpose, one might want to include as a covariate? (Present your answer in a way that makes clear you understand what a mediating variable is.) [1]
Including a potential confounding variable as a covariate in a regression analysis serves a different purpose than fitting a model in which one includes a potential mediating variable. What are these different purposes? [2]

For questions 11–12, here are estimates from a model using the NHANES height data (persons aged 3-18):

nhanes <- read_dta("../dta/nhanes_bodymeasures.dta") %>%
  filter(age > 2 & age <= 18) %>%
    mutate(raceeth = factor(raceeth5cat, levels = c(1, 2, 3, 4, 5), 
                         labels = c("mex-am", "oth-hisp", "nonh-white", "nonh-black", "nonh-other")))
  
model <- lm(height ~ male + age + raceeth, data = nhanes)
tula(model)

AIC = 124991.034                                        Number of obs =  23467
BIC = 125055.541                                        R-squared     = 0.8705
                                                        Adj R-squared = 0.8704
                                                        Root MSE      =  3.469
──────────────────────────────────────────────────────────────────────────────
height       │      Coef  Std. Err.          t     P>|t|  [95% Conf  Interval]
──────────────────────────────────────────────────────────────────────────────
male         │     1.675      .0453      36.97    <.0001      1.586      1.763
age          │      1.93    .004901      393.8    <.0001       1.92       1.94
raceeth      │                                                                
  oth-hisp   │     .2111     .09516      2.219     .0265     .02462      .3976
  nonh-white │     .8946     .05993      14.93    <.0001      .7771      1.012
  nonh-black │     1.231     .05897      20.87    <.0001      1.115      1.346
  nonh-other │     .3371     .09741      3.461     .0005      .1462       .528
(Intercept)  │      34.4     .07164      480.2    <.0001      34.26      34.54
──────────────────────────────────────────────────────────────────────────────

From these data, what is the predicted height of a 14-year-old non-Hispanic White boy? Explain/show how you arrive at this prediction. [1]
In OLS, the linear predictor is often referred to as \(\hat{y}\). Why might it be preferable to refer to the linear predictor as \(\mathbf{x}\mathbf{\beta}\) rather than \(\hat{y}\), especially in a course about models beyond OLS? [2]

The following estimates are from a model of the logged number of hours a day the respondent watches television, with marital status as the explanatory variable.

Unlike the earlier example, the reference category is never married.
The folks who report not watching TV are dropped. So this population of adults is those who report at least watching some TV in a typical day.
The coefficients have been exponentiated.

gss18_tvwatchers <- gss18 %>%
  filter(tvhours > 0) %>%
  mutate(marstatus = relevel(marstatus, ref = "never married"))

model <- glm(tvhours ~ marstatus, 
             family = gaussian(link = "log"), 
             data = gss18_tvwatchers)
tula(model, exp = TRUE)

Family: gaussian / Link: log
AIC            =  6883.797                          Number of obs   =    1409
BIC            =  6915.301                          McFadden R-sq   = 0.02692
Log likelihood = -3435.899                          Nagelkerke R-sq =  0.1916
─────────────────────────────────────────────────────────────────────────────
tvhours     │    exp(b)       DMSE          t     P>|t|  [95% Conf  Interval]
─────────────────────────────────────────────────────────────────────────────
marstatus   │                                                                
  married   │     .8017     .04647     -3.813     .0001      .7156      .8982
  widowed   │     1.208     .08657      2.637     .0085       1.05       1.39
  divorced  │     1.015     .06608      .2246     .8223      .8931      1.153
  separated │     1.101      .1491      .7102     .4777      .8442      1.436
(Intercept) │     3.463      .1458      29.51    <.0001      3.189      3.761
─────────────────────────────────────────────────────────────────────────────

In one sentence, interpret the exponentiated coefficient of \(\texttt{married}\). The interpretation should be in terms of hours, not logged hours. [1]
In one sentence, interpret the exponentiated coefficient of \(\texttt{widowed}\). The interpretation should be in terms of hours, not logged hours. [1]