Problem Set: Linear Probability Model

Concept Comprehension

The example below shows a linear probability model using General Social Survey data from 1972-2018. The outcome is whether the respondent is a religious none (\(\texttt{norelig}\)), and the explanatory variable is years of education (\(\texttt{educ}\)).

model <- lm(norelig ~ educ, data = gss, weight = wtssall)
summary(model)


Call:
lm(formula = norelig ~ educ, data = gss, weights = wtssall)

Weighted Residuals:
     Min       1Q   Median       3Q      Max 
-0.38262 -0.13490 -0.10906 -0.07909  2.20734 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.0096048  0.0054034   1.778   0.0755 .  
educ        0.0085585  0.0004081  20.971   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3235 on 64191 degrees of freedom
Multiple R-squared:  0.006804,  Adjusted R-squared:  0.006789 
F-statistic: 439.8 on 1 and 64191 DF,  p-value: < 2.2e-16

For the case of someone with 12 years of education, what is \(\mathbf{x}\mathbf{\beta}\)? Explain or show how you arrived at your answer. [1]
Substantively, how would one interpret this value of \(\mathbf{x}\mathbf{\beta}\)? [1]
What is linear about the linear probability model? [1]
In published empirical articles, you will regularly see tables of descriptive statistics that report the mean and standard deviation of variables. For binary variables, providing both these quantities is not in fact necessary. Why not? [1]
The reason for using robust standard errors in the linear probability model is tied to an assumption of linear regression that is necessarily violated if the outcome is binary. Provide both the term for the assumption that is violated and explain what the assumption is. [1]

Here are estimates from a linear probability model. The data are NHANES; the sample are persons age 3-18; and the outcome is whether the respondent is over 100 pounds.

model <- lm(over100lbs ~ age, data = nhanes)
summary(model)


Call:
lm(formula = over100lbs ~ age, data = nhanes)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.99622 -0.17777 -0.01408  0.16747  0.98592 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.2314595  0.0047902  -48.32   <2e-16 ***
age          0.0818454  0.0004073  200.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2882 on 23377 degrees of freedom
  (1182 observations deleted due to missingness)
Multiple R-squared:  0.6333,    Adjusted R-squared:  0.6333 
F-statistic: 4.038e+04 on 1 and 23377 DF,  p-value: < 2.2e-16

Age here is measured in integer years. Given these estimates, how old would a child need to be for their predicted probability of being over 100 pounds to be greater than 1 (i.e., an impossible prediction)? Explain or show how you arrived at your answer.