Problem Set: Background
Practice
If you complete this exercise using the Quarto file used to generate this page, you could be able to change the format: header above to say format: pdf, it will render as a pdf file that includes just the questions and your answers (and not, for example, this text).
Here you will use a data of your choosing, which may be data that you’ve used before. From these data, you will provide a substantively intelligible, empirical example in which the observed association between an explanatory variable and a continuous outcome variable is altered when a third variable (hereafter referred to as the covariate) is taken into account. All I mean by substantively intelligible is that you can articulate some plausible reason why the explanatory variable would be associated with the outcome and why conditioning on the covariate would alter that association.
For this:
To avoid potential problems with specifics, I would recommend your dataset have at least 400 observations.
You may consider an outcome to be continuous if it has at least 5 unique values in a metric that you are willing to treat as interval level, and you may transform, recode, or scale the variable as appropriate to achieve the latter.
You may use a binary or continuous variable for both the explanatory variable and the covariate. You can transform a factor variable into a dichotomy to this end (that is, pick category \(k\) of the factor variable and make the binary variable \(k\) vs not-\(k\)). If you use a continuous explanatory variable or covariate, you will need to transform it into a binary variable for some purposes below, as noted, but otherwise just leave it as continuous.
- In 1-2 sentences, describe the dataset that you will be using, in the sense of what it is a dataset of.
- In 1-3 sentences, clearly describe the rationale of your example.
- That is, explain why it makes sense that the key explanatory variable would be associated with the outcome, and why the covariate would be associated with both the explanatory variable and the outcome. (You don’t have to worry about being particularly intellectual about this – or even correct! – the point is just to affirm you aren’t haphazardly hurling variables together.)
- Explain if you are thinking of the relationship between the explanatory variable and the outcome as causal or something else. This is not to require you to make any claims that the data you have are adequate to support a causal inference. Do you think of the covariate as a potential confounder, as a potential mediator, or something else with respect to the relationship between the explanatory variable and outcome?
- Give your three variables mnemonically transparent names if they do not already have them. If the explanatory variables include a factor variable, give the levels transparent labels if that would make it clearer
- Provide summary statistics of your explanatory variables
- Fit a linear regression model of the outcome on the key explanatory variable (not including the covariate). In one sentence, interpret the coefficient for the key explanatory variable.
- Fit a linear regression model of the outcome on the key explanatory variable and the covariable. In one sentence, interpret the coefficient for the key explanatory variable.
- Make a pleasant-looking table which presents the results of the coefficients from the models fit without your covariate added (called “Model 1”) and with your covariate added (called “Model 2”).
The rows should be given substantive labels.
Standard errors should be placed in parentheses below the coefficients.
Stars should be used to denote significance at the .05, .01, and .001 level, with note below the table indicating this.
All coefficients and standard errors should be presented to the same number of decimal places, and no more than 4 digits after the decimal place should be presented. - If you have any statistically significant coefficients in which the magnitude of the coefficient is less than .04, you should use at least 3 decimal places. - If you have any statistically significant coefficients in which the magnitude of the coefficient is less than .004, you should use 4 decimal places. - Avoid using 4 decimals places unless you must given the above. - If you find yourself pressed to include more decimal places tha you otherwise would because of continuous explanatory variables with really small coefficient magnitudes, you are urged to rescale them. Consider that, for continuous explanatory variables, rescaling your explanatory variables can reduce the number of decimal places needed to avoid this: coding income in thousands of dollars instead of dollars, for example, shifts all the coefficients three decimal places to the left. - In any case, the point of the foregoing is to try to get you to avoid presenting in tables coefficients like .00 or .000 with stars after it, which makes the actual magnitude of the coefficients in question mostly inscrutable.
You can look at various published papers for examples of tables of the form that I want. As an investment in future time-saving and reproducibility, I would recommend figuring out how to automate as much of the process as you can.