Soc 383 – Friendly Background Refresher

The questions below are intended to cover basic material that you are encouraged to refresh if you do not remember them. They are presented as questions because it is easy to mistake recognizing an answer when it is presented with actually knowing it. It is also easy to mistake thinking that one could articulate the answer to a question with actually being able to articulate it.

(NAQ) means Not A Question, which means either that I couldn't figure out a good way to word it as a question or decided it wasn't worth the bother.

The answers here were composed by Claude Opus 4.6, although I have reviewed them all and tweaked some.

Notation

What is $\sum_{i=1}^{4} i$ ?

Show answer

The sigma notation tells us to add up the values of $i$ from 1 to 4: $1 + 2 + 3 + 4 = 10$.

What is $3!$ ?

Show answer

The exclamation point denotes a factorial. $3! = 3 \times 2 \times 1 = 6$.

When I write $\Pr(A|B)$, what does the $\Pr(\,)$ mean?

Show answer

$\Pr(\,)$ denotes probability — the probability that whatever is inside the parentheses is true.

When I write $\Pr(A|B)$, what does the $|$ mean?

Show answer

The vertical bar means "given" or "conditional on." So $\Pr(A|B)$ is read as "the probability of A given B" — i.e., the probability that A is true among cases where B is true.

When I write $y_i$ as an outcome variable, what does the $_i$ mean?

Show answer

The subscript $i$ indexes a specific observation (or individual, or case) in the data. So $y_i$ is the value of $y$ for the $i$th observation.

When I write $\hat{y}$, what does the $\hat{\phantom{y}}$ mean?

Show answer

The "hat" indicates a predicted or estimated value. So $\hat{y}$ is the predicted value of $y$ from a model.

When I write $\bar{x}$, what does the $\bar{\phantom{x}}$ mean?

Show answer

The bar denotes the mean (average). So $\bar{x}$ is the mean of $x$.

When I write $\mathbf{E}(x)$, what does the $\mathbf{E}$ mean?

Show answer

$\mathbf{E}$ denotes the expected value (or expectation) of a random variable. $\mathbf{E}(x)$ is the long-run average value of $x$, which for a population is equivalent to the population mean.

Univariate Statistics

How is the mean of a variable calculated?

Show answer

Sum all the values and divide by the number of values: $\bar{x} = \frac{1}{N}\sum_{i=1}^{N} x_i$.

What is the median of a variable?

Show answer

The median is the middle value when the observations are arranged in order from smallest to largest. If there is an even number of observations, the median is the average of the two middle values. It is the value that divides the distribution in half — 50% of values are above it and 50% are below it.

We can calculate the squared deviation of each value $x$ from the mean $\bar{x}$. What is the mean of the square of these deviations $(x_i - \bar{x})^2$ called?

Show answer

The variance.

What is the square root of this quantity called?

Show answer

The standard deviation.

How do you interpret the mean of a binary variable? For example, below we will have a variable smoke that is 1 if a mother smoked while pregnant and 0 if not. How would you interpret a mean of .14 for this variable?

Show answer

The mean of a binary (0/1) variable equals the proportion of cases that are 1. A mean of .14 means that 14% of mothers in the sample smoked while pregnant.

What is meant by the distribution of a variable?

Show answer

The distribution of a variable describes the set of possible values the variable can take and how frequently (or with what probability) each value or range of values occurs.

What does it mean for a variable to be standardized?

Show answer

A variable has been standardized when it has been transformed to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean and dividing by the standard deviation: $z_i = \frac{x_i - \bar{x}}{s}$.

Normal Distribution

Draw the basic shape of a normal distribution.

Show answer

The normal distribution is a symmetric, bell-shaped curve centered on the mean:

Show that you understand the meaning of skew by drawing the example of the distribution of a variable that is skewed right.

Show answer

A right-skewed (positively skewed) distribution has a longer tail extending to the right. The bulk of the data is concentrated on the left:

Income is a classic example of a right-skewed variable: most people earn moderate amounts, but a small number earn extremely high amounts, creating a long right tail.

What are the mean and standard deviation of the standard normal distribution?

Show answer

The standard normal distribution has a mean of 0 and a standard deviation of 1.

Probability

What are possible values of a probability?

Show answer

A probability can range from 0 to 1, inclusive. A probability of 0 means the event is impossible; a probability of 1 means it is certain.

Interpret: $\Pr(y=1) = .6$

Show answer

The probability that $y$ equals 1 is 0.6, or 60%. In other words, there is a 60% chance that $y$ takes the value 1.

What does it mean to say that two outcomes are exclusive?

Show answer

Two outcomes are (mutually) exclusive if they cannot both occur at the same time. If one happens, the other cannot.

What does it mean to say that two outcomes are exhaustive?

Show answer

Two (or more) outcomes are exhaustive if they cover all possible outcomes — at least one of them must occur.

If a set of possible outcomes are exclusive and exhaustive, what is the sum of their probabilities?

Show answer

The sum of their probabilities is 1.

What does it mean to say that two events, say A and B, are independent of one another?

Show answer

Two events are independent if the occurrence of one does not affect the probability of the other occurring. Formally, $\Pr(A|B) = \Pr(A)$ and $\Pr(B|A) = \Pr(B)$. For example, if you roll two dice, the result of the first die does not affect the result of the second — knowing you rolled a 6 on the first die does not change the probability of rolling any particular number on the second.

If A and B are independent events, then how do we calculate the probability that A and B will both occur?

Show answer

Multiply their individual probabilities: $\Pr(A \text{ and } B) = \Pr(A) \times \Pr(B)$.

Exponents

What is $x^2 \times x^3$? [Hint: $(x \times x) \times (x \times x \times x)$]

Show answer

$x^2 \times x^3 = x^{2+3} = x^5$. When multiplying terms with the same base, you add the exponents.

What is $z^x \times z^y$?

Show answer

$z^x \times z^y = z^{x+y}$. Same rule: when multiplying terms with the same base, add the exponents.

What is $(z^{1/2})^2$ ?

Show answer

$(z^{1/2})^2 = z^{(1/2) \times 2} = z^1 = z$. Taking the square root and then squaring returns you to the original value.

What is $(z^x)^y$? [Hint: $(z^3)^2 = (z \times z \times z) \times (z \times z \times z)$]

Show answer

$(z^x)^y = z^{xy}$. When raising a power to another power, you multiply the exponents.

Logarithms

What is $\ln(1)$?

Show answer

$\ln(1) = 0$. This is because $e^0 = 1$.

Does every value of $x$ have a $\ln(x)$? If not, over what range of values is $\ln(x)$ defined?

Show answer

No. $\ln(x)$ is only defined for $x > 0$ (positive numbers). There is no natural log of zero or of a negative number.

When is the value of $\ln(x)$ negative?

Show answer

$\ln(x)$ is negative when $0 < x < 1$. For example, $\ln(0.5) \approx -0.69$.

If $\ln(x)$ is defined, then $\ln(x+1)$ is always greater than $\ln(x)$. As $x$ increases, however, does the difference $\ln(x+1) - \ln(x)$ get bigger, smaller, or stay the same?

Show answer

The difference gets smaller. The natural log is a concave function — it increases at a decreasing rate. So going from $x$ to $x + 1$ adds less and less to $\ln(x)$ as $x$ gets larger.

What is $e^x e^y$? (Hint: You already answered this question earlier, just not explicitly for $e$.)

Show answer

$e^x e^y = e^{x+y}$. This is the same exponent rule from the previous section: when multiplying terms with the same base, add the exponents.

Since $x = e^{\ln(x)}$ and $y = e^{\ln(y)}$, we can write: $$xy \;=\; e^{\ln(x)} \cdot e^{\ln(y)} \;=\; e^{\,\ln(x)\,+\,\ln(y)}$$ But we also know that $xy = e^{\ln(xy)}$, so: $$\ln(xy) \;=\; \ln(x) + \ln(y)$$ In other words, logarithms turn multiplication into addition. This is why, historically, large tables of logarithms and mechanical devices like slide rules were so useful: to multiply $x$ and $y$, one could look up $\ln(x)$ and $\ln(y)$, add them together, and exponentiate the result.

Correlation

What does it mean to say that two variables are negatively correlated (or inversely correlated) with one another?

Show answer

Two variables are negatively correlated when higher values of one variable tend to be associated with lower values of the other, and vice versa. As one goes up, the other tends to go down.

What does it mean for two variables to be uncorrelated? (Other terms for uncorrelated that will be used are independent and orthogonal.)

Show answer

Two variables are uncorrelated when there is no linear relationship between them. Knowing the value of one variable does not help predict the value of the other (at least not in a linear way).

What is the range of possible values for the correlation coefficient?

Show answer

The correlation coefficient ranges from $-1$ to $+1$. A value of $-1$ indicates a perfect negative linear relationship, $+1$ indicates a perfect positive linear relationship, and 0 indicates no linear relationship.

What letter is commonly used in journal articles and elsewhere to denote the correlation coefficient?

Show answer

$r$. (Fun fact: the "r" is for "regression.")

What does the statement "correlation is not causation" mean?

Show answer

Just because two variables are correlated — that is, they tend to move together — does not mean that one causes the other. The association could be due to reverse causation (B causes A rather than A causes B), a common cause (some third variable C causes both A and B), or some other non-causal explanation.

Variable Types

The words "dummy variable", "indicator variable", "dichotomous variable", and "binary variable" are often used interchangeably. Whatever it is called, what does it mean?

Show answer

A variable that can take only two possible values, typically coded as 0 and 1. The value 1 usually indicates the presence of some characteristic and 0 its absence.

What does it mean for a variable to be an ordinal variable?

Show answer

An ordinal variable is a categorical variable whose categories have a meaningful rank or order, but the distances between categories are not necessarily equal. For example, education level (less than high school, high school, some college, BA, post-BA) is ordinal — the categories have a clear ordering, but the "distance" between each category is not the same.

What does it mean for a variable to be a nominal variable?

Show answer

A nominal variable is a categorical variable whose categories have no inherent order or ranking. Examples include race/ethnicity, religious affiliation, or region of residence. The categories are simply labels — there is no sense in which one category is "higher" or "lower" than another.

Explanatory Variables and Outcomes

Below we will be estimating a regression model for the research question of whether pregnant women who smoke give birth to smaller birthweight babies. We will call the variable that indicates whether the mother smokes smoke, and the variable measuring the weight of the baby birthweight.

Which of these is the independent variable(s) in the regression model and which is the dependent variable(s)?

Show answer

smoke is the independent variable (the variable we think might do the explaining). birthweight is the dependent variable (the outcome we are trying to explain).

I will often use the terms explanatory variable and outcome variable instead of independent variable and dependent variable. Which terms are synonyms?

Show answer

Explanatory variable = independent variable. Outcome variable = dependent variable.

When talking about regression models, I will sometimes use the terms slope and intercept instead of coefficient and constant. Which terms are synonyms?

Show answer

Slope = coefficient (the effect of an explanatory variable). Intercept = constant (the predicted value when all explanatory variables are 0).

I will often use $x$ and $y$ to refer to explanatory variables and outcomes algebraically. Which letter corresponds to which term?

Show answer

$x$ = explanatory variable (independent variable). $y$ = outcome variable (dependent variable).

Sometimes you will hear people refer to left hand side variables and right hand side variables (sometimes abbreviated to LHS or RHS). This refers to how regression models are conventionally written as equations. Which side corresponds to the explanatory variables and which to the outcome?

Show answer

The outcome variable is on the left hand side (LHS), and the explanatory variables are on the right hand side (RHS). This comes from the conventional form $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots$

We will often plot relationships using the $x$- and $y$-axis of a graph. Which is the vertical axis and which is the horizontal?

Show answer

The $x$-axis is horizontal and the $y$-axis is vertical. So in a regression context, the explanatory variable is on the horizontal axis and the outcome variable is on the vertical axis.

Additional Basic Terminology

What Greek letter is used for regression coefficients?

Show answer

Beta ($\beta$). The coefficients in a regression model are typically written as $\beta_0$ (the intercept), $\beta_1$ (the coefficient for the first explanatory variable), and so on.

What does the $N$ of a study refer to?

Show answer

The sample size — the total number of observations (individuals, cases) in the study.

regress fits an OLS regression model. What does OLS stand for?

Show answer

Ordinary Least Squares.

We will use the words error and residual more or less interchangeably in the class. In the context of regression, what is a residual?

Show answer

A residual is the difference between an observation's actual (observed) value of the outcome and its predicted value from the regression model: $e_i = y_i - \hat{y}_i$. It represents the part of the outcome that the model does not explain.

Interpreting Regression Output

In these data, birwtoz is the birth weight of an infant (measured in ounces), smoke is whether or not the mother smokes (1=yes, 0=no), and gestat is the gestational period (the amount of time the mother was pregnant before giving birth, measured in weeks).

The coefficient for gestat is 3.17. In one sentence, describe the relationship between gestational period and birth weight using this coefficient.

Show answer

Each additional week of gestational period is associated with an increase of approximately 3.17 ounces in birth weight, holding smoking status constant.

The coefficient for smoke is −10.68. In one sentence, describe the relationship between smoking and birthweight using this coefficient.

Show answer

Mothers who smoked during pregnancy had babies that weighed approximately 10.68 ounces less than babies of non-smoking mothers, holding gestational period constant.

Consider a mother who smoked while pregnant and gave birth at 37 weeks. What is the predicted birthweight of her baby?

Show answer

Plug the values into the regression equation: $\hat{y} = -2.33 + (-10.68)(1) + (3.17)(37)$. That is, take the constant, add the coefficient for smoke times 1 (because she smoked), and add the coefficient for gestat times 37. This gives approximately $-2.33 - 10.68 + 117.29 = 104.28$ ounces.

(Intercept) in the output is the constant of the model. In OLS regression, the constant is the predicted value of the outcome variable for an observation in which all the explanatory variables are 0. This is only substantively meaningful if 0 is a substantively possible value for every variable.

Is the coefficient for smoke "statistically significant" by the prevailing conventions of social science? How can you tell?

Show answer

Yes. You can tell in multiple ways: the p-value (shown in the P>|t| column) is <.0001, which is well below the conventional threshold of 0.05. Equivalently, the 95% confidence interval (−12.13 to −9.23) does not include 0.

The right side of the regression output contains the lower bound and upper bound of the 95% confidence interval for the coefficients. How are those values calculated?

Show answer

The 95% confidence interval is calculated as the coefficient $\pm$ approximately $1.96 \times$ the standard error. For example, for smoke: $-10.68 \pm 1.96 \times 0.74 \approx (-12.13, -9.23)$.

The output above says the Root MSE is 16.24. What does the MSE stand for? If we add additional explanatory variables to the model, will the Root MSE increase or decrease?

Show answer

MSE stands for Mean Squared Error — it is the average of the squared residuals. The Root MSE is its square root, which can be interpreted as the standard deviation of the residuals (roughly, the typical size of a prediction error in the units of the outcome). Adding useful explanatory variables to the model should decrease the Root MSE, because the model will do a better job predicting the outcome and the residuals will be smaller.

The output above says the R-squared is .1744. Explain what this means.

Show answer

$R^2 = 0.1744$ means that approximately 17.44% of the variance in birth weight is explained by the two explanatory variables in the model (smoking status and gestational period). The remaining 82.56% of the variance is unexplained by the model.

Here are some more estimates, using the General Social Survey cumulative file (through 2018). The variable WORDSUM is a 10-item vocabulary test, so its units are "correct answers." The variable age is age in years, and I(age^2) in the output refers to the coefficient for age-squared. male is interviewer-coded gender and edcat is highest educational attainment (the omitted category is less than high school).

Draw the shape of the relationship between age and WORDSUM implied by these results.

Show answer

Because the coefficient on age is positive and the coefficient on age-squared is negative, the relationship is an inverted-U shape (a downward-opening parabola). WORDSUM scores increase with age up to a peak, and then decline at older ages. The peak occurs around age $\frac{0.0461}{2 \times 0.000343} \approx 67$ years.

From these estimates, what is the predicted value of WORDSUM for a 50-year-old man with a BA-level college degree? You do not actually have to calculate it if you make clear how you would calculate it.

Show answer

Plug each value into the regression equation: $$\hat{y} = \underbrace{3.255}_{\text{intercept}} + \underbrace{0.0461}_{\text{age}} \times 50 + \underbrace{(-0.000343)}_{\text{age}^2} \times 50^2 + \underbrace{(-0.1558)}_{\text{male}} \times 1 + \underbrace{2.693}_{\text{BA-level}} \times 1$$ Working it out: $$3.255 + 2.307 - 0.858 - 0.156 + 2.693 = 7.24 \text{ correct answers}$$

The coefficient for BA-level degree is 2.69. In one sentence, describe the relationship between BA-degree and vocabulary test score using this coefficient.

Show answer

People with a BA-level degree score approximately 2.69 points higher on the vocabulary test than people with less than high school education (the omitted reference category), holding age and gender constant.

The estimates below use the 2000–2018 General Social Survey. In addition to measures described above, inc here represents respondent income (2018 dollars, measured in thousands of dollars), centered so that the mean is 0. The term male*inc represents an interaction term.

The estimated constant in the model is 6.38. In this instance, how may that value of 6.38 be interpreted?

Show answer

The constant is the predicted WORDSUM score when all explanatory variables are 0. Here that means: a woman (male = 0) with average income (inc = 0, because income is centered at its mean). So 6.38 is the predicted vocabulary score for a woman at the mean income level.

The coefficient for male listed above is −.40. In one sentence, how should one interpret this coefficient?

Show answer

At mean income (where inc = 0), men score approximately 0.40 points lower on the vocabulary test than women. The reason we centered inc so that the mean = 0 is so we obtain a coefficient that can be interpreted this way.

The coefficient for the interaction term listed above is −.0031. In one sentence, how should one interpret this coefficient?

Show answer

The positive relationship between income and vocabulary score is 0.0031 weaker for men than for women. Equivalently, for each additional thousand dollars of income, the gender gap in vocabulary scores (with men scoring lower) widens by about 0.003 points.

Standard Errors and Related Concepts

The standard error reflects a specific kind of uncertainty about a coefficient, usually conceived as the uncertainty involved in using a probability sample to draw conclusions about a population. More formally, the standard error of a coefficient estimates the standard deviation of the sampling distribution for that coefficient. What is a sampling distribution? (One approach is to explain this by explaining how the sampling distribution of an explanatory variable's regression coefficient is different from the distribution of the explanatory variable itself.)

Show answer

The sampling distribution of a statistic (such as a regression coefficient) is the distribution of values that statistic would take across all possible samples of the same size drawn from the same population. For example, the distribution of the variable age in your data describes the range and frequency of ages among the individuals in your sample. But the sampling distribution of the regression coefficient for age describes how that coefficient would vary if you drew the sample over and over — each sample would give a slightly different estimate, and those estimates form the sampling distribution. The standard error measures the spread (standard deviation) of that sampling distribution.

How do changes in sample size affect the standard error?

Show answer

As the sample size increases, the standard error decreases. Larger samples give us more precise estimates, so there is less variability in the sampling distribution.

When the standard error decreases, does the t-statistic (increase or decrease?), the p-value (increase or decrease?), and the confidence interval (narrows or widens?).

Show answer

When the standard error decreases, the t-statistic increases, the p-value decreases, and the confidence interval narrows.

How does the value of the t-statistic relate to the value of the coefficient and standard error?

Show answer

The t-statistic is the coefficient divided by its standard error: $t = \frac{\text{coefficient}}{\text{standard error}}$. It measures how many standard errors the coefficient is from zero.

What is the null hypothesis being "tested" by the p-value for a given regression coefficient?

Show answer

The null hypothesis is that the true population coefficient is zero — that is, the explanatory variable has no effect on (no linear relationship with) the outcome, after accounting for the other variables in the model.

One will sometimes hear an estimator (like, for example, OLS) described as efficient. What does it mean for an estimator to be efficient in terms of the standard errors?

Show answer

An efficient estimator produces estimates with the smallest possible standard errors (among some class of estimators). In other words, it makes the most precise use of the data — there is less variability in its estimates across repeated samples compared to other estimators.

Regression Miscellany

What does it mean to say that an outcome is homoskedastic with respect to an explanatory variable?

Show answer

Homoskedasticity means the variance of the outcome variable (or equivalently, the variance of the residuals) is constant across all values of the explanatory variable. If the spread of the residuals is wider for some values of $x$ than for others, the data are heteroskedastic.

What is the range of possible values of $R^2$?

Show answer

$R^2$ ranges from 0 to 1. A value of 0 means the model explains none of the variance in the outcome; a value of 1 means it explains all of the variance.

The estimates we get when we fit an OLS model represent the slopes and intercept for the best-fitting regression line through the data. In what sense is it the best fitting? (Hint: I'm fishing for an answer that has to do with the "LS" part of "OLS.")

Show answer

OLS finds the line that minimizes the sum of the squared residuals — that is, the sum of the squared differences between the observed values and the predicted values. The "Least Squares" in OLS refers to this criterion: among all possible lines, OLS chooses the one for which the total squared error is as small as possible.

What is the shape of the sampling distribution of a regression coefficient? (Hint: you may have had this point presented in the context of the Central Limit Theorem)

Show answer

The sampling distribution of a regression coefficient is approximately normal (bell-shaped), especially with large samples. This follows from the Central Limit Theorem, which tells us that the distribution of sample-based estimates tends toward a normal distribution as the sample size grows, regardless of the shape of the underlying data distribution.