y-hat and x-beta

The big point here is that what I will refer to as \(\mathbf{x\beta}\) (“x-beta”) is:

\[ \mathbf{x\beta} = \beta_0 + \beta_1x_1 + \text{ ... } + \beta_kx_k \]

or the result of what you are already used to seeing on the right-hand side of a regression equation.

Linear functions

Regression models use what in mathematics is called a linear function. A linear function has the following form:

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_kx_k \]

That is, we have a value on the left-hand-side of the equation that is a function of a set of \(k\) explanatory variables, usually denoted as \(x\). Each of those explanatory variables is multiplied by a coefficient, usually denoted as \(\beta\). There is also a constant term, which is also sometimes called the intercept. I will write the constant term as \(\beta_0\), but \(\alpha\) is also often used.

Linear functions are very useful for modeling, more so than just in OLS. Linear functions are even handy when we want to model processes that are not linear! This is important because, for our purposes, we want to stretch the use of linear functions as far as we usefully can, so that we keep building on what we already understand rather than trotting off into something else.

The linear predictor

In a linear function, the entire right-hand-side, that is:

\[\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_kx_k\]

is sometimes called the linear predictor.

In OLS regression, the linear predictor gives you the predicted value of the outcome: \(\hat{y}\).

But we want to keep using linear functions and the linear predictor outside of OLS. When we do, even though we will still be using the linear predictor, it will often not equal the predicted value of the outcome. So it would be wrong to keep calling it \(\hat{y}\).

My preferred term for the linear predictor is \(\mathbf{x\beta}\), pronounced “x-beta.” I am going to be saying “x beta” so much in this course that you will be repeatedly and irredeemably lost if you do not understand what I mean by it.

\(\mathbf{x\beta}\) is a reference to matrix algebra. A matrix is a set of values arranged as row(s) and column(s), and matrix algebra is algebra for operations upon matrices. \(\mathbf{x\beta}\) is the result of matrix \(\mathbf{x}\) being multiplied by matrix \(\mathbf{\beta}\), following the rules of matrix multiplication.

You do not need to know one iota of matrix algebra to understand what I mean when I say \(\mathbf{x\beta}\). We will be mostly avoiding matrix algebra in this course to keep things as familiar as possible.

Why we need \(\mathbf{x\beta}\)

You have seen the way of computing the predicted value of \(y\), \(\hat{y}\), in the OLS regression model written as:

\[ \hat{y} = \beta_0 + \beta_1x_1 + \text{ ... } + \beta_kx_k \]

For our purposes, life would be easier if you had been taught this as two separate equations. First:

\[ \mathbf{x\beta} = \beta_0 + \beta_1x_1 + \text{ ... } + \beta_kx_k \]

and, second, for linear regression:

\[ \hat{y} = \mathbf{x\beta} \]

The problem is that \(\hat{y}\) means something very specific, which is estimated \(y\). Putting a hat over something in statistics always means “estimated” (which, for an outcome, is the same as saying “predicted”).

However, once we depart from the familiar linear regression model for continuous variables, we run into situations where we will still use \(\mathbf{x\beta}\), but where it no longer equals the predicted value of \(y\). Why not?

Sometimes we use the linear predictor in models for which there is no \(\hat{y}\) per se, but instead the outcome is something like \(\hat{\Pr}(y=1)\).
Sometimes we use the linear predictor in models for which there needs to be an additional transformation of \(\mathbf{x\beta}\) to arrive at \(\hat{y}\). For example, one could fit a model in which \(\hat{y}=e^{\mathbf{x\beta}}\).

You’ll notice that throughout this page, I am taking the trouble to put \(\mathbf{x\beta}\) in boldface. That’s not for emphasis. Instead, I am following the common math convention of using bold to indicate a matrix. \(\mathbf{x\beta}\) is \(\mathbf{x\beta}\) because \(\beta_0 + \beta_1x_1 + \text{ ... } + \beta_kx_k\) can instead be expressed as the product of two matrices, \(\mathbf{x}\) and \(\mathbf{\beta}\).

Aside: Why isn’t the beta in boldface? The \(\mathbf{\beta}\) in \(\mathbf{x\beta}\) absolutely should be in boldface – and, indeed, in the code used to make this webpage the beta is in boldface, but may not be properly supported by the math font used. After no small consternation, I have made my peace with this.

Summary of key points

Here is a quick summary of what you do need to know in terms of terminology surrounding \(\mathbf{x\beta}\):

When I say \(\mathbf{x\beta}\), I am always referring to the linear predictor.
- In OLS, \(\mathbf{x\beta} = \hat{y}\).
- For most of what we will cover, however, \(\mathbf{x\beta} \neq \hat{y}\)
I will sometimes use \(\mathbf{x}\) to refer to the whole set of explanatory variables. The boldface \(\mathbf{x}\) means a matrix, allowing us to refer to a set of (explanatory) variables.
I will sometimes use \(\mathbf{\beta}\) (note: no subscript) to refer to the entire set of coefficients, including the constant.
When I use \(\mathbf{x}_i\mathbf{\beta}\) or \(\mathbf{x}_i\), I am including the subscript i to make clear that we are talking about the specific values of the explanatory variables \(\mathbf{x}\) belonging to a particular observation.
When I write \(\mathbf{x}_i\mathbf{\beta} + \mathbf{\varepsilon}_i\), this is just the linear predictor plus an error term.

Matrix addition and multiplication

Note: This section can be skipped.

We will mostly avoid matrix algebra in this course, and you can follow all the main ideas of the course without understanding matrix algebra, so long as you know what I mean when I say \(\mathbf{x\beta}\). But this section is provided as a basic introduction to matrix addition and multiplication.

Matrix multiplication is all you need to understand why \(\mathbf{x\beta}\) is the matrix way of saying:

\[ \mathbf{x\beta} = \beta_0 + \beta_1x_1 + \text{ ... } + \beta_kx_k \]

You were probably introduced to regression analysis using scalar algebra, which was probably just called “algebra.” A scalar is a single value, like \(4\) or \(y_i\). So the basic linear regression model in scalar algebra is commonly written as:

\[ y_i = _0 + _1x_1 + + _kx_k \]

or:

\[{y_i = \beta_0 + \sum_{j=1}^{k}\beta_j x_{i,j} + \epsilon_i}\]

Statisticians do not actually work using scalar algebra, but instead they use matrix algebra. A matrix is a set of numbers, like [1, 2, 3, 9, 10].

Matrices can have rows and columns: \[ \begin{bmatrix} 1 & 3 \\ 2 & 1 \\ 3 & 4 \\ 5 & 1 \\ 9 & 5 \\ \end{bmatrix} \] This means that we can refer to an entire set of observations and variables just as a single matrix, without the need for the awkward notation of \(x_1\) … \(x_k\) as in scalar algebra. Specifically, we can refer to all of our explanatory variables together just as \(\mathbf{x}\). (I try to be consistent in using \(\mathbf{x}\) for a matrix and \(x\) for a scalar.)

\[ \mathbf{x} = \begin{bmatrix} 1 & x_{1, 1} & x_{1, \dots} & x_{1, k} \\ 1 & x_{\dots, 1} & \dots & x_{\dots, k} \\ 1 & x_{n, 1} & x_{n, \dots} & x_{n, k} \\ \end{bmatrix} \]

We can also use a vector (a matrix with only one row or one column) to represent our outcome:¹

\[ \mathbf{y} = \begin{bmatrix} y_{1} \\ y_{\dots} \\ y_{n} \\ \end{bmatrix} \]

and a separate vector to represent our coefficients:²

\[ \boldsymbol{\beta} = \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \beta_{\dots} \\ \beta_{k} \\ \end{bmatrix} \]

This means that the basic regression model can be written in matrix algebra:

\[\mathbf{y} = \mathbf{x}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

This uses the two most basic operations, matrix addition and matrix multiplication.

Matrix addition basically works like you would expect that it would. You can add two matrices if they are the same size. Each element in the sum \(\mathbf{a} + \mathbf{b}\) is just the corresponding cell of \(\mathbf{a}\) added to the corresponding cell of \(\mathbf{b}\).

\[ \begin{bmatrix} 1 \\ 2 \\ 3 \\ \end{bmatrix} + \begin{bmatrix} 7 \\ 5 \\ 2 \\ \end{bmatrix} = \begin{bmatrix} 8 \\ 7 \\ 5 \\ \end{bmatrix} \]

Matrix multiplication, on the other hand, is not immediately analogous to scalar multiplication. For example, in matrix multiplication \(\mathbf{a}\times\mathbf{b}\) is not equal to \(\mathbf{b}\times\mathbf{a}\). Instead, matrix multiplication involves the same set of scalar operations that you are already familiar with as the way to compute the predicted value of \(y\) in the linear regression model.

To multiply \(\mathbf{a}\) by \(\mathbf{b}\), the number of columns of \(\mathbf{a}\) needs to be the same as the number of rows in \(\mathbf{b}\). Multiplying \(\mathbf{a}\) by \(\mathbf{b}\) yields an answer that has the same number of rows as \(\mathbf{a}\) and the same number of columns as \(\mathbf{b}\).

You multiply each element in a row of \(\mathbf{a}\) by the corresponding element in a column of \(\mathbf{b}\), and add up these products. \[ \mathbf{x\beta} = \left[ \begin{array}{c} \beta_{0}+\beta_{1}x_{(i=1)1}+...+\beta _{k}x_{(i=1)k} \\ \beta_{0}+\beta_{1}x_{(i=2)1}+...+\beta_{k}x_{(i=2)k} \\ \beta_{0}+\beta_{1}x_{(i=...)1}+...+\beta_{k}x_{(i=...)k} \\ \beta_{0}+\beta_{1}x_{(i=N)1}+...+\beta_{k}x_{(i=N)k}% \end{array}% \right] \]

That is:

\[ \mathbf{x}_i\boldsymbol{\beta} = \beta_0 + \beta_1x_{i1} + \text{ ... } + \beta_kx_{ik} \]

The subscript \(i\) in \(\mathbf{x}_i\boldsymbol{\beta}\) allows us to refer to just the \(i\)th row of the larger matrix \(\mathbf{x}\boldsymbol{\beta}\).

Footnotes

A vector with one row is a row vector and a vector with one column is a column vector.↩︎
The error term can also be written as a column vector.↩︎