Lecture 9
Simple regression:
Response variable Y depends on one predictor X
The goal is to learn distribution of
Y | X
Y | X allows to predict values of Y knowing values of X
General regression:
Z_1 , \ldots, Z_p
The goal is to learn distribution of Y | Z_1, \ldots, Z_p
Y | Z_1 , \dots, Z_p allows to predict Y knowing values of Z_1 , \dots, Z_p
Suppose given random variables Z_1 , \ldots, Z_p and Y
Notation: We denote points of \mathbb{R}^p by z = (z_1, \ldots, z_p)
R \colon \mathbb{R}^p \to \mathbb{R}\,, \qquad \quad R(z) := {\rm I\kern-.3em E}[Y | Z_1 = z_1, \ldots , Z_p = z_p]
Notation: We use the shorthand {\rm I\kern-.3em E}[Y | z_1, \ldots, z_p] := {\rm I\kern-.3em E}[Y | Z_1 = z_1, \ldots , Z_p = z_p]
Assumption: For each i =1 , \ldots, n we observe
Key assumption: The regression function R(z) = {\rm I\kern-.3em E}[Y | z_1, \ldots, z_p] is linear
{\rm I\kern-.3em E}[Y | z_1, \ldots, z_p] = \beta_1 z_1 + \ldots + \beta_p z_p \,, \quad \forall \, z \in \mathbb{R}^p
\beta_1, \ldots, \beta_p are called regression coefficients
Note:
In simple regression we have 2 coefficients \alpha and \beta
In multiple regression we have p coefficients \beta_1, \ldots, \beta_p
Suppose that for each i =1 , \ldots, n we observe
Assumptions:
Predictor is known: The values z_{i1}, \ldots, z_{ip} are known
Normality: The distribution of Y_i is normal
Linear mean: There are parameters \beta_1,\ldots,\beta_p such that {\rm I\kern-.3em E}[Y_i] = \beta_1 z_{i1} + \ldots + \beta_p z_{ip}
Common variance (Homoscedasticity): There is a parameter \sigma^2 such that {\rm Var}[Y_i] = \sigma^2
Independence: The random variables Y_1 \,, \ldots \,, Y_n are independent
Proof: Similar to the proof of Proposition in Slide 63 Lecture 8. Exercise
Introduce the column vectors \beta := (\beta_1, \ldots, \beta_p)^T \,, \qquad y := (y_1,\ldots,y_n)^T
Y_i \sim N(\beta_1 z_{i1} + \ldots + \beta_p z_{ip} , \sigma^2 )
f_{Y_i} (y_i) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\frac{(y_i - \beta_1 z_{i1} - \ldots - \beta_p z_{ip})^2}{2\sigma^2} \right)
Since Y_1,\ldots, Y_n are independent we obtain \begin{align*} L(\beta, \sigma^2 | y) & = f(y_1,\ldots,y_n) \\ & = \prod_{i=1}^n f_{Y_i}(y_i) \\ & = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\frac{(y_i - \beta_1 z_{i1} - \ldots - \beta_p z_{ip})^2}{2\sigma^2} \right) \\ & = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{\sum_{i=1}^n(y_i- \beta_1 z_{i1} - \ldots - \beta_p z_{ip})^2}{2\sigma^2} \right) \end{align*}
This concludes the proof
For each i = 1 , \ldots, n suppose given p observations z_{i1}, \ldots, z_{ip}
\begin{align*} y - Z\beta & = \left( \begin{array}{c} y_1 \\ y_2 \\ \ldots \\ y_n \\ \end{array} \right) - \left( \begin{array}{ccc} z_{11} & \ldots & z_{1p} \\ z_{21} & \ldots & z_{2p} \\ \ldots & \ldots & \ldots \\ z_{n1} & \ldots & z_{np} \\ \end{array} \right) \left( \begin{array}{c} \beta_1 \\ \ldots \\ \beta_p \\ \end{array} \right) \\[20pt] & = \left( \begin{array}{c} y_1 \\ y_2 \\ \ldots \\ y_n \\ \end{array} \right) - \left( \begin{array}{c} \beta_1 z_{11} + \ldots + \beta_p z_{1p}\\ \beta_1 z_{21} + \ldots + \beta_p z_{2p} \\ \ldots \\ \beta_1 z_{n1} + \ldots + \beta_p z_{np} \\ \end{array} \right) \\[20pt] & = \left( \begin{array}{c} y_1 - \beta_1 z_{11} - \ldots - \beta_p z_{1p}\\ y_2 - \beta_1 z_{21} - \ldots - \beta_p z_{2p} \\ \ldots \\ y_n - \beta_1 z_{n1} - \ldots - \beta_p z_{np} \\ \end{array} \right) \, \in \, \mathbb{R}^n \end{align*}
From the previous calculation we get
(y - Z \beta)^T \, (y - Z\beta) = \sum_{i=1}^n \left( y_i - \beta_1 z_{i1} - \ldots - \beta_p z_{ip} \right)^2
Using vectorial notation the likelihood function can be re-written as
\begin{align*} L(\beta, \sigma^2 | y ) & = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{\sum_{i=1}^n(y_i - \beta_1 z_{i1} - \ldots - \beta_p z_{ip})^2}{2\sigma^2} \right) \\[20pt] & = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( - \frac{ (y - Z \beta)^T \, (y - Z\beta) }{2 \sigma^2} \right) \\[20pt] & = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( - \frac{\mathop{\mathrm{RSS}}(\beta) }{2 \sigma^2} \right) \end{align*}
General linear regression of Y on Z_1, \ldots, Z_p is the function {\rm I\kern-.3em E}[Y | z_1,\ldots,z_p] = \beta_1 z_{1} + \ldots + \beta_p z_p
For each i=1, \ldots, n suppose given the observations (z_{i1}, \ldots, z_{ip}, y_i)
Y | z_{i1} ,\ldots, z_{ip}
Y_i = \beta_1 z_{i1} + \ldots + \beta_p z_{ip} + \varepsilon_i
\varepsilon_1,\ldots, \varepsilon_n \,\, \text{ iid } \,\, N(0,\sigma^2)
For each i = 1 , \ldots, n suppose given data points (z_{i1}, \ldots, z_{ip}, y_i)
Y_i | z_{i1}, \ldots, z_{ip}
{\rm I\kern-.3em E}[Y | z_{i1}, \ldots, z_{ip}] = {\rm I\kern-.3em E}[Y_i]
\begin{equation} \tag{6} {\rm I\kern-.3em E}[Y_i] \ \approx \ y_i \,, \qquad \forall \, i = 1 , \ldots, n \end{equation}
P(Y_1 \approx y_1, \ldots, Y_n \approx y_n)
Recall that the likelihood for fixed y \in \mathbb{R}^n is defined as L(\beta, \sigma^2 | y) = f(y_1,\ldots,y_n)
Thus joint probability is maximized iff likelihood is maximized
Hence we need to find parameters \hat \beta, \hat \sigma which maximize the likelihood function
\max_{\beta,\sigma} \ L(\beta, \sigma^2 | y )
The proof of previous Theorem relies on the following Lemma
To prove the Lemma we need some auxiliary results
In the following x denotes an n \times 1 column vector
x = \left( \begin{array}{c} x_1 \\ x_2 \\ \vdots \\ x_n \end{array} \right)
Suppose given an n \times 1 column vector a
Define the scalar function
f \colon \mathbb{R}^n \to \mathbb{R}\,, \qquad f(x) := a^T x
\nabla f (x) = a^T
Suppose given an m \times n matrix A
Define the vectorial affine function
g \colon \mathbb{R}^n \to \mathbb{R}^m \,, \qquad g(x) := A x
J g (x) = A
Suppose given an n \times n matrix M
Define the quadratic form
h \colon \mathbb{R}^n \to \mathbb{R}\,, \qquad h(x) := x^T M x
J h(x) = x^T (M + M^T)
For any m \times n matrix A (A^T)^T= A
Let M be a square n \times n matrix. Then M is symmetric iff M^T = M
Scalars can be seen as 1 \times 1 symmetric matrices
Therefore for any scalar c \in \mathbb{R} c^T = c
For any m \times n matrix A and n \times p matrix B (AB)^T=B^T A^T
The matrix A^T A is symmetric by point (2) (A^T A)^T = A^T (A^T)^T = A^T A
We also need the n-dimensional version of the Lemma in Slide 18 Lecture 8
f( \hat x ) = \min_{x \in \mathbb{R}^n} f(x)
\nabla f (\hat x) = 0
\min_{\beta} \ \mathop{\mathrm{RSS}}(\beta)
\mathop{\mathrm{RSS}}(\beta) = (y - Z \beta)^T (y - Z \beta)
It is not too difficult to show that \nabla^2 \mathop{\mathrm{RSS}} is positive-semidefinite
The proof of this fact is left as an exercise
\beta \,\, \text{ minimizes } \,\, \mathop{\mathrm{RSS}}\qquad \iff \qquad \nabla \mathop{\mathrm{RSS}}(\beta) = 0
\begin{align*} \mathop{\mathrm{RSS}}(\beta) & = (y-Z \beta)^T (y-Z \beta) \\[5pt] & = (y^T - (Z\beta)^T ) (y-Z \beta) \\[5pt] & = (y^T - \beta^T Z^T ) (y-Z \beta) \\[5pt] & = y^T y - y^T Z \beta - \beta^T Z^T y + \beta^T Z^T Z \beta \end{align*}
\left( 1 \times p \right) \times \left( p \times n \right) \times \left( n \times 1 \right) = 1 \times 1
\left( \beta^T Z^T y \right)^T = \beta^T Z^T y
\left( \beta^T Z^T y \right)^T = y^T Z \beta
y^T Z \beta = \beta^T Z^T y
\begin{align*} \mathop{\mathrm{RSS}}(\beta) & = y^T y - y^T Z \beta - \textcolor{magenta}{\beta^T Z^T y} + \beta^T Z^T Z \beta \\[5pt] & = y^T y - y^T Z \beta - \textcolor{magenta}{y^T Z \beta} + \beta^T Z^T Z \beta \\[5pt] & = y^T y - 2 y^T Z \beta + \beta^T Z^T Z \beta \end{align*}
\begin{align*} \nabla \mathop{\mathrm{RSS}}(\beta) & = \nabla ( y^T y ) - 2 \nabla (y^T Z \beta ) + \nabla (\beta^T Z^T Z \beta ) \\[5pt] & = - 2 \nabla (y^T Z \beta ) + \nabla (\beta^T Z^T Z \beta ) \end{align*}
\nabla (y^T Z \beta ) \qquad \quad \text{ and } \qquad \quad \nabla (\beta^T Z^T Z \beta )
( 1 \times n ) \times ( n \times p ) = 1 \times p
f \colon \mathbb{R}^p \to \mathbb{R}\,, \qquad f(\beta) := y^T Z \beta
\nabla f = \nabla ( \textcolor{magenta}{y^T Z} \beta) = \textcolor{magenta}{y^T Z}
Note that Z^T Z has dimension ( p \times n ) \times ( n \times p ) = p \times p
Since \beta has dimension p \times 1 we can define the function
g \colon \mathbb{R}^p \to \mathbb{R}\,, \qquad g(\beta) := \beta^T Z^T Z \beta
\begin{align*} \nabla h & = \nabla (\beta^T \textcolor{magenta}{Z^T Z} \beta) = \beta^T (\textcolor{magenta}{Z^T Z} + (\textcolor{magenta}{Z^T Z})^T ) \\[5pt] & = \beta^T ( Z^T Z + Z^T Z ) = 2 \beta^T (Z^T Z) \end{align*}
\begin{align*} \nabla \mathop{\mathrm{RSS}}(\beta) & = - 2 \nabla (y^T Z \beta ) + \nabla (\beta^T Z^T Z \beta ) \\[5pt] & = -2 y^T Z + 2 \beta^T (Z^T Z) \end{align*}
\begin{align*} \nabla \mathop{\mathrm{RSS}}(\beta) = 0 \qquad & \iff \qquad -2 y^T Z + 2 \beta^T (Z^T Z) = 0 \\[5pt] & \iff \qquad \beta^T (Z^T Z) = y^T Z \end{align*}
\beta^T (Z^T Z) = y^T Z
(Z^T Z) \beta = Z^T y
\det (Z^T Z) \neq 0
(Z^T Z) \beta = Z^T y
\beta = (Z^T Z)^{-1} Z^T y
\nabla \mathop{\mathrm{RSS}}(\beta) = 0 \qquad \iff \qquad \beta = (Z^T Z)^{-1} Z^T y
\beta = (Z^T Z)^{-1} Z^T y \,\, \text{ minimizes } \,\, \mathop{\mathrm{RSS}}
We are finally in position to prove the main Theorem. We recall the statement
The \log function is strictly increasing
Therefore the problem \max_{\beta,\sigma} \ L(\beta, \sigma^2 | y ) is equivalent to \max_{\beta,\sigma} \ \log L( \beta, \sigma^2 | y )
L(\beta, \sigma^2 | y ) = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( - \frac{ \mathop{\mathrm{RSS}}(\beta) }{2\sigma^2} \right)
\log L(\beta, \sigma^2 | y ) = - \frac{n}{2} \log (2 \pi) - \frac{n}{2} \log \sigma^2 - \frac{ \mathop{\mathrm{RSS}}(\beta) }{2 \sigma^2}
Suppose \sigma is fixed. In this case the problem \max_{\beta} \ \left\{ \frac{n}{2} \log (2 \pi) - \frac{n}{2} \log \sigma^2 - \frac{ \mathop{\mathrm{RSS}}(\beta) }{2 \sigma^2} \right\} is equivalent to \min_{\beta} \ \mathop{\mathrm{RSS}}(\beta)
We have already seen that the \mathop{\mathrm{RSS}} is minimized at \hat \beta = (Z^T Z)^{-1} Z^T y
\begin{align*} \max_{\beta,\sigma} \ & \log L(\beta, \sigma^2 | y ) = \max_{\sigma} \ \log L(\hat \beta, \sigma^2 | y ) \\[5pt] & = \max_{\sigma} \ \left\{ - \frac{n}{2} \log (2 \pi) - \frac{n}{2} \log \sigma^2 - \frac{ \mathop{\mathrm{RSS}}(\hat \beta) }{2 \sigma^2} \right\} \end{align*}
It can be shown that the unique solution to the above problem is \hat{\sigma}^2 = \frac{1}{n} \mathop{\mathrm{RSS}}(\hat \beta)
This concludes the proof
Y_i = \beta_1 z_{i1} + \ldots + \beta_p z_{ip} + \varepsilon_i \,, \qquad i = 1 , \ldots, n
Z = \left( \begin{array}{ccc} z_{11} & \ldots & z_{1p} \\ \ldots & \ldots & \ldots \\ z_{n1} & \ldots & z_{np} \\ \end{array} \right)
y = \left( \begin{array}{c} y_1 \\ \ldots \\ y_{n} \\ \end{array} \right) \qquad \quad \beta = \left( \begin{array}{c} \beta_1 \\ \ldots \\ \beta_{p} \\ \end{array} \right)
\hat \beta = (Z^T Z)^{-1} Z^T y
Y_i = \alpha + \beta x_i + \varepsilon_i \,, \qquad i = 1 , \ldots, n
x = \left( \begin{array}{c} x_1 \\ \ldots \\ x_{n} \\ \end{array} \right) \qquad \quad y = \left( \begin{array}{c} y_1 \\ \ldots \\ y_{n} \\ \end{array} \right)
\hat \alpha = \overline{y} - \hat \beta \, \overline{x} \,, \qquad \quad \hat \beta = \frac{S_{xy}}{S_{xx}}
The general regression model comprises the simple regression one
To see this, consider a general linear model with p = 2 parameters
Y_i = \beta_1 z_{i1} + \beta_2 z_{i2} + \varepsilon_i \,, \qquad i = 1 , \ldots, n
Y_i = \alpha + \beta x_i + \varepsilon_i \,, \qquad i = 1 , \ldots, n
z_{i1} = 1 \,, \qquad z_{i2} = x_i \,, \qquad \forall \, i = 1 , \ldots, n
Z = \left( \begin{array}{cc} 1 & x_{1} \\ \ldots & \ldots \\ 1 & x_{n} \\ \end{array} \right)
n \times p = n \times 2
Show that the multiple linear regression estimator (Z^T Z)^{-1} Z^T y recovers the simple linear regression estimators \hat \alpha = \overline{y} - \hat \beta \overline{x} \,, \qquad \quad \hat \beta = \frac{S_{xy}}{S_{xx}}
We expect (Z^T Z)^{-1} Z^T y to contain the estimators \hat \alpha, \hat \beta
(Z^T Z)^{-1} Z^T y = \left( \begin{array}{cc} \hat \alpha \\ \hat \beta \end{array} \right)
( 2 \times n ) \times (n \times 1) = 2 \times 1
\begin{align*} Z^T y & = \left( \begin{array}{cc} 1 & x_1 \\ \ldots & \ldots\\ 1 & x_n \end{array} \right)^T \left( \begin{array}{l} y_1 \\ \ldots\\ y_n \end{array}\right) \\[15pt] & = \left( \begin{array}{ccc} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{array} \right) \left( \begin{array}{l} y_1 \\ \ldots\\ y_n \end{array}\right) \\[15pt] & = \left(\begin{array}{c} 1 \times y_1 + 1 \times y_2 + \ldots + 1 \times y_n \\ x_1 y_1 + x_2 y_2 + \ldots + x_n y_n \end{array} \right) \\[15pt] & = \left( \begin{array}{c} n \overline{y} \\ \sum_{i=1}^n x_i y_i \end{array} \right) \end{align*}
( 2 \times n ) \times (n \times 2) = 2 \times 2
\begin{align*} Z^T Z & = \left( \begin{array}{cc} 1 & x_1 \\ \ldots & \ldots\\ 1 & x_n \end{array} \right)^T \left( \begin{array}{cc} 1 & x_1 \\ \ldots & \ldots\\ 1 & x_n \end{array} \right) \\[15pt] & = \left( \begin{array}{ccc} 1 & \ldots & 1 \\ x_1 & \ldots & x_n \end{array} \right) \left( \begin{array}{cc} 1 & x_1 \\ \ldots & \ldots\\ 1 & x_n \end{array} \right) \\[15pt] & = \left( \begin{array}{cc} 1 \times 1 + \ldots + 1 \times 1 & 1 \times x_1 + \ldots + 1 \times x_n \\ x_1 \times 1 + \ldots + x_n \times 1 & x_1 \times x_1 + \ldots + x_n \times x_n \end{array} \right) \\[15pt] & = \left( \begin{array}{cc} n & n \overline{x} \\ n \overline{x} & \sum_{i=1}^n x_i^2 \end{array} \right) \end{align*}
M = \left( \begin{array}{cc} a & b \\ c & d \end{array} \right)
\det M = ad - bc \neq 0
M^{-1} = \frac{1}{ ad - bc} \left( \begin{array}{cc} d & -b \\ -c & a \end{array}\right)
\begin{align*} \det \left( Z^T Z \right) & = \det \left( \begin{array}{cc} n & n \overline{x} \\ n \overline{x} & \sum_{i=1}^n x_i^2 \end{array} \right) \\[15pt] & = n \sum_{i=1}^n x^2_i - n^2 \overline{x}^2 \\[15pt] & = n S_{xx} \end{align*}
\det \left( Z^T Z \right) = n S_{xx} \geq 0
S_{xx} = \sum_{i=1}^n (x_i - \overline{x})^2 = 0 \qquad \iff \qquad x_1 = \ldots = x_n = \overline{x}
Therefore we assume S_{xx} > 0
This way we have
\det \left( Z^T Z \right) = n S_{xx} > 0
\begin{align*} (Z^T Z)^{-1} & = \left( \begin{array}{cc} n & n \overline{x} \\ n \overline{x} & \sum_{i=1}^n x_i^2 \end{array} \right)^{-1} \\[15pt] & = \frac{1}{n S_{xx} } \left( \begin{array}{cc} \sum_{i=1}^n x^2_i & -n \overline{x}\\ -n\overline{x} & n \end{array} \right) \end{align*}
(Z^T Z)^{-1} has size 2 \times 2
Z^T y has size 2 \times 1
(Z^T Z)^{-1} Z^T y therefore has dimensions (2 \times 2) \times (2 \times 1) = (2 \times 1)
We expect (Z^T Z)^{-1} Z^T y to contain the simple regression estimators \hat \alpha, \hat \beta
(Z^T Z)^{-1} Z^T y = \left( \begin{array}{cc} \hat \alpha \\ \hat \beta \end{array} \right)
We have already computed (Z^T Z)^{-1} and Z^T y
Their product is
(Z^T Z)^{-1} Z^T y = \frac{1}{n S_{xx} } \left( \begin{array}{cc} \sum_{i=1}^n x^2_i & -n \overline{x}\\ -n\overline{x} & n \end{array} \right) \left( \begin{array}{c} n \overline{y} \\ \sum_{i=1}^n x_i y_i \end{array} \right)
\begin{align*} \frac{ \sum_{i=1}^n x_i^2 n \overline{y} - n \overline{x} \sum_{i=1}^n x_i y_i }{n S_{xx}} & = \frac{ \sum_{i=1}^n x_i^2 \overline{y} - \overline{x} \sum_{i=1}^n x_i y_i }{ S_{xx}} \\[10pt] & = \frac{ \sum_{i=1}^n x_i^2 \overline{y} \textcolor{magenta}{- \overline{y} n \overline{x}^2 + \overline{y} n \overline{x}^2} - \overline{x} \sum_{i=1}^n x_i y_i }{ S_{xx}} \\[10pt] & = \overline{y} \ \frac{ \sum_{i=1}^n x_i^2 - n \overline{x}^2}{ S_{xx} } - \overline{x} \ \frac{ \sum_{i=1}^n x_i y_i - n \overline{x} \overline{y} }{ S_{xx} } \\[10pt] & = \overline{y} \ \frac{ S_{xx} }{ S_{xx} } - \overline{x} \ \frac{ S_{xy} }{ S_{xx} } \\[10pt] & = \overline{y} - \hat \beta \, \overline{x} = \hat{\alpha} \end{align*}
\frac{ - n \overline{x} n \overline{y} + n \sum_{i=1}^n x_i y_i}{ n S_{xx} } = \frac{ \sum_{i=1}^n x_i y_i - n\overline{x} \overline{y} }{ S_{xx} } = \frac{ S_{xy} }{S_{xx}} = \hat{\beta}
(Z^T Z)^{-1} Z^T y = \left( \begin{array}{cc} \hat \alpha \\ \hat \beta \end{array} \right)
The response variable Y depends on p-1 predictors
X_2, \ldots, X_p
Note: We start counting from 2 for notational convenience
Assumption: For each i=1, \ldots, n we observe
data point (x_{i2}, \ldots, x_{ip}, y_i)
x_{ij} is from X_j
y_i is from Y
Denote by Y_i the random variable
Y | x_{i2} ,\ldots, x_{ip}
Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i
where the errors satisfy
\varepsilon_1,\ldots, \varepsilon_n \,\, \text{ iid } \,\, N(0,\sigma^2)
Multiple linear regression is a particular case of general linear regression
To see this, consider a general linear model with p parameters
Y_i = \beta_1 z_{i1} + \beta_2 z_{i2} + \ldots + \beta_p z_{ip} + \varepsilon_i \,, \qquad i = 1 , \ldots, n
Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i
z_{i1} = 1 \,, \qquad z_{i2} = x_{i2} \,, \qquad \ldots \qquad z_{ip} = x_{ip} \qquad \forall \, i = 1 , \ldots, n
Z = \left( \begin{array}{cccc} 1 & x_{12} & \ldots & x_{1p} \\ 1 & x_{22} & \ldots & x_{2p} \\ \ldots & \ldots & \ldots & \ldots \\ 1 & x_{n2} & \ldots & x_{np} \\ \end{array} \right) \,, \qquad y = \left( \begin{array}{c} y_1 \\ \ldots \\ y_n \\ \end{array} \right)
\hat \beta = (Z^T Z)^{-1} Z^T y
Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p \, x_{ip} + \varepsilon_i
\hat \beta = (\hat \beta_1, \ldots, \hat \beta_p ) = (Z^T Z)^{-1} Z^T y
y_1 , \ldots, y_n
\begin{align*} \hat y_i & := {\rm I\kern-.3em E}[ Y | x_{i2}, \ldots , z_{ip} ] \\ & = \hat \beta_1 + \hat \beta_2 \, x_{i2} + \ldots + \hat \beta_p \, x_{ip} \end{align*}
\mathop{\mathrm{RSS}}:= \sum_{i=1}^n ( y_i - \hat y_i )^2
\mathop{\mathrm{ESS}}:= \sum_{i=1}^n ( \hat y_i - \overline{y} )^2
\mathop{\mathrm{TSS}}:= \sum_{i=1}^n ( y_i - \overline{y} )^2
\mathop{\mathrm{TSS}} measures deviation of observed values from the grand mean
Hence \mathop{\mathrm{TSS}} measures variability of the observed data y_1 , \ldots, y_n
Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_{i} for errors \varepsilon_1 , \ldots, \varepsilon_n iid N(0,\sigma^2). Then
\mathop{\mathrm{TSS}}= \mathop{\mathrm{ESS}}+ \mathop{\mathrm{RSS}}
Proof: Left as an exercise
R^2 := \frac{ \mathop{\mathrm{ESS}}}{ \mathop{\mathrm{TSS}}} = \frac{ \sum_{i=1}^n ( \hat y_i - \overline{y} )^2 }{ \sum_{i=1}^n ( y_i - \overline{y} )^2 }
R^2 measures proportion of total data variability (\mathop{\mathrm{TSS}}) which is explained by the regression model (\mathop{\mathrm{ESS}})
By the Theorem we have
R^2 = \frac{ \mathop{\mathrm{ESS}}}{ \mathop{\mathrm{TSS}}} = 1 - \frac{ \mathop{\mathrm{RSS}}}{ \mathop{\mathrm{TSS}}}
0 \leq R^2 \leq 1
Moreover
R^2 = 1 \qquad \iff \qquad y_i = \hat y_i \quad \,\, \forall \, i = 1 , \ldots, n
\mathop{\mathrm{TSS}}, \, \mathop{\mathrm{ESS}}, \, \mathop{\mathrm{RSS}}\geq 0
R^2 := \frac{\mathop{\mathrm{ESS}}}{\mathop{\mathrm{TSS}}} \geq 0
R^2 = \frac{ \mathop{\mathrm{ESS}}}{ \mathop{\mathrm{TSS}}} = 1 - \frac{ \mathop{\mathrm{RSS}}}{ \mathop{\mathrm{TSS}}} \leq 1
\mathop{\mathrm{RSS}}= \sum_{i=1}^n ( y_i - \hat y_i )^2 = 0 \qquad \iff \qquad y_i = \hat y_i \quad \,\, \forall \, i = 1 , \ldots, n
\begin{align*} R^2 = 1 \quad & \iff \quad 1 - \frac{ \mathop{\mathrm{RSS}}}{ \mathop{\mathrm{TSS}}} = 1 \\[15pt] & \iff \quad \frac{ \mathop{\mathrm{RSS}}}{ \mathop{\mathrm{TSS}}} = 0 \\[15pt] & \iff \quad \mathop{\mathrm{RSS}}= 0 \\[15pt] & \iff \quad y_i = \hat y_i \quad \,\, \forall \, i = 1 , \ldots, n \end{align*}
We have shown that
0 \leq R^2 \leq 1
R^2 = 1 \quad \iff \quad y_i = \hat y_i \quad \,\, \forall \, i = 1 , \ldots, n
Conclusion: R^2 measures how well the model fits the data
Goal: Predict unemployment rates
Unemp. | Product. | Year |
---|---|---|
3.1 | 113 | 1950 |
1.9 | 123 | 1951 |
1.7 | 127 | 1952 |
1.6 | 138 | 1953 |
3.2 | 130 | 1954 |
2.7 | 146 | 1955 |
2.6 | 151 | 1956 |
2.9 | 152 | 1957 |
4.7 | 141 | 1958 |
3.8 | 159 | 1959 |
plot
we plot productivity and year against unemployment# Set up the layout with two plot panels side by side
par(mfrow = c(1, 2))
# Scatter plot: Productivity vs Unemployment
plot(productivity, unemployment,
xlab = "Productivity", ylab = "Unemployment",
main = "Productivity vs Unemployment",
pch = 16)
# Scatter plot: Year vs Unemployment
plot(year, unemployment,
xlab = "Year", ylab = "Unemployment",
main = "Year vs Unemployment",
pch = 16)
To better understand the data we can do a 3D scatter plot using either
scatterplot3D
install.packages("scatterplot3d")
plotly
install.packages("plotly")
scatterplot3d
scatterplot3d
plotly
# Load the plotly library
library(plotly)
# Create a scatter plot using plot_ly
plot_ly(x = productivity,
y = year,
z = unemployment,
type = "scatter3d",
mode = "markers",
marker = list(size = 5, color = "black")) %>%
layout(scene = list(xaxis = list(title = "Productivity"),
yaxis = list(title = "Year"),
zaxis = list(title = "Unemployment")),
width = 1000,
height = 500)
plotly
We saw that the data points (x_i, y_i, z_i) = ( \text{productivity}, \text{year}, \text{unemployment} ) roughly lie on a plane
Goal: Use multiple regression to model such plane
Multiple regression model is {\rm I\kern-.3em E}[Y | x_2, x_3] = \beta_1 + \beta_2 \, x_2 + \beta_3 \, x_3
Z
into R%*%
A
\, is \,t(A)
A
\, is \,solve(A)
Unemp. | Product. | Year |
---|---|---|
y_i | x_{i2} | x_{i3} |
3.1 | 113 | 1950 |
1.9 | 123 | 1951 |
1.7 | 127 | 1952 |
1.6 | 138 | 1953 |
3.2 | 130 | 1954 |
2.7 | 146 | 1955 |
2.6 | 151 | 1956 |
2.9 | 152 | 1957 |
4.7 | 141 | 1958 |
3.8 | 159 | 1959 |
The design matrix is Z=\left(\begin{array}{cccc} 1 & 113 & 1950 \\ 1 & 123 & 1951 \\ 1 & 127 & 1952 \\ 1 & 138 & 1953 \\ 1 & 130 & 1954 \\ 1 & 146 & 1955 \\ 1 & 151 & 1956 \\ 1 & 152 & 1957 \\ 1 & 141 & 1958 \\ 1 & 159 & 1959 \\ \end{array}\right)
Unemp. | Product. | Year |
---|---|---|
y_i | x_{i2} | x_{i3} |
3.1 | 113 | 1950 |
1.9 | 123 | 1951 |
1.7 | 127 | 1952 |
1.6 | 138 | 1953 |
3.2 | 130 | 1954 |
2.7 | 146 | 1955 |
2.6 | 151 | 1956 |
2.9 | 152 | 1957 |
4.7 | 141 | 1958 |
3.8 | 159 | 1959 |
The data vector is y = \left(\begin{array}{c} 3.1 \\ 1.9 \\ 1.7 \\ 1.6 \\ 3.2 \\ 2.7 \\ 2.6 \\ 2.9 \\ 4.7 \\ 3.8 \\ \end{array}\right)
Unemp. | Product. | Year |
---|---|---|
y_i | x_{i2} | x_{i3} |
3.1 | 113 | 1950 |
1.9 | 123 | 1951 |
1.7 | 127 | 1952 |
1.6 | 138 | 1953 |
3.2 | 130 | 1954 |
2.7 | 146 | 1955 |
2.6 | 151 | 1956 |
2.9 | 152 | 1957 |
4.7 | 141 | 1958 |
3.8 | 159 | 1959 |
Data vector y is stored in R as usual
Design matrix is stored as follows
\hat \beta = (Z^T Z)^{-1} Z^T y
t(A)
A %*% B
m1
solve(A)
m2
%*%
beta
The estimator is -1271.75 -0.1033386 0.6594171
\begin{align*} {\rm I\kern-.3em E}[Y | x_2, x_3] & = \hat\beta_1 + \hat\beta_2 \, x_2 + \hat\beta_3 \, x_3 \\[15pt] & = -1271.75 + -0.1033386 \times x_2 + 0.6594171 \times x_3 \end{align*}
# Load the plotly library
library(plotly)
# Data
unemployment <- c(3.1, 1.9, 1.7, 1.6, 3.2, 2.7, 2.6, 2.9, 4.7, 3.8)
productivity <- c(113, 123, 127, 138, 130, 146, 151, 152, 141, 159)
year <- seq(1950, 1959, by = 1)
# Generate a grid of x and y values
x_grid <- seq(min(productivity), max(productivity), length.out = 50)
y_grid <- seq(min(year), max(year), length.out = 50)
grid <- expand.grid(productivity = x_grid, year = y_grid)
# Calculate z values for the regression plane
z_values <- -1271.75 - 0.1033386 * grid$productivity + 0.6594171 * grid$year
# Create a scatter plot using plot_ly
plot_ly(x = productivity,
y = year,
z = unemployment,
type = "scatter3d",
mode = "markers",
marker = list(size = 5, color = "black")) %>%
layout(scene = list(xaxis = list(title = "Productivity"),
yaxis = list(title = "Year"),
zaxis = list(title = "Unemployment")),
width = 1000,
height = 500) %>%
add_surface(
x = x_grid,
y = y_grid,
z = matrix(z_values, nrow = length(x_grid)),
opacity = 0.5,
colorscale = list(c(0, 'rgb(255,255,255)'), c(1, 'rgb(255,0,0)')))
How well does such multiple linear model fit our data?
To answer this question we compute the coefficient of determination
R^2 = \frac{ \mathop{\mathrm{ESS}}}{ \mathop{\mathrm{TSS}}} = \frac{ \sum_{i=1}^n ( \hat y_i - \overline{y} )^2 }{ \sum_{i=1}^n (y_i - \overline{y})^2 }
\mathop{\mathrm{TSS}}:= \sum_{i=1}^n (y_i - \overline{y})^2
\mathop{\mathrm{ESS}}:= \sum_{i=1}^n (\hat y_i - \overline{y})^2
\hat y_i := \hat \beta_1 + \hat \beta_2 x_{i2} + \hat \beta_3 x_{i3}
\hat y = \left(\begin{array}{c} \hat y_1 \\ \ldots \\ \hat y_i \\ \ldots \\ \hat y_n \\ \end{array}\right) \,, \qquad Z = \left(\begin{array}{cccc} 1 & x_{11} & x_{12} \\ \ldots & \ldots & \ldots \\ 1 & x_{i1} & x_{i2} \\ \ldots & \ldots & \ldots \\ 1 & x_{n1} & x_{n2} \\ \end{array}\right) \,, \qquad \hat \beta = \left(\begin{array}{c} \hat \beta_1 \\ \hat \beta_2 \\ \hat \beta_3 \\ \end{array}\right)
Recall that \hat \beta is stored in beta
Therefore we can compute \hat y with R command
# Compute coefficient of determination R^2
R2 <- ESS / TSS
# Print ESS, RSS and R^2
cat("\nThe Estimated Squared Error ESS is", ESS)
cat("\nThe Total Squared Error TSS is", TSS)
cat("\nThe coefficient of determination R^2 is", R2)
The estimator is -1271.75 -0.1033386 0.6594171
The Estimated Squared Error ESS is 7.249764
The Total Squared Error TSS is 8.376
The coefficient of determination R^2 is 0.8655401
R^2 = 0.8655401
The coefficient of determination
R^2 = 0.8655401
is quite close to 1, showing that
Two-sample t-test is particular case of general linear regression
This example is important for two reasons
We have 2 populations A and B distributed like N(\mu_A, \sigma_A^2) and N(\mu_B, \sigma_B^2)
We have two samples
Two-sample t-test compares means \mu_A and \mu_B by computing t-statistic t = \frac{ \overline{a} - \overline{b} }{\mathop{\mathrm{e.s.e.}}}
We want a regression model that can describe the sample means \overline{a} and \overline{b}
To this end, we fit a general linear model with p = 2 to the data
{\rm I\kern-.3em E}[Y | Z_1 = z_1 , Z_2 = z_2 ] = \beta_1 z_1 + \beta_2 z_2
y = \left(\begin{array}{c} y_1 \\ \ldots \\ y_{n_A} \\ y_{n_A + 1}\\ \ldots \\ y_{n_A + n_B} \end{array}\right) = \left(\begin{array}{c} a_1 \\ \ldots \\ a_{n_A} \\ b_{1}\\ \ldots \\ b_{n_B} \end{array}\right)
Z_1 = 1_{A}(i) := \begin{cases} 1 & \text{ if i-th observation is from population A} \\ 0 & \text{ otherwise} \end{cases}
\begin{align*} z_1 & = (\underbrace{1_A(i) , \ldots, 1_A(i)}_{n_A + n_B} ) \\[15pts] & = (\underbrace{1 , \ldots, 1}_{n_A}, \underbrace{0 , \ldots, 0}_{n_B} ) \end{align*}
Z_2 = 1_{B}(i) := \begin{cases} 1 & \text{ if i-th observation is from population B} \\ 0 & \text{ otherwise} \end{cases}
\begin{align*} z_2 & = (\underbrace{1_B(i), \ldots, 1_B (i)}_{n_A + n_B} ) \\[15pts] & = (\underbrace{0 , \ldots, 0}_{n_A}, \underbrace{1 , \ldots, 1}_{n_B} ) \end{align*}
Y_i = \beta_1 \, 1_{A}(i) + \beta_2 \, 1_{B}(i) + \varepsilon_i
Variables 1_A and 1_B are called dummy variables (more on this later)
The design matrix is therefore
Z = (1_A | 1_B) = \left(\begin{array}{cc} 1_A(i) & 1_B (i) \\ \ldots & \ldots\\ 1_A(i) & 1_B (i) \\ \end{array}\right) = \left(\begin{array}{cc} 1 & 0 \\ \ldots & \ldots\\ 1 & 0\\ 0 & 1 \\ \ldots & \ldots\\ 0 & 1 \end{array}\right)
\hat \beta = (Z^T Z)^{-1} Z^T y
[ 2 \times (n_A + n_B)] \times [ (n_A + n_B) \times 1 ] = 2 \times 1
\begin{align*} Z^T y & = \left(\begin{array}{cc} 1 & 0 \\ \ldots & \ldots\\ 1 & 0\\ 0 & 1\\ \ldots & \ldots\\ 0 & 1 \end{array}\right)^T \left(\begin{array}{c} y_1 \\ \ldots\\ y_{n_A}\\ y_{n_A+1}\\ \ldots\\ y_{n_A+n_B} \end{array}\right) = \left(\begin{array}{cccccc} 1 & \ldots & 1 & 0 & \ldots & 0 \\ 0 & \ldots & 0 & 1 & \ldots & 1 \\ \end{array}\right) \left(\begin{array}{c} y_1 \\ \ldots\\ y_{n_A}\\ y_{n_A+1}\\ \ldots\\ y_{n_A+n_B} \end{array}\right) \\[35pt] & = \left(\begin{array}{c} y_1 \cdot 1 + \ldots + y_{n_A} \cdot 1 + y_{n_A + 1} \cdot 0 + \ldots + y_{n_A + n_B} \cdot 0 \\ y_1 \cdot 0 + \ldots + y_{n_A} \cdot 0 + y_{n_A + 1} \cdot 1 + \ldots + y_{n_A + n_B} \cdot 1 \end{array}\right) \\[20pt] & = \left(\begin{array}{c} y_1 + \ldots + y_{n_A} \\ y_{n_A + 1} + \ldots + y_{n_A + n_B} \end{array}\right) = \left(\begin{array}{c} a_1 + \ldots + a_{n_A} \\ b_{1} + \ldots + b_{n_B} \end{array}\right) = \left(\begin{array}{c} n_A \ \overline{a} \\ n_B \ \overline{b} \end{array}\right) \end{align*}
\begin{align*} Z^T Z & = \left(\begin{array}{cc} 1 & 0 \\ \ldots & \ldots\\ 1 & 0\\ 0 & 1\\ \ldots & \ldots\\ 0 & 1 \end{array}\right)^T \left(\begin{array}{cc} 1 & 0 \\ \ldots & \ldots\\ 1 & 0\\ 0 & 1\\ \ldots & \ldots\\ 0 & 1 \end{array}\right) \\[35pt] & = \left(\begin{array}{cccccc} 1 & \ldots & 1 & 0 & \ldots & 0 \\ 0 & \ldots & 0 & 1 & \ldots & 1 \\ \end{array}\right) \left(\begin{array}{cc} 1 & 0 \\ \ldots & \ldots\\ 1 & 0\\ 0 & 1\\ \ldots & \ldots\\ 0 & 1 \end{array}\right) = \left(\begin{array}{cc} n_A & 0 \\ 0 & n_B \end{array}\right) \end{align*}
(Z^T Z)^{-1} = \left(\begin{array}{cc} n_A & 0 \\ 0 & n_B \end{array}\right)^{-1} =\left(\begin{array}{cc} \frac{1}{n_A} & 0 \\ 0 & \frac{1}{n_B} \end{array}\right)
\hat \beta = (Z^T Z)^{-1} Z^Ty = \left(\begin{array}{cc} \frac{1}{n_A} & 0 \\ 0 & \frac{1}{n_B} \end{array}\right) \left(\begin{array}{c} n_A \ \overline{a} \\ n_B \ \overline{b} \end{array}\right) = \left(\begin{array}{c} \overline{a} \\ \overline{b} \end{array}\right)
\hat \beta = \left(\begin{array}{c} \overline{a} \\ \overline{b} \end{array}\right)
\begin{align*} {\rm I\kern-.3em E}[Y | 1_A = z_1 , 1_B = z_2] & = \hat\beta_1 \ z_1 + \hat\beta_2 \ z_2 \\[20pt] & = \overline{a} \ z_1 + \overline{b} \ z_2 \end{align*}