Lecture 8
10 patients are treated with Drug A and Drug B
i | x_i | y_i |
---|---|---|
1 | 1.9 | 0.7 |
2 | 0.8 | -1.0 |
3 | 1.1 | -0.2 |
4 | 0.1 | -1.2 |
5 | -0.1 | -0.1 |
6 | 4.4 | 3.4 |
7 | 4.6 | 0.0 |
8 | 1.6 | 0.8 |
9 | 5.5 | 3.7 |
10 | 3.4 | 2.0 |
Goal:
Plot:
Linear relation:
Least Squares Line:
Least Squares Line:
We would like predicted and actual value to be close \hat{y}_i \approx y_i
Hence the vertical difference has to be small y_i - \hat{y}_i \approx 0
Least Squares Line:
We want \hat{y}_i \approx y_i \,, \qquad \forall \, i
A way to ensure the above is by minimizing the sum of squares \min_{\alpha, \beta} \ \sum_{i} \ (y_i - \hat{y}_i)^2 \hat{y}_i = \beta x_i + \alpha
Note: \mathop{\mathrm{RSS}} can be seen as a function \mathop{\mathrm{RSS}}\colon \mathbb{R}^2 \to \mathbb{R}\qquad \quad \mathop{\mathrm{RSS}}= \mathop{\mathrm{RSS}}(\alpha,\beta)
Suppose given the sample (x_1,y_1), \ldots, (x_n, y_n)
We want to minimize the associated RSS \min_{\alpha,\beta} \ \mathop{\mathrm{RSS}}(\alpha,\beta) = \min_{\alpha,\beta} \ \sum_{i=1}^n (y_i-\alpha-{\beta}x_i)^2
To this end we define the following quantities
Suppose given the sample (x_1,y_1), \ldots, (x_n, y_n)
Sample Means: \overline{x} := \frac{1}{n} \sum_{i=1}^n x_i \qquad \quad \overline{y} := \frac{1}{n} \sum_{i=1}^n y_i
Sums of squares: S_{xx} := \sum_{i=1}^n ( x_i - \overline{x} )^2 \qquad \quad S_{yy} := \sum_{i=1}^n ( y_i - \overline{y} )^2
Suppose given the sample (x_1,y_1), \ldots, (x_n, y_n)
Let (x_1,y_1), \ldots, (x_n, y_n) be a set of n points. Consider the minimization problem \begin{equation} \tag{M} \min_{\alpha,\beta } \ \mathop{\mathrm{RSS}}= \min_{\alpha,\beta} \ \sum_{i=1}^n (y_i-\alpha-{\beta}x_i)^2 \end{equation} Then
To prove the Theorem we need some background results
A symmetric matrix is positive semi-definite if all the eigenvalues \lambda_i satisfy \lambda_i \geq 0
Proposition: A 2 \times 2 symmetric matrix M is positive semi-definite iff \det M \geq 0 \,, \qquad \quad \operatorname{Tr}(M) \geq 0
f \colon \mathbb{R}^2 \to \mathbb{R}\qquad \quad f = f (x,y)
\nabla^2 f = \left( \begin{array}{cc} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \\ \end{array} \right)
\det \nabla^2 f = f_{xx} f_{yy} - f_{xy}^2 \geq 0 \qquad \quad f_{xx} + f_{yy} \geq 0
\nabla^2 f \, \text{ is positive semi-definite} \qquad \iff \qquad f \, \text{ is convex}
Suppose f \colon \mathbb{R}^2 \to \mathbb{R} has positive semi-definite Hessian. They are equivalent
The point (\hat{x},\hat{y}) is a minimizer of f, that is, f(\hat{x}, \hat{y}) = \min_{x,y} \ f(x,y)
The point (\hat{x},\hat{y}) satisfies the optimality conditions \nabla f (\hat{x},\hat{y}) = 0
Note: The proof of the above Lemma can be found in [1]
f(x,y) = x^2 + y^2
\min_{x,y} \ f(x,y) = \min_{x,y} \ x^2 + y^2 = 0 with the only minimizer being (0,0)
\nabla f = (f_x,f_y) = (2x, 2y)
\nabla f = 0 \qquad \iff \qquad x = y = 0
\nabla^2 f = \left( \begin{array}{cc} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{array} \right) = \left( \begin{array}{cc} 2 & 0 \\ 0 & 2 \end{array} \right)
\det \nabla^2 f = 4 > 0 \qquad \qquad f_{xx} + f_{yy} = 4 > 0
0 = f(0,0) = \min_{x,y} \ f(x,y)
We go back to proving the RSS Minimization Theorem
Suppose given data points (x_1,y_1), \ldots, (x_n, y_n)
We want to solve the minimization problem
\begin{equation} \tag{M} \min_{\alpha,\beta } \ \mathop{\mathrm{RSS}}= \min_{\alpha,\beta} \ \sum_{i=1}^n (y_i-\alpha-{\beta}x_i)^2 \end{equation}
\nabla \mathop{\mathrm{RSS}}\quad \text{ and } \quad \nabla^2 \mathop{\mathrm{RSS}}
We first compute \nabla \mathop{\mathrm{RSS}} and solve the optimality conditions \nabla \mathop{\mathrm{RSS}}(\alpha,\beta) = 0
To this end, recall that \overline{x} := \frac{\sum_{i=1}^nx_i}{n} \qquad \implies \qquad \sum_{i=1}^n x_i = n \overline{x}
Similarly we have \sum_{i=1}^n y_i = n \overline{y}
\begin{align*} \mathop{\mathrm{RSS}}_{\alpha} & = -2\sum_{i=1}^n(y_i- \alpha- \beta x_i) \\[10pt] & = - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} \\[20pt] \mathop{\mathrm{RSS}}_{\beta} & = -2\sum_{i=1}^n x_i (y_i- \alpha - \beta x_i) \\[10pt] & = - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2 \end{align*}
\begin{align} - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} & = 0 \tag{1} \\[20pt] - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2 & = 0 \tag{2} \end{align}
-2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} = 0
\alpha = \overline{y}- \beta \overline{x}
\sum_{i=1}^n x_i y_i - \alpha n \overline{x} - \beta \sum_{i=1}^n x^2_i = 0
Hence Equation (2) is equivalent to \beta = \frac{S_{xy}}{ S_{xx} }
Also recall that Equation (1) is equivalent to \alpha = \overline{y}- \beta \overline{x}
Therefore (\hat\alpha, \hat\beta) solves the optimality conditions \nabla \mathop{\mathrm{RSS}}= 0 iff \hat\alpha = \overline{y}- \hat\beta \overline{x} \,, \qquad \quad \hat\beta = \frac{S_{xy}}{ S_{xx} }
We need to compute \nabla^2 \mathop{\mathrm{RSS}}
To this end recall that \mathop{\mathrm{RSS}}_{\alpha} = - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} \,, \quad \mathop{\mathrm{RSS}}_{\beta} = - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2
Therefore we have \begin{align*} \mathop{\mathrm{RSS}}_{\alpha \alpha} & = 2n \qquad & \mathop{\mathrm{RSS}}_{\alpha \beta} & = 2 n \overline{x} \\ \mathop{\mathrm{RSS}}_{\beta \alpha } & = 2 n \overline{x} \qquad & \mathop{\mathrm{RSS}}_{\beta \beta} & = 2 \sum_{i=1}^{n} x_i^2 \end{align*}
\begin{align*} \det \nabla^2 \mathop{\mathrm{RSS}}& = \mathop{\mathrm{RSS}}_{\alpha \alpha}\mathop{\mathrm{RSS}}_{\beta \beta} - \mathop{\mathrm{RSS}}_{\alpha \beta}^2 \\[10pt] & = 4n \sum_{i=1}^{n} x_i^2 - 4 n^2 \overline{x}^2 \\[10pt] & = 4n \left( \sum_{i=1}^{n} x_i^2 - n \overline{x}^2 \right) \\[10pt] & = 4n S_{xx} \end{align*}
S_{xx} = \sum_{i=1}^n (x_i - \overline{x})^2 \geq 0
\det \nabla^2 \mathop{\mathrm{RSS}}= 4n S_{xx} \geq 0
\mathop{\mathrm{RSS}}_{\alpha \alpha} + \mathop{\mathrm{RSS}}_{\beta \beta} = 2n + 2 \sum_{i=1}^{n} x_i^2 \geq 0
Therefore we have proven \det \nabla^2 \mathop{\mathrm{RSS}}\geq 0 \,, \qquad \quad \mathop{\mathrm{RSS}}_{\alpha \alpha} + \mathop{\mathrm{RSS}}_{\beta \beta} \geq 0
As the Hessian is symmetric, we conclude that \nabla^2 \mathop{\mathrm{RSS}} is positive semi-definite
By the Lemma we have that all the solutions (\alpha,\beta) to the optimality conditions \nabla \mathop{\mathrm{RSS}}(\alpha,\beta) = 0 are minimizers
Therefore (\hat \alpha,\hat\beta) with \hat\alpha = \overline{y}- \hat\beta \overline{x} \,, \qquad \quad \hat\beta = \frac{S_{xy}}{ S_{xx} } is a minimizer of \mathop{\mathrm{RSS}}, ending the proof
The previous Theorem allows to give the following definition
In R do the following:
Input the data into a data-frame
Plot the data points (x_i,y_i)
Compute the least-square line coefficients \hat{\alpha} = \overline{y} - \hat{\beta} \ \overline{x} \qquad \qquad \hat{\beta} = \frac{S_{xy}}{S_{xx}}
Plot the least squares line
i | x_i | y_i |
---|---|---|
1 | 1.9 | 0.7 |
2 | 0.8 | -1.0 |
3 | 1.1 | -0.2 |
4 | 0.1 | -1.2 |
5 | -0.1 | -0.1 |
6 | 4.4 | 3.4 |
7 | 4.6 | 0.0 |
8 | 1.6 | 0.8 |
9 | 5.5 | 3.7 |
10 | 3.4 | 2.0 |
We give a first solution using elementary R functions
The code to input the data into a data-frame is as follows
# Input blood pressure changes data into data-frame
changes <- data.frame(drug_A = c(1.9, 0.8, 1.1, 0.1, -0.1,
4.4, 4.6, 1.6, 5.5, 3.4),
drug_B = c(0.7, -1.0, -0.2, -1.2, -0.1,
3.4, 0.0, 0.8, 3.7, 2.0)
)
drug_A
and drug_B
to vectors x
and y
Coefficient alpha = -0.7861478
Coefficient beta = 0.685042
# Plot the data
plot(x, y, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Drug A reaction x_i", side = 1, line = 3, cex = 2.1)
mtext("Drug B reaction y_i", side = 2, line = 2.5, cex = 2.1)
pch = 16
plots points with black circlescex = 2
stands for character expansion – Specifies width of pointsxlab = ""
and ylab = ""
add empty axis labels# Plot the data
plot(x, y, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Drug A reaction x_i", side = 1, line = 3, cex = 2.1)
mtext("Drug B reaction y_i", side = 2, line = 2.5, cex = 2.1)
mtext
is used to fine-tune the axis labelsside = 1
stands for x-axisside = 2
stands for y-axisline
specifies distance of label from axis# Compute least-squares line on grid
x_grid <- seq(from = -1, to = 6, by = 0.1)
y_grid <- beta * x_grid + alpha
# Plot the least-squares line
lines(x_grid, y_grid, col = "red", lwd = 3)
col
specifies color of the plotlwd
specifies line widthPrevious code can be downloaded here least_squares_1.R
Running the code we obtain the plot on the right
lm
lm
stands for linear modelWe now use lm
to fit the least-squares line
The basic syntax of lm
is
lm(formula, data)
data
expects a data-frame in inputformula
stands for the relation to fitIn case of least-squares the formula is
formula = y ~ x
The symbol y ~ x
can be read as
x
and y
are the names of two variables in the data-frame
x
and y
lm(y ~ x)
changes
isprint(least_squares)
Call:
lm(formula = drug_B ~ drug_A, data = changes)
Coefficients:
(Intercept) drug_A
-0.7861 0.6850
\hat \alpha = -0.7861 \,, \qquad \quad \hat \beta = 0.6850
changes$drug_A
changes$drug_B
The least-squares line is currently stored in least_squares
To add such line to the current plot use abline
Previous code can be downloaded here least_squares_2.R
Running the code we obtain the plot on the right
Note: To learn Y|X one would need joint distribution of (X,Y)
Problem: The joint distribution of (X,Y) is unknown
Data: We have partial knowledge on (X,Y) in the form of
Goal:Use the data to learn Y|X
Least-Squares:
Naive solution to regression problem
Find a line of best fit y = \hat \alpha + \hat \beta x
Such line explains the data, i.e., y_i \ \approx \ \hat \alpha + \hat \beta x_i
Drawbacks of least squares:
Only predicts values of y such that (x,y) \, \in \, \text{ Line}
Ignores that (x_i,y_i) comes from joint distribution (X,Y)
Linear Regression:
Find a regression line R(x) = \alpha + \beta x
R(x) predicts most likely value of Y when X = x
Linear Regression:
We will see that regression line coincides with line of best fit R(x) = \hat \alpha + \hat \beta x
Hence regression gives statistical meaning to the line of best fit
Suppose given two random variables X and Y
R \colon \mathbb{R}\to \mathbb{R}\,, \qquad \quad R(x) := {\rm I\kern-.3em E}[Y | X = x]
Notation: We use the shorthand {\rm I\kern-.3em E}[Y|x] := {\rm I\kern-.3em E}[Y | X = x]
Assumption: Suppose to have n observations (x_1,y_1) \,, \ldots , (x_n, y_n)
Regression problem is difficult without prior knowledge on {\rm I\kern-.3em E}[Y | x]
A popular model is to assume that {\rm I\kern-.3em E}[Y | x] is linear
\alpha and \beta are called regression coefficients
The above regression is called simple because only 2 variables are involved
Note: We said that the regression is linear if {\rm I\kern-.3em E}[Y | x ] = \alpha + \beta x In the above we mean linearity wrt the parameters \alpha and \beta
Examples:
Linear regression of Y on X^2 is {\rm I\kern-.3em E}[Y | x^2 ] = \alpha + \beta x^2
Linear regression of \log Y on 1/X is {\rm I\kern-.3em E}[ \log Y | x ] = \alpha + \beta \frac{1}{ x }
Suppose to have n observations (x_1,y_1) \,, \ldots , (x_n, y_n)
Assumptions:
Normality: The distribution of Y_i is normal
Linear mean: There are parameters \alpha and \beta such that {\rm I\kern-.3em E}[Y_i] = \alpha + \beta x_i
Common variance (Homoscedasticity): There is a parameter \sigma^2 such that {\rm Var}[Y_i] = \sigma^2
Independence: The random variables Y_1 \,, \ldots \,, Y_n are independent
By Assumption 2 we have that Y_i is normal
By Assumption 3 and 4 we have
{\rm I\kern-.3em E}[Y_i] = \alpha + \beta x_i \,, \qquad \quad {\rm Var}[Y_i] = \sigma^2
Y_i \sim N(\alpha + \beta x_i, \sigma^2)
\varepsilon_i := Y_i - (\alpha + \beta x_i)
By Assumption 5 we have that Y_1,\ldots,Y_n are independent
Therefore \varepsilon_1,\ldots,\varepsilon_n are independent
Since Y_i \sim N(\alpha + \beta x_i, \sigma^2) we conclude that
\varepsilon_i \sim N(0,\sigma^2)
Y_i \sim N( \alpha + \beta x_i , \sigma^2 )
f_{Y_i} (y_i) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\frac{(y_i-\alpha-\beta{x_i})^2}{2\sigma^2} \right)
{\rm I\kern-.3em E}[Y | x] = \alpha + \beta x
(x_1,y_1) , \ldots , (x_n, y_n)
Y | X = x_i
Y_i = \alpha + \beta x_i + \varepsilon_i
Y | x_i
{\rm I\kern-.3em E}[Y | x_i] = {\rm I\kern-.3em E}[Y_i]
\begin{equation} \tag{4} {\rm I\kern-.3em E}[Y_i] \ \approx \ y_i \,, \qquad \forall \, i = 1 , \ldots, n \end{equation}
P(Y_1 \approx y_1, \ldots, Y_n \approx y_n)
\max_{\alpha,\beta,\sigma} \ L(\alpha,\beta, \sigma^2 | y_1, \ldots, y_n )
Note: The coefficients \hat \alpha and \hat \beta are the same of least-squares line!
The \log function is strictly increasing
Therefore the problem \max_{\alpha,\beta,\sigma} \ L(\alpha,\beta, \sigma^2 | y_1, \ldots, y_n ) is equivalent to \max_{\alpha,\beta,\sigma} \ \log L( \alpha,\beta, \sigma^2 | y_1, \ldots, y_n )
Recall that the likelihood is L(\alpha,\beta, \sigma^2 | y_1, \ldots, y_n ) = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( - \frac{\sum_{i=1}^n(y_i-\alpha - \beta x_i)^2}{2\sigma^2} \right)
Hence the log–likelihood is \log L(\alpha,\beta, \sigma^2 | y_1, \ldots, y_n ) = - \frac{n}{2} \log (2 \pi) - \frac{n}{2} \log \sigma^2 - \frac{ \sum_{i=1}^n(y_i-\alpha - \beta x_i)^2 }{2 \sigma^2}
Suppose \sigma is fixed. In this case the problem \max_{\alpha,\beta} \ \left\{ \frac{n}{2} \log (2 \pi) - \frac{n}{2} \log \sigma^2 - \frac{ \sum_{i=1}^n(y_i-\alpha - \beta x_i)^2 }{2 \sigma^2} \right\} is equivalent to \min_{\alpha, \beta} \ \sum_{i=1}^n(y_i-\alpha - \beta x_i)^2
This is the least-squares problem! Hence the solution is \hat \alpha = \overline{y} - \hat \beta \, \overline{x} \,, \qquad \hat \beta = \frac{S_{xy}}{S_{xy}}
Substituting \hat \alpha and \hat \beta we obtain \begin{align*} \max_{\alpha,\beta,\sigma} \ & \log L(\alpha,\beta, \sigma^2 | y_1, \ldots, y_n ) = \max_{\sigma} \ \log L(\hat \alpha, \hat \beta, \sigma^2 | y_1, \ldots, y_n ) \\[10pt] & = \max_{\sigma} \ \left\{ - \frac{n}{2} \log (2 \pi) - \frac{n}{2} \log \sigma^2 - \frac{ \sum_{i=1}^n(y_i-\hat\alpha - \hat\beta x_i)^2 }{2 \sigma^2} \right\} \end{align*}
It can be shown that the unique solution to the above problem is \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \left( y_i - \hat \alpha - \hat \beta x_i \right)^2
This concludes the proof
Linear regression and least-squares give seemingly the same answer
Least-squares line | y = \hat \alpha + \hat \beta x |
---|---|
Linear regression line | {\rm I\kern-.3em E}[Y | x ] = \hat \alpha + \hat \beta x |
Question: Why did we define regression if it gives same answer as least-squares?
Answer: There is actually a big difference