Statistical Models

Lecture 9

Lecture 9:
Practical regression

Outline of Lecture 9

  1. Plotting variables in R
    • Cross-check formal statistical results with graphical analyses
    • Important in practical research work


  1. Coefficient of determination R^2
    • R^2 measures proportion of variability in the data explained by the model
    • R^2 close to 1 \implies Model exaplains data well
    • Any R^2 larger than 0.3 is potentially interesting

Outline of Lecture 9

  1. t-test for simple regression
    • Test the significance of the slope parameter
  2. t-test for general regression
    • Test the significance of regression parameters
  3. F-test for multiple regression
    • Test the Overall Significance of the parameters
  4. Worked Example: The Longley dataset
    • Classic example of highly collinear dataset

Part 1:
Plotting variables in R

Plotting variables in R

Interested in relationship between 2 variables

  • Want to plot the 2 variables together

  • Cross-check the results of a formal statistical analysis

  • Very important in real project work

Example: Stock and Gold prices

Dataset with 33 entries for Stock and Gold price pairs

Stock Price Gold Price
1 3.230426 9.402434
2 2.992937 8.987918
3 2.194025 10.120387
4 2.602475 9.367327
5 2.963497 8.708742
6 4.224242 8.494215
7 7.433981 8.739684
8 5.060836 8.609681
9 3.903316 7.552746
10 4.260542 9.834538
11 3.469490 9.406448
Stock Price Gold Price
12 2.948513 10.62240
13 3.354562 13.12062
14 3.930106 15.05097
15 3.693491 13.39932
16 3.076129 15.34968
17 2.934277 14.83910
18 2.658664 16.01850
19 2.450606 17.25952
20 2.489758 18.26270
21 2.591093 18.13104
22 2.520800 20.20052
Stock Price Gold Price
23 2.471447 24.13767
24 2.062430 30.07695
25 1.805153 35.69485
26 1.673950 39.29658
27 1.620848 39.52317
28 1.547374 36.12564
29 1.721679 31.01106
30 1.974891 29.60810
31 2.168978 35.00593
32 2.277214 37.62929
33 2.993353 41.45828

Example: Stock and Gold prices

  • The data is stored in a .txt file

  • The file can be downloaded here stock_gold.txt

  • The text file looks like this
  • Remarks:
    • There is a Header
    • 1st column lists Stock Price
    • 2nd column lists Gold Price

Reading data into R

To read stock_gold.txt into R proceed as follows:

  1. Download stock_gold.txt and move file to Desktop

  2. Open the R Console and change working directory to Desktop

# In MacOS type
setwd("~/Desktop")

# In Windows type
setwd("C:/Users/YourUsername/Desktop")

Reading data into R

  1. Read stock_gold.txt into R and store it in data-frame prices with code
prices = read.table(file = "stock_gold.txt",
                    header = TRUE)


Note: We are telling read.table() that

  • stock_gold.txt has a header
  • Headers are optional
  • Headers are good practice to describe data

Reading data into R

  1. Double check that we loaded the correct data file
print(prices)
   stock_price gold_price
1     3.230426   9.402434
2     2.992937   8.987918
3     2.194025  10.120387
4     2.602475   9.367327
5     2.963497   8.708742
6     4.224242   8.494215
7     7.433981   8.739684
8     5.060836   8.609681
9     3.903316   7.552746
10    4.260542   9.834538
11    3.469490   9.406449
12    2.948513  10.622398
13    3.354562  13.120620
14    3.930106  15.050968
15    3.693491  13.399324
16    3.076129  15.349677
17    2.934277  14.839097
18    2.658664  16.018502
19    2.450606  17.259515
20    2.489758  18.262699
21    2.591093  18.131039
22    2.520801  20.200525
23    2.471447  24.137667
24    2.062430  30.076947
25    1.805153  35.694847
26    1.673950  39.296579
27    1.620848  39.523171
28    1.547374  36.125635
29    1.721679  31.011062
30    1.974891  29.608098
31    2.168978  35.005929
32    2.277215  37.629288
33    2.993353  41.458284

Store data into vectors

  • We now store Stock and Gold prices in 2 vectors
    • Stock prices are in 1st column of prices
    • Gold prices are in 2nd column of prices
stock.price <- prices[ , 1]
gold.price <- prices[ , 2]


  • Alternatively the same can be achieved with
stock.price <- prices$stock_price
gold.price <- prices$gold_price

Plot Stock Price vs Gold Price

plot(stock.price, 
     gold.price, 
     xlab = "Stock Price", 
     ylab = "Gold Price",
     pch = 16
    )

Plot Stock Price vs Gold Price

  • xlab and ylab specify axes labels

  • pch specifies type of points

  • Scaling is achieved with

    • xlim = c(lower, upper)
    • ylim = c(lower, upper)

Examining the graph

  • Graph suggests that the 2 variables are negatively correlated

  • Need to cross-check with the results of a formal statistical regression analysis

Part 2:
Coefficient of
determination R^2

Coefficient of determination R^2

  • R^2 is defined as

R^2 = 1 - \frac{ \mathop{\mathrm{RSS}}}{ \mathop{\mathrm{TSS}}}

  • R^2 measures proportion of variability in the data explained by the model

  • Recall that R^2 \leq 1

Important
  • R^2 is automatically computed by R when using lm
  • High values of R^2 are better! (that is, values of R^2 close to 1)

Some observations about R^2

Warning
R^2 increases as more X variables are added to a regression model

This is not necessarily good

  • One can add lots of variables and make R^2 \approx 1

  • This way the model explains the data really well y_i \approx \hat y_i \,, \quad \forall \,\, i = 1 , \ldots, n

  • Problem: the model will not make good predictions on new data

  • This is known as overfitting and it should be avoided
    (more on this later)

Some observations about R^2

  • For multiple regression R^2 lies between 0 and 1

    • R^2 = 0 model explains nothing
    • R^2 = 1 model explains everything
  • Warning: R^2 can be negative for general linear regression (no intercept)

  • Generally: the higher the value of R^2 the better the model

    • Textbook examples often have high values R^2 \geq 0.7
    • Example: In the Unemployment example of Lecture 8 we found R^2 = 0.8655401

Some observations about R^2

Important
In practice values R^2 \geq 0.3 imply there is a nontrivial amount of variation in the data explained by the model

Example: In the Stock Price Vs Gold Price example we have R^2 = 0.395325

  • This shows that Stock Price affects Gold Price
  • Since R^2 is not too large, also other factors may affect Gold Price

Regression in R

  • The basic R command used to run regression is

lm(formula)


  • lm stands for linear model

Simple linear regression in R

For simple linear regression

Y_i = \alpha + \beta x_i + \varepsilon_i

the command is

lm(y ~ x)


  • Symbol y ~ x reads as y modelled as function of x

  • y is vector containing the data y_1, \ldots, y_n

  • x is vector containing the data x_1, \ldots, x_n

Multiple linear regression in R

For multiple linear regression

Y_i = \beta_1 + \beta_2 \, x_{i2} + \ldots + \beta_p \, x_{ip} + \varepsilon_i

the command is

lm (y ~ x2 + x3 + ... + xp)


  • y is vector containing the data y_1, \ldots, y_n

  • xj is vector containing the data x_{1j}, \ldots , x_{jp}

General commands for regression in R

The best way to run regression is

  1. Run the regression analysis and store the results in a variable
fit.model <- lm(formula)


  1. Use command summary to read output of regression
summary(fit.model)

Note: If you are running the code from .R file you need to print output

print( summary(fit.model) )

Example: Stock and Gold prices

  • Stock price is stored in vector
    • stock.price
  • Gold price is stored in vector
    • gold.price
  • We want to fit the simple linear model \text{gold.price } = \alpha + \beta \, \times ( \text{ stock.price }) + \text{ error}
# Fit simple linear regression model
fit.model <- lm(gold.price ~ stock.price)

# Print result to screen
summary(fit.model)

  • There is a lot of information here!

  • We will make sense of most of it in this lecture

Interesting parts of Output

  • The MLE are under estimate
    • (Intercept) refers to the coefficient \hat \alpha \qquad \implies \qquad \hat \alpha = 37.917
    • stock.price refers to the coefficient \hat \beta \qquad \implies \qquad \hat \beta = - 6.169

Interesting parts of Output

  • Other quantities of interest are:
Coefficient R^2 \texttt{Multiple R-squared: 0.3953}
t-statistic for stock.price \texttt{stock.price t-value: -4.502}
F-statistic \texttt{F-statistic: 20.27}

Plotting the regression line

# Data stored in stock.price 
# and gold.price
# Plot the data
plot(stock.price, gold.price, 
     xlab = "Stock Price", 
     ylab= "Gold Price",
     pch = 16)

# Model stored in fit.model
# Plot the regression line
abline(fit.model, 
       col = "red", 
       lwd = 3)

Conclusion

  • We fit a simple linear model to Stock Price Vs Gold Price

  • We obtained the regression line

{\rm I\kern-.3em E}[Y | x] = \hat \alpha + \hat \beta x = 37.917 - 6.169 \times x

  • The coefficient of determination is

R^2 = 0.395325 \geq 0.3

  • Hence the linear model explains the data to a reasonable extent:
    • Stock Price affects Gold Price
    • Since R^2 is not too large, also other factors may affect Gold Price

t-test and F-test for regression

  • From lm we also obtained
t-statistic for stock.price \texttt{stock.price t-value: -4.502}
F-statistic \texttt{F-statistic: 20.27}


  • t-statistic and F-statistic for regression are mathematically DIFFICULT topic

  • In the next two parts we explain what they mean

    • We however omit mathematical details
    • If interested check out Section 11.3 of [1] and Chapter 11 of [2]

Part 3:
t-test for simple regression

t-test for simple regression

  • Consider the simple linear regression model

Y_i = \alpha + \beta X_i + \varepsilon_i

  • We have that

X \,\, \text{ affects } \,\, Y \qquad \iff \qquad \beta \neq 0

  • \beta is a random quantity which depends on the sample

  • To see if X affects Y, we can test the hypothesis

\begin{align*} H_0 \colon & \beta = 0 \\ H_1 \colon & \beta \neq 0 \end{align*}

Construction of t-test

  • More in general, consider the hypothesis

\begin{align*} H_0 \colon & \beta = b \\ H_1 \colon & \beta \neq b \end{align*}

  • Our best guess for \beta is the ML Estimator \hat \beta = \frac{ S_{xy} }{ S_{xx} }

  • To test above hypotheses, we therefore need to

    • Know the distribution of \hat \beta
    • Construct t-statistic involving \hat \beta

Distribution of \hat \beta

Theorem
Consider the linear regression model

Y_i = \alpha + \beta x_i + \varepsilon_i \,, \qquad \varepsilon_i \,\, \text{ iid } \,\, N(0, \sigma^2)

The MLE \hat{\beta} is normally distributed

\hat \beta \sim N \left(\beta , \frac{ \sigma^2 }{ S_{xx} } \right)

Proof: Quite difficult. If interested see Theorem 11.3.3 in [1]

Construction of t-statistic for \hat \beta

  • From the previous Theorem, we know that \hat \beta \sim N \left(\beta , \frac{ \sigma^2 }{ S_{xx} } \right)

  • In particular \hat \beta is an unbiased estimator for \beta {\rm I\kern-.3em E}[ \hat \beta ] = \beta

  • This means \hat \beta is the Estimate for the unknown parameter \beta

  • The t-statistic is therefore t = \frac{\text{Estimate } - \text{ Hypothesised Value}}{\mathop{\mathrm{e.s.e.}}} = \frac{ \hat \beta - \beta }{ \mathop{\mathrm{e.s.e.}}}

Estimated Standard Error for \hat \beta

  • From the Theorem in Slide 30, we know that {\rm Var}[\hat \beta] = \frac{ \sigma^2 }{ S_{xx}} \quad \implies \quad {\rm SD}[\hat \beta] = \frac{ \sigma }{ \sqrt{S_{xx}} }

  • The standard deviation {\rm SD} cannot be used as error, since \sigma^2 is unknown
    (Recall that \sigma^2 is the unknown variance of the error \varepsilon_i)

  • We however have an estimate for \sigma^2 \hat \sigma^2 = \frac{1}{n} \mathop{\mathrm{RSS}}= \frac1n \sum_{i=1}^n (y_i - \hat y_i)^2

  • \hat \sigma^2 was obtained from maximization of the likelihood function (Lecture 8)

  • It can be shown that {\rm I\kern-.3em E}[ \hat\sigma^2 ] = \frac{n-2}{n} \, \sigma^2 (for a proof, see Section 11.3.4 in [1])

  • Therefore \hat\sigma^2 is not unbiased estimator of \sigma^2

  • To obtain an unbiased estimator, we rescale \hat\sigma^2 and introduce S^2 S^2 := \frac{n}{n-2} \, \hat\sigma^2 = \frac{\mathop{\mathrm{RSS}}}{n-2}

  • This way, S^2 is unbiased estimator for \sigma^2 {\rm I\kern-.3em E}[S^2] = \frac{n}{n-2} \, {\rm I\kern-.3em E}[\hat\sigma^2] = \frac{n}{n-2} \,\cdot \, \frac{n-2}{n} \, \cdot \, \sigma^2 = \sigma^2

  • Recall that the standard deviation of \hat \beta is

{\rm SD}[\hat \beta] = \frac{ \sigma }{ \sqrt{S_{xx}} }

  • We replace the unknown \sigma with its unbiased estimator S

  • This gives the estimated standard error

\mathop{\mathrm{e.s.e.}}:= \frac{S}{\sqrt{S_{xx}}} \,, \qquad S = \sqrt{\frac{\mathop{\mathrm{RSS}}}{n-2}}

t-statistic for testing \hat \beta

The t-statistic for \hat \beta is then

t = \frac{\text{Estimate } - \text{ Hypothesised Value}}{\mathop{\mathrm{e.s.e.}}} = \frac{ \hat \beta - \beta }{ S / \sqrt{S_{xx}} }

Theorem
Consider the linear regression model

Y_i = \alpha + \beta x_i + \varepsilon_i \,, \qquad \varepsilon_i \,\, \text{ iid } \,\, N(0, \sigma^2)

The t-statistic for \hat{\beta} follows a t-distribution

t = \frac{ \hat \beta - \beta }{ S / \sqrt{S_{xx}} } \, \sim \, t_{n-2}

How to prove the Theorem

  • Proof of this Theorem is quite difficult and we omit it

  • If you are interested in the proof, see Section 11.3.4 in [1]

  • The main idea is that t-statistic can be rewritten as

t = \frac{ \hat \beta - \beta }{ S / \sqrt{S_{xx}} } = \frac{ U }{ \sqrt{ V/(n-2) } }

  • Here we defined

U := \frac{ \hat \beta - \beta }{ \sigma / \sqrt{S_{xx}} } \,, \qquad \quad V := \frac{ (n-2) S^2 }{ \sigma^2 }

  • We know that \hat \beta \sim N \left(\beta , \frac{ \sigma^2 }{ S_{xx} } \right)

  • Therefore U = \frac{ \hat \beta - \beta }{ \sigma / \sqrt{S_{xx}} } \, \sim \, N(0,1)

  • Moreover, it can be shown that V = \frac{(n-2) S^2}{\sigma^2} \, \sim \, \chi_{n-2}^2

  • It can also be shown that U and V are independent

  • In summary, we have

t = \frac{ \hat \beta - \beta }{ S / \sqrt{S_{xx}} } = \frac{ U }{ \sqrt{ V/(n-2) } }

  • U and V are independent, with

U \sim N(0,1) \,, \qquad \quad V \sim \chi_{n-2}^2

  • From the Theorem on t-distribution in Lecture 2, we conclude

t \sim t_{n-2}

The t-test for \beta

Assumption: Given data points (x_1, y_1), \ldots, (x_n,y_n), consider the simple linear regression model Y_i = \alpha + \beta x_i + \varepsilon_i \,, \qquad \varepsilon_i \,\, \text{ iid } \,\, N(0,\sigma^2)

Goal: Statistical inference on the slope \beta

Hypotheses: If b is guess for \beta, the two-sided hypothesis is H_0 \colon \beta = b \,, \quad \qquad H_1 \colon \beta \neq b The one-sided alternative hypotheses are H_1 \colon \beta < b \quad \text{ or } \quad H_1 \colon \beta > b

Procedure: 3 Steps

  1. Calculation: Compute the MLE \hat{\alpha}, \hat{\beta} and the predictions \hat{y}_i \hat \beta = \frac{ S_{xy} }{ S_{xx} } \,, \qquad \hat \alpha = \overline{y} - \hat \beta \overline{x} \,, \qquad \hat{y}_i = \hat{\alpha} + \hat{\beta} x_i Compute the Residual Sum of Squares and the estimator S^2 \mathop{\mathrm{RSS}}= \sum_{i=1}^n (y_i - \hat{y}_i)^2 \,, \qquad S^2 := \frac{\mathop{\mathrm{RSS}}}{n-2} Finally, compute the \mathop{\mathrm{e.s.e.}} and t-statistic \mathop{\mathrm{e.s.e.}}= \frac{S}{\sqrt{S_{xx}} } \,, \qquad t = \frac{\hat \beta - b }{ \mathop{\mathrm{e.s.e.}}} \ \sim \ t_{n-2}

  1. Statistical Tables or R: Find either
    • Critical value t^* in Table 1
    • p-value in R


  1. Interpretation: Reject H_0 when either p < 0.05 \qquad \text{ or } \qquad t \in \,\,\text{Rejection Region} \qquad \qquad \qquad \qquad (T \, \sim \, t_{n-2})
Alternative Rejection Region t^* p-value
\beta \neq b |t| > t^* t_{n-2}(0.025) 2P(T > |t|)
\beta < b t < - t^* t_{n-2}(0.05) P(T < t)
\beta > b t > t^* t_{n-2}(0.05) P(T > t)

Worked Example: Stock and Gold prices

  • Recall that

    • X = Stock Price
    • Y = Gold Price
  • Goal: Test if Gold Price affects Stock Price at level 0.05

  • Procedure: Consider the linear model Y_i = \alpha + \beta x_i + \varepsilon_i \,, \quad \varepsilon_i \,\, \text{ iid } \,\, N(0,\sigma^2) Test the hypotheses with two-sided alternative \begin{align*} H_0 & \colon \beta = 0 \\ H_1 & \colon \beta \neq 0 \end{align*}

Testing for \beta = 0

  • Recall that Stock Prices and Gold Prices are stored in R vectors
    • stock.price
    • gold.price
  • Fit the simple linear model with the following commands

\text{gold.price } = \alpha + \beta \, \times (\text{ stock.price }) + \text{ error}

# Fit simple linear regression model
fit.model <- lm(gold.price ~ stock.price)

# Print result to screen
summary(fit.model)

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

  • \texttt{(Intercept)}: 1st row of the table contains information related to \hat \alpha

  • \texttt{stock.price}: 2nd row of the table contains information related to \hat \beta

  • For larger models, there will be additional rows below the 2nd

    • These will be informative about additional regression parameters

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

  • The 2nd row of the table has to be interpreted as follows
\texttt{Estimate} \text{The value of } \hat \beta
\texttt{Std. Error} Estimated standard error \mathop{\mathrm{e.s.e.}} for \hat \beta
\texttt{t value} t-statistic \, t = \dfrac{\hat \beta - 0 }{\mathop{\mathrm{e.s.e.}}}
\texttt{Pr(>|t|)} p-value \, p = 2 P( t_{n-2} > |t| )
\texttt{***}, \, \texttt{**}, \, \texttt{*}, \, \texttt{.} Significance (how large is p-value) – More stars is better

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

  • The above table then gives

\hat \beta = -6.169 \,, \qquad \mathop{\mathrm{e.s.e.}}= 1.37 \, , \qquad t = - 4.502 \,, \qquad p = 8.9 \times 10^{-5} \

  • The t-statistic computed by R can also be computed by hand

t = \frac{\hat \beta - 0}{ \mathop{\mathrm{e.s.e.}}} = \frac{-6.169}{1.37} = -4.502

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

  • The p-value computed by R is p = 8.9 \times 10^{-5}

  • We can also compute it by hand: The degrees of freedom are

n = \text{No. of data points} = 33 \,, \qquad \text{df} = n - 2 = 31

t <- - 4.502 
p <- 2 - 2 * pt(abs(t), df = 31)

cat("The p-value is", p)
The p-value is 8.901425e-05

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

  • In alternative, we can find critical value in Table 1

n = \text{No. of data points} = 33 \,, \qquad \text{df} = n - 2 = 31

  • For two-sided test, the critical value is \, t_{31}(0.025) = 2.040

|t| = 4.502 > 2.040 = t_{31}(0.025) \quad \implies \quad p < 0.05

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

Interpretation: \, p is very small (hence the \, \texttt{***} \, rating)

  • Therefore, we reject the null hypothesis H_0, and the real parameter is \beta \neq 0

  • Since \beta \neq 0, we have that Stock Prices affect Gold Prices

  • The best estimate for \beta is the MLE \hat \beta = -6.169

  • \hat \beta < 0 and statistically significant:
    \,\, As Stock Prices increase Gold Prices decrease

Warning

The t-statistic and p-value in the summary refer to the two-sided test H_0 \colon \beta = 0 \,, \qquad H_1 \colon \beta \neq 0 In such case the t-statistic and p-value given in summary are t = \frac{\hat \beta - 0}{\mathop{\mathrm{e.s.e.}}} \,, \qquad p = 2P(t_{n- 2} > |t|)

If instead you are required to test H_0 \colon \beta = b \,, \qquad H_1 \colon \beta \neq b \,, \quad \beta < b\,, \quad \text{ or } \,\, \beta > b

  1. The t-statistic has to be computed by hand t = \frac{\hat \beta - b}{\mathop{\mathrm{e.s.e.}}}
    • \, \mathop{\mathrm{e.s.e.}} is in the 2nd row under \,\, \texttt{Std. Error}
  2. The p-value must be computed by hand, according to the table
Alternative p-value R command
\beta \neq b 2P(t_{n-2} > |t|) 2 - 2 * pt(abs(t), df = n - 2)
\beta < b P(t_{n-2} < t) pt(t, df = n - 2)
\beta > b P(t_{n-2} > t) 1 - pt(t, df = n - 2)

Part 4:
t-test for general regression

Recall: The t-test for simple regression

  • In the previous Part, we have considered the simple linear regression model

Y_i = \alpha + \beta x_i + \varepsilon_i \,, \qquad \varepsilon_i \,\, \text{ iid } \,\, N(0,\sigma^2)

  • We have that

X \,\, \text{ affects } \,\, Y \qquad \iff \qquad \beta \neq 0

  • To see if X affects Y, we have developed a t-test for the hypothesis

\begin{align*} H_0 \colon & \beta = b \\ H_1 \colon & \beta \neq b \end{align*}

The t-test for general regression

  • Now, consider the general linear regression model

Y_i = \beta_1 z_{i1} + \ldots + \beta_{ip} z_{ip} + \varepsilon_i \,, \qquad \varepsilon_i \, \text{ iid } \, N(0, \sigma^2)

  • We have that

Z_j \,\, \text{ affects } \,\, Y \qquad \iff \qquad \beta_j \neq 0

  • To see if Z_j affects Y we can test the hypothesis

\begin{align*} H_0 \colon & \beta_j = 0 \\ H_1 \colon & \beta_j \neq 0 \end{align*}

Construction of t-test

  • More in general, consider the hypothesis

\begin{align*} H_0 \colon & \beta_j = b_j \\ H_1 \colon & \beta_j \neq b_j \end{align*}

  • Our best guess for \beta_j is the ML Estimator \hat{\beta}_j = (\hat \beta)_j \,, \qquad \hat \beta = (Z^T Z)^{-1} Z^T y (\beta_j is the j-th component of the vector \beta)

  • To test above hypotheses, we therefore need to

    • Know the distribution of \hat{\beta}_j
    • Construct t-statistic involving \hat{\beta}_j

Distribution of \hat \beta_j

Theorem
Consider the general linear regression model Y_i = \beta_1 z_{i1} + \ldots + \beta_{ip} z_{ip} + \varepsilon_i \,, \qquad \varepsilon_i \, \text{ iid } \, N(0, \sigma^2) The MLE \hat{\beta}_j is normally distributed \hat \beta_j \sim N \left( \beta_j , \xi_{jj} \sigma^2 \right) \,, where the numbers \xi_{jj} are the diagonal entries of the p \times p matrix (Z^T Z)^{-1} = \left( \begin{array}{ccc} \xi_{11} & \ldots & \xi_{1p} \\ \ldots & \ldots & \ldots \\ \xi_{p1} & \ldots & \xi_{pp} \\ \end{array} \right)

Proof: Quite difficult. If interested, see see Section 11.5 in [2]

Construction of the t-statistic for \hat{\beta}_j

  • From the previous Theorem, we know that \hat \beta_j \sim N \left( \beta_j , \xi_{jj} \sigma^2 \right)

  • In particular, \hat{\beta}_j is an unbiased estimator for \beta_j {\rm I\kern-.3em E}[ \hat{\beta}_j ] = \beta_j

  • This means \hat{\beta}_j is the Estimate for the unknown parameter \beta

  • The t-statistic is therefore t = \frac{\text{Estimate } - \text{ Hypothesised Value}}{\mathop{\mathrm{e.s.e.}}} = \frac{ \hat{\beta}_j - \beta_j }{ \mathop{\mathrm{e.s.e.}}}

Estimated Standard Error for \hat{\beta}_j

  • From the Theorem, we know that {\rm Var}[\hat{\beta}_j] = \xi_{jj} \, \sigma^2 \qquad \implies \qquad {\rm SD}[\hat \beta_j] = \xi_{jj}^{1/2} \, \sigma

  • The standard deviation {\rm SD} cannot be used as error, since \sigma^2 is unknown

  • We however have an estimate for \sigma^2 \hat \sigma^2 = \frac{1}{n} \mathop{\mathrm{RSS}}= \frac1n \sum_{i=1}^n (y_i - \hat y_i)^2

  • \hat \sigma^2 was obtained from maximization of the likelihood function (Lecture 8)

  • It can be shown that (see Section 11.5 in [2]) {\rm I\kern-.3em E}[ \hat\sigma^2 ] = \frac{n-p}{n} \, \sigma^2

  • Therefore \hat\sigma^2 is not unbiased estimator of \sigma^2

  • To obtain unbiased estimator, we rescale \hat\sigma^2 and introduce S^2 S^2 := \frac{n}{n-p} \, \hat\sigma^2 = \frac{\mathop{\mathrm{RSS}}}{n-p}

  • This way, S^2 is unbiased estimator for \sigma^2

{\rm I\kern-.3em E}[S^2] = \frac{n}{n-p} \, {\rm I\kern-.3em E}[\hat\sigma^2] = \frac{n}{n-p} \, \frac{n-p}{n} \, \sigma^2 = \sigma^2

  • Recall that the standard deviation of \hat{\beta}_j is

{\rm SD}[\hat{\beta}_j] = \xi_{jj}^{1/2} \, \sigma

  • We replace the unknown \sigma with its unbiased estimator S

  • This gives the estimated standard error

\mathop{\mathrm{e.s.e.}}=\xi_{jj}^{1/2} \, S \,, \qquad S = \sqrt{\frac{\mathop{\mathrm{RSS}}}{n-2}}

t-statistic for testing \beta_j

Theorem
Consider the general linear regression model

Y_i = \beta_1 z_{i1} + \ldots +\beta_p z_{ip} + \varepsilon_i \,, \qquad \varepsilon_i \,\, \text{ iid } \,\, N(0, \sigma^2)

The t-statistic for \hat{\beta} follows a t-distribution

t = \frac{\text{Estimate } - \text{ Hypothesised Value}}{\mathop{\mathrm{e.s.e.}}} = \frac{ \hat \beta_j - \beta_j }{ \xi_{jj}^{1/2} \, S} \, \sim \, t_{n-p}

Proof: See section 11.5 in [2]

The t-test for \beta_j

Assumption: Given data points (z_{1i}, \ldots, z_{pi}, y_i), consider the general model Y_i = \beta_1 z_{i1} + \ldots + \beta_p z_{ip} + \varepsilon_i \,, \qquad \varepsilon_i \, \text{ iid } \, N(0,\sigma^2)

Goal: Statistical inference on the coefficients \beta_j

Hypotheses: If b_j is guess for \beta_j, the two-sided hypotheses is H_0 \colon \beta_j = b_j \,, \quad \qquad H_1 \colon \beta_j \neq b_j The one-sided alternative hypotheses are H_1 \colon \beta_j < b_j \quad \text{ or } \quad H_1 \colon \beta_j > b_j

Procedure: 3 Steps

  1. Calculation: Write down the design matrix Z and compute (Z^TZ)^{-1} Z := \left( \begin{array}{ccc} z_{11} & \ldots & z_{1p} \\ \ldots & \ldots & \ldots \\ z_{n1} & \ldots & z_{np} \\ \end{array} \right) \,, \qquad (Z^T Z)^{-1} = \left( \begin{array}{ccc} \xi_{11} & \ldots & \xi_{1p} \\ \ldots & \ldots & \ldots \\ \xi_{p1} & \ldots & \xi_{pp} \\ \end{array} \right) Compute the MLE \hat{\beta}, predictions \hat{y}, RSS and S^2 \hat \beta = (Z^TZ)^{-1} Z^T y \,, \qquad \hat{y} = Z \hat{\beta} \,, \qquad \mathop{\mathrm{RSS}}= \sum_{i=1}^n (y_i - \hat{y}_i)^2 \,, \qquad S^2 := \frac{\mathop{\mathrm{RSS}}}{n-2} Finally, compute the \mathop{\mathrm{e.s.e.}} and the t-statistic \mathop{\mathrm{e.s.e.}}= \xi_{jj}^{1/2} \, S\,, \qquad t = \frac{\hat \beta_j - b_j }{ \mathop{\mathrm{e.s.e.}}} \ \sim \ t_{n-p}

  1. Statistical Tables or R: Find either
    • Critical value t^* in Table 1
    • p-value in R


  1. Interpretation: Reject H_0 when either p < 0.05 \qquad \text{ or } \qquad t \in \,\,\text{Rejection Region} \qquad \qquad \qquad \qquad (T \, \sim \, t_{n-p})
Alternative Rejection Region t^* p-value
\beta_j \neq b_j |t| > t^* t_{n-p}(0.025) 2P(T > |t|)
\beta_j < b_j t < - t^* t_{n-p}(0.05) P(T < t)
\beta_j > b_j t > t^* t_{n-p}(0.05) P(T > t)

The t-test for \beta_j in R

  • First, fit general regression model with lm

  • Then read the summary

Assume the hypotheses to test are H_0 \colon \beta_j = 0 \,, \qquad H_1 \colon \beta_j \neq 0 In such case, the t-statistic and p-value are explicitly given in the summary t = \frac{\hat \beta_j - 0}{\mathop{\mathrm{e.s.e.}}} \,, \qquad p = 2P(t_{n- p} > |t|)

  • Read the t-statistic in j-th variable row under \,\, \texttt{t value}

  • Read the p-value in j-th variable row under \,\, \texttt{Pr(>|t|)}

If instead you are required to test H_0 \colon \beta_j = b_j \,, \qquad H_1 \colon \beta_j \neq b_j \,, \quad \beta_j < b_j \,, \quad \text{ or } \,\, \beta_j > b_j

  1. The t-statistic has to be computed by hand t = \frac{\hat{\beta}_j - b_j}{\mathop{\mathrm{e.s.e.}}}
    • \,\, \hat \beta_j is in j-th variable row under \,\, \texttt{Estimate}
    • \,\, \mathop{\mathrm{e.s.e.}} for \hat \beta_j is in j-th variable row under \,\, \texttt{Std. Error}
  2. The p-value must be computed by hand, according to the table
Alternative p-value R command
\beta_j \neq b_j 2P(t_{n-p} > |t|) 2 - 2 * pt(abs(t), df = n - p)
\beta_j < b_j P(t_{n-p} < t) pt(t, df = n - p)
\beta_j > b_j P(t_{n-p} > t) 1 - pt(t, df = n - p)

Worked Example: Stock and Gold prices

  • Recall that

    • X = Stock Price
    • Y = Gold Price
  • Goal: Test if an intercept is present, at level 0.05

  • Procedure: Consider the linear model Y_i = \alpha + \beta x_i + \varepsilon_i \,, \quad \varepsilon_i \,\, \text{ iid } \,\, N(0,\sigma^2) Test the hypotheses with two-sided alternative \begin{align*} H_0 & \colon \alpha = 0 \\ H_1 & \colon \alpha \neq 0 \end{align*}

Testing for \alpha= 0

  • Recall that Stock Prices and Gold Prices are stored in R vectors
    • stock.price
    • gold.price
  • Fit the simple linear model with the following commands

\text{gold.price } = \alpha + \beta \, \times (\text{ stock.price }) + \text{ error}

# Fit simple linear regression model
fit.model <- lm(gold.price ~ stock.price)

# Print result to screen
summary(fit.model)

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

  • \texttt{(Intercept)}: 1st row of the table contains information related to \hat \alpha
\texttt{Estimate} \text{The value of } \hat \alpha
\texttt{Std. Error} Estimated standard error \mathop{\mathrm{e.s.e.}} for \hat \alpha
\texttt{t value} t-statistic \, t = \dfrac{\hat \alpha - 0 }{\mathop{\mathrm{e.s.e.}}}
\texttt{Pr(>|t|)} p-value \, p = 2 P( t_{n-2} > |t| )
\texttt{***}, \, \texttt{**}, \, \texttt{*}, \, \texttt{.} Significance (how large is p-value) – More stars is better

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

  • The above table then gives

\hat \alpha = 37.917 \,, \qquad \mathop{\mathrm{e.s.e.}}= 4.336 \, , \qquad t = 8.744 \,, \qquad p = 7.14 \times 10^{-10} \

  • The t-statistic computed by R can also be computed by hand

t = \frac{\hat \alpha - 0}{ \mathop{\mathrm{e.s.e.}}} = \frac{37.917}{4.336} = 8.744

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

  • The p-value computed by R is p = 7.14 \times 10^{-10}

  • We can also compute it by hand: The degrees of freedom are

n = \text{No. of data points} = 33 \,, \qquad \text{df} = n - 2 = 31

t <- 8.744
p <- 2 - 2 * pt(abs(t), df = 31)

cat("The p-value is", p)
The p-value is 7.135255e-10

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

  • In alternative, we can find the critical value in Table 1

n = \text{No. of data points} = 33 \,, \qquad \text{df} = n - 2 = 31

  • For two-sided test, the critical value is \, t_{31}(0.025) = 2.040

|t| = 8.744 > 2.040 = t_{31}(0.025) \quad \implies \quad p < 0.05

Output: \mathop{\mathrm{e.s.e.}}, t-statistic, p-value

Interpretation: \, p is very small (hence the \, \texttt{***} \, rating)

  • Therefore, we reject the null hypothesis H_0, and the real parameter is \alpha \neq 0

  • Since \alpha \neq 0, we deduce that the model has to include an intercept

  • The best estimate for \alpha is the MLE \hat \alpha = 37.917

Theoretical Example

Deriving \mathop{\mathrm{e.s.e.}} for simple linear regression

  • Consider the simple regression model

Y_i = \alpha + \beta x_i + \varepsilon_i

  • We have seen in the previous part that the \mathop{\mathrm{e.s.e.}} for \hat{\beta} is

\mathop{\mathrm{e.s.e.}}= \frac{S}{\sqrt{S_{xx}}} \,, \qquad S = \sqrt{ \frac{\mathop{\mathrm{RSS}}}{n-2} }

  • However, we have not computed the \mathop{\mathrm{e.s.e.}} for \hat{\alpha}

Goal: Compute \mathop{\mathrm{e.s.e.}} for \hat{\alpha} using the theory developed for general regression

Theory developed so far:

  • Consider the general linear model Y_i = \beta_1 z_{i1} + \ldots +\beta_p z_{ip} + \varepsilon_i \,, \qquad \varepsilon_i \,\, \text{ iid } \,\, N(0, \sigma^2)

  • We have shown that the estimated standard error for \hat{\beta}_j is \mathop{\mathrm{e.s.e.}}= \xi_{jj}^{1/2} \, S \,, \qquad S = \sqrt{ \frac{\mathop{\mathrm{RSS}}}{n-2} } where \xi_{jj} are the diagonal entries of the matrix (Z^T Z)^{-1} = \left( \begin{array}{ccc} \xi_{11} & \ldots & \xi_{1p} \\ \ldots & \ldots & \ldots \\ \xi_{p1} & \ldots & \xi_{pp} \\ \end{array} \right)

  • Let us go back to the linear regression model

Y_i = \alpha + \beta x_i + \varepsilon_i

  • The design matrix is

Z = \left( \begin{array}{cc} 1 & x_1 \\ \ldots & \ldots \\ 1 & x_n \\ \end{array} \right)

  • We have computed in Lecture 8 that

(Z^T Z)^{-1} = \frac{1}{n S_{xx} } \left( \begin{array}{cc} \sum_{i=1}^n x^2_i & -n \overline{x}\\ -n\overline{x} & n \end{array} \right)

  • Hence the \mathop{\mathrm{e.s.e.}} for \hat \alpha and \hat \beta are

\begin{align*} \mathop{\mathrm{e.s.e.}}(\hat \alpha) & = \xi_{11}^{1/2} \, S = \sqrt{ \frac{ \sum_{i=1}^n x_i^2 }{ n S_{xx} } } \, S \\[7pt] \mathop{\mathrm{e.s.e.}}(\hat \beta) & = \xi_{22}^{1/2} \, S = \frac{ S }{ \sqrt{ S_{xx} } } \end{align*}

  • Note: \, \mathop{\mathrm{e.s.e.}}(\hat\beta) coincides with the \mathop{\mathrm{e.s.e.}} in Slide 35

Part 5:
F-test for multiple regression

F-test for Overall Significance

  • Want to test the Overall Significance of the model

Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i

  • This means answering the question:

\text{ Does at least one } X_i \text{ affect } Y \text{ ?}

  • How to do this?
    • Could perform a sequence of t-tests on the \beta_j
    • For statistical reasons this is not really desirable
    • To assess overall significance we can perform F-test

F-test for Overall Significance

The F-test for Overall Significance has 3 steps:

  1. Define a larger Full Model (with more parameters)

  2. Define a smaller nested Reduced Model (with fewer parameters)

  3. Use an F-statistic to decide between larger or smaller model

Overall Significance for multiple regression

  • Model 1 is the smaller Reduced Model
  • Model 2 is the larger Full Model

\begin{align*} \textbf{Model 1:} & \quad Y_i = \beta_1 + \varepsilon_i \\[15pt] \textbf{Model 2:} & \quad Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i \end{align*}

  • Choosing the Full Model 2 is equivalent to rejecting H_0

\begin{align*} H_0 & \colon \, \beta_2 = \beta_3 = \ldots = \beta_p = 0 \\ H_1 & \colon \text{ At least one of the } \beta_i \text{ is non-zero} \end{align*}

Overall Significance for simple regression

  • Model 1 is the smaller Reduced Model
  • Model 2 is the larger Full Model

\begin{align*} \textbf{Model 1:} & \quad Y_i = \alpha + \varepsilon_i \\[15pt] \textbf{Model 2:} & \quad Y_i = \alpha + \beta x_i + \varepsilon_i \end{align*}

  • Choosing the Full Model 2 is equivalent to rejecting H_0

\begin{align*} H_0 & \colon \beta = 0 \\ H_1 & \colon \beta \neq 0 \end{align*}

RSS for the Full Model

  • Consider the Full Model with p parameters

\textbf{Model 2:} \quad Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i

  • Denote by \hat \beta the MLE for the Full Model. Therefore, predictions are

\hat y_i := \hat{\beta}_1 + \hat{\beta}_2 x_{i2} + \ldots + \hat{\beta}_p x_{ip} \,, \quad \forall \, i = 1 , \ldots, n \,,

  • Define the Residual Sum of Squares for the Full Model

\mathop{\mathrm{RSS}}(p) := \sum_{i=1}^n (y_i - \hat y_i)^2

RSS for the Reduced Model

  • Consider now the Reduced Model

\textbf{Model 1:} \quad Y_i = \beta_1 + \varepsilon_i

  • Denote by \hat{\beta}_1 the MLE for the Reduced Model. Therefore, predictions are

\hat y_i = \hat{\beta}_1 \,, \quad \forall \, i = 1 , \ldots, n

  • Define the Residual Sum of Squares for the Reduced Model

\mathop{\mathrm{RSS}}(1) := \sum_{i=1}^n (y_i - \hat{\beta}_1)^2

Property of the RSS

  • Recall that \mathop{\mathrm{RSS}} is defined via minimization \mathop{\mathrm{RSS}}(k) := \min_{\beta_1 , \ldots , \beta_k} \ \sum_{i=1}^n ( y_i - \hat y_i)^2 \,, \qquad \hat y_i := \beta_1 + \beta_2 x_{i2} + \ldots + \beta_k x_{ik} with minumum achieved when \beta coincides with the MLE \hat \beta = (Z^TZ)^{-1}Z^Ty

  • Therefore \mathop{\mathrm{RSS}} cannot increase if we add parameters to the model

\mathop{\mathrm{RSS}}(1) \geq \mathop{\mathrm{RSS}}(p)

  • In particular, we have that

\frac{\mathop{\mathrm{RSS}}(1) - \mathop{\mathrm{RSS}}(p)}{\mathop{\mathrm{RSS}}(p)} \geq 0

Deciding between the two models

The hypothesis for deciding between the two model is \begin{align*} H_0 & \colon \, \beta_2 = \beta_3 = \ldots = \beta_p = 0 \\ H_1 & \colon \text{ At least one of the } \beta_i \text{ is non-zero} \end{align*}

  • Suppose the null hypothesis H_0 holds

  • In this case, the Reduced Model and Full models are the same

  • Therefore, predictions of Full and Reduced Model will be similar

  • Hence, the \mathop{\mathrm{RSS}} for the 2 models are similar

\mathop{\mathrm{RSS}}(1) \, \approx \, \mathop{\mathrm{RSS}}(p)

Deciding between the two models

The hypothesis for deciding between the two model is \begin{align*} H_0 & \colon \, \beta_2 = \beta_3 = \ldots = \beta_p = 0 \\ H_1 & \colon \text{ At least one of the } \beta_i \text{ is non-zero} \end{align*}

  • Suppose instead that the alternative hypothesis H_1 holds

  • In this case the predictions of Full and Reduced Model will be different

  • As already noted, in general it holds that \mathop{\mathrm{RSS}}(1) \geq \mathop{\mathrm{RSS}}(p)

  • From H_1, we know that some of the extra parameters \beta_2 , \ldots, \beta_p are non zero

  • Thus, Full Model will give better predictions \implies \mathop{\mathrm{RSS}}(p) is much smaller \mathop{\mathrm{RSS}}(1) \gg \mathop{\mathrm{RSS}}(p)

Conclusion: 2 cases

  1. Choose the Reduced Model:

\begin{align*} H_0 \,\, \text{ holds } & \, \iff \, \beta_2 = \ldots = \beta_p = 0 \\[15pt] & \, \iff \, \mathop{\mathrm{RSS}}(1) \approx \mathop{\mathrm{RSS}}(p) \, \iff \, \frac{\mathop{\mathrm{RSS}}(1) - \mathop{\mathrm{RSS}}(p)}{\mathop{\mathrm{RSS}}(p)} \approx 0 \end{align*}

  1. Choose the Full Model:

\begin{align*} H_1 \,\, \text{ holds } & \, \iff \, \exists \,\, i \, \text{ s.t. } \, \beta_i \neq 0 \\[15pt] & \, \iff \, \mathop{\mathrm{RSS}}(1) \gg \mathop{\mathrm{RSS}}(p) \, \iff \, \frac{\mathop{\mathrm{RSS}}(1) - \mathop{\mathrm{RSS}}(p)}{\mathop{\mathrm{RSS}}(p)} \gg 0 \end{align*}

Construction of F-statistic

  • Thus, the quantity below could be used as statistic to decide between the 2 models

\frac{\mathop{\mathrm{RSS}}(1) - \mathop{\mathrm{RSS}}(p)}{\mathop{\mathrm{RSS}}(p)}

  • However, in order to obtain a known distribution, we need to rescale

  • Note that the degrees of freedom of Reduced Model (1 parameter) are

\mathop{\mathrm{df}}(1) = n - 1

  • The degrees of freedom of the Full Model (p parameters) are

\mathop{\mathrm{df}}(p) = n - p

F-statistic for Overall Significance

Definition
The F-statistic for Overall Significance is

\begin{align*} F & := \frac{\mathop{\mathrm{RSS}}(1) - \mathop{\mathrm{RSS}}(p)}{ \mathop{\mathrm{df}}(1) - \mathop{\mathrm{df}}(p) } \bigg/ \frac{\mathop{\mathrm{RSS}}(p)}{\mathop{\mathrm{df}}(p)} \\[15pt] & = \frac{\mathop{\mathrm{RSS}}(1) - \mathop{\mathrm{RSS}}(p)}{ p - 1 } \bigg/ \frac{\mathop{\mathrm{RSS}}(p)}{n - p} \end{align*}

Theorem: The F-statistic for Overall Significance has F-distribution

F \, \sim \, F_{p-1,n-p}

Rewriting the F-statistic

Proposition
The F-statistic can be rewritten as

\begin{align*} F & = \frac{\mathop{\mathrm{RSS}}(1) - \mathop{\mathrm{RSS}}(p)}{ p - 1 } \bigg/ \frac{\mathop{\mathrm{RSS}}(p)}{n - p} \\[15pt] & = \frac{R^2}{1 - R^2} \, \cdot \, \frac{n - p}{p - 1} \end{align*}

Proof of Proposition

  • Recall the definition of Total Sum of Squares \mathop{\mathrm{TSS}}= \sum_{i=1}^n (y_i - \overline{y})^2

  • Notice that the definition of \mathop{\mathrm{TSS}} does not depend on p

  • Recall the definition of R^2 for the Full Model (which has p parameters) R^2 = 1 - \frac{\mathop{\mathrm{RSS}}(p)}{\mathop{\mathrm{TSS}}}

  • From the above, we obtain

\mathop{\mathrm{RSS}}(p) = (1 - R^2) \mathop{\mathrm{TSS}}

  • By definition, we have that

\mathop{\mathrm{RSS}}(1) = \min_{\beta_1} \ \sum_{i=1}^n (y_i - \beta_1)^2

  • Exercise: Check that the unique solution to the above problem is

\beta_1 = \overline{y}

  • Therefore, we have

\mathop{\mathrm{RSS}}(1) = \sum_{i=1}^n (y_i - \overline{y})^2 = \mathop{\mathrm{TSS}}

  • We just obtained the two identities

\mathop{\mathrm{RSS}}(p) = (1 - R^2) \mathop{\mathrm{TSS}}\,, \qquad \quad \mathop{\mathrm{RSS}}(1) = \mathop{\mathrm{TSS}}

  • From the above, we conclude the proof

\begin{align*} F & = \frac{\mathop{\mathrm{RSS}}(1) - \mathop{\mathrm{RSS}}(p)}{ p - 1 } \bigg/ \frac{\mathop{\mathrm{RSS}}(p)}{n - p} \\[15pt] & = \frac{\mathop{\mathrm{TSS}}- (1 - R^2) \mathop{\mathrm{TSS}}}{ p - 1 } \bigg/ \frac{(1 - R^2) \mathop{\mathrm{TSS}}}{n - p}\\[15pt] & = \frac{R^2}{1 - R^2} \, \cdot \, \frac{n - p}{p - 1} \end{align*}

Testing Overall Significance

  • Recall: We want to test the Overall Significance of the parameters \beta_2,\ldots, \beta_p

  • This means deciding which model gives better predictions between \begin{align*} \textbf{Model 1:} & \quad Y_i = \beta_1 + \varepsilon_i & \text{Reduced Model}\\[15pt] \textbf{Model 2:} & \quad Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i & \text{Full Model} \end{align*}

  • To make a decision, we have formulated the hypothesis \begin{align*} H_0 & \colon \, \beta_2 = \beta_3 = \ldots = \beta_p = 0 \\ H_1 & \colon \text{ At least one of the } \beta_i \text{ is non-zero} \end{align*}

  • H_0 favors Model 1

  • H_1 favors Model 2

We use the F-statistic to decide between the two models F = \frac{\mathop{\mathrm{RSS}}(1) - \mathop{\mathrm{RSS}}(p)}{ p - 1 } \bigg/ \frac{\mathop{\mathrm{RSS}}(p)}{n - p}

  1. Choose the Reduced Model: \begin{align*} H_0 \,\, \text{ holds } \, \iff \, \beta_2 = \ldots = \beta_p = 0 \, \iff \, \mathop{\mathrm{RSS}}(1) \approx \mathop{\mathrm{RSS}}(p) \, \iff \, F \approx 0 \end{align*}

  2. Choose the Full Model: \begin{align*} H_1 \,\, \text{ holds } \, \iff \, \exists \,\, i \, \text{ s.t. } \, \beta_i \neq 0 \, \iff \, \mathop{\mathrm{RSS}}(1) \gg \mathop{\mathrm{RSS}}(p) \, \iff \, F \gg 0 \end{align*}

Therefore, the test is one-sided: \,\, Reject H_0 \iff F \gg 0

The F-test for Overall Significance

Assumption: Given data points (x_{i2},\ldots , x_{ip}, y_i), consider the nested models

\begin{align*} \textbf{Model 1:} & \quad Y_i = \beta_1 + \varepsilon_i & \text{Reduced Model}\\[15pt] \textbf{Model 2:} & \quad Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i & \text{Full Model} \end{align*}

Goal: Testing the Overall Significance of the parameters \beta_2,\ldots, \beta_p
(i.e. decide which model gives better predictions)

Hypothesis: Choosing the Full Model 2, is equivalent to rejecting H_0

\begin{align*} H_0 & \colon \, \beta_2 = \beta_3 = \ldots = \beta_p = 0 \\ H_1 & \colon \text{ At least one of the } \beta_i \text{ is non-zero} \end{align*}

Procedure: 3 Steps

  1. Calculation: Write design matrix Z, and compute MLE \hat{\beta} and predictions \hat{y}_i Z = \left( \begin{array}{cccc} 1 & x_{12} & \ldots & x_{1p} \\ \ldots & \ldots & \ldots & \ldots \\ 1 & x_{n2} & \ldots & x_{np} \\ \end{array} \right) \,, \qquad \hat{\beta} = (Z^TZ)^{-1} Z^T y \,, \qquad \hat{y} = Z \hat{\beta} Compute the \mathop{\mathrm{RSS}}, \mathop{\mathrm{TSS}} and R^2 coefficient for the Full Model \mathop{\mathrm{RSS}}(p) = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \,, \qquad \mathop{\mathrm{TSS}}= \sum_{i=1}^n (y_i - \overline{y})^2\,, \qquad R^2 = 1 - \frac{\mathop{\mathrm{RSS}}(p)}{\mathop{\mathrm{TSS}}} Finally, compute the F-statistic for Overall Significance F = \frac{R^2}{1 - R^2} \, \cdot \, \frac{n - p}{p - 1} \ \sim \ F_{p-1, n - p}

  1. Statistical Tables or R: Find either
    • Critical value F^* in Table 3
    • p-value in R


  1. Interpretation: Reject H_0 when either p < 0.05 \qquad \text{ or } \qquad F \in \,\,\text{Rejection Region}
Alternative Rejection Region F^* p-value
At least one of the \beta_i is non-zero F > F^* F_{p-1,n-p}(0.05) P(F_{p-1,n-p} > F)

The F-test for Overall Significance in R

  1. Fit the Full Model with lm

  2. Read the Summary

    • F-statistic is listed in the summary
    • p-value is listed in the summary
# Fit the Full Model
full.model <- lm(y ~ x2 + x3 + ... + xp)

# Read the summary
summary(full.model)




## F-statistic for simple regression {.smaller}

::: Proposition

1. The $F$-statistic for Overall Significance in simple regression is
$$
F = t^2 \,, \qquad \quad t = \frac{\hat \beta}{ S / \sqrt{S_{xx}}}
$$
where $t$ is the t-statistic for $\hat \beta$.

2. In particular, the p-values for t-test and F-test coincide
$$
p = 2P( t_{n-2} > |t| ) = P( |t_{n-2}| > |t| )  = P( F_{1,n-2} > F )
$$

:::

**Proof:** Will be left as an exercise




## Worked Example: Stock and Gold prices {.smaller}

- Recall that 
    * $X =$ Stock Price
    * $Y =$ Gold Price

- Consider the simple linear regression model

$$
Y_i =  \alpha + \beta x_i + \e_i 
$$

- We want to test the **Overall Significance** of the parameter $\beta$


## {.smaller}

- This means deciding which of the two models below gives better predictions

\begin{align*}
\textbf{Model 1:} & \quad Y_i = \alpha + \e_i  & \text{Reduced Model}\\[15pt]
\textbf{Model 2:} & \quad Y_i = \alpha  + \beta x_{i} + \e_i & \text{Full Model}
\end{align*}

- **Hypothesis:** Choosing the Full Model 2, is equivalent to rejecting $H_0$

\begin{align*}
H_0 & \colon \, \beta = 0 \\
H_1 & \colon \, \beta \neq 0
\end{align*}

- Let us perform the F-test in R





##  {.smaller}

- Recall that Stock Prices and Gold Prices are stored in R vectors
    * ``stock.price``
    * ``gold.price``


- Fit the simple linear model with the following commands

$$
\text{gold.price } = \alpha + \beta \, \times ( \text{ stock.price } ) + \text{ error}
$$


```r
# Fit simple linear regression model
fit.model <- lm(gold.price ~ stock.price)

# Print result to screen
summary(fit.model)

Output: F-statistic and p-value

  • The computed F-statistic is F = 20.27

  • There are n = 33 data points, and p = 2 parameters in the Full Model

  • Therefore, the degrees of freedom are {\rm df}_1 = p - 1 = 2 - 1 = 1 \,, \qquad {\rm df}_2 = n - p = 33 - 2 = 31 These are listed in the output, after the F-statistic

  • In particular, we have that the F-statistic has distribution F \ \sim \ F_{{\rm df}_1, {\rm df}_2} = F_{1,31}

Output: F-statistic and p-value

  • The computed p-value is p = P( F_{1,31} > F ) = 8.904 \times 10^{-5}

  • Conclusion: Strong evidence (p=0.000) that the parameter \beta is significant

    • Going back to the model, this means that Stock Price affects Gold Price

Output: F-statistic and p-value

We could also compute F-statistic by hand

  • From the output we see that R^2 = 0.3953

  • We have p = 2 and n =33

  • Therefore

F = \frac{ R^2 }{ 1 - R^2 } \, \cdot \, \frac{n-p}{p-1} = \frac{0.395325}{0.604675} \, \cdot \, \frac{31}{1} = 20.267

Output: F-statistic and p-value

  • Similarly, the p-value could be computed by hand

p = P( F_{1,31} > F ) = 8.904 \times 10^{-5}

F <- 20.267    # Input the precise value, rather than the rounded one given in Summary
n <- 33
p <- 2

p.value <- 1 - pf(F, df1 = p - 1, df2 = n - p)

cat("The p-value is", p.value)
The p-value is 8.904244e-05

Output: F-statistic and p-value

  • We can also find critical value F_{1,31} (0.05), by finding its closest values in Table 3 F_{1, 30} (0.05) = 4.17 \,, \qquad \quad F_{1, 40} (0.05) = 4.08

  • Approximate F_{1,31} (0.05) by averaging the above values F_{1,31}(0.05) \, \approx \, \frac{F_{1, 30} (0.05) + F_{1, 40} (0.05)}{2} = 4.125

  • We reject H_0, since F = 20.267 > F_{1,31}(0.05) = 4.125 \quad \implies \quad p < 0.05

Part 6:
Worked Example:
The Longley dataset

Longley US macroeconomic dataset

7 economical variables, observed yearly 1947–1962 (16 years)

  • The Longley dataset was analyzed by J.W. Longley in his 1967 paper [3]

  • It is stored in R by default, in the variable longley

  • The first line of help(longley) reads:

    • A macroeconomic data set which provides a well-known example for a highly collinear regression
  • We will see next time what highly collinear means

  • For convenience, you can also download the Longley dataset here longley.txt

  • Download the file and place it in current working directory

  • Let us load the dataset and read the first 3 lines

longley <- read.table(file = "datasets/longley.txt", header = TRUE)

head(longley, n = 3)     # Prints the first 3 rows of dataset
     GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
1947         83.0 234.289      235.6        159.0    107.608 1947   60.323
1948         88.5 259.426      232.5        145.6    108.632 1948   61.122
1949         88.2 258.054      368.2        161.6    109.773 1949   60.171

     GNP.deflator     GNP Unemployed Armed.Forces Population Year Employed
1947         83.0 234.289      235.6        159.0    107.608 1947   60.323
1948         88.5 259.426      232.5        145.6    108.632 1948   61.122
1949         88.2 258.054      368.2        161.6    109.773 1949   60.171


# Name Description
X_2 GNP.deflator Coefficient for adjusting GNP for inflation (1954 = 100) (1954 is reference year: other years need adjusting)
X_3 GNP Yearly Gross National Product (in Billion Dollars)
X_4 Unemployed Number of unemployed people (in thousands)
X_5 Armed.Forces Number of people in the Armed Forces (in thousands)
X_6 Population Non-institutionalized Population \geq age 14 (in thousands) (People \geq 14 not in Armed Forces or care of institutions)
X_7 Year The year in which the data is recorded. Range: 1947–1962
Y Employed Number of employed people (in thousands)

Analyzing the Longley Dataset

Goal: Explain the number of Employed people Y in terms of X_i variables

  • Longley dataset available here longley.txt

  • Download the file and place it in current working directory

# Read data file
longley <- read.table(file = "longley.txt", header = TRUE)

# Store columns in vectors
x2 <- longley[ , 1]        # GNP Deflator
x3 <- longley[ , 2]        # GNP
x4 <- longley[ , 3]        # Unemployed
x5 <- longley[ , 4]        # Armed Forces
x6 <- longley[ , 5]        # Population
x7 <- longley[ , 6]        # Year
y <- longley[ , 7]         # Employed

Fitting multiple regression

  • We want to fit the multiple regression model

Y = \beta_1 + \beta_2 \, X_2 + \beta_3 \, X_3 + \beta_4 \, X_4 + \beta_5 \, X_5 + \beta_6 \, X_6 + \beta_7 \, X_7 + \varepsilon

# Fit multiple regression model
model <- lm(y ~ x2 + x3 + x4 + x5 + x6 + x7)

# Print summary
summary(model)


Interpreting R output

We are interested in the following information:

  1. Coefficient of determination R^2

  2. Individual t-statistics for the parameters \beta_i

  3. F-statistic to assess Overall Significance of parameters \beta_i

1. Coefficient of determination R^2

  • We have that R^2=0.9955

  • R^2 is very high, which suggests we might have quite a good model

  • This means the model explains around 99.6% of the variability in the data

  • There is a chance R^2 is too high to be true
    (This means the model is overfitting - More on this later)

2. Individual t-statistics

  • Look at p-values for each variable X_j

  • Recall: Such p-values refer to the two-sided t-test

H_0 \colon \, \beta_j = 0 \qquad \quad H_1 \colon \, \beta_j \neq 0

  • Find X_j for which p < 0.05
    • For X_j we reject H_0
    • Therefore \beta_j \neq 0
    • Hence X_j influences Y
    • X_j is statistically significant
  • In dissertations p < 0.1 is OK
    • Weak evidence against H_0
    • X_j has some weak effect on Y


Stars in output help you find the significant p-values


Significance code p-value Coefficient
No stars 0.1 \leq p \leq 1 \beta_j = 0
\texttt{.} 0.05 \leq p < 0.1 \beta_j = 0
\texttt{*} 0.1 \leq p < 0.05 \beta_j \neq 0
\texttt{**} 0.001 \leq p < 0.01 \beta_j \neq 0
\texttt{***} p < 0.001 \beta_j \neq 0

Interpretation: Not all variables are statistically significant

  • This is because some variables have no stars
  • Significant variables have at least one star

  1. Intercept has two stars \, \texttt{**} and hence is significant
    • The p-value is p = 0.00356
    • Since p < 0.05, we conclude that the real intercept is \beta_1 \neq 0
    • Estimated intercept is \hat \beta_1 = -3.482 \times 10^{3}

  1. X_2, X_3, X_6 have no stars
    • p-values are p \geq 0.1 and therefore the real \beta_2 = \beta_3 = \beta_6 = 0
    • Notice that the ML Estimators \hat{\beta}_2, \hat{\beta}_3, \hat{\beta}_6 are not zero
    • However, these estimates have to be ignored as t-test is not significant
    • X_2, X_3, X_6 have no effect on Y
    • Hence GNP Deflator, GNP and Population do not affect Number of Employed

  1. X_4 has two stars \, \texttt{**}, and hence is significant
    • Since p<0.05 we have that \beta_4 \neq 0
    • Estimated coefficient is \hat \beta_4 < 0
    • Since \hat \beta_4 < 0, we have that X_4 negatively affects Y
    • As the Number of Unemployed increases, the Number of Employed decreases

  1. X_5 has three stars \, \texttt{***}, and hence is significant
    • Since p<0.05, we have that \beta_5 \neq 0
    • Estimated coefficient is \hat \beta_5 < 0
    • Therefore X_5 negatively affects Y
    • As the size of Armed Forces increases, the Number of Employed decreases

  1. X_7 has two stars \, \texttt{**}, and hence is significant
    • Since p<0.05 we have that \beta_7 \neq 0
    • Estimated coefficient is \hat \beta_7 > 0
    • Therefore X_7 positively affects Y
    • Remember that X_7 is the Year
    • The Number of Employed is generally increasing from 1947 to 1962

3. F-statistic to assess Overall Significance

  • The F-statistic is F = 330.3, with distribution F_{6, 9}

  • The p-value for F-test is p = 4.984 \times 10^{-10} < 0.05

  • Recall: F-statistic and p-value refer to the F-test for

H_0 \colon \, \beta_2 = \ldots = \beta_6 = 0 \,, \qquad H_1 \colon \, \text{At least one } \beta_j \neq 0

  • Since p < 0.05, we reject H_0

  • There is evidence that at least one of the X-variables affects Y

Conclusions

  • From F-test: There is evidence that at least one of the X-variables affects Y

  • From individual t-tests, we have:

    • X_2, X_3, X_6 do not affect Y
    • X_4 and X_5 negatively affect Y
    • X_7 positively affects Y
  • Coefficient of determination R^2 is really high:

    • The model explains around 99.6% of the variability in the data

Warning: The extremely high R^2 looks very suspicious

  • Indeed, this is too good to be true, and it happens for 2 reasons

Reason 1: Overfitting

  • There are 16 sample points (16 years observed)

  • There are 6 variables

  • Sample is small and variables too many: the model can easily overfit

  • Generally, adding more variables to the model improves predictions

  • In turn, this can artificially inflate R^2

\hat{y}_i \approx y_i \quad \implies \quad \mathop{\mathrm{RSS}}\approx 0 \quad \implies \quad R^2 = 1 - \frac{\mathop{\mathrm{RSS}}}{\mathop{\mathrm{TSS}}} \approx 1

  • Despite higher R^2, the model might not make good predictions

Overfitting Example: Think of a polynomial passing through all the data points

  • Polynomial explains data perfectly \hat{y}_i = y_i \quad \implies \quad \mathop{\mathrm{RSS}}= 0 \quad \implies \quad R^2 = 1 - \frac{\mathop{\mathrm{RSS}}}{\mathop{\mathrm{TSS}}} = 1

  • However, the trend is clearly linear

  • Polynomial model artificially increases R^2, but predictions are not better

Reason 2: Multicollinearity

  • An assumption of regression is independence of the variables X_i

  • In particular, this implies the X_i are uncorrelated

  • However, the Longley dataset contains several correlated variables

  • Example: GNP, Population and GNP Deflator are highly correlated

    • Growing economy (higher Population and GNP) increases inflation (GNP Deflator)
  • Multicollinearity leads to a strange phenomenon:

    • We have seen that GNP, Population and GNP Deflator do not affect Employed
    • However, common sense says they should
    • At the same time, high R^2 suggests a good model
    • What is going on?
  • Multicollinearity breaks the model: t-tests for individual coefficients are unreliable

References

[1]
Casella, George, Berger, Roger L., Statistical inference, second edition, Brooks/Cole, 2002.
[2]
DeGroot, Morris H., Schervish, Mark J., Probability and statistics, Fourth Edition, Addison-Wesley, 2012.
[3]
J.W. Longley, An appraisal of least-squares programs from the point of view of the user, Journal of the American Statistical Association. 62 (1967) 819–841.