Lecture 11
In Lecture 9 we have introduced the general linear regression model
Y_i = \beta_1 z_{i1} + \beta_2 z_{i2} + \ldots + \beta_p z_{ip} + \varepsilon_i
Z_1 \, , \,\, \ldots \, , \, Z_p
Y | Z_1 = z_{i1} \,, \,\, \ldots \,, \,\, Z_p = z_{ip}
Predictor is known: The values z_{i1}, \ldots, z_{ip} are known
Normality: The distribution of Y_i is normal
Linear mean: There are parameters \beta_1,\ldots,\beta_p such that {\rm I\kern-.3em E}[Y_i] = \beta_1 z_{i1} + \ldots + \beta_p z_{ip}
Homoscedasticity: There is a parameter \sigma^2 such that {\rm Var}[Y_i] = \sigma^2
Independence: rv Y_1 , \ldots , Y_n are independent and thus uncorrelated
{\rm Cor}(Y_i,Y_j) = 0 \qquad \forall \,\, i \neq j
Predictor is known: The values z_{i1}, \ldots, z_{ip} are known
Normality: The distribution of \varepsilon_i is normal
Linear mean: The errors have zero mean {\rm I\kern-.3em E}[\varepsilon_i] = 0
Homoscedasticity: There is a parameter \sigma^2 such that {\rm Var}[\varepsilon_i] = \sigma^2
Independence: Errors \varepsilon_1 , \ldots , \varepsilon_n are independent and thus uncorrelated
{\rm Cor}(\varepsilon_i, \varepsilon_j) = 0 \qquad \forall \,\, i \neq j
Z^T Z \, \text{ is invertible}
\beta = (\beta_1, \ldots, \beta_p)
\hat \beta = (Z^T Z)^{-1} Z^T y
{\rm Var}[\varepsilon_i] \neq {\rm Var}[\varepsilon_j] \qquad \text{ for some } \,\, i \neq j
{\rm Cor}( \varepsilon_i, \varepsilon_j ) \neq 0 \qquad \text{ for some } \,\, i \neq j
Z^T Z
Y_i = \beta_1 z_{i1} + \beta_2 z_{i2} + \ldots + \beta_p z_{ip} + \varepsilon_i
{\rm Var}[\varepsilon_i] \neq {\rm Var}[\varepsilon_j] \qquad \text{ for some } \,\, i \neq j
In Lecture 10 we presented a few methods to assess linear models
The above methods rely heavily on homoscedasticity
For example the maximum likelihood estimation relied on the calculation \begin{align*} L & = \prod_{i=1}^n f_{Y_i} (y_i) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\frac{(y_i - \hat y_i)^2}{2\sigma^2} \right) \\[15pts] & = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{\sum_{i=1}^n(y_i- \hat y_i)^2}{2\sigma^2} \right) \\[15pts] & = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{ \mathop{\mathrm{RSS}}}{2\sigma^2} \right) \end{align*}
The calculation is only possible thanks to homoscedasticity
{\rm Var}[Y_i] = \sigma^2 \qquad \forall \,\, i
L = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{ \mathop{\mathrm{RSS}}}{2\sigma^2} \right)
\min_{\beta} \ \mathop{\mathrm{RSS}}
\hat \beta = (Z^T Z)^{-1} Z^T y
L \neq \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{ \mathop{\mathrm{RSS}}}{2\sigma^2} \right)
Therefore \hat \beta would no longer maximize the likelihood!
In this case \hat \beta would still be an unbiased estimator for \beta
{\rm I\kern-.3em E}[\hat \beta ] = \beta
However the quantity S^2 = \frac{ \mathop{\mathrm{RSS}}(\hat \beta) }{n-p} is not anymore unbiased estimator for the population variance \sigma^2 {\rm I\kern-.3em E}[S^2] \neq \sigma^2
This is a problem because the estimated standard error for \beta_j involves S^2 \mathop{\mathrm{e.s.e.}}(\beta_j) = \xi_{jj}^{1/2} \, S
Therefore \mathop{\mathrm{e.s.e.}} becomes unreliable
Then also t-statistic for significance of \beta_j becomes unreliable
This is because the t-statistic depends on \mathop{\mathrm{e.s.e.}}
t = \frac{ \hat\beta_j - \beta_j }{ \mathop{\mathrm{e.s.e.}}}
Heteroscedasticity in linear regression is no longer a big problem
This is thanks to 1980s research on robust standard errors (more info here)
Moreover heteroscedasticity only becomes a problem when it is severe
e_i := y_i - \hat y_i
By definition e_i is sampled from \varepsilon_i
We have heteroscedasticity if
{\rm Var}[\varepsilon_i] \neq {\rm Var}[\varepsilon_j] \, \quad \, \text{ for some } \, i \neq j
Funnelling out of residuals
Funnelling in of residuals
Linear residuals – Proportional to \hat y_i
Quadratic residuals – Proportional to \hat{y}^2_i
In these special cases we can transform the data to avoid heteroscedasticity
Heteroscedasticity can be associated with some of the X-variables
The book [1] discusses two cases
In each case divide through by the square root of the offending X-term
Y_i = \beta_1 + \beta_2 X_{i} + \varepsilon_i
\begin{equation} \tag{1} \frac{Y_i}{X_i} = \frac{\beta_1}{X_i}+\beta_2+\frac{\varepsilon_i}{X_i} \end{equation}
Y_i = \beta_1 + \beta_2 X_{i} + \varepsilon_i
\begin{equation} \tag{2} \frac{Y_i}{\sqrt{X_i}} = \frac{\beta_1}{\sqrt{X_i}}+\beta_2 \sqrt{X_i} + \frac{\varepsilon_i}{\sqrt{X_i}} \end{equation}
We need R commands for residuals and fitted values
Fit a linear model as usual
The full code for the example is available here residual_graphs.R
Stock Vs Gold prices data is available here stock_gold.txt
Read data into R and fit simple regression
\log Y_i = \alpha + \beta X_i + \varepsilon_i
Y_i = \beta_1 z_{i1} + \beta_2 z_{i2} + \ldots + \beta_p z_{ip} + \varepsilon_i
{\rm Cor}(\varepsilon_i , \varepsilon_j) = 0 \qquad \text{ for some } \,\, i \neq j
Recall the methods to assess linear models
The above methods rely heavily on independence
Once again let us consider the likelihood calculation \begin{align*} L & = f(y_1, \ldots, y_n) = \prod_{i=1}^n f_{Y_i} (y_i) \\[15pts] & = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{\sum_{i=1}^n(y_i- \hat y_i)^2}{2\sigma^2} \right) \\[15pts] & = \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{ \mathop{\mathrm{RSS}}}{2\sigma^2} \right) \end{align*}
The second equality is only possible thanks to independence of
Y_1 , \ldots, Y_n
{\rm Cor}(\varepsilon_i,\varepsilon_j) \neq 0 \quad \text{ for some } \, i \neq j
\varepsilon_i \, \text{ and } \, \varepsilon_j \, \text{ dependent } \quad \implies \quad Y_i \, \text{ and } \, Y_j \, \text{ dependent }
L \neq \frac{1}{(2\pi \sigma^2)^{n/2}} \, \exp \left( -\frac{ \mathop{\mathrm{RSS}}}{2\sigma^2} \right)
In this case \hat \beta does no longer maximize the likelihood!
As already seen, this implies that
\mathop{\mathrm{e.s.e.}}(\beta_j) \,\, \text{ is unreliable}
{\rm Cor}(\varepsilon_i,\varepsilon_j) \neq 0 \quad \text{ for some } \, i \neq j
Autocorrelation if often unavoidable
Typically associated with time series data
Economic time series tend to exhibit cyclical behaviour
Examples include GNP, price indices, production figures, employment statistics etc.
Since these series tend to be quite slow moving
This is an extremely common phenomenon in financial and economic time series
Characteristic of industries in which a large amount of time passes between
Cobweb phenomenon is common with agricultural commodities
Economic agents (e.g. farmers) decide
\begin{equation} \tag{3} {\rm Supply}_t = \beta_1 + \beta_2 \, {\rm Price}_{t-1} + \varepsilon_t \end{equation}
Errors \varepsilon_t in equation (3) are unlikely to be completely random and patternless
This is because they represent actions of intelligent economic agents (e.g. farmers)
Error terms are likely to be autocorrelated
Examples:
Quarterly data may smooth out the wild fluctuations in monthly sales figures
Low frequency economic survey data may be interpolated
However: Such data transformations may be inevitable
In social sciences data quality may be variable
This may induce systematic patterns and autocorrelation
No magic solution – Autocorrelation is unavoidable and must be considered
Graphical and statistical methods can be useful cross-check of each other!
Code for this example is available here autocorrelation_graph_tests.R
Stock Vs Gold prices data is available here stock_gold.txt
Read data into R and fit simple regression
[1] 33
Y_i = \alpha + \beta x_i + \varepsilon_i
{\rm Cor}(\varepsilon_i, \varepsilon_j ) \neq 0 \quad \text{ for some } \, i \neq j
e_t \, \approx \, a + b \, e_{t-1} \quad \text{ for some } \,\, a , \, b \in \mathbb{R}
In this case the simple linear model is not the right thing to consider
The right thing to do is consider Autoregressive linear models
Y_t = \alpha + \beta x_{t} + \varepsilon_t
These models couple regression with time-series analysis (ARIMA models)
Good reference is book by Shumway and Stoffer [3]
Y_i = \beta_1 z_{i1} + \beta_2 z_{i2} + \ldots + \beta_p z_{ip} + \varepsilon_i
\det(Z^T Z ) \, \approx \, 0 \, \quad \implies \quad Z^T Z \, \text{ is (almost) not invertible}
\text{Multicollinearity = multiple (linear) relationships between the Z-variables}
Z-variables inter-related \quad \implies \quad hard to isolate individual influence on Y
Perfect collinearity for Z_1 and Z_2
No perfect collinearity for Z_1 and Z_3
But Z_3 is small perturbation of Z_2
Z_1 | Z_2 | Z_3 |
---|---|---|
10 | 50 | 52 |
15 | 75 | 75 |
18 | 90 | 97 |
24 | 120 | 129 |
30 | 150 | 152 |
Z_1 | Z_2 | Z_3 |
---|---|---|
10 | 50 | 52 |
15 | 75 | 75 |
18 | 90 | 97 |
24 | 120 | 129 |
30 | 150 | 152 |
Z = (Z_1 | Z_2 | \ldots | Z_p) = \left( \begin{array}{cccc} z_{11} & z_{12} & \ldots & z_{1p} \\ z_{21} & z_{22} & \ldots & z_{2p} \\ \ldots & \ldots & \ldots & \ldots \\ z_{n1} & z_{n2} & \ldots & z_{np} \\ \end{array} \right)
{\rm rank} (Z) < p
{\rm rank} \left( Z^T Z \right) = {\rm rank} \left( Z \right)
{\rm rank} \left( Z^T Z \right) < p \qquad \implies \qquad Z^T Z \,\, \text{ is NOT invertible}
\hat{\beta} = (Z^T Z)^{-1} Z^T y
Multicollinearity is a big problem!
Z_1, Z_2, Z_3 as before
Exact Multicollinearity since Z_2 = 5 Z_1
Thus Z^T Z is not invertible
Let us check with R
Z_1 | Z_2 | Z_3 |
---|---|---|
10 | 50 | 52 |
15 | 75 | 75 |
18 | 90 | 97 |
24 | 120 | 129 |
30 | 150 | 152 |
R computed that \,\, {\rm det} ( Z^T Z) = -3.531172 \times 10^{-7}
Therefore the determinant of Z^T Z is almost 0
Z^T Z \text{ is not invertible!}
Error in solve.default(t(Z) %*% Z) :
system is computationally singular: reciprocal condition number = 8.25801e-19
In practice one almost never has exact Multicollinearity
If Multicollinearity is present, it is likely to be Approximate Multicollinearity
In case of approximate Multicollinearity it holds that
Approximate Multicollinearity is still a big problem!
This is because of:
\text{Small perturbations in } Z \quad \implies \quad \text{large variations in } (Z^T Z)^{-1}
\text{Small perturbations in } Z \quad \implies \quad \text{large variations in } \xi_{ij}
\mathop{\mathrm{e.s.e.}}(\beta_j) = \xi_{jj}^{1/2} \, S \,, \qquad \quad S^2 = \frac{\mathop{\mathrm{RSS}}}{n-p}
\begin{align*} \text{Multicollinearity} & \quad \implies \quad \text{Numerical instability} \\[5pts] & \quad \implies \quad \text{potentially larger } \xi_{jj} \\[5pts] & \quad \implies \quad \text{potentially larger } \mathop{\mathrm{e.s.e.}}(\beta_j) \end{align*}
t = \frac{\beta_j}{ \mathop{\mathrm{e.s.e.}}(\beta_j) }
But Multicollinearity increases the \mathop{\mathrm{e.s.e.}}(\beta_j)
Therefore the t-statistic reduces in size:
It becomes harder to reject incorrect hypotheses!
Z_1, Z_2, Z_3 as before
We know we have exact Multicollinearity, since Z_2 = 5 Z_1
Therefore Z^T Z is not invertible
Z_1 | Z_2 | Z_3 |
---|---|---|
10 | 50 | 52 |
15 | 75 | 75 |
18 | 90 | 97 |
24 | 120 | 129 |
30 | 150 | 152 |
To get rid of Multicollinearity we can add a small perturbation to Z_1 Z_1 \,\, \leadsto \,\, Z_1 + 0.01
The new dataset Z_1 + 0.01, Z_2, Z_3 is
Z_1 | Z_2 | Z_3 |
---|---|---|
10 | 50 | 52 |
15 | 75 | 75 |
18 | 90 | 97 |
24 | 120 | 129 |
30 | 150 | 152 |
Define the new design matrix Z = (Z_1 + 0.01 | Z_2 | Z_3)
Data is approximately Multicollinear
Therefore the inverse of Z^T Z exists
Let us compute this inverse in R
Z_1 + 0.01 | Z_2 | Z_3 |
---|---|---|
10.01 | 50 | 52 |
15.01 | 75 | 75 |
18.01 | 90 | 97 |
24.01 | 120 | 129 |
30.01 | 150 | 152 |
Z = (Z_1 + 0.01 | Z_2 | Z_3)
# Consider perturbation Z1 + 0.01
PZ1 <- Z1 + 0.01
# Assemble perturbed design matrix
Z <- matrix(c(PZ1, Z2, Z3), ncol = 3)
# Compute the inverse of Z^T Z
solve ( t(Z) %*% Z )
[,1] [,2] [,3]
[1,] 17786.804216 -3556.4700048 -2.4186075
[2,] -3556.470005 711.1358432 0.4644159
[3,] -2.418608 0.4644159 0.0187805
\text{ consider } \, Z_1 + 0.02 \, \text{ instead of } \, Z_1 + 0.01
# Consider perturbation Z1 + 0.02
PZ1 <- Z1 + 0.02
# Assemble perturbed design matrix
Z <- matrix(c(PZ1, Z2, Z3), ncol = 3)
# Compute the inverse of Z^T Z
solve ( t(Z) %*% Z )
[,1] [,2] [,3]
[1,] 4446.701047 -888.8947902 -1.2093038
[2,] -888.894790 177.7098841 0.2225551
[3,] -1.209304 0.2225551 0.0187805
[1] 19.4
\begin{align*} \text{Percentage Change} & = \left( \frac{\text{New Value} - \text{Old Value}}{\text{Old Value}} \right) \times 100\% \\[15pts] & = \left( \frac{(19.4 + 0.02) - (19.4 + 0.01)}{19.4 + 0.01} \right) \times 100 \% \ \approx \ 0.05 \% \end{align*}
\text{Percentage Change in } \, \xi_{11} = \frac{4446 - 17786}{17786} \times 100 \% \ \approx \ −75 \%
\text{perturbation of } \, 0.05 \% \, \text{ in the data } \quad \implies \quad \text{change of } \, - 75 \% \, \text{ in } \, \xi_{11}
\text{Small perturbations in } Z \quad \implies \quad \text{large variations in } (Z^T Z)^{-1}
Often can know in advance when you might experience Multicollinearity
This is a big sign of potential Multicollinearity
Why is this contradictory?
Numerical instabilities:
Parameter estimates \hat \beta_j become very sensitive to small changes in the data
The \mathop{\mathrm{e.s.e.}} become very sensitive to small changes in the data
Parameter estimates \hat \beta_j take the wrong sign or otherwise look strange
Parameter estimates may be highly correlated
Consider the following illustrative example
Want to explain expenditure Y in terms of
It is intuitively clear that income and wealth are highly correlated
To detect Multicollinearity look out for
Expenditure Y | Income X_2 | Wealth X_3 |
---|---|---|
70 | 80 | 810 |
65 | 100 | 1009 |
90 | 120 | 1273 |
95 | 140 | 1425 |
110 | 160 | 1633 |
115 | 180 | 1876 |
120 | 200 | 2052 |
140 | 220 | 2201 |
155 | 240 | 2435 |
150 | 260 | 2686 |
# Enter data
y <- c(70, 65, 90, 95, 110, 115, 120, 140, 155, 150)
x2 <- c(80, 100, 120, 140, 160, 180, 200, 220, 240, 260)
x3 <- c(810, 1009, 1273, 1425, 1633, 1876, 2052, 2201, 2435, 2686)
# Fit model
model <- lm(y ~ x2 + x3)
# We want to display only part of summary
# First capture the output into a vector
temp <- capture.output(summary(model))
# Then print only the lines of interest
cat(paste(temp[9:20], collapse = "\n"))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.77473 6.75250 3.669 0.00798 **
x2 0.94154 0.82290 1.144 0.29016
x3 -0.04243 0.08066 -0.526 0.61509
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.808 on 7 degrees of freedom
Multiple R-squared: 0.9635, Adjusted R-squared: 0.9531
F-statistic: 92.4 on 2 and 7 DF, p-value: 9.286e-06
Three basic statistics
Main red flag for Multicollinearity:
High R^2 value suggests model is really good
However low t-values imply neither income nor wealth affect expenditure
F-statistic is high, meaning that at least one between income nor wealth affect expenditure
The wealth variable has the wrong sign (\hat \beta_3 < 0)
Multicollinearity is definitely present!
[1] 0.9989624
This once again confirms Multicollinearity is present
Conclusion: The variables Income and Wealth are highly correlated
Klein’s rule of thumb: Multicollinearity will be a serious problem if:
Example: In the Expenditure vs Income and Wealth dataset we have
Regressing Y against X_2 and X_3 gives R^2=0.9635
Regressing X_2 against X_3 gives R^2 = 0.9979
Klein’s rule of thumb suggests that Multicollinearity will be a serious problem
Multicollinearity is essentially a data-deficiency problem
Sometimes we have no control over the dataset available
Important point:
Multicollinearity is a sample feature
Possible that another sample involving the same variables will have less Multicollinearity
Acquiring more data might reduce severity of Multicollinearity
More data can be collected by either
To do this properly would require advanced Bayesian statistical methods
This is beyond the scope of this module
Multicollinearity may be reduced by transforming variables
This may be possible in various different ways
Further reading in Chapter 10 of [1]
Simplest approach to tackle Multicollinearity is to drop one or more of the collinear variables
Goal: Find the best combination of X variables which reduces Multicollinearity
We present 2 alternatives
Consider again the Expenditure vs Income and Wealth dataset
The variables Income and Wealth are highly correlated
Intuitively we expect both Income and Wealth to affect Expenditure
Solution can be to drop either Income or Wealth variables
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.45455 6.41382 3.813 0.00514 **
x2 0.50909 0.03574 14.243 5.75e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.493 on 8 degrees of freedom
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9573
F-statistic: 202.9 on 1 and 8 DF, p-value: 5.753e-07
Strong evidence that Expenditure increases as Income increases
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 24.411045 6.874097 3.551 0.0075 **
x3 0.049764 0.003744 13.292 9.8e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.938 on 8 degrees of freedom
Multiple R-squared: 0.9567, Adjusted R-squared: 0.9513
F-statistic: 176.7 on 1 and 8 DF, p-value: 9.802e-07
Strong evidence that Expenditure increases as Wealth increases
Stepwise regression: Method of comparing regression models
Involves iterative selection of predictor variables X to use in the model
It can be achieved through
Note: Significance criterion for X_j is in terms of AIC
Note: Stepwise Selection ensures that at each step all the variables are significant
# Forward selection
best.model <- step(null.model,
direction = "forward",
scope = formula(full.model))
GNP Unemployed Armed.Forces Population Year Employed
1 234.289 235.6 159.0 107.608 1947 60.323
2 259.426 232.5 145.6 108.632 1948 61.122
3 258.054 368.2 161.6 109.773 1949 60.171
Goal: Explain the number of Employed people Y in the US in terms of
Code for this example is available here longley_stepwise.R
Longley dataset available here longley.txt
Download the data file and place it in current working directory
Y = \beta_1 + \beta_2 \, X_2 + \beta_3 \, X_3 + \beta_4 \, X_4 + \beta_5 \, X_5 + \beta_6 \, X_6 + \varepsilon
GNP Unemployed Armed.Forces Population Year
GNP TRUE FALSE FALSE TRUE TRUE
Unemployed FALSE TRUE FALSE FALSE FALSE
Armed.Forces FALSE FALSE TRUE FALSE FALSE
Population TRUE FALSE FALSE TRUE TRUE
Year TRUE FALSE FALSE TRUE TRUE
Call:
lm(formula = y ~ x2 + x3 + x4 + x6)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
x2 -4.019e-02 1.647e-02 -2.440 0.032833 *
x3 -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
x4 -1.015e-02 1.837e-03 -5.522 0.000180 ***
x6 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
Call:
lm(formula = y ~ x2 + x3 + x4 + x6)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
x2 -4.019e-02 1.647e-02 -2.440 0.032833 *
x3 -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
x4 -1.015e-02 1.837e-03 -5.522 0.000180 ***
x6 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
Call:
lm(formula = y ~ x2 + x3 + x4 + x6)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
x2 -4.019e-02 1.647e-02 -2.440 0.032833 *
x3 -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
x4 -1.015e-02 1.837e-03 -5.522 0.000180 ***
x6 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
Call:
lm(formula = y ~ x2 + x3 + x4 + x6)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
x2 -4.019e-02 1.647e-02 -2.440 0.032833 *
x3 -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
x4 -1.015e-02 1.837e-03 -5.522 0.000180 ***
x6 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
Call:
lm(formula = y ~ x2 + x3 + x4 + x6)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
x2 -4.019e-02 1.647e-02 -2.440 0.032833 *
x3 -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
x4 -1.015e-02 1.837e-03 -5.522 0.000180 ***
x6 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
Call:
lm(formula = y ~ x2 + x3 + x4 + x6)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.599e+03 7.406e+02 -4.859 0.000503 ***
x2 -4.019e-02 1.647e-02 -2.440 0.032833 *
x3 -2.088e-02 2.900e-03 -7.202 1.75e-05 ***
x4 -1.015e-02 1.837e-03 -5.522 0.000180 ***
x6 1.887e+00 3.828e-01 4.931 0.000449 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2794 on 11 degrees of freedom
Multiple R-squared: 0.9954, Adjusted R-squared: 0.9937
F-statistic: 589.8 on 4 and 11 DF, p-value: 9.5e-13
R^2 = 1 - \frac{\mathop{\mathrm{RSS}}}{\mathop{\mathrm{TSS}}}
Drawback: R^2 increases when number of predictors increases
We saw this phenomenon in the Galileo example in Lecture 10
Fitting the model \rm{distance} = \beta_1 + \beta_2 \, \rm{height} + \varepsilon yielded R^2 = 0.9264
In contrast the quadratic, cubic, and quartic models yielded, respectively R^2 = 0.9903 \,, \qquad R^2 = 0.9994 \,, \qquad R^2 = 0.9998
Fitting a higher degree polynomial gives higher R^2
Conclusion: If the degree of the polynomial is sufficiently high, we will get R^2 = 1
(\rm{height}_1, \rm{distance}_1) \,, \ldots , (\rm{height}_n, \rm{distance}_n)
\hat y_i = y_i \,, \qquad \forall \,\, i = 1 , \ldots, n
\mathop{\mathrm{RSS}}= \sum_{i=1}^n (y_i - \hat y_i )^2 = 0 \qquad \implies \qquad R^2 = 1
Warning: Adding increasingly higher number of parameters is not good
The best model seems to be Linear y = \beta_1 + \beta_2 x + \varepsilon
Linear model interpretation:
# Divorces data
year <- c(1, 2, 3, 4, 5, 6,7, 8, 9, 10, 15, 20, 25, 30)
percent <- c(3.51, 9.5, 8.91, 9.35, 8.18, 6.43, 5.31,
5.07, 3.65, 3.8, 2.83, 1.51, 1.27, 0.49)
# Fit linear model
linear <- lm(percent ~ year)
# Fit order 6 model
order_6 <- lm(percent ~ year + I( year^2 ) + I( year^3 ) +
I( year^4 ) + I( year^5 ) +
I( year^6 ))
# Scatter plot of data
plot(year, percent, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Years of marriage", side = 1, line = 3, cex = 2.1)
mtext("Risk of divorce by adultery", side = 2, line = 2.5, cex = 2.1)
# Plot Linear Vs Quadratic
polynomial <- Vectorize(function(x, ps) {
n <- length(ps)
sum(ps*x^(1:n-1))
}, "x")
curve(polynomial(x, coef(linear)), add=TRUE, col = "red", lwd = 2)
curve(polynomial(x, coef(order_6)), add=TRUE, col = "blue", lty = 2, lwd = 3)
legend("topright", legend = c("Linear", "Order 6"),
col = c("red", "blue"), lty = c(1,2), cex = 3, lwd = 3)
# Divorces data
year <- c(1, 2, 3, 4, 5, 6,7, 8, 9, 10, 15, 20, 25, 30)
percent <- c(3.51, 9.5, 8.91, 9.35, 8.18, 6.43, 5.31,
5.07, 3.65, 3.8, 2.83, 1.51, 1.27, 0.49)
# Fit linear model
linear <- lm(percent ~ year)
# Fit order 6 model
order_6 <- lm(percent ~ year + I( year^2 ) + I( year^3 ) +
I( year^4 ) + I( year^5 ) +
I( year^6 ))
# Scatter plot of data
plot(year, percent, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Years of marriage", side = 1, line = 3, cex = 2.1)
mtext("Risk of divorce by adultery", side = 2, line = 2.5, cex = 2.1)
# Plot Linear Vs Quadratic
polynomial <- Vectorize(function(x, ps) {
n <- length(ps)
sum(ps*x^(1:n-1))
}, "x")
curve(polynomial(x, coef(linear)), add=TRUE, col = "red", lwd = 2)
curve(polynomial(x, coef(order_6)), add=TRUE, col = "blue", lty = 2, lwd = 3)
legend("topright", legend = c("Linear", "Order 6"),
col = c("red", "blue"), lty = c(1,2), cex = 3, lwd = 3)
# Divorces data
year <- c(1, 2, 3, 4, 5, 6,7, 8, 9, 10, 15, 20, 25, 30)
percent <- c(3.51, 9.5, 8.91, 9.35, 8.18, 6.43, 5.31,
5.07, 3.65, 3.8, 2.83, 1.51, 1.27, 0.49)
# Fit linear model
linear <- lm(percent ~ year)
# Fit order 6 model
order_6 <- lm(percent ~ year + I( year^2 ) + I( year^3 ) +
I( year^4 ) + I( year^5 ) +
I( year^6 ))
# Scatter plot of data
plot(year, percent, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Years of marriage", side = 1, line = 3, cex = 2.1)
mtext("Risk of divorce by adultery", side = 2, line = 2.5, cex = 2.1)
# Plot Linear Vs Quadratic
polynomial <- Vectorize(function(x, ps) {
n <- length(ps)
sum(ps*x^(1:n-1))
}, "x")
curve(polynomial(x, coef(linear)), add=TRUE, col = "red", lwd = 2)
curve(polynomial(x, coef(order_6)), add=TRUE, col = "blue", lty = 2, lwd = 3)
legend("topright", legend = c("Linear", "Order 6"),
col = c("red", "blue"), lty = c(1,2), cex = 3, lwd = 3)
The AIC is a number which measures how well a regression model fits the data
Also R^2 measures how well a regression model fits the data
The difference between AIC and R^2 is that AIC also accounts for overfitting
The formal definition of AIC is
{\rm AIC} := 2p - 2 \log ( \hat{L} )
p = number of parameters in the model
\hat{L} = maximum value of the likelihood function
\ln(\hat L)= -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\hat\sigma^2) - \frac{1}{2\hat\sigma^2} \mathop{\mathrm{RSS}}\,, \qquad \hat \sigma^2 := \frac{\mathop{\mathrm{RSS}}}{n}
\ln(\hat L)= - \frac{n}{2} \log \left( \frac{\mathop{\mathrm{RSS}}}{n} \right) + C
C is constant depending only on the number of sample points
Thus C does not change if the data does not change
{\rm AIC} = 2p + n \log \left( \frac{ \mathop{\mathrm{RSS}}}{n} \right) - 2C
Years of Marriage | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
% divorces adultery | 3.51 | 9.50 | 8.91 | 9.35 | 8.18 | 6.43 | 5.31 |
Years of Marriage | 8 | 9 | 10 | 15 | 20 | 25 | 30 |
---|---|---|---|---|---|---|---|
% divorces adultery | 5.07 | 3.65 | 3.80 | 2.83 | 1.51 | 1.27 | 0.49 |
# Divorces data
year <- c(1, 2, 3, 4, 5, 6,7, 8, 9, 10, 15, 20, 25, 30)
percent <- c(3.51, 9.5, 8.91, 9.35, 8.18, 6.43, 5.31,
5.07, 3.65, 3.8, 2.83, 1.51, 1.27, 0.49)
{\rm percent} = \beta_1 + \varepsilon
\rm{percent} = \beta_1 + \beta_2 \, {\rm year} + \beta_3 \, {\rm year}^2 + \ldots + \beta_7 \, {\rm year}^6
Call:
lm(formula = percent ~ year)
.....
Dummy variable: Variables X which are quantitative in nature
ANOVA: refers to situations where regression models contain
ANCOVA: refers to situations where regression models contain a combination of
Dummy variable:
Regression models can include dummy variables
Qualitatitve binary variables can be represented by X with
Examples of binary quantitative variables are
fridge.sales durable.goods.sales Q1 Q2 Q3 Q4 Quarter
1 1317 252.6 1 0 0 0 1
2 1615 272.4 0 1 0 0 2
3 1662 270.9 0 0 1 0 3
4 1295 273.9 0 0 0 1 4
5 1271 268.9 1 0 0 0 1
6 1555 262.9 0 1 0 0 2
7 1639 270.9 0 0 1 0 3
8 1238 263.4 0 0 0 1 4
9 1277 260.6 1 0 0 0 1
10 1258 231.9 0 1 0 0 2
11 1417 242.7 0 0 1 0 3
12 1185 248.6 0 0 0 1 4
13 1196 258.7 1 0 0 0 1
14 1410 248.4 0 1 0 0 2
15 1417 255.5 0 0 1 0 3
16 919 240.4 0 0 0 1 4
17 943 247.7 1 0 0 0 1
18 1175 249.1 0 1 0 0 2
19 1269 251.8 0 0 1 0 3
20 973 262.0 0 0 0 1 4
21 1102 263.3 1 0 0 0 1
22 1344 280.0 0 1 0 0 2
23 1641 288.5 0 0 1 0 3
24 1225 300.5 0 0 0 1 4
25 1429 312.6 1 0 0 0 1
26 1699 322.5 0 1 0 0 2
27 1749 324.3 0 0 1 0 3
28 1117 333.1 0 0 0 1 4
29 1242 344.8 1 0 0 0 1
30 1684 350.3 0 1 0 0 2
31 1764 369.1 0 0 1 0 3
32 1328 356.4 0 0 0 1 4
Fridge Sales | Durable Goods Sales | Q1 | Q2 | Q3 | Q4 | Quarter |
---|---|---|---|---|---|---|
1317 | 252.6 | 1 | 0 | 0 | 0 | 1 |
1615 | 272.4 | 0 | 1 | 0 | 0 | 2 |
1662 | 270.9 | 0 | 0 | 1 | 0 | 3 |
1295 | 273.9 | 0 | 0 | 0 | 1 | 4 |
Two alternative approaches:
Differences between the two approaches:
This method is good to first understand dummy variable regression
This is the most efficient way of organising cathegorical data in R
Question: Why only m - 1 dummy variables?
Answer: To avoid the dummy variable trap
To illustrate the dummy variable trap consider the following
D_i = \begin{cases} 1 & \text{ if data belongs to Quarter i} \\ 0 & \text{ otherwise} \\ \end{cases}
Y = \beta_0 \cdot (1) + \beta_1 D_1 + \beta_2 D_2 + \beta_3 D_3 + \beta_4 D_4 + \varepsilon
D_1 + D_2 + D_3 + D_4 = 1
Dummy variable trap: Variables are collinear (linearly dependent)
Indeed the design matrix is
Z = \left( \begin{array}{ccccc} 1 & 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 1 \\ 1 & 1 & 0 & 0 & 0 \\ \dots & \dots & \dots & \dots & \dots \\ \end{array} \right)
We want to avoid Multicollinearity (or dummy variable trap)
How? Drop one dummy variable (e.g. the first) and consider the model
Y = \beta_1 \cdot (1) + \beta_2 D_2 + \beta_3 D_3 + \beta_4 D_4 + \varepsilon
If data belongs to Q1 then D_2 = D_3 = D_4 = 0
Therefore, in general, we have
D_2 + D_3 + D_4 \not\equiv 1
Y = \beta_1 \cdot (1) + \beta_2 D_2 + \beta_3 D_3 + \beta_4 D_4 + \varepsilon\qquad ?
D_1 + D_2 + D_3 + D_4 = 1
\begin{align*} Y & = \beta_1 \cdot ( D_1 + D_2 + D_3 + D_4 ) + \beta_2 D_2 + \beta_3 D_3 + \beta_4 D_4 + \varepsilon\\[10pts] & = \beta_1 D_1 + (\beta_1 + \beta_2) D_2 + (\beta_1 + \beta_3) D_3 + (\beta_1 + \beta_4 ) D_4 + \varepsilon \end{align*}
Conclusion: When fitting regression model with dummy variables
Increase for first dummy variable D_1 is intercept term \beta_1
Increase for successive dummy variables D_i with i > 1 is computed by \beta_1 + \beta_i
Intercept coefficient acts as base reference point
D_i = \begin{cases} 1 & \text{ if X has level i} \\ 0 & \text{ otherwise} \\ \end{cases}
D_1 + D_2 + \ldots + D_m = 1
Y = \beta_1 \cdot (1) + \beta_2 D_2 + \beta_3 D_3 + \ldots + \beta_m D_m + \varepsilon
D_2 = D_3 = \ldots = D_m = 0
D_2 + D_3 + \ldots + D_m \not \equiv 1
Y = \beta_1 \cdot (1) + \beta_2 D_2 + \beta_3 D_3 + \ldots + \beta_m D_m + \varepsilon\quad ?
D_1 + D_2 + \ldots + D_m = 1
Y = \beta_1 D_1 + (\beta_1 + \beta_2) D_2 + \ldots + (\beta_1 + \beta_m) D_m + \varepsilon
Conclusion: When fitting regression model with dummy variables
Increase for first dummy variable D_1 is intercept term \beta_1
Increase for successive dummy variables D_i with i > 1 is computed by \beta_1 + \beta_i
Intercept coefficient acts as base reference point
Before proceeding we need to introduce factors in R
Factors: A way to represent discrete variables taking a finite number of values
Example: Suppose to have a vector of people’s names
# Turn sex.num into a factor
sex.num.factor <- factor(sex.num)
# Print the factor obtained
print(sex.num.factor)
[1] 1 1 1 0 1 0
Levels: 0 1
The factor \, \texttt{sex.num.factor} looks like the original vector \, \texttt{sex.num}
The difference is that the factor \, \texttt{sex.num.factor} contains levels
# Turn sex.char into a factor
sex.char.factor <- factor(sex.char)
# Print the factor obtained
print(sex.char.factor)
[1] female female female male female male
Levels: female male
Again, the factor \, \texttt{sex.char.factor} looks like the original vector \, \texttt{sex.char}
Again, the difference is that the factor \, \texttt{sex.char.factor} contains levels
[1] 1 1 0 1
Levels: 0 1
[1] female female female female
Levels: female male
The levels of \, \texttt{subset.factor} are still \, \texttt{"female"} and \, \texttt{"male"}
This is despite \, \texttt{subset.factor} only containing \, \texttt{"female"}
[1] "0" "1"
[1] "0" "1"
The levels of \, \texttt{sex.num.factor} are the strings \, \texttt{"0"} and \, \texttt{"1"}
This is despite the original vector \, \texttt{sex.num} being numeric
The command \, \texttt{factor} converted numeric levels into strings
The function \, \texttt{levels} can also be used to relabel factors
For example we can relabel
# Relabel levels of sex.char.factor
levels(sex.char.factor) <- c("f", "m")
# Print relabelled factor
print(sex.char.factor)
[1] f f f m f m
Levels: f m
Logical subsetting is done exactly like in the case of vectors
Important: Need to remember that levels are always strings
Example: To identify all the men in \, \texttt{sex.num.factor} we do
[1] FALSE FALSE FALSE TRUE FALSE TRUE
[1] "Boris" "Tim"
Code for this example is available here anova.R
Data is in the file fridge_sales.txt
The first 4 rows of the data-set are given below
fridge.sales durable.goods.sales Q1 Q2 Q3 Q4 Quarter
1 1317 252.6 1 0 0 0 1
2 1615 272.4 0 1 0 0 2
3 1662 270.9 0 0 1 0 3
4 1295 273.9 0 0 0 1 4
[1] 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Levels: 1 2 3 4
As already mentioned, there are 2 ways of fitting regression and ANOVA models in R
We can proceed in 2 equivalent ways
# Need to convert quarter to a factor
quarter.f <- factor(quarter)
factor.lm <- lm(fridge ~ quarter.f)
summary(factor.lm)
Call:
lm(formula = fridge ~ q2 + q3 + q4)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1222.12 59.99 20.372 < 2e-16 ***
q2 245.37 84.84 2.892 0.007320 **
q3 347.63 84.84 4.097 0.000323 ***
q4 -62.12 84.84 -0.732 0.470091
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 169.7 on 28 degrees of freedom
Multiple R-squared: 0.5318, Adjusted R-squared: 0.4816
F-statistic: 10.6 on 3 and 28 DF, p-value: 7.908e-05
Call:
lm(formula = fridge ~ quarter.f)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1222.12 59.99 20.372 < 2e-16 ***
quarter.f2 245.37 84.84 2.892 0.007320 **
quarter.f3 347.63 84.84 4.097 0.000323 ***
quarter.f4 -62.12 84.84 -0.732 0.470091
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 169.7 on 28 degrees of freedom
Multiple R-squared: 0.5318, Adjusted R-squared: 0.4816
F-statistic: 10.6 on 3 and 28 DF, p-value: 7.908e-05
The two outputs are essentially the same (only difference is variables names)
\, \texttt{quarter.f} is a factor with four levels \, \texttt{1}, \texttt{2}, \texttt{3}, \texttt{4}
Variables \, \texttt{quarter.f2}, \texttt{quarter.f3}, \texttt{quarter.f4} refer to the levels \, \texttt{2}, \texttt{3}, \texttt{4} in \, \texttt{quarter.f}
Call:
lm(formula = fridge ~ quarter.f)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1222.12 59.99 20.372 < 2e-16 ***
quarter.f2 245.37 84.84 2.892 0.007320 **
quarter.f3 347.63 84.84 4.097 0.000323 ***
quarter.f4 -62.12 84.84 -0.732 0.470091
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 169.7 on 28 degrees of freedom
Multiple R-squared: 0.5318, Adjusted R-squared: 0.4816
F-statistic: 10.6 on 3 and 28 DF, p-value: 7.908e-05
\, \texttt{lm} is treating \, \texttt{quarter.f2}, \texttt{quarter.f3}, \texttt{quarter.f4} as if they were dummy variables
Note that \, \texttt{lm} automatically drops \, \texttt{quarter.f1} to prevent dummy variable trap
Thus \, \texttt{lm} behaves the same way as if we passed dummy variables \, \texttt{q2}, \texttt{q3}, \texttt{q4}
Call:
lm(formula = fridge ~ q2 + q3 + q4)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1222.12 59.99 20.372 < 2e-16 ***
q2 245.37 84.84 2.892 0.007320 **
q3 347.63 84.84 4.097 0.000323 ***
q4 -62.12 84.84 -0.732 0.470091
Recall that \, \texttt{Intercept} refers to coefficient for \, \texttt{q1}
Coefficients for \, \texttt{q2}, \texttt{q3}, \texttt{q4} are obtained by summing \, \texttt{Intercept} to coefficient in appropriate row
Call:
lm(formula = fridge ~ q2 + q3 + q4)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1222.12 59.99 20.372 < 2e-16 ***
q2 245.37 84.84 2.892 0.007320 **
q3 347.63 84.84 4.097 0.000323 ***
q4 -62.12 84.84 -0.732 0.470091
Dummy variable | Coefficient formula | Estimated coefficient |
---|---|---|
\texttt{q1} | \beta_1 | 1222.12 |
\texttt{q2} | \beta_1 + \beta_2 | 1222.12 + 245.37 = 1467.49 |
\texttt{q3} | \beta_1 + \beta_3 | 1222.12 + 347.63 = 1569.75 |
\texttt{q4} | \beta_1 + \beta_4 | 1222.12 - 62.12 = 1160 |
\begin{align*} {\rm I\kern-.3em E}[\text{ Fridge sales } ] = & \,\, 1222.12 \times \, \text{Q1} + 1467.49 \times \, \text{Q2} + \\[15pts] & \,\, 1569.75 \times \, \text{Q3} + 1160 \times \, \text{Q4} \end{align*}
\text{Q1} + \text{Q2} + \text{Q4} + \text{Q4} = 1
Therefore the expected sales for each quarter are
\begin{align*} {\rm I\kern-.3em E}[\text{ Fridge sales } | \text{ Q1 } = 1] & = 1222.12 \\[15pts] {\rm I\kern-.3em E}[\text{ Fridge sales } | \text{ Q2 } = 1] & = 1467.49 \\[15pts] {\rm I\kern-.3em E}[\text{ Fridge sales } | \text{ Q3 } = 1] & = 1569.75 \\[15pts] {\rm I\kern-.3em E}[\text{ Fridge sales } | \text{ Q4 } = 1] & = 1160 \\[15pts] \end{align*}
First of all: What is ANOVA?
ANOVA stands for Analysis of Variance
ANOVA is used to compare means of independent populations:
ANOVA is a generalization of two-sample t-test to multiple populations
\begin{align*} H_0 & \colon \mu_{1} = \mu_{2} = \ldots = \mu_{M} \\ H_1 & \colon \mu_i \neq \mu_j \, \text{ for at least one pair i and j} \end{align*}
To decide on the above hypothesis we use the ANOVA F-test
We omit mathematical details. All you need to know is that
If p < 0.05 we reject H_0 and there is a difference in populations averages
We clearly see 4 populations
Averages appear different
Df Sum Sq Mean Sq F value Pr(>F)
quarter.f 3 915636 305212 10.6 7.91e-05 ***
Residuals 28 806142 28791
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The F-statistic for the ANOVA F-test is \,\, F = 10.6
The p-value for ANOVA F-test is \,\, p = 7.91 \times 10^{-5}
Therefore p < 0.05 and we reject H_0
Alternative way to get ANOVA F-statistic: Look at output of dummy variable model
\texttt{lm(fridge} \, \sim \, \texttt{q2 + q3 + q4)}
Residual standard error: 169.7 on 28 degrees of freedom
Multiple R-squared: 0.5318, Adjusted R-squared: 0.4816
F-statistic: 10.6 on 3 and 28 DF, p-value: 7.908e-05
These always coincide with F-statistic and p-value for ANOVA F-test! \quad Why?
Y_i = \alpha + \beta \, 1_B (i) + \varepsilon_i
In more details:
A and B are two normal populations N(\mu_A, \sigma^2) and N(\mu_B, \sigma^2)
We have two samples
y = (a,b) = (a_1, \ldots, a_{n_A}, b_1, \ldots, b_{n_B} )
Y_i = \alpha + \beta \, 1_B (i) + \varepsilon_i
1_B(i) = \begin{cases} 1 & \text{ if i-th sample belongs to population B} \\ 0 & \text{ otherwise} \end{cases}
{\rm I\kern-.3em E}[Y | 1_B = x] = \alpha + \beta x
Y | \text{sample belongs to population A} \ \sim \ N(\mu_A, \sigma^2)
{\rm I\kern-.3em E}[Y | 1_B = 0] = {\rm I\kern-.3em E}[Y | \text{ sample belongs to population A}] = \mu_A
{\rm I\kern-.3em E}[Y | 1_B = x] = \alpha + \beta x
{\rm I\kern-.3em E}[Y | 1_B = 0] = \alpha + \beta \cdot 0 = \alpha
Recall that \, {\rm I\kern-.3em E}[Y | 1_B = 0] = \mu_A
Hence we get
\alpha = \mu_A
{\rm I\kern-.3em E}[Y | 1_B = 1] = {\rm I\kern-.3em E}[Y | \text{ sample belongs to population B}] = \mu_B
{\rm I\kern-.3em E}[Y | 1_B = 1] = \alpha + \beta \cdot 1 = \alpha + \beta
\alpha + \beta = \mu_B
\alpha + \beta = \mu_B \quad \implies \quad \beta = \mu_B - \mu_A
\begin{align*} Y_i & = \alpha + \beta \, 1_{B} (i) + \varepsilon_i \\[10pts] & = \mu_A + (\mu_B - \mu_A) \, 1_{B} (i) + \varepsilon_i \end{align*}
\begin{align*} H_0 \colon \beta = 0 \\ H_1 \colon \beta \neq 0 \end{align*}
\begin{align*} H_0 \colon \mu_A = \mu_B \\ H_1 \colon \mu_A \neq \mu_B \end{align*}
F-test for overall significance is equivalent to two-sample t-test
We now consider the general ANOVA case
\begin{align*} H_0 & \colon \mu_1 = \mu_2 = \ldots = \mu_M \\ H_1 & \colon \mu_i \neq \mu_j \text{ for at least one pair i and j} \end{align*}
We want to introduce dummy variable regression model which models ANOVA
To each population A_i associate a dummy variable
1_{A_i}(i) = \begin{cases} 1 & \text{ if i-th sample belongs to population } A_i \\ 0 & \text{ otherwise} \\ \end{cases}
Suppose to have samples a_i of size n_{A_i} from population A_i
Concatenate these samples into a long vector (length n_{A_1} + \ldots + n_{A_M})
y = (a_1, \ldots, a_M)
Y_i = \beta_1 + \beta_2 \, 1_{A_2} (i) + \beta_3 \, 1_{A_3} (i) + \ldots + \beta_M \, 1_{A_M} (i) + \varepsilon_i
{\rm I\kern-.3em E}[Y | 1_{A_2} = x_2 , \, \ldots, \, 1_{A_M} = x_M ] = \beta_1 + \beta_2 \, x_2 + \ldots + \beta_M \, x_M
Y | \text{sample belongs to population } A_i \ \sim \ N(\mu_i , \sigma^2)
1_{A_2} = 1_{A_3} = \ldots = 1_{A_M} = 0
{\rm I\kern-.3em E}[ Y | 1_{A_2} = 0 , \, \ldots, \, 1_{A_M} = 0] = {\rm I\kern-.3em E}[Y | \text{sample belongs to population } A_1] = \mu_1
{\rm I\kern-.3em E}[ Y | 1_{A_2} = 0 , \, \ldots, \, 1_{A_M} = 0] = \beta_1 + \beta_2 \cdot 0 + \ldots + \beta_M \cdot 0 = \beta_1
1_{A_2} = 1 \quad \text{and} \quad 1_{A_1} = 1_{A_3} = \ldots = 1_{A_M} = 0
{\rm I\kern-.3em E}[ Y | 1_{A_2} = 1 , \, \ldots, \, 1_{A_M} = 0] = {\rm I\kern-.3em E}[Y | \text{sample belongs to population } A_2] = \mu_2
{\rm I\kern-.3em E}[ Y | 1_{A_2} = 1 , \, \ldots, \, 1_{A_M} = 0] = \beta_1 + \beta_2 \cdot 1 + \ldots + \beta_M \cdot 0 = \beta_1 + \beta_2
\mu_1 = \beta_1 \quad \text{ and } \quad \mu_2 = \beta_1 + \beta_2
\beta_1 = \mu_1 \quad \text{ and } \quad \beta_2 = \mu_2 - \mu_1
\beta_i = \mu_i - \mu_1 \qquad \forall \, i > 1
\begin{align*} Y_i & = \beta_1 + \beta_2 \, 1_{A_2} (i) + \ldots + \beta_M \, 1_{A_M} (i) + \varepsilon_i \end{align*}
\begin{align*} H_0 & \colon \beta_2 = \beta_3 = \ldots = \beta_M = 0 \\ H_1 & \colon \text{ At least one of the above } \beta_j \neq 0 \end{align*}
\beta_1 = \mu_1 \,, \qquad \beta_i = \mu_i - \mu_1 \quad \forall \, i > 1
\begin{align*} \beta_2 = \ldots = \beta_M = 0 & \quad \iff \quad \mu_2 - \mu_1 = \ldots = \mu_M - \mu_1 = 0 \\[10pts] & \quad \iff \quad \mu_2 = \ldots = \mu_M = \mu_1 \end{align*}
\begin{align*} H_0 & \colon \mu_1 = \mu_2 = \ldots = \mu_M \\ H_1 & \colon \mu_i \neq \mu_j \text{ for at least one pair i and j} \end{align*}
F-test for overall significance is equivalent to ANOVA F-test