Statistical Models

Lecture 5

Lecture 5:
Two-sample F-test and
goodness-of-fit test

Outline of Lecture 5

  1. The F-distribution
  2. Two-sample F-test
  3. Worked Example

Part 1:
The F-distribution

Recall

The chi-squared distribution with p degrees of freedom is \chi_p^2 = Z_1^2 + \ldots + Z_p^2 \qquad \text{where} \qquad Z_1, \ldots, Z_n \,\,\, \text{iid} \,\,\, N(0, 1)

Chi-squared distribution was used to:

  • Describe distribution of sample variance S^2: \frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2

  • Define t-distribution with p degrees of freedom: t_p \sim \frac{U}{\sqrt{V/p}} \qquad \text{where} \qquad U \sim N(0,1) \,, \quad V \sim \chi_p^2 \quad \text{ independent}

The F-distribution

Definition
The r.v. F has F-distribution with p and q degrees of freedom if the pdf is f_F(x) = \frac{ \Gamma \left(\frac{p+q}{2} \right) }{ \Gamma \left( \frac{p}{2} \right) \Gamma \left( \frac{q}{2} \right) } \left( \frac{p}{q} \right)^{p/2} \, \frac{ x^{ (p/2) - 1 } }{ [ 1 + (p/q) x ]^{(p+q)/2} } \,, \quad x > 0

Notation: F-distribution with p and q degrees of freedom is denoted by F_{p,q}

Remark: Used to describes variance estimators for independent samples

Characterization of F-distribution

The F-distribution is obtained as ratio of 2 independent chi-squared distributions

Theorem
Suppose that U \sim \chi_p^2 and V \sim \chi_q^2 are independent. Then X := \frac{U/p}{V/q} \sim F_{p,q}

Idea of Proof

  • This is similar to the proof (seen in Homework 2) that \frac{U}{\sqrt{V/p}} \sim t_p where U \sim N(0,1) and V \sim \chi_p^2 are independent

  • In our case we need to prove X := \frac{U/p}{V/q} \sim F_{p,q} where U \sim \chi_p^2 and V \sim \chi_q^2 are independent

Idea of Proof

  • U \sim \chi_{p}^2 and V \sim \chi_q^2 are independent. Therefore \begin{align*} f_{U,V} (u,v) & = f_U(u) f_V(v) \\ & = \frac{ 1 }{ \Gamma \left( \frac{p}{2} \right) \Gamma \left( \frac{q}{2} \right) 2^{(p+q)/2} } u^{\frac{p}{2} - 1} v^{\frac{q}{2} - 1} e^{-(u+v)/2} \end{align*}

  • Consider the change of variables x(u,v) := \frac{u/p}{v/q} \,, \quad y(u,v) := u + v

Idea of Proof

  • This way we have X = \frac{U/p}{V/q} \,, \qquad Y = U + V

  • To conclude the proof, we need to compute the pdf of X, that is f_X

  • This can be computed as the X marginal of f_{X,Y} f_{X}(x) = \int_{0}^\infty f_{X,Y}(x,y) \, dy

Idea of Proof

  • The joint pdf f_{X,Y} can be computed by inverting the change of variables x(u,v) := \frac{u/p}{v/q} \,, \quad y(u,v) := u + v and using the formula f_{X,Y}(x,y) = f_{U,V}(u(x,y),v(x,y)) \, |\det J| where J is the Jacobian of the inverse transformation (x,y) \mapsto (u(x,y),v(x,y))

Idea of Proof

  • Since f_{U,V} is known, then also f_{X,Y} is known

  • Moreover the integral f_{X}(x) = \int_{0}^\infty f_{X,Y}(x,y) \, dy can be explicitly computed, yielding the thesis f_{X}(x) = \frac{ \Gamma \left(\frac{p+q}{2} \right) }{ \Gamma \left( \frac{p}{2} \right) \Gamma \left( \frac{q}{2} \right) } \left( \frac{p}{q} \right)^{p/2} \, \frac{ x^{ (p/2) - 1 } }{ [ 1 + (p/q) x ]^{(p+q)/2} }

Properties of F-distribution

Theorem
  1. Suppose X \sim F_{p,q} with q>2. Then {\rm I\kern-.3em E}[X] = \frac{q}{q-2}

  2. If X \sim F_{p,q} then 1/X \sim F_{q,p}

  3. If X \sim t_q then X^2 \sim F_{1,q}

Properties of F-distribution

Proof of Theorem

  1. Requires a bit of work. It will be left as Homework assignment

  2. By the Theorem in Slide 44, we have X \sim F_{p,q} \quad \implies \quad X = \frac{U/p}{V/q} with U \sim \chi_p^2 and V \sim \chi_q^2 independent. Therefore \frac{1}{X} = \frac{V/q}{U/p} \sim \frac{\chi^2_q/q}{\chi^2_p/p} \sim F_{q,p}

Properties of F-distribution

Proof of Theorem

  1. Suppose X \sim t_q. The Theorem in Slide 118 of Lecture 2, guarantees that X = \frac{U}{\sqrt{V/q}} where U \sim N(0,1) and V \sim \chi_q^2 are independent. Therefore X^2 = \frac{U^2}{V/q}

Properties of F-distribution

Proof of Theorem

  • Since U \sim N(0,1), by definition U^2 \sim \chi_1^2.
  • Moreover U^2 and V are independet, since U and V are independent
  • Finally, the Theorem in Slide 44 implies X^2 = \frac{U^2}{V/q} \sim \frac{\chi_1^2/1}{\chi_q^2/q} \sim F_{1,q}

Part 2:
Two-sample F-test

Variance estimators

Suppose given random samples from 2 normal populations:

  • X_1, \ldots, X_n iid random sample from N(\mu_X, \sigma_X^2)
  • Y_1, \ldots, Y_m iid random sample from N(\mu_Y, \sigma_Y^2)

Problem:

  • We want to compare variance of the 2 populations
  • We do it by studying the variances ratio \frac{\sigma_X^2}{\sigma_Y^2}

Variance estimators

Question:

  • Suppose the variances \sigma_X^2 and \sigma_Y^2 are unknown
  • How can we estimate the ratio \sigma_X^2 /\sigma_Y^2 \, ?

Answer:

  • Estimate the ratio \sigma_X^2 /\sigma_Y^2 \, using sample variances S^2_X / S^2_Y

  • The F-distribution allows to compare the quantities \sigma_X^2 /\sigma_Y^2 \qquad \text{and} \qquad S^2_X / S^2_Y

Variance ratio distribution

Theorem
Suppose given random samples from 2 normal populations:

  • X_1, \ldots, X_n iid random sample from N(\mu_X, \sigma_X^2)
  • Y_1, \ldots, Y_m iid random sample from N(\mu_Y, \sigma_Y^2)

The random variable F = \frac{ S_X^2 / \sigma_X^2 }{ S_Y^2 / \sigma_Y^2 } \, \sim \, F_{n-1,m-1} that is, F is F-distributed with n-1 and m-1 degrees of freedom

Variance ratio distribution

Proof

  • We need to prove F = \frac{ S_X^2 / \sigma_X^2 }{ S_Y^2 / \sigma_Y^2 } \sim F_{n-1,m-1}

  • By the Theorem in Slide 48 Lecture 3 we have that \frac{S_X^2}{ \sigma_X^2} \sim \frac{\chi_{n-1}^2}{n-1} \,, \qquad \frac{S_Y^2}{ \sigma_Y^2} \sim \frac{\chi_{m-1}^2}{m-1}

Variance ratio distribution

Proof

  • Therefore F = \frac{ S_X^2 / \sigma_X^2 }{ S_Y^2 / \sigma_Y^2 } = \frac{U/p}{V/q} where we have U \sim \chi_{p}^2 \,, \qquad V \sim \chi_q^2 \,, \qquad p = n-1 \,, \qquad q = m - 1

  • By the Theorem in Slide 46 Lecture 5 we infer the thesis F = \frac{U/p}{V/q} \sim F_{n-1,m-1}

Unbiased estimation of variance ratio

Question: Why is S_X^2/S_Y^2 a good estimator of \sigma_X^2/\sigma_Y^2

Answer:

  • Because S_X^2/S_Y^2 is (asymptotically) unbiased estimator of \sigma_X^2/\sigma_Y^2
  • This is shown in the following Theorem

Unbiased estimation of variance ratio

Theorem
Suppose given independent random samples from 2 normal populations:

  • X_1, \ldots, X_n iid random sample from N(\mu_X, \sigma_X^2)
  • Y_1, \ldots, Y_m iid random sample from N(\mu_Y, \sigma_Y^2)

It holds that {\rm I\kern-.3em E}\left[ \frac{S_X^2}{S_Y^2} \right] = \frac{m-1}{m-3} \frac{\sigma_X^2}{\sigma_Y^2} \,, \qquad \lim_{m \to \infty} {\rm I\kern-.3em E}\left[ \frac{S_X^2}{S_Y^2} \right] = \frac{\sigma_X^2}{\sigma_Y^2}

Proof: Will be left as an exercise

Two-sample one-sided F-test

Assumptions: Suppose given samples from 2 normal populations

  • X_1, \ldots, X_n iid with distribution N(\mu_X, \sigma_X^2)
  • Y_1, \ldots, Y_m iid with distribution N(\mu_Y, \sigma_Y^2)

Goal: Compare variances \sigma_X^2 and \sigma_Y^2. We consider the test H_0 \colon \sigma_X^2 = \sigma_Y^2 \qquad \qquad H_1 \colon \sigma_X^2 > \sigma_Y^2

Two-sample one-sided F-test

Statistic: For the variance test we will use the F-statistic F = \frac{ S_X^2 / \sigma_X^2 }{ S_Y^2 / \sigma_Y^2 } \sim F_{n-1,m-1}

Note: Under the Null hypothesis that \sigma_X^2 = \sigma_Y^2 the F-statistic simplifies to F = \frac{ S_X^2 }{ S_Y^2 } \sim F_{n-1,m-1}

Computing F-statistic with tables

  • F-distribution tables are often one-sided

  • This is because, in general, F-tests associated with regression are one-sided tests

  • This means that in our hand calculation examples we construct F-statistic as F = \frac{s^2_X}{s^2_Y} \quad\quad \text{where} \quad\quad s^2_X > s^2_Y

  • The above guarantees F>1 and we can use one-sided F-tables

Computing F-statistic with tables

  • The values F_{\nu_1,\nu_2}(0.05) listed are such that P(F_{\nu_1,\nu_2} > F_{\nu_1,\nu_2}(0.05)) = 0.05

  • For example F_{4,3}(0.05) = 9.12

Computing F-statistic with tables

  • Sometimes the value F_{\nu_1,\nu_2}(0.05) is missing from F-table

  • In such case approximate F_{\nu_1,\nu_2}(0.05) with average of closest entries available

  • Example: F_{21,5}(0.05) is missing. We can approximate it by F_{21,5}(0.05) \approx \frac{F_{20,5}(0.05) + F_{24,5}(0.05)}{2} = \frac{ 4.56 + 4.53 }{ 2 } = 4.545

The two-sample F-test

Procedure

Suppose given two independent samples

  • sample x_1, \ldots, x_n from N(\mu_X,\sigma^2) of size n
  • sample y_1, \ldots, y_m from N(\mu_Y,\sigma^2) of size m

The one-sided hypothesis test for difference in variances is H_0 \colon \sigma^2_X = \sigma^2_Y \quad \qquad H_1 \colon \sigma^2_X > \sigma^2_Y

The two-sample F-test consists of 3 steps

The two-sample F-test

Procedure

  1. Calculation: Compute the two-sample F-statistic F = \frac{ s_X^2}{ s_Y^2} where sample variances are s_X^2 = \frac{\sum_{i=1}^n x_i^2 - n \overline{x}^2}{n-1} \qquad \quad s_Y^2 = \frac{\sum_{i=1}^m y_i^2 - m \overline{y}^2}{m-1}
  • s_X^2 refers to sample with largest variance
  • This way s_X^2 > s_Y^2, so that F > 1

The two-sample F-test

Procedure

  1. Statistical Tables or R: Find either
    • Critical value in Tables 13.2, 13.3 F_{n-1,m-1} (0.05)
    • p-value in R p := P( F_{n - 1, m-1} > |F| )

The two-sample F-test

Procedure

  1. Interpretation:
    • Reject H_0 if |F| > F_{n - 1, m-1} (0.05) \qquad \text{ or } \qquad p < 0.05
    • Do not reject H_0 if |F| \leq F_{n -1, m-1} (0.05) \qquad \text{ or } \qquad p \geq 0.05

The two-sample F-test in R

Procedure using var.test

  1. Store the samples x_1,\ldots,x_n and y_1,\ldots,y_m in two R vectors
    • x_sample <- c(x1, ..., xn)
    • y_sample <- c(y1, ..., ym)
  2. Perform a two-sample one-sided F-test on x_sample and y_sample
    • var.test(x_sample, y_sample, alternative = "greater")
  3. Read output
    • Output is similar to two-sample t-test
    • The main quantity of interest is p-value

Note: alternative = "greater" specifies alternative hypothesis \sigma_X^2 > \sigma_Y^2

Part 3:
Worked Example

Two-sample F-test

Example

  • Samples: Back to example of wages of 10 Mathematicians and 13 Accountants

  • Assumptions: Wages are independent and normally distributed

  • Goal: Compare variance of wages for the 2 professions

    • Is there evidence of differences in how spread out pays are?
Mathematicians x_1 x_2 x_3 x_4 x_5 x_6 x_7 x_8 x_9 x_{10}
Wages 36 40 46 54 57 58 59 60 62 63
Accountants y_1 y_2 y_3 y_4 y_5 y_6 y_7 y_8 y_9 y_{10} y_{11} y_{12} y_{13}
Wages 37 37 42 44 46 48 54 56 59 60 60 64 64

Calculations: First sample

Repeat calculations of Slide 43

  • Sample size: \ n = No. of Mathematicians = 10

  • Mean: \bar{x} = \frac{\sum_{i=1}^n x_i}{n} = \frac{36+40+46+ \ldots +62+63}{10}=\frac{535}{10}=53.5

  • Variance: \begin{align*} \sum_{i=1}^n x_i^2 & = 36^2+40^2+46^2+ \ldots +62^2+63^2 = 29435 \\ s^2_X & = \frac{\sum_{i=1}^n x_i^2 - n \bar{x}^2}{n -1 } = \frac{29435-10(53.5)^2}{9} = 90.2778 \end{align*}

Calculations: Second sample

Repeat calculations of Slide 44

  • Sample size: \ m = No. of Accountants = 13

  • Mean: \bar{y} = \frac{37+37+42+ \dots +64+64}{13} = \frac{671}{13} = 51.6154

  • Variance: \begin{align*} \sum_{i=1}^m y_i^2 & = 37^2+37^2+42^2+ \ldots +64^2+64^2 = 35783 \\ s^2_Y & = \frac{\sum_{i=1}^m y_i^2 - m \bar{y}^2}{m - 1} = \frac{35783-13(51.6154)^2}{12} = 95.7547 \end{align*}

Calculations: F-statistic

  1. Calculation:

    • Notice that s^2_Y = 95.7547 > 90.2778 = s_X^2

    • Hence the F-statistic is F = \frac{s^2_Y}{s_X^2} = \frac{95.7547}{90.2778} = 1.061\ \quad (3\ \text{d.p.})

    • Note: We have swapped role of s^2_X and s^2_Y, since s^2_Y > s^2_X

Completing the F-test

  1. Referencing Tables:
    • Degrees of freedom are n - 1 = 10 - 1 = 9 \,, \qquad m - 1 = 13 - 1 = 12

    • Note: Since we have swapped role of s^2_X and s^2_Y, we have F = \frac{s^2_Y}{s_X^2} \sim F_{m-1,n-1} = F_{12,9}

    • Find corresponding critical value in Tables 13.2, 13.3 F_{12, 9}(0.05) = 3.07

Completing the F-test

  1. Interpretation:
    • We have that F = 1.061 < 3.07 = F_{12, 9}(0.05)
    • Therefore the p-value satisfies p > 0.05
    • There is no evidence (p > 0.05) in favor of H_1
    • Hence we accept that \sigma_X^2 = \sigma_Y^2
  2. Conclusion: Wage levels for the two groups appear to be equally well spread out

The F-test in R

We present two F-test solutions in R

  1. Simple solution using the command var.test
  2. A first-principles construction closer to our earlier hand calculation

Simple solution: Code

# Enter Wages data in 2 vectors using function c()

mathematicians <- c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)
accountants <- c(37, 37, 42, 44, 46, 48, 54, 56, 59, 60, 60, 64, 64)


# Perform one-sided F-test using var.test
# Store result of var.test in ans

ans <- var.test(accountants, mathematicians, alternative = "greater")


# Print answer
print(ans)
  • Note: accountants is first because it has larger variance
  • Code can be downloaded here F_test.R

Simple solution: Output


    F test to compare two variances

data:  accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.4753
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
 0.3451691       Inf
sample estimates:
ratio of variances 
          1.060686 

Comments:

  1. First line: R tells us that an F-test is performed
  2. Second line: Data for F-test is accountants and mathematicians

Simple solution: Output


    F test to compare two variances

data:  accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.4753
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
 0.3451691       Inf
sample estimates:
ratio of variances 
          1.060686 

Comments:

  1. The F-statistic computed is F = 1.0607
    • Note: This coincides with the one computed by hand (up to rounding error)

Simple solution: Output


    F test to compare two variances

data:  accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.4753
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
 0.3451691       Inf
sample estimates:
ratio of variances 
          1.060686 

Comments:

  1. Numerator of F-statistic has 12 degrees of freedom
  2. Denominator of F-statistic has 9 degrees of freedom
  3. p-value is p = 0.4753

Simple solution: Output


    F test to compare two variances

data:  accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.4753
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
 0.3451691       Inf
sample estimates:
ratio of variances 
          1.060686 

Comments:

  1. Fourth line: The alternative hypothesis is that ratio of variances is \, > 1
    • This translates to H_1 \colon \sigma_X^2 > \sigma^2_Y
    • Warning: This is not saying to reject H_0 – R is just stating H_1

Simple solution: Output


    F test to compare two variances

data:  accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.4753
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
 0.3451691       Inf
sample estimates:
ratio of variances 
          1.060686 

Comments:

  1. Fifth line: R computes a 95 \% confidence interval for ratio \sigma_Y^2/\sigma_X^2
    • Based on the data, the set of hypothetical values for \sigma_Y^2/\sigma_X^2 is (\sigma_Y^2/\sigma_X^2 ) \in [0.3451691, \infty]

Simple solution: Output


    F test to compare two variances

data:  accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.4753
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
 0.3451691       Inf
sample estimates:
ratio of variances 
          1.060686 

Comments:

  1. Seventh line: R computes ratio of sample variances
    • We have that s_Y^2/s_X^2 = 1.060686
    • By definition the above coincides with F-statistic (up to rounding)

Simple solution: Output


    F test to compare two variances

data:  accountants and mathematicians
F = 1.0607, num df = 12, denom df = 9, p-value = 0.4753
alternative hypothesis: true ratio of variances is greater than 1
95 percent confidence interval:
 0.3451691       Inf
sample estimates:
ratio of variances 
          1.060686 

Conclusion: The p-value is p = 0.4753

  • Since p > 0.05 we do not reject H_0
  • Hence \sigma^2_X and \sigma^2_Y appear to be similar
  • Wage levels for the two groups appear to be equally well spread out

First principles solution: Code

  • Start by entering data into R
# Enter Wages data in 2 vectors using function c()

mathematicians <- c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)
accountants <- c(37, 37, 42, 44, 46, 48, 54, 56, 59, 60, 60, 64, 64)

First principles solution: Code

  • Check which population has higher variance
  • In our case accountants has higher variance
# Check which variance is higher

cat("\n Variance of accountants is", var(accountants))
cat("\n Variance of mathematicians is", var(mathematicians))

First principles solution: Code

  • Compute sample sizes
# Calculate sample sizes

n <- length(mathematicians)
m <- length(accountants)

First principles solution: Code

  • Compute F-statistic F = \frac{s_Y^2}{s_X^2}

  • Recall: Numerator has to have larger variance

  • In our case accountants is numerator

# Compute F-statistics
# Numerator is population with largest variance

F <- var(accountants) / var(mathematicians)

First principles solution: Code

  • Compute the p-value p = P(F_{m-1, n-1} > F) = 1 - P(F_{m-1, n-1} \leq F)
# Compute p-value
p_value <- 1 - pf(F, df1 = m - 1, df2 = n - 1)

# Print p-value
cat("\n The p-value for one-sided F-test is", p_value)


  • Note: The command pf(f, df1 = n, df2 = m) computes probability P(F_{n,m} \leq f)

First principles solution: Output


 Variance of accountants is 95.75641

 Variance of mathematicians is 90.27778

 The p-value for one-sided F-test is 0.4752684


  • Since p > 0.05 we do not reject H_0

References