Chi-squared test for given probabilities
data: counts
X-squared = 9.68, df = 5, p-value = 0.08483
Lecture 7
Data: in the form of numerical counts
Test: difference between observed counts and predictions of theoretical model
Example: Blood counts
Suppose to have counts of blood type for some people
A | B | AB | O |
---|---|---|---|
2162 | 738 | 228 | 2876 |
We want to compare the above to the probability model
A | B | AB | O |
---|---|---|---|
1/3 | 1/8 | 1/24 | 1/2 |
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Example: Relative performance of Manchester United managers
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Questions:
Hypothesis testing for counts can be done using chi-squared test
Chi-squared test is simple to use for real-world project datasets (e.g. dissertations)
Potentially applicable to a whole range of different models
Easy to compute by hand/software
Motivates the more advanced study of contingency tables
Categorical Data:
Type 1 | Type 2 | \ldots | Type n |
---|---|---|---|
O_1 | O_2 | \ldots | O_n |
Type 1 | Type 2 | \ldots | Type n |
---|---|---|---|
E_1 | E_2 | \ldots | E_n |
Goal: Compare expected counts E_i with observed counts O_i
Method:
Note: We have that \chi^2 = 0 \qquad \iff \qquad O_i = E_i \,\,\,\, \text{ for all } \,\,\,\, i = 1 , \, \ldots , \, n
Remarks:
Multinomial distribution: is a model for the following experiment
Multinomial distribution: Schematic visualization
Outcome type | 1 | \ldots | n | Total |
---|---|---|---|---|
Counts | X_1 | \ldots | X_n | X_1 + \ldots + X_n = m |
Probabilities | p_1 | \ldots | p_n | p_1 + \ldots + p_n = 1 |
For n = 2 the multinomial reduces to a binomial:
Outcome types | 1 | 2 |
---|---|---|
Counts | X_1 | X_2 = m - X_1 |
Probabilities | p_1 | p_2 = 1 - p_1 |
Suppose that \mathbf{X}= (X_1, \ldots, X_n) \sim \mathop{\mathrm{Mult}}(m,p_1,\ldots,p_n)
O_i refers to observed count of category i
E_i refers to expected count of category i
We suppose that Type i is observed with probability p_i and 0 \leq p_i \leq 1 \,, \qquad \quad p_1 + \ldots + p_n = 1
Total number of observations is m
The counts are modelled by (O_1, \ldots, O_n) \sim \mathop{\mathrm{Mult}}(m, p_1, \ldots, p_n)
The expected counts are modelled by E_i := {\rm I\kern-.3em E}[ O_i ] = m p_i
Consider counts and expected counts (O_1, \ldots, O_n) \sim \mathop{\mathrm{Mult}}(m, p_1, \ldots, p_n) \qquad \qquad E_i := m p_i
Question: What is the distribution of \chi^2 \,?
Since O_i \sim \mathop{\mathrm{Bin}}(m, p_i), the Central Limit Theorem implies \frac{O_i - m p_i }{ \sqrt{m p_i(1 - p_i) } } \ \stackrel{{\rm d}}{\longrightarrow} \ N(0,1) as m \to \infty
In particular, since (1-p_i) in constant, we have \frac{O_i - m p_i }{ \sqrt{m p_i } } \ \approx \ \frac{O_i - m p_i }{ \sqrt{m p_i(1 - p_i) } } \ \approx \ N(0,1)
Squaring the above we get \frac{(O_i - m p_i)^2 }{ m p_i } \ \approx \ N(0,1)^2 = \chi_1^2
If the above random variables were pairwise independent, we would obtain \chi^2 = \sum_{i=1}^n \frac{(O_i - m p_i)^2 }{ m p_i } \ \approx \ \sum_{i=1}^n \chi_1^2 = \chi_n^2
However the O_i are not independent because of linear constraint O_1 + \ldots + O_n = m
The idea is that a priori there should be n degrees of freedom
However the linear constraint reduces degrees of freedom by 1
This is because one can choose the first n-1 counts and the last one is given by O_n = m - O_1 - \ldots - O_{n-1}
A rather technical proof is needed to actually prove that \chi^2 = \sum_{i=1}^n \frac{(O_i - m p_i)^2 }{ m p_i } \ \approx \ \chi_{n-1}^2
Define expected counts E_i := m p_i
Consider approximation from Theorem: \chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2 }{ E_i } \ \approx \ \chi_{n-1}^2
The approximation is:
Good | E_i \geq 5 \, \text{ for all } \, i = 1 , \ldots n |
---|---|
Bad | E_i < 5 \, for some \, i = 1 , \ldots n |
Good: Use \chi_{n-1}^2 approximation of \chi^2
Bad: Use Monte Carlo simulations (more on this later)
Setting:
Population consists of items of n different types
p_i is probability that an item selected at random is of type i
p_i is unknown and needs to be estimated
As guess for p_1,\ldots,p_n we take p_1^0, \ldots, p_n^0 such that 0 \leq p_i^0 \leq 1 \qquad \qquad \sum_{i=1}^n p_i^0 = 1
Hypothesis Test: We test for equality of p_i to p_i^0 \begin{align*} & H_0 \colon p_i = p_i^0 \qquad \text{ for all } \, i = 1, \ldots, n \\ & H_1 \colon p_i \neq p_i^0 \qquad \text{ for at least one } \, i \end{align*}
Sample:
Data: Vector of counts (o_1,\ldots,o_n)
Schematically: We can represent probabilities and counts in a table
Type | 1 | \ldots | n | Total |
---|---|---|---|---|
Counts | o_1 | \ldots | o_n | m |
Probabilities | p_1 | \ldots | p_n | 1 |
counts <- c(o1, ..., on)
null_probs <- c(p1, ..., pn)
counts
with null_probs
chisq.test(counts, p = null_probs)
Suppose to have counts of blood type for some people
A | B | AB | O |
---|---|---|---|
2162 | 738 | 228 | 2876 |
We want to compare the above to the probability model
A | B | AB | O |
---|---|---|---|
1/3 | 1/8 | 1/24 | 1/2 |
Hence the null hypothesis probabilities are p_1^0 = \frac{1}{3} \qquad \quad p_2^0 = \frac{1}{8} \qquad \quad p_3^0 = \frac{1}{24} \qquad \quad p_4^0 = \frac{1}{2}
1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|
13 | 17 | 9 | 17 | 18 | 26 |
Chi-squared test for given probabilities
data: counts
X-squared = 9.68, df = 5, p-value = 0.08483
Chi-squared statistic is \, \chi^2 = 9.68
Degrees of freedom are \, {\rm df} = 5
p-value is p \approx 0.08
Therefore p > 0.05
We cannot reject H_0
The dice appears to be fair
Republican | Democrat | Undecided |
---|---|---|
35 | 40 | 25 |
chisq.test
Recall that p_1^0 + p_2^0 + p_3^0 = 1
Hence we deduce p_1^0 = 0.35 \qquad \quad p_2^0 = 0.35 \qquad \quad p_3^0 = 0.3
We test the following hypothesis \begin{align*} & H_0 \colon p_i = p_i^0 \qquad \text{ for all } \, i = 1, \ldots, n \\ & H_1 \colon p_i \neq p_i^0 \qquad \text{ for at least one } \, i \end{align*}
We want to perform goodness-of-fit test without using chisq.test
First we enter the data
# Compute p-value
p_value <- 1 - pchisq(chi_squared, df = degrees)
# Print p-value
cat("The p-value is:", p_value)
The p-value is: 0.4612526
Therefore p-value is p > 0.05
We do not reject H_0
There is no reason to believe that Republican and Democrat are not tie
Recall the approximation \chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2 }{ E_i } \ \approx \ \chi_{n-1}^2
We said that
Draw x_1, \ldots, x_N and y_1, \ldots, y_N from {\rm Uniform(-1,1)}
Count the number of points (x_i,y_i) falling inside circle of radius 1
These are points satisfying condition x_i^2 + y_i^2 \leq 1
Area of circle estimated with proportion of points falling inside circle: \text{Area} \ = \ \pi \ \approx \ \frac{\text{Number of points } (x_i,y_i) \text{ inside circle}}{N} \ \times \ 4
Note: 4 is the area of square of side 2
N <- 10000
total <- 0 # Counts total number of points
in_circle <- 0 # Counts points falling in circle
for (j in 1:10) {
for (i in 1:N) {
x <- runif(1, -1, 1); y <- runif(1, -1, 1); # sample point (x,y)
if (x^2 + y^2 <= 1) {
in_circle <- in_circle + 1 # If (x,y) in circle increase counter
}
total <- total + 1 # Increase total counter
}
pi_approx <- ( in_circle / total ) * 4 # Compute approximate area
cat(sprintf("After %8d iterations pi is %.08f, error is %.08f\n",
(j * N), pi_approx, abs(pi_approx - pi)))
}
After 10000 iterations pi is 3.14000000, error is 0.00159265
After 20000 iterations pi is 3.13640000, error is 0.00519265
After 30000 iterations pi is 3.13933333, error is 0.00225932
After 40000 iterations pi is 3.13740000, error is 0.00419265
After 50000 iterations pi is 3.13824000, error is 0.00335265
After 60000 iterations pi is 3.13953333, error is 0.00205932
After 70000 iterations pi is 3.14097143, error is 0.00062123
After 80000 iterations pi is 3.14105000, error is 0.00054265
After 90000 iterations pi is 3.13977778, error is 0.00181488
After 100000 iterations pi is 3.13976000, error is 0.00183265
Goal: use Monte Carlo simulations to compute p-value of goodness-of-fit test
Type | 1 | \ldots | n | Total |
---|---|---|---|---|
Observed counts | o_1 | \ldots | o_n | m |
Null hypothesis Probabilities | p_1^0 | \ldots | p_n^0 | 1 |
The expected counts are E_i = m p_i^0
Under the null hypothesis, the observed counts (o_1, \ldots, o_n) come from (O_1 , \ldots, O_n) \sim \mathop{\mathrm{Mult}}(m, p_1^0, \ldots, p_n^0)
The chi-squared statistic random variable is \chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2 }{ E_i }
The observed chi-squared statistics is \chi^2_{\rm obs} = \sum_{i=1}^n \frac{(o_i - E_i)^2 }{ E_i }
The p-value is defined as p = P( \chi^2 > \chi^2_{\rm obs} )
Exercise: Think of Monte Carlo simulation to approximate p
Simulate counts (o_1^{\rm sim},\ldots,o_n^{\rm sim}) from \mathop{\mathrm{Mult}}(m, p_1^0, \ldots, p_n^0)
Compute the corresponding simulated chi-squared statistic \chi^2_{\rm sim} = \sum_{i=1}^n \frac{ (o_i^{\rm sim} - E_i)^2 }{E_i}
The simulated chi-squared statistic is extreme if \, \chi^2_{\rm sim} > \chi^2_{\rm obs}
We estimate theoretical p-value in the following way p = P( \chi^2 > \chi^2_{\rm obs} ) \ \approx \ \frac{ \# \text{ of extreme simulated statistics} }{ \text{Total number of simulations} }
Data: Number of defects in printed circuit boards
# Defects | 0 | 1 | 2 | 3 |
---|---|---|---|---|
Counts | 32 | 15 | 9 | 4 |
Null hypothesis probabilities are
p_1^0 | p_2^0 | p_3^0 | p_4^0 |
---|---|---|---|
0.5 | 0.3 | 0.15 | 0.05 |
E_1 | E_2 | E_3 | E_4 |
---|---|---|---|
30 | 18 | 9 | 3 |
chisq.test
# Enter counts and null hypothesis probabilities
counts <- c(32, 15, 9, 4)
null_probs <- c(0.5, 0.3, 0.15, 0.05)
# Compute total counts
m <- sum(counts)
# Compute expected counts
exp_counts <- m * null_probs
# Compute the observed chi-squared statistic
obs_chi_squared <- sum( (counts - exp_counts)^2 / exp_counts )
# Perform Monte Carlo simulations
for (i in 1:N) {
# Simulate multinomial counts under null hypothesis
simul_counts <- rmultinom(1, m, null_probs)
# Compute chi-squared statistic for the simulated counts
simul_chi_squared <- sum( (simul_counts - exp_counts )^2 / exp_counts)
# Check if simulated chi-squared statistic is extreme
if (simul_chi_squared >= obs_chi_squared) {
count_extreme_statistics <- count_extreme_statistics + 1
}
}
chisq.test
# Perform chi-squared test using built-in R function
chi_squared_test <- chisq.test(counts, p = null_probs,
simulate.p.value = TRUE)
# Extract p-value from chi-squared test result
chisq_p_value <- chi_squared_test$p.value
# Print p-values for comparison
cat("\nCustom Monte Carlo p-value:", monte_carlo_p_value)
cat("\nR Monte Carlo p-value:", chisq_p_value)
The previous code can be downloaded here monte_carlo_p_value.R
Running the code yields the following output
Custom Monte Carlo p-value: 0.82036
R Monte Carlo p-value: 0.822089
chisq.test
runs a more refined Monte Carlo algorithm for p-valueTwo-way Contigency Tables: Table in which each observation is classified in 2 ways
Example: Relative performance of Man Utd managers
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
O_{R+} = \sum_{j=1}^C O_{Rj} \qquad \quad O_{+2} = \sum_{i=1}^R O_{i2}
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
m = O_{++} = \sum_{i=1}^R \sum_{j=1}^C O_{ij}
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
\sum_{i=1}^R O_{i+} = \sum_{i=1}^R \sum_{j=1}^C O_{ij} = m
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
\sum_{j=1}^C O_{+j} = \sum_{j=1}^C \sum_{i=1}^R O_{ij} = m
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | p_{11} | p_{12} | \ldots | p_{1C} | p_{1+} |
X = 2 | p_{21} | p_{22} | \ldots | p_{2C} | p_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | p_{R1} | p_{R2} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | p_{+2} | \ldots | p_{+C} | 1 |
p_{ij} := P(X = i , Y = j)
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | p_{11} | p_{12} | \ldots | p_{1C} | p_{1+} |
X = 2 | p_{21} | p_{22} | \ldots | p_{2C} | p_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | p_{R1} | p_{R2} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | p_{+2} | \ldots | p_{+C} | 1 |
P(X = i) = \sum_{j=1}^C P(X = i , Y = j) = \sum_{j=1}^C p_{ij} = p_{i+}
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | p_{11} | p_{12} | \ldots | p_{1C} | p_{1+} |
X = 2 | p_{21} | p_{22} | \ldots | p_{2C} | p_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | p_{R1} | p_{R2} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | p_{+2} | \ldots | p_{+C} | 1 |
P(Y = j) = \sum_{i=1}^R P(X = i , Y = j) = \sum_{i=1}^R p_{ij} = p_{+j}
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | p_{11} | p_{12} | \ldots | p_{1C} | p_{1+} |
X = 2 | p_{21} | p_{22} | \ldots | p_{2C} | p_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | p_{R1} | p_{R2} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | p_{+2} | \ldots | p_{+C} | 1 |
\sum_{i=1}^R p_{i+} = \sum_{i=1}^R P(X = i) = \sum_{i=1}^R \sum_{j=1}^C P(X = i , Y = j) = 1
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | p_{11} | p_{12} | \ldots | p_{1C} | p_{1+} |
X = 2 | p_{21} | p_{22} | \ldots | p_{2C} | p_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | p_{R1} | p_{R2} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | p_{+2} | \ldots | p_{+C} | 1 |
\sum_{j=1}^C p_{+j} = \sum_{j=1}^C P(X = j) = \sum_{j=1}^R \sum_{j=1}^C P(X = i , Y = j) = 1
Counts O_{ij} and probabilities p_{ij} can be assembled into R \times C matrices O = \left( \begin{array}{ccc} O_{11} & \ldots & O_{1C} \\ \vdots & \ddots & \vdots \\ O_{R1} & \ldots & O_{RC} \\ \end{array} \right) \qquad \qquad p = \left( \begin{array}{ccc} p_{11} & \ldots & p_{1C} \\ \vdots & \ddots & \vdots \\ p_{R1} & \ldots & p_{RC} \\ \end{array} \right)
This way the random matrix O has multinomial distribution O \sim \mathop{\mathrm{Mult}}(m,p)
The counts O_{ij} are therefore binomial O_{ij} \sim \mathop{\mathrm{Bin}}(m,p_{ij})
We can also consider the marginal random vectors (O_{1+}, \ldots, O_{R+}) \qquad \qquad (O_{+1}, \ldots, O_{+C})
These have also multinomial distribution (O_{1+}, \ldots, O_{R+}) \sim \mathop{\mathrm{Mult}}(m, p_{1+}, \ldots, p_{R+}) (O_{+1}, \ldots, O_{+C}) \sim \mathop{\mathrm{Mult}}(m, p_{+1}, \ldots, p_{+C})
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | E_{11} | E_{12} | \ldots | E_{1C} | E_{1+} |
X = 2 | E_{21} | E_{22} | \ldots | E_{2C} | E_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | E_{R1} | E_{R2} | \ldots | E_{RC} | E_{R+} |
Totals | E_{+1} | E_{+2} | \ldots | E_{+C} | m |
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | E_{11} | E_{12} | \ldots | E_{1C} | E_{1+} |
X = 2 | E_{21} | E_{22} | \ldots | E_{2C} | E_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | E_{R1} | E_{R2} | \ldots | E_{RC} | E_{R+} |
Totals | E_{+1} | E_{+2} | \ldots | E_{+C} | m |
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
Suppose the counts O \sim \mathop{\mathrm{Mult}}(m,p). Then \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - m p_{ij})^2 }{ m p_{ij} } \ \stackrel{ {\rm d} }{ \longrightarrow } \ \chi_{RC - 1 - {\rm fitted} }^2 when sample size m \to \infty. The convergence is in distribution and
Quality of approximation: The chi-squared approximation is good if E_{ij} = m p_{ij} \geq 5 \quad \text{ for all } \,\, i = 1 , \ldots , R \, \text{ and } j = 1, \ldots, C
Setting:
Population consists of items of n = R \times C different types
p_{ij} is probability that an item selected at random is of type (i,j)
p_{ij} is unknown and needs to be estimated
As guess for p_{ij} we take p_{ij}^0 such that 0 \leq p_{ij}^0 \leq 1 \qquad \qquad \sum_{i=1}^R \sum_{j=1}^C p_{ij}^0 = 1
Hypothesis Test: We test for equality of p_{ij} to p_{ij}^0 \begin{align*} & H_0 \colon p_{ij} = p_{ij}^0 \qquad \text{ for all } \, i = 1, \ldots, R \,, \text{ and } \, j = 1 , \ldots, C \\ & H_1 \colon p_{ij} \neq p_{ij}^0 \qquad \text{ for at least one pair } \, (i,j) \end{align*}
Sample:
Data: Matrix of counts o = (o_{ij})_{ij}
counts <- c(o_11, o_12, ..., o_RC)
null_probs_i <- c(p_11, p_12, ..., p_RC)
counts
with null_probs
chisq.test(counts, p = null_probs)
Note: Compute Monte Carlo p-value with option simulate.p.value = TRUE
Background story:
Question: Are the 6 managers to blame for worse Team Performance?
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Table contains Man Utd games since 2014 season
To each Man Utd game in the sample we associate two categories:
Question: Is the number of Wins, Draws and Losses uniformly distributed?
Hypothesis Test: To answer the question we perform a goodness-of-fit test on \begin{align*} & H_0 \colon p_{ij} = p_{ij}^0 \quad \text { for all pairs } \, (i,j) \\ & H_1 \colon p_{ij} \neq p_{ij}^0 \quad \text { for at least one pair } \, (i,j) \\ \end{align*}
Null Probabilities:
Manager | Won | Drawn | Lost | Total |
---|---|---|---|---|
Moyes | 27 | 9 | 15 | o_{1+} = 51 |
Van Gaal | 54 | 25 | 24 | o_{2+} = 103 |
Mourinho | 84 | 32 | 28 | o_{3+} = 144 |
Solskjaer | 91 | 37 | 40 | o_{4+} = 168 |
Rangnick | 11 | 10 | 8 | o_{5+} = 29 |
ten Hag | 61 | 12 | 28 | o_{6+} = 101 |
If the results are uniformly distributed the expected scores are E_{ij} = \frac{o_{i+}}{3}
Recall that the expected scores are modelled as E_{ij} = m p_{ij}^0
We hence obtain p_{ij}^0 = \frac{o_{i+}}{3 m}
\begin{align*} \text{Moyes: } \quad & p_{1j}^0 = \frac{ o_{1+} }{ 3m } = 51 \big/ 1788 \\ \text{Van Gaal: }\quad & p_{2j}^0 = \frac{ o_{2+} }{ 3m } = 103 \big/ 1788 \\ \text{Mourinho: } \quad& p_{3j}^0 = \frac{ o_{3+} }{ 3m } = 144 \big/ 1788 \\ \text{Solskjaer: } \quad& p_{4j}^0 = \frac{ o_{4+} }{ 3m } = 168 \big/ 1788 \\ \text{Rangnick: }\quad & p_{5j}^0 = \frac{ o_{5+} }{ 3m } = 29 \big/ 1788 \\ \text{ten Hag: }\quad & p_{6j}^0 = \frac{ o_{6+} }{ 3m } = 101 \big/ 1788 \end{align*}
We are now ready to perform the goodness-of-fit test in R
We start by storing the matrix of counts into a single R vector, row-by-row
rep
counts
and null_probs
Chi-squared test for given probabilities
data: counts
X-squared = 137.93, df = 17, p-value < 2.2e-16
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
Consider a two-way contingency table as above
To each person we associate two categories:
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
Y = 1 | \ldots | Y = C | Totals | |
---|---|---|---|---|
X = 1 | O_{11} | \ldots | O_{1C} | O_{1+} |
\cdots | \cdots | \cdots | \cdots | \cdots |
X = R | O_{R1} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | \ldots | O_{+C} | m |
Consider the general two-way contingency table as above
They are equivalent:
Y = 1 | \ldots | Y = C | Totals | |
---|---|---|---|---|
X = 1 | O_{11} | \ldots | O_{1C} | O_{1+} |
\cdots | \cdots | \cdots | \cdots | \cdots |
X = R | O_{R1} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | \ldots | O_{+C} | m |
Hence, testing for independece means testing following hypothesis: \begin{align*} & H_0 \colon X \, \text{ and } \, Y \, \text{ are independent } \\ & H_1 \colon X \, \text{ and } \, Y \, \text{ are not independent } \end{align*}
We need to quantify H_0 and H_1
Y = 1 | \ldots | Y = C | Totals | |
---|---|---|---|---|
X = 1 | p_{11} | \ldots | p_{1C} | p_{1+} |
\cdots | \cdots | \cdots | \cdots | \cdots |
X = R | p_{R1} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | \ldots | p_{+C} | 1 |
R.v. X and Y are independent iff cell probabilities factor into marginals p_{ij} = P(X = i , Y = j) = P(X = i) P (Y = j) = p_{i+}p_{+j}
Therefore the hypothesis test for independence becomes \begin{align*} & H_0 \colon p_{ij} = p_{i+}p_{+j} \quad \text{ for all } \, i = 1, \ldots , R \, \text{ and } \, j = 1 ,\ldots C \\ & H_1 \colon p_{ij} \neq p_{i+}p_{+j} \quad \text{ for some } \, (i,j) \end{align*}
Assume the null hypothesis is true H_0 \colon p_{ij} = p_{i+}p_{+j} \quad \text{ for all } \, i = 1, \ldots , R \, \text{ and } \, j = 1 ,\ldots C
Under H_0 the expected counts become E_{ij} = m p_{ij} = p_{i+}p_{+j}
We need a way to estimate the marginal probabilities p_{i+} and p_{+j}
Goal: Estimate marginal probabilities p_{i+} and p_{+j}
By definition we have p_{i+} = P( X = i )
Hence p_{i+} is probability of and observation to be classified in i-th row
Estimate p_{i+} with the proportion of observations classified in i-th row p_{i+} := \frac{o_{i+}}{m} = \frac{ \text{Number of observations in } \, i\text{-th row} }{ \text{ Total number of observations} }
Goal: Estimate marginal probabilities p_{i+} and p_{+j}
Similarly, by definition p_{+j} = P( Y = j )
Hence p_{+j} is probability of and observation to be classified in j-th column
Estimate p_{+j} with the proportion of observations classified in j-th column p_{+j} := \frac{o_{+j}}{m} = \frac{ \text{Number of observations in } \, j\text{-th column} }{ \text{ Total number of observations} }
Summary: The marginal probabilities are estimated with p_{i+} := \frac{o_{i+}}{m} \qquad \qquad p_{+j} := \frac{o_{+j}}{m}
Therefore the expected counts become E_{ij} = m p_{ij} = m p_{i+} p_{+j} = m \left( \frac{o_{i+}}{m} \right) \left( \frac{o_{+j}}{m} \right) = \frac{o_{i+} \, o_{+j}}{m}
By the Theorem in Slide 88 the chi-squared statistics satisfies \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - E_{ij})^2 }{ E_{ij} } \ \approx \ \chi^2_{RC - 1 - {\rm fitted}}
We need to compute the number of fitted parameters
We have to estimate the row marginals p_{i+}
We estimate the first R-1 row marginals by p_{i+} := \frac{o_{i+}}{m} \,, \qquad i = 1 , \ldots, R - 1
Since the marginals p_{i+} sum to 1, we can obtain p_{R+} by p_{R+} = 1 - p_{1+} - \ldots - p_{(R-1)+} = 1 - \frac{o_{1+}}{m} - \ldots - \frac{o_{(R-1)+}}{m}
Similarly, we estimate the first C-1 column marginals by p_{+j} = \frac{o_{+j}}{m} \,, \qquad j = 1, \ldots, C - 1
Since the marginals p_{+j} sum to 1, we can obtain p_{+C} by p_{+C} = 1 - p_{+1} - \ldots - p_{+(C-1)} = 1 - \frac{o_{+1}}{m} - \ldots - \frac{o_{+(C-1)}}{m}
In total we only have to estimate (R - 1) + (C - 1 ) = R + C - 2 parameters
In total we estimate R + C -2 parameters, so that {\rm fitted} = R + C - 2
Therefore the degrees of freedom are \begin{align*} RC - 1 - {\rm fitted} & = RC - 1 - R - C + 2 \\ & = RC - R - C + 1 \\ & = (R - 1)(C - 1) \end{align*}
Assume the null hypothesis of row and column independence H_0 \colon p_{ij} = p_{i+}p_{+j} \quad \text{ for all } \, i = 1, \ldots , R \, \text{ and } \, j = 1 ,\ldots C
Suppose the counts are O \sim \mathop{\mathrm{Mult}}(m,p) and expected counts are E_{ij} = \frac{o_{i+} \, o_{+j}}{m}
By the previous slides and Theorem in Slide 88 we have that \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - E_{ij})^2 }{ E_{ij} } \approx \ \chi_{RC - 1 - {\rm fitted} }^2 = \chi_{(R-1)(C-1)}^2
The chi-squared approximation \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - E_{ij})^2 }{ E_{ij} } \approx \ \chi_{(R-1)(C-1)}^2 is good if E_{ij} \geq 5 \quad \text{ for all } \,\, i = 1 , \ldots , R \, \text{ and } j = 1, \ldots, C
Sample:
We draw m items from population
Each item can be of type (i,j) where
O_{ij} denotes the number of items of type (i,j) drawn
p_{ij} is probability of observing type (i,j)
p = (p_{ij})_{ij} is probability matrix
The matrix of counts O = (o_{ij})_{ij} has multinomial distribution \mathop{\mathrm{Mult}}(m,p)
Data: Matrix of counts, represented by two-way contigency table
Y = 1 | \ldots | Y = C | Totals | |
---|---|---|---|---|
X = 1 | o_{11} | \ldots | o_{1C} | o_{1+} |
\cdots | \cdots | \cdots | \cdots | \cdots |
X = R | o_{R1} | \ldots | o_{RC} | o_{R+} |
Totals | o_{+1} | \ldots | o_{+C} | m |
Hypothesis test: We test for independence of rows and columns \begin{align*} & H_0 \colon X \, \text{ and } \, Y \, \text{ are independent } \\ & H_1 \colon X \, \text{ and } \, Y \, \text{ are not independent } \end{align*}
row_i <- c(o_i1, o_i2, ..., o_iC)
counts <- rbind(row_1, ..., row_R)
counts
models the matrix of counts o = (o_{ij})_{ij}counts
chisq.test(counts)
Note: Compute Monte Carlo p-value with option simulate.p.value = TRUE
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
People are sampled at random in NW, London and SW
To each person we associate two categories:
Exercise: Test for independence of Residential Status and Region
Owned | Rented | Tot | |
---|---|---|---|
NW | 2180 | 871 | o_{1+} = 3051 |
Lon | 1820 | 1400 | o_{2+} = 3220 |
SW | 1703 | 614 | o_{2+} = 2317 |
Tot | o_{+1} = 5703 | o_{+2} = 2885 | m = 8588 |
Owned | Rented | Tot | |
---|---|---|---|
NW | \frac{3051{\times}5703}{8588} \approx 2026 | \frac{3051{\times}2885}{8588} \approx 1025 | o_{1+} = 3051 |
Lon | \frac{3220{\times}5703}{8588} \approx 2138 | \frac{3220{\times}2885}{8588} \approx 1082 | o_{2+} = 3220 |
SW | \frac{2317{\times}5703}{8588} \approx 1539 | \frac{2317{\times}1885}{8588} \approx 778 | o_{2+} = 2317 |
Tot | o_{+1} = 5703 | o_{+2} = 2885 | m = 8588 |
Owned | Rented | |
---|---|---|
North West | 71.5\% | 28.5\% |
London | 56.5\% | 43.5\% |
South West | 73.5\% | 26.5\% |
Pearson's Chi-squared test
data: counts
X-squared = 228.11, df = 2, p-value < 2.2e-16
Recap of Background story:
Question: Are the 6 managers to blame for worse Team Performance?
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Table contains Man Utd games since 2014 season
To each Man Utd game in the sample we associate two categories:
Question: Is there association between Manager and Team Performance?
To answer the question we perform chi-square test of independence in R
Test can be performed by suitably modifying the code independence_test.R
# Store each row into and R vector
row_1 <- c(27, 9, 15)
row_2 <- c(54, 25, 24)
row_3 <- c(84, 32, 28)
row_4 <- c(91, 37, 40)
row_5 <- c(11, 10, 8)
row_6 <- c(61, 12, 28)
# Assemble the rows into an R matrix
counts <- rbind(row_1, row_2, row_3, row_4, row_5, row_6)
# Perform chi-squared test of independence
ans <- chisq.test(counts)
# Print answer
print(ans)
Pearson's Chi-squared test
data: counts
X-squared = 12.678, df = 10, p-value = 0.2422
Comments on
chisq.test
R performs a goodness-of-fit test on
counts
withnull_probs
withchisq.test(counts, p = null_probs)
R implicitly assumes the null hypothesis is H_0 \colon p_i = p_i^0 \qquad \text{ for all } \, i = 1 , \ldots, n
By default R computes p-value using \chi_{n-1}^2 approximation
R can compute p-value with Monte Carlo simulation using option
simulate.p.value = TRUE