[1] 0.03 0.03 0.03 0.06 0.06 0.06 0.08 0.08 0.08 0.09 0.09 0.09 0.02 0.02 0.02
[16] 0.06 0.06 0.06
Lecture 6
Two-way Contigency Tables: Table in which each observation is classified in 2 ways
Example:
Relative performance of Man Utd managers
Two classifications for each game: Result, and Manager in charge
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
O_{R+} = \sum_{j=1}^C O_{Rj} \qquad \quad O_{+2} = \sum_{i=1}^R O_{i2}
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
m = O_{++} = \sum_{i=1}^R \sum_{j=1}^C O_{ij}
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
\sum_{i=1}^R O_{i+} = \sum_{i=1}^R \sum_{j=1}^C O_{ij} = m
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
\sum_{j=1}^C O_{+j} = \sum_{j=1}^C \sum_{i=1}^R O_{ij} = m
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | p_{11} | p_{12} | \ldots | p_{1C} | p_{1+} |
X = 2 | p_{21} | p_{22} | \ldots | p_{2C} | p_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | p_{R1} | p_{R2} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | p_{+2} | \ldots | p_{+C} | 1 |
p_{ij} := P(X = i , Y = j)
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | p_{11} | p_{12} | \ldots | p_{1C} | p_{1+} |
X = 2 | p_{21} | p_{22} | \ldots | p_{2C} | p_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | p_{R1} | p_{R2} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | p_{+2} | \ldots | p_{+C} | 1 |
P(X = i) = \sum_{j=1}^C P(X = i , Y = j) = \sum_{j=1}^C p_{ij} = p_{i+}
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | p_{11} | p_{12} | \ldots | p_{1C} | p_{1+} |
X = 2 | p_{21} | p_{22} | \ldots | p_{2C} | p_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | p_{R1} | p_{R2} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | p_{+2} | \ldots | p_{+C} | 1 |
P(Y = j) = \sum_{i=1}^R P(X = i , Y = j) = \sum_{i=1}^R p_{ij} = p_{+j}
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | p_{11} | p_{12} | \ldots | p_{1C} | p_{1+} |
X = 2 | p_{21} | p_{22} | \ldots | p_{2C} | p_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | p_{R1} | p_{R2} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | p_{+2} | \ldots | p_{+C} | 1 |
\sum_{i=1}^R p_{i+} = \sum_{i=1}^R P(X = i) = \sum_{i=1}^R \sum_{j=1}^C P(X = i , Y = j) = 1
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | p_{11} | p_{12} | \ldots | p_{1C} | p_{1+} |
X = 2 | p_{21} | p_{22} | \ldots | p_{2C} | p_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | p_{R1} | p_{R2} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | p_{+2} | \ldots | p_{+C} | 1 |
\sum_{j=1}^C p_{+j} = \sum_{j=1}^C P(X = j) = \sum_{j=1}^R \sum_{j=1}^C P(X = i , Y = j) = 1
Counts O_{ij} and probabilities p_{ij} can be assembled into R \times C matrices O = \left( \begin{array}{ccc} O_{11} & \ldots & O_{1C} \\ \vdots & \ddots & \vdots \\ O_{R1} & \ldots & O_{RC} \\ \end{array} \right) \qquad \qquad p = \left( \begin{array}{ccc} p_{11} & \ldots & p_{1C} \\ \vdots & \ddots & \vdots \\ p_{R1} & \ldots & p_{RC} \\ \end{array} \right)
This way the random matrix O has multinomial distribution O \sim \mathop{\mathrm{Mult}}(m,p)
The counts O_{ij} are therefore binomial O_{ij} \sim \mathop{\mathrm{Bin}}(m,p_{ij})
We can also consider the marginal random vectors (O_{1+}, \ldots, O_{R+}) \qquad \qquad (O_{+1}, \ldots, O_{+C})
These have also multinomial distribution (O_{1+}, \ldots, O_{R+}) \sim \mathop{\mathrm{Mult}}(m, p_{1+}, \ldots, p_{R+}) (O_{+1}, \ldots, O_{+C}) \sim \mathop{\mathrm{Mult}}(m, p_{+1}, \ldots, p_{+C})
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | E_{11} | E_{12} | \ldots | E_{1C} | E_{1+} |
X = 2 | E_{21} | E_{22} | \ldots | E_{2C} | E_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | E_{R1} | E_{R2} | \ldots | E_{RC} | E_{R+} |
Totals | E_{+1} | E_{+2} | \ldots | E_{+C} | m |
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | E_{11} | E_{12} | \ldots | E_{1C} | E_{1+} |
X = 2 | E_{21} | E_{22} | \ldots | E_{2C} | E_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | E_{R1} | E_{R2} | \ldots | E_{RC} | E_{R+} |
Totals | E_{+1} | E_{+2} | \ldots | E_{+C} | m |
Y = 1 | Y = 2 | \ldots | Y = C | Totals | |
---|---|---|---|---|---|
X = 1 | O_{11} | O_{12} | \ldots | O_{1C} | O_{1+} |
X = 2 | O_{21} | O_{22} | \ldots | O_{2C} | O_{2+} |
\cdots | \cdots | \cdots | \ldots | \cdots | \cdots |
X = R | O_{R1} | O_{R2} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | O_{+2} | \ldots | O_{+C} | m |
Suppose the counts O \sim \mathop{\mathrm{Mult}}(m,p). Then \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - m p_{ij})^2 }{ m p_{ij} } \ \stackrel{ {\rm d} }{ \longrightarrow } \ \chi_{RC - 1 - {\rm fitted} }^2 when sample size m \to \infty. The convergence is in distribution and
Quality of approximation: The chi-squared approximation is good if E_{ij} = m p_{ij} \geq 5 \quad \text{ for all } \,\, i = 1 , \ldots , R \, \text{ and } j = 1, \ldots, C
Setting:
Population consists of items of n = R \times C different types
p_{ij} is probability that an item selected at random is of type (i,j)
p_{ij} is unknown and needs to be estimated
As guess for p_{ij} we take p_{ij}^0 such that 0 \leq p_{ij}^0 \leq 1 \qquad \qquad \sum_{i=1}^R \sum_{j=1}^C p_{ij}^0 = 1
Hypothesis Test: We test for equality of p_{ij} to p_{ij}^0 \begin{align*} & H_0 \colon p_{ij} = p_{ij}^0 \qquad \text{ for all } \, i = 1, \ldots, R \,, \text{ and } \, j = 1 , \ldots, C \\ & H_1 \colon p_{ij} \neq p_{ij}^0 \qquad \text{ for at least one pair } \, (i,j) \end{align*}
Sample:
Data: Matrix of counts o = (o_{ij})_{ij}
Alternative | Rejection Region | p-value |
---|---|---|
\exists \, (i,j) \,\, s.t. \,\, p_{ij} \neq p_{ij}^0 | \chi^2 > \chi^2_{RC-1}(0.05) | P(\chi_{RC-1}^2 > \chi^2) |
counts <- c(o_11, o_12, ..., o_RC)
nullp.p <- c(p_11, p_12, ..., p_RC)
counts
with null.p
Alternative | R command |
---|---|
\exists \, (i,j) \,\, s.t. \,\, p_{ij} \neq p_{ij}^0 | chisq.test(counts, p = null.p) |
Monte Carlo p-value: compute with the option simulate.p.value = T
Note: Command in 3 is exactly the same as goodness-of-fit test for simple counts
Background story:
Question: Are the 6 managers to blame for worse Team Performance?
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Table contains Man Utd games since 2014 season (updated to 2024 season)
To each Man Utd game in the sample we associate two categories:
Question: Is the number of Wins, Draws and Losses uniformly distributed?
Hypothesis Test: To answer the question we perform a goodness-of-fit test on \begin{align*} & H_0 \colon p_{ij} = p_{ij}^0 \quad \text { for all pairs } \, (i,j) \\ & H_1 \colon p_{ij} \neq p_{ij}^0 \quad \text { for at least one pair } \, (i,j) \\ \end{align*}
Null Probabilities:
Manager | Won | Drawn | Lost | Total |
---|---|---|---|---|
Moyes | 27 | 9 | 15 | o_{1+} = 51 |
Van Gaal | 54 | 25 | 24 | o_{2+} = 103 |
Mourinho | 84 | 32 | 28 | o_{3+} = 144 |
Solskjaer | 91 | 37 | 40 | o_{4+} = 168 |
Rangnick | 11 | 10 | 8 | o_{5+} = 29 |
ten Hag | 61 | 12 | 28 | o_{6+} = 101 |
The results are uniformly distributed if each manager
Since Manager i plays o_{i+} games, the expected scores are E_{ij} = \frac{o_{i+}}{3} \,, \qquad j=1,2,3
Recall that the expected scores are modelled as E_{ij} = m p_{ij}^0
Thus, the null probabilities under the assumption of uniform distribution are
p_{ij}^0 = \frac13 \times \frac{o_{i+}}{m} \quad \left( = \frac{1}{3} \times \text{Proportion of total games played by Manager } \, i \right)
\begin{align*} \text{Moyes: } \quad & p_{1j}^0 = \frac13 \times \frac{ o_{1+} }{ m } = 51 \big/ 1788 \\ \text{Van Gaal: }\quad & p_{2j}^0 = \frac13 \times \frac{ o_{2+} }{ m } = 103 \big/ 1788 \\ \text{Mourinho: } \quad& p_{3j}^0 = \frac13 \times \frac{ o_{3+} }{ m } = 144 \big/ 1788 \\ \text{Solskjaer: } \quad& p_{4j}^0 = \frac13 \times \frac{ o_{4+} }{ m } = 168 \big/ 1788 \\ \text{Rangnick: }\quad & p_{5j}^0 = \frac13 \times \frac{ o_{5+} }{ m } = 29 \big/ 1788 \\ \text{ten Hag: }\quad & p_{6j}^0 = \frac13 \times \frac{ o_{6+} }{ m } = 101 \big/ 1788 \end{align*}
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | \frac{51}{1788} | \frac{51}{1788} | \frac{51}{1788} |
Van Gaal | \frac{103}{1788} | \frac{103}{1788} | \frac{103}{1788} |
Mourinho | \frac{144}{1788} | \frac{144}{1788} | \frac{144}{1788} |
Solskjaer | \frac{168}{1788} | \frac{168}{1788} | \frac{168}{1788} |
Rangnick | \frac{29}{1788} | \frac{29}{1788} | \frac{29}{1788} |
ten Hag | \frac{101}{1788} | \frac{101}{1788} | \frac{101}{1788} |
rep
# Store matrix of null probabilities into single R vector
# To avoid copy pasting, first store probabilities of each Manager
manager.null.p <- c(51/1788, 103/1788, 144/1788,
168/1788, 29/1788, 101/1788)
# Repeat each entry 3 times
null.p <- rep(manager.null.p, c(3, 3, 3, 3, 3, 3))
[1] 0.03 0.03 0.03 0.06 0.06 0.06 0.08 0.08 0.08 0.09 0.09 0.09 0.02 0.02 0.02
[16] 0.06 0.06 0.06
counts
and null.p
Chi-squared test for given probabilities
data: counts
X-squared = 137.93, df = 17, p-value < 2.2e-16
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
Consider a two-way contingency table as above
To each person we associate two categories:
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
Y = 1 | \ldots | Y = C | Totals | |
---|---|---|---|---|
X = 1 | O_{11} | \ldots | O_{1C} | O_{1+} |
\cdots | \cdots | \cdots | \cdots | \cdots |
X = R | O_{R1} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | \ldots | O_{+C} | m |
Consider the general two-way contingency table as above
They are equivalent:
Y = 1 | \ldots | Y = C | Totals | |
---|---|---|---|---|
X = 1 | O_{11} | \ldots | O_{1C} | O_{1+} |
\cdots | \cdots | \cdots | \cdots | \cdots |
X = R | O_{R1} | \ldots | O_{RC} | O_{R+} |
Totals | O_{+1} | \ldots | O_{+C} | m |
Hence, testing for independece means testing following hypothesis: \begin{align*} & H_0 \colon X \, \text{ and } \, Y \, \text{ are independent } \\ & H_1 \colon X \, \text{ and } \, Y \, \text{ are not independent } \end{align*}
We need to quantify H_0 and H_1
Y = 1 | \ldots | Y = C | Totals | |
---|---|---|---|---|
X = 1 | p_{11} | \ldots | p_{1C} | p_{1+} |
\cdots | \cdots | \cdots | \cdots | \cdots |
X = R | p_{R1} | \ldots | p_{RC} | p_{R+} |
Totals | p_{+1} | \ldots | p_{+C} | 1 |
R.v. X and Y are independent iff cell probabilities factor into marginals p_{ij} = P(X = i , Y = j) = P(X = i) P (Y = j) = p_{i+}p_{+j}
Therefore the hypothesis test for independence becomes \begin{align*} & H_0 \colon p_{ij} = p_{i+}p_{+j} \quad \text{ for all } \, i = 1, \ldots , R \, \text{ and } \, j = 1 ,\ldots C \\ & H_1 \colon p_{ij} \neq p_{i+}p_{+j} \quad \text{ for some } \, (i,j) \end{align*}
Assume the null hypothesis is true H_0 \colon p_{ij} = p_{i+}p_{+j} \quad \text{ for all } \, i = 1, \ldots , R \, \text{ and } \, j = 1 ,\ldots C
Under H_0, the expected counts become E_{ij} = m p_{ij} = p_{i+}p_{+j}
We need a way to estimate the marginal probabilities p_{i+} and p_{+j}
Goal: Estimate marginal probabilities p_{i+} and p_{+j}
By definition we have p_{i+} = P( X = i )
Hence p_{i+} is probability of and observation to be classified in i-th row
Estimate p_{i+} with the proportion of observations classified in i-th row p_{i+} := \frac{o_{i+}}{m} = \frac{ \text{Number of observations in } \, i\text{-th row} }{ \text{ Total number of observations} }
Goal: Estimate marginal probabilities p_{i+} and p_{+j}
Similarly, by definition p_{+j} = P( Y = j )
Hence p_{+j} is probability of and observation to be classified in j-th column
Estimate p_{+j} with the proportion of observations classified in j-th column p_{+j} := \frac{o_{+j}}{m} = \frac{ \text{Number of observations in } \, j\text{-th column} }{ \text{ Total number of observations} }
Summary: The marginal probabilities are estimated with p_{i+} := \frac{o_{i+}}{m} \qquad \qquad p_{+j} := \frac{o_{+j}}{m}
Therefore the expected counts become E_{ij} = m p_{ij} = m p_{i+} p_{+j} = m \left( \frac{o_{i+}}{m} \right) \left( \frac{o_{+j}}{m} \right) = \frac{o_{i+} \, o_{+j}}{m}
By the Theorem in Slide 20, the chi-squared statistics satisfies \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - E_{ij})^2 }{ E_{ij} } \ \approx \ \chi^2_{RC - 1 - {\rm fitted}}
We need to compute the number of fitted parameters
We estimate the first R-1 row marginals by p_{i+} := \frac{o_{i+}}{m} \,, \qquad i = 1 , \ldots, R - 1
Since the marginals p_{i+} sum to 1, we can obtain p_{R+} by p_{R+} = 1 - p_{1+} - \ldots - p_{(R-1)+} = 1 - \frac{o_{1+}}{m} - \ldots - \frac{o_{(R-1)+}}{m}
Similarly, we estimate the first C-1 column marginals by p_{+j} = \frac{o_{+j}}{m} \,, \qquad j = 1, \ldots, C - 1
Since the marginals p_{+j} sum to 1, we can obtain p_{+C} by p_{+C} = 1 - p_{+1} - \ldots - p_{+(C-1)} = 1 - \frac{o_{+1}}{m} - \ldots - \frac{o_{+(C-1)}}{m}
In total, we only have to estimate (R - 1) + (C - 1 ) = R + C - 2 parameters
Therefore, the fitted parameters are {\rm fitted} = R + C - 2
Consequently, the degrees of freedom are \begin{align*} RC - 1 - {\rm fitted} & = RC - 1 - R - C + 2 \\ & = RC - R - C + 1 \\ & = (R - 1)(C - 1) \end{align*}
Assume the null hypothesis of row and column independence H_0 \colon p_{ij} = p_{i+}p_{+j} \quad \text{ for all } \, i = 1, \ldots , R \, \text{ and } \, j = 1 ,\ldots C
Suppose the counts are O \sim \mathop{\mathrm{Mult}}(m,p), and expected counts are E_{ij} = \frac{o_{i+} \, o_{+j}}{m}
By the previous slides, and Theorem in Slide 20 we have that \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - E_{ij})^2 }{ E_{ij} } \approx \ \chi_{RC - 1 - {\rm fitted} }^2 = \chi_{(R-1)(C-1)}^2
The chi-squared approximation \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (O_{ij} - E_{ij})^2 }{ E_{ij} } \approx \ \chi_{(R-1)(C-1)}^2 is good if E_{ij} \geq 5 \quad \text{ for all } \,\, i = 1 , \ldots , R \, \text{ and } j = 1, \ldots, C
Sample:
We draw m individuals from population
Each individual can be of type (i,j), where
O_{ij} denotes the number of items of type (i,j) drawn
p_{ij} is probability of observing type (i,j)
p = (p_{ij})_{ij} is probability matrix
The matrix of counts O = (o_{ij})_{ij} has multinomial distribution \mathop{\mathrm{Mult}}(m,p)
Data: Matrix of counts, represented by two-way contigency table
Y = 1 | \ldots | Y = C | Totals | |
---|---|---|---|---|
X = 1 | o_{11} | \ldots | o_{1C} | o_{1+} |
\cdots | \cdots | \cdots | \cdots | \cdots |
X = R | o_{R1} | \ldots | o_{RC} | o_{R+} |
Totals | o_{+1} | \ldots | o_{+C} | m |
Hypothesis test: We test for independence of rows and columns \begin{align*} & H_0 \colon X \, \text{ and } \, Y \, \text{ are independent } \\ & H_1 \colon X \, \text{ and } \, Y \, \text{ are not independent } \end{align*}
Compute marginal and total counts o_{i+} := \sum_{j=1}^C o_{ij} \,, \qquad o_{+j} := \sum_{i=1}^R o_{ij} \,, \qquad m = \sum_{i=1}^R o_{i+} = \sum_{j=1}^C o_{+j}
Compute the expected counts E_{ij} = \frac{o_{i+} \, o_{+j} }{ m }
Compute the chi-squared statistic \chi^2 = \sum_{i=1}^R \sum_{j=1}^C \frac{ (o_{ij} - E_{ij})^2 }{E_{ij}}
Alternative | Rejection Region | p-value |
---|---|---|
X and Y not independent | \chi^2 > \chi^2_{\rm{df}}(0.05) | P(\chi_{\rm df}^2 > \chi^2) |
row_i <- c(o_i1, o_i2, ..., o_iC)
counts <- rbind(row_1, ..., row_R)
counts
with null.p
Alternative | R command |
---|---|
X and Y are independent | chisq.test(counts) |
Note: Compute Monte Carlo p-value with option simulate.p.value = TRUE
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
People are sampled at random in NW, London and SW
To each person we associate two categories:
Question: Are Residential Status and Region independent?
Owned | Rented | Tot | |
---|---|---|---|
NW | 2180 | 871 | o_{1+} = 3051 |
Lon | 1820 | 1400 | o_{2+} = 3220 |
SW | 1703 | 614 | o_{2+} = 2317 |
Tot | o_{+1} = 5703 | o_{+2} = 2885 | m = 8588 |
Owned | Rented | Tot | |
---|---|---|---|
NW | \frac{3051{\times}5703}{8588} \approx 2026 | \frac{3051{\times}2885}{8588} \approx 1025 | o_{1+} = 3051 |
Lon | \frac{3220{\times}5703}{8588} \approx 2138 | \frac{3220{\times}2885}{8588} \approx 1082 | o_{2+} = 3220 |
SW | \frac{2317{\times}5703}{8588} \approx 1539 | \frac{2317{\times}1885}{8588} \approx 778 | o_{2+} = 2317 |
Tot | o_{+1} = 5703 | o_{+2} = 2885 | m = 8588 |
Owned | Rented | |
---|---|---|
North West | 71.5\% | 28.5\% |
London | 56.5\% | 43.5\% |
South West | 73.5\% | 26.5\% |
Pearson's Chi-squared test
data: counts
X-squared = 228.11, df = 2, p-value < 2.2e-16
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Table contains Man Utd games since 2014 season (updated to 2024)
To each Man Utd game in the sample we associate two categories:
Question: Is there association between Manager and Team Performance?
To answer the question, we perform chi-squared test of independence in R
Test can be performed by suitably modifying the code independence_test.R
# Store each row into and R vector
row_1 <- c(27, 9, 15)
row_2 <- c(54, 25, 24)
row_3 <- c(84, 32, 28)
row_4 <- c(91, 37, 40)
row_5 <- c(11, 10, 8)
row_6 <- c(61, 12, 28)
# Assemble the rows into an R matrix
counts <- rbind(row_1, row_2, row_3, row_4, row_5, row_6)
# Perform chi-squared test of independence
ans <- chisq.test(counts)
# Print answer
print(ans)
Pearson's Chi-squared test
data: counts
X-squared = 12.678, df = 10, p-value = 0.2422
The distribution of the \chi^2 statistic is approximately \chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2 }{ E_i } \ \approx \ \chi_{n-1}^2
The approximation is:
Good | E_i \geq 5 \, \text{ for all } \, i = 1 , \ldots n |
---|---|
Bad | E_i < 5 \, for some \, i = 1 , \ldots n |
How to compute the p-value: When approximation is
Question: What is a Monte Carlo simulation?
\frac{\pi}{4} \approx \frac{r}{n} \,\, \implies \,\, \pi \approx \frac{4r}{n}
Draw x_1, \ldots, x_N and y_1, \ldots, y_N from {\rm Uniform(-1,1)}
Count the number of points (x_i,y_i) falling inside circle of radius 1
These are points satisfying condition x_i^2 + y_i^2 \leq 1
Area of circle estimated with proportion of points falling inside circle: \text{Area} \ = \ \pi \ \approx \ \frac{\text{Number of points } (x_i,y_i) \text{ inside circle}}{N} \ \times \ 4
Note: 4 is the area of square of side 2
Plot: 1000 random points in square of side 2
N <- 10000 # We do 100000 iterations and print every 10000
total <- 0 # Counts total number of points
in_circle <- 0 # Counts points falling in circle
for (j in 1:10) { # This loop is for printing message every N iterations
for (i in 1:N) { # Loop in which points are counted
x <- runif(1, -1, 1); y <- runif(1, -1, 1); # sample point (x,y)
if (x^2 + y^2 <= 1) {
in_circle <- in_circle + 1 # If (x,y) in circle increase counter
}
total <- total + 1 # Increase total counter (in any case)
}
pi_approx <- ( in_circle / total ) * 4 # Compute approximate area
cat(sprintf("After %8d iterations pi is %.08f, error is %.08f\n",
(j * N), pi_approx, abs(pi_approx - pi)))
}
After 10000 iterations pi is 3.13600000, error is 0.00559265
After 20000 iterations pi is 3.13280000, error is 0.00879265
After 30000 iterations pi is 3.14480000, error is 0.00320735
After 40000 iterations pi is 3.14700000, error is 0.00540735
After 50000 iterations pi is 3.14280000, error is 0.00120735
After 60000 iterations pi is 3.14080000, error is 0.00079265
After 70000 iterations pi is 3.14405714, error is 0.00246449
After 80000 iterations pi is 3.14530000, error is 0.00370735
After 90000 iterations pi is 3.14217778, error is 0.00058512
After 100000 iterations pi is 3.14212000, error is 0.00052735
Note:
The error decreases overall, but there are random fluctuations
This is because we are trying to approximate \pi by a random quantity
Deriving error estimates for numerical methods is not easy (Numerical Analysis)
Going back to our example:
For every \varepsilon> 0 fixed, it can be shown that P( | X_n - \pi/4 | \geq \varepsilon) \leq {\rm Error} = \frac{1}{20 n \varepsilon^2}
To get small Error: \,\, \varepsilon has to be small, n has to be large
Estimate can be derived by interpreting the problem as the Monte Carlo integration of \int_0^1 \sqrt{1-x^2} \, dx = \pi /4, see stackexchange post
Simulation can be used to gain insight into the shape of a distribution:
Mean and variance
Related probabilities (p-values)
How to simulate in R:
for
loop to simulate \pireplicate(n, expr)
repeats expression n times and outputs resultsapply(x, expr)
applies expression to the object x
Consider sample X_1, \ldots, X_n from N(\mu,\sigma^2)
Recall that the sample mean satisfies {\rm I\kern-.3em E}[\overline{X}] = \mu \,, \qquad {\rm Var}[\overline{X}] = \frac{\sigma^2}{n} (As n gets larger, {\rm Var}[\overline{X}] will be closer to 0 \implies \overline{X} \approx \mu when n is large)
For example, if the sample size is n = 10, mean \mu = 0 and sd \sigma = 1, then {\rm I\kern-.3em E}[\overline{X}] = 0 \,, \qquad {\rm Var}[\overline{X}] = \frac{1}{10} = 0.1
We can obtain the above quantitites through simulation
(In fancy terms: Estimating mean and variance of \overline{X})
The vector sim.xbar
contains M=1000 simulations from \overline{X}
Reasonable to suppose that sim.xbar
approximates the random variable \overline{X}
(This can be made rigorous with Central Limit Thm)
Therefore, we can make the following estimation
Quantity | Estimation |
---|---|
{\rm I\kern-.3em E}[\overline{X}] | Sample mean of sim.xbar |
{\rm Var}[\overline{X}] | Sample variance of sim.xbar |
# Print mean and variance of sample mean
cat("Expected value:", mean(sim.xbar), "Variance:", var(sim.xbar))
Expected value: 0.005180686 Variance: 0.1058328
These are very close to the theoretical values: \quad {\rm I\kern-.3em E}[\overline{X}] = 0 \,, \quad {\rm Var}[\overline{X}] = \frac{1}{10} = 0.1
Both sample mean and median can be used to estimate the center of a distribution
But which one is better?
Suppose the population is N(\mu,\sigma^2)
We know that the sample mean \overline{X} satisfies {\rm I\kern-.3em E}[\overline{X}] = \mu \,, \qquad {\rm Var}[\overline{X}] = \frac{\sigma^2}{n}
However, no such formulas are available for the sample meadian
Question: How can we compare them? Use simulation
To simulate sample mean and median from N(0,1), we do:
Generate sample x_1, \ldots, x_{10} from N(0,1), and compute sample mean and median
Repeat M = 1000 times
M <- 1000; n <- 10
sim.mean <- replicate(M, mean(rnorm(n)))
sim.median <- replicate(M, median(rnorm(n)))
To compare simulated sample mean and median, we can produce a boxplot
The one with smallest variance, will be the better estimator for the true population mean \mu = 0
Question: What is the probability of rolling 7 with two dice?
Of course this problem has combinatorial answer: p = \frac16
But we can also answer using simulation
To simulate the roll of a dice, we use the function sample
sample(x, n, replace = T)
x
with replacementTo roll 2 dice, we input:
To compute the probability of rolling 7 with two dice we do:
M <- 1000
sim.rolls <- replicate(M, sum( sample(1:6, 2, replace = T) ))
number.of.7s <- sum( sim.rolls == 7 )
p <- number.of.7s / M
cat("The estimated probability of rolling 7 is", p)
The estimated probability of rolling 7 is 0.166
Suppose given a normal population N(\mu,\sigma^2)
A 95 \% confidence interval for the true mean \mu is a (random) interval [a,b] s.t. P(\mu \in [a,b]) = 0.95
Confidence interval is not probability statement about \mu – note \mu is a constant!
It is probability statement about [a,b]
Interpretation: The random interval [a,b] contains the mean \mu with probability 0.95
Constructing the confidence interval with t-statistic:
Suppose given a sample x_1, \ldots, x_n from N(\mu,\sigma^2)
Recall: the t-statistic has t-distribution t = \frac{\overline{x}-\mu}{\mathop{\mathrm{e.s.e.}}} \, \sim \, t_{n-1} \,, \qquad \mathop{\mathrm{e.s.e.}}= \frac{s}{\sqrt{n}}
We impose that t is observed with probability 0.95 P(- t^* \leq t \leq t^*) = 0.95 \,, \qquad t^* = t_{n-1}(0.025)
The 95\% confidence interval is obtained by solving above equation for \mu P(\mu \in [a,b] ) = 0.95 \,, \qquad a = \overline{x} - t^* \times \mathop{\mathrm{e.s.e.}}, \qquad b = \overline{x} + t^* \times \mathop{\mathrm{e.s.e.}}
Interpretation: If you sample over and over again, the interval [a,b] will contain \mu about 95\% of the times
Each row of points is a sample from the same normal distribution \qquad (Image from Wikipedia)
The colored lines are confidence intervals for the mean \mu
At the center of each interval is the sample mean (diamond)
The blue intervals contain the population mean, and the red ones do not
Simulating one confidence interval in R:
$conf.int
# Sample 100 times from N(0,1)
x <- rnorm(100)
# Compute confidence interval using t.test
interval <- t.test(x)$conf.int
interval
is a vector containing the simulated confidence interval
interval[1]
\qquad Right extreme in interval[2]
Simulated confidence interval: a = -0.2262375 b = 0.1689576
Testing the claim: \quad P (\mu \in [a,b]) = 0.95
We test with the following method:
Sample n = 100 times from N(0,1) and compute confidence interval
Repeat M = 1000 times
Check how many times \mu = 0 belongs to simulated confidence intervals
Estimate the probability of the CI containing \mu as P(\mu \in [a,b]) \approx \frac{\# \text{ of times } \mu = 0 \text{ belongs to simulated } [a,b]}{M}
The estimate should approach 0.95
# Simulate M confidence intervals for the mean
n <- 100; M <- 10000
intervals <- replicate(M, t.test(rnorm(n)) $ conf.int)
intervals
is a matrix containing the M = 1000 simulated intervals (as columns)
intervals[1, ]
intervals[2, ]
[,1] [,2] [,3] [,4] [,5]
[1,] -0.2262375 -0.07222545 -0.2428610 -0.1788659 -0.07420973
[2,] 0.1689576 0.35116698 0.1201007 0.2189942 0.30661995
Now, we have our M = 1000 confidence intervals for \mu = 0
Check how many times \mu = 0 belongs to simulated confidence intervals
# Save left and right extremes in vectors
a <- intervals[1, ]; b <- intervals[2, ]
# Check how many times mu = 0 belongs to (a,b)
mu.belongs <- sum( (a < 0) & (0 < b) )
# Estimate the probability that mu = 0 belongs to (a,b)
p <- mu.belongs / M
cat("mu = 0 belongs to the confidence interval with probability", p)
mu = 0 belongs to the confidence interval with probability 0.9494
Suppose given a random variable X such that
The distribution of X is (possibly) unknown
A method for simulating X is available
Goal: For a given x^*, we want to estimate the p-value
p = P(X > x^*)
Goal: Estimate the p-value p = P(X>x^*)
Simuate the random varible X. You obtain a numerical value X_{\rm sim}
The simulated value is extreme if X_{\rm sim} > x^*
There are two possibilities:
We estimate the p-value as p = P( X > x^* ) \ \approx \ \frac{ \# \text{ of extreme simulated values} }{ \text{Total number of simulations} }
Goal: use Monte Carlo simulations to compute p-value for goodness-of-fit test
Type | 1 | \ldots | n | Total |
---|---|---|---|---|
Observed counts | o_1 | \ldots | o_n | m |
Null Probabilities | p_1^0 | \ldots | p_n^0 | 1 |
The expected counts are E_i = m p_i^0
Under the null hypothesis, the observed counts (o_1, \ldots, o_n) come from (O_1 , \ldots, O_n) \sim \mathop{\mathrm{Mult}}(m, p_1^0, \ldots, p_n^0)
The chi-squared statistic random variable is \chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2 }{ E_i }
The observed chi-squared statistics is \chi^2_{\rm obs} = \sum_{i=1}^n \frac{(o_i - E_i)^2 }{ E_i }
The p-value is defined as p = P( \chi^2 > \chi^2_{\rm obs} )
Suppose E_i < 5 for some i
In this case, \chi^2 is not \chi^2_{n-1} distributed
In fact, the distribution is unknown \chi^2 = \sum_{i=1}^n \frac{(O_i - E_i)^2 }{ E_i } \ \, \sim \ \, ???
If the distribution is unknown, how do we compute the p-value p = P( \chi^2 > \chi^2_{\rm obs} ) \ \ ???
Exercise: Come up with a simulation to approximate p
Simulate counts (o_1^{\rm sim},\ldots,o_n^{\rm sim}) from \mathop{\mathrm{Mult}}(m, p_1^0, \ldots, p_n^0)
Compute the corresponding simulated chi-squared statistic \chi^2_{\rm sim} = \sum_{i=1}^n \frac{ (o_i^{\rm sim} - E_i)^2 }{E_i} \,, \qquad E_i := m p_i^0
The simulated chi-squared statistic is extreme if \chi^2_{\rm sim} > \chi^2_{\rm obs} \,, \quad \text{ where \, } \chi^2_{\rm obs} = \sum_{i=1}^n \frac{(o_i - E_i)^2 }{ E_i } \,, \,\, o_i \, \text{ the observed counts}
We estimate the theoretical p-value by p = P( \chi^2 > \chi^2_{\rm obs} ) \ \approx \ \frac{ \# \text{ of extreme simulated statistics} }{ \text{Total number of simulations} }
Data: Number of defects in printed circuit boards, along with null probabilities
# Defects | 0 | 1 | 2 | 3 |
---|---|---|---|---|
Counts | 32 | 15 | 9 | 4 |
Null Probabilities | 0.5 | 0.3 | 0.15 | 0.05 |
E_1 | E_2 | E_3 | E_4 |
---|---|---|---|
30 | 18 | 9 | 3 |
Note: E_4 = 3 < 5 \quad \implies \quad the distribution of \chi^2 is unknown
Exercise: Write R code to perform a goodness-of-fit test on above data
chisq.test
We now generate M = 10000 simulated chi-squared statistics:
Simulate counts (o_1^{\rm sim},\ldots,o_n^{\rm sim}) from \mathop{\mathrm{Mult}}(m, p_1^0, \ldots, p_n^0)
Compute the corresponding simulated chi-squared statistic \chi^2_{\rm sim} = \sum_{i=1}^n \frac{ (o_i^{\rm sim} - E_i)^2 }{E_i}
The vector sim.chi
contains M = 10000 simulated chi-squared statistics
We approximate the p-value by p = P(\chi^2 > \chi_{\rm obs}^2) \approx \ \frac{ \# \text{ of extreme simulated statistics} }{ M }
chisq.test
# Perform chi-squared test using built-in R function
ans <- chisq.test(counts, p = null.p, simulate.p.value = T)
# Extract p-value from chi-squared test result
R.p.value <- ans$p.value
# Print p-values for comparison
cat("\nOur simulated p-value is:", sim.p.value)
cat("\nR simulated p-value is:", R.p.value)
Our simulated p-value is: 0.8169
R simulated p-value is: 0.8195902
Note: The simulated p-values are the same! (up to random fluctuations)
Conclusion: We see that p > 0.05. There is not enogh evidence to reject H_0