[1] 0.03 0.03 0.03 0.06 0.06 0.06 0.08 0.08 0.08 0.09 0.09 0.09 0.02 0.02 0.02
[16] 0.06 0.06 0.06
Lecture 6
Two-way Contigency Tables: Table in which each observation is classified in 2 ways
Example:
Relative performance of Man Utd managers
Two classifications for each game: Result, and Manager in charge
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | O11 | O12 | … | O1C | O1+ |
X=2 | O21 | O22 | … | O2C | O2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | OR1 | OR2 | … | ORC | OR+ |
Totals | O+1 | O+2 | … | O+C | m |
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | O11 | O12 | … | O1C | O1+ |
X=2 | O21 | O22 | … | O2C | O2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | OR1 | OR2 | … | ORC | OR+ |
Totals | O+1 | O+2 | … | O+C | m |
OR+=j=1∑CORjO+2=i=1∑ROi2
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | O11 | O12 | … | O1C | O1+ |
X=2 | O21 | O22 | … | O2C | O2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | OR1 | OR2 | … | ORC | OR+ |
Totals | O+1 | O+2 | … | O+C | m |
m=O++=i=1∑Rj=1∑COij
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | O11 | O12 | … | O1C | O1+ |
X=2 | O21 | O22 | … | O2C | O2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | OR1 | OR2 | … | ORC | OR+ |
Totals | O+1 | O+2 | … | O+C | m |
i=1∑ROi+=i=1∑Rj=1∑COij=m
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | O11 | O12 | … | O1C | O1+ |
X=2 | O21 | O22 | … | O2C | O2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | OR1 | OR2 | … | ORC | OR+ |
Totals | O+1 | O+2 | … | O+C | m |
j=1∑CO+j=j=1∑Ci=1∑ROij=m
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | p11 | p12 | … | p1C | p1+ |
X=2 | p21 | p22 | … | p2C | p2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | pR1 | pR2 | … | pRC | pR+ |
Totals | p+1 | p+2 | … | p+C | 1 |
pij:=P(X=i,Y=j)
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | p11 | p12 | … | p1C | p1+ |
X=2 | p21 | p22 | … | p2C | p2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | pR1 | pR2 | … | pRC | pR+ |
Totals | p+1 | p+2 | … | p+C | 1 |
P(X=i)=j=1∑CP(X=i,Y=j)=j=1∑Cpij=pi+
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | p11 | p12 | … | p1C | p1+ |
X=2 | p21 | p22 | … | p2C | p2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | pR1 | pR2 | … | pRC | pR+ |
Totals | p+1 | p+2 | … | p+C | 1 |
P(Y=j)=i=1∑RP(X=i,Y=j)=i=1∑Rpij=p+j
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | p11 | p12 | … | p1C | p1+ |
X=2 | p21 | p22 | … | p2C | p2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | pR1 | pR2 | … | pRC | pR+ |
Totals | p+1 | p+2 | … | p+C | 1 |
i=1∑Rpi+=i=1∑RP(X=i)=i=1∑Rj=1∑CP(X=i,Y=j)=1
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | p11 | p12 | … | p1C | p1+ |
X=2 | p21 | p22 | … | p2C | p2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | pR1 | pR2 | … | pRC | pR+ |
Totals | p+1 | p+2 | … | p+C | 1 |
j=1∑Cp+j=j=1∑CP(X=j)=j=1∑Rj=1∑CP(X=i,Y=j)=1
Counts Oij and probabilities pij can be assembled into R×C matrices O=O11⋮OR1…⋱…O1C⋮ORCp=p11⋮pR1…⋱…p1C⋮pRC
This way the random matrix O has multinomial distribution O∼Mult(m,p)
The counts Oij are therefore binomial Oij∼Bin(m,pij)
We can also consider the marginal random vectors (O1+,…,OR+)(O+1,…,O+C)
These have also multinomial distribution (O1+,…,OR+)∼Mult(m,p1+,…,pR+) (O+1,…,O+C)∼Mult(m,p+1,…,p+C)
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | O11 | O12 | … | O1C | O1+ |
X=2 | O21 | O22 | … | O2C | O2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | OR1 | OR2 | … | ORC | OR+ |
Totals | O+1 | O+2 | … | O+C | m |
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | E11 | E12 | … | E1C | E1+ |
X=2 | E21 | E22 | … | E2C | E2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | ER1 | ER2 | … | ERC | ER+ |
Totals | E+1 | E+2 | … | E+C | m |
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | E11 | E12 | … | E1C | E1+ |
X=2 | E21 | E22 | … | E2C | E2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | ER1 | ER2 | … | ERC | ER+ |
Totals | E+1 | E+2 | … | E+C | m |
Y=1 | Y=2 | … | Y=C | Totals | |
---|---|---|---|---|---|
X=1 | O11 | O12 | … | O1C | O1+ |
X=2 | O21 | O22 | … | O2C | O2+ |
⋯ | ⋯ | ⋯ | … | ⋯ | ⋯ |
X=R | OR1 | OR2 | … | ORC | OR+ |
Totals | O+1 | O+2 | … | O+C | m |
Suppose the counts O∼Mult(m,p). Then χ2=i=1∑Rj=1∑Cmpij(Oij−mpij)2 ⟶d χRC−1−fitted2 when sample size m→∞. The convergence is in distribution and
Quality of approximation: The chi-squared approximation is good if Eij=mpij≥5 for all i=1,…,R and j=1,…,C
Setting:
Population consists of items of n=R×C different types
pij is probability that an item selected at random is of type (i,j)
pij is unknown and needs to be estimated
As guess for pij we take pij0 such that 0≤pij0≤1i=1∑Rj=1∑Cpij0=1
Hypothesis Test: We test for equality of pij to pij0 H0:pij=pij0 for all i=1,…,R, and j=1,…,CH1:pij=pij0 for at least one pair (i,j)
Sample:
Data: Matrix of counts o=(oij)ij
Alternative | Rejection Region | p-value |
---|---|---|
∃(i,j) s.t. pij=pij0 | χ2>χRC−12(0.05) | P(χRC−12>χ2) |
counts <- c(o_11, o_12, ..., o_RC)
nullp.p <- c(p_11, p_12, ..., p_RC)
counts
with null.p
Alternative | R command |
---|---|
∃(i,j) s.t. pij=pij0 | chisq.test(counts, p = null.p) |
Monte Carlo p-value: compute with the option simulate.p.value = T
Note: Command in 3 is exactly the same as goodness-of-fit test for simple counts
Background story:
Question: Are the 6 managers to blame for worse Team Performance?
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Table contains Man Utd games since 2014 season (updated to 2024 season)
To each Man Utd game in the sample we associate two categories:
Question: Is the number of Wins, Draws and Losses uniformly distributed?
Hypothesis Test: To answer the question we perform a goodness-of-fit test on H0:pij=pij0 for all pairs (i,j)H1:pij=pij0 for at least one pair (i,j)
Null Probabilities:
Manager | Won | Drawn | Lost | Total |
---|---|---|---|---|
Moyes | 27 | 9 | 15 | o1+=51 |
Van Gaal | 54 | 25 | 24 | o2+=103 |
Mourinho | 84 | 32 | 28 | o3+=144 |
Solskjaer | 91 | 37 | 40 | o4+=168 |
Rangnick | 11 | 10 | 8 | o5+=29 |
ten Hag | 61 | 12 | 28 | o6+=101 |
The results are uniformly distributed if each manager
Since Manager i plays oi+ games, the expected scores are Eij=3oi+,j=1,2,3
Recall that the expected scores are modelled as Eij=mpij0
Thus, the null probabilities under the assumption of uniform distribution are
pij0=31×moi+(=31×Proportion of total games played by Manager i)
Moyes: Van Gaal: Mourinho: Solskjaer: Rangnick: ten Hag: p1j0=31×mo1+=51/1788p2j0=31×mo2+=103/1788p3j0=31×mo3+=144/1788p4j0=31×mo4+=168/1788p5j0=31×mo5+=29/1788p6j0=31×mo6+=101/1788
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 178851 | 178851 | 178851 |
Van Gaal | 1788103 | 1788103 | 1788103 |
Mourinho | 1788144 | 1788144 | 1788144 |
Solskjaer | 1788168 | 1788168 | 1788168 |
Rangnick | 178829 | 178829 | 178829 |
ten Hag | 1788101 | 1788101 | 1788101 |
rep
# Store matrix of null probabilities into single R vector
# To avoid copy pasting, first store probabilities of each Manager
manager.null.p <- c(51/1788, 103/1788, 144/1788,
168/1788, 29/1788, 101/1788)
# Repeat each entry 3 times
null.p <- rep(manager.null.p, c(3, 3, 3, 3, 3, 3))
[1] 0.03 0.03 0.03 0.06 0.06 0.06 0.08 0.08 0.08 0.09 0.09 0.09 0.02 0.02 0.02
[16] 0.06 0.06 0.06
counts
and null.p
Chi-squared test for given probabilities
data: counts
X-squared = 137.93, df = 17, p-value < 2.2e-16
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
Consider a two-way contingency table as above
To each person we associate two categories:
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
Y=1 | … | Y=C | Totals | |
---|---|---|---|---|
X=1 | O11 | … | O1C | O1+ |
⋯ | ⋯ | ⋯ | ⋯ | ⋯ |
X=R | OR1 | … | ORC | OR+ |
Totals | O+1 | … | O+C | m |
Consider the general two-way contingency table as above
They are equivalent:
Y=1 | … | Y=C | Totals | |
---|---|---|---|---|
X=1 | O11 | … | O1C | O1+ |
⋯ | ⋯ | ⋯ | ⋯ | ⋯ |
X=R | OR1 | … | ORC | OR+ |
Totals | O+1 | … | O+C | m |
Hence, testing for independece means testing following hypothesis: H0:X and Y are independent H1:X and Y are not independent
We need to quantify H0 and H1
Y=1 | … | Y=C | Totals | |
---|---|---|---|---|
X=1 | p11 | … | p1C | p1+ |
⋯ | ⋯ | ⋯ | ⋯ | ⋯ |
X=R | pR1 | … | pRC | pR+ |
Totals | p+1 | … | p+C | 1 |
R.v. X and Y are independent iff cell probabilities factor into marginals pij=P(X=i,Y=j)=P(X=i)P(Y=j)=pi+p+j
Therefore the hypothesis test for independence becomes H0:pij=pi+p+j for all i=1,…,R and j=1,…CH1:pij=pi+p+j for some (i,j)
Assume the null hypothesis is true H0:pij=pi+p+j for all i=1,…,R and j=1,…C
Under H0, the expected counts become Eij=mpij=pi+p+j
We need a way to estimate the marginal probabilities pi+ and p+j
Goal: Estimate marginal probabilities pi+ and p+j
By definition we have pi+=P(X=i)
Hence pi+ is probability of and observation to be classified in i-th row
Estimate pi+ with the proportion of observations classified in i-th row pi+:=moi+= Total number of observationsNumber of observations in i-th row
Goal: Estimate marginal probabilities pi+ and p+j
Similarly, by definition p+j=P(Y=j)
Hence p+j is probability of and observation to be classified in j-th column
Estimate p+j with the proportion of observations classified in j-th column p+j:=mo+j= Total number of observationsNumber of observations in j-th column
Summary: The marginal probabilities are estimated with pi+:=moi+p+j:=mo+j
Therefore the expected counts become Eij=mpij=mpi+p+j=m(moi+)(mo+j)=moi+o+j
By the Theorem in Slide 20, the chi-squared statistics satisfies χ2=i=1∑Rj=1∑CEij(Oij−Eij)2 ≈ χRC−1−fitted2
We need to compute the number of fitted parameters
We estimate the first R−1 row marginals by pi+:=moi+,i=1,…,R−1
Since the marginals pi+ sum to 1, we can obtain pR+ by pR+=1−p1+−…−p(R−1)+=1−mo1+−…−mo(R−1)+
Similarly, we estimate the first C−1 column marginals by p+j=mo+j,j=1,…,C−1
Since the marginals p+j sum to 1, we can obtain p+C by p+C=1−p+1−…−p+(C−1)=1−mo+1−…−mo+(C−1)
In total, we only have to estimate (R−1)+(C−1)=R+C−2 parameters
Therefore, the fitted parameters are fitted=R+C−2
Consequently, the degrees of freedom are RC−1−fitted=RC−1−R−C+2=RC−R−C+1=(R−1)(C−1)
Assume the null hypothesis of row and column independence H0:pij=pi+p+j for all i=1,…,R and j=1,…C
Suppose the counts are O∼Mult(m,p), and expected counts are Eij=moi+o+j
By the previous slides, and Theorem in Slide 20 we have that χ2=i=1∑Rj=1∑CEij(Oij−Eij)2≈ χRC−1−fitted2=χ(R−1)(C−1)2
The chi-squared approximation χ2=i=1∑Rj=1∑CEij(Oij−Eij)2≈ χ(R−1)(C−1)2 is good if Eij≥5 for all i=1,…,R and j=1,…,C
Sample:
We draw m individuals from population
Each individual can be of type (i,j), where
Oij denotes the number of items of type (i,j) drawn
pij is probability of observing type (i,j)
p=(pij)ij is probability matrix
The matrix of counts O=(oij)ij has multinomial distribution Mult(m,p)
Data: Matrix of counts, represented by two-way contigency table
Y=1 | … | Y=C | Totals | |
---|---|---|---|---|
X=1 | o11 | … | o1C | o1+ |
⋯ | ⋯ | ⋯ | ⋯ | ⋯ |
X=R | oR1 | … | oRC | oR+ |
Totals | o+1 | … | o+C | m |
Hypothesis test: We test for independence of rows and columns H0:X and Y are independent H1:X and Y are not independent
Compute marginal and total counts oi+:=j=1∑Coij,o+j:=i=1∑Roij,m=i=1∑Roi+=j=1∑Co+j
Compute the expected counts Eij=moi+o+j
Compute the chi-squared statistic χ2=i=1∑Rj=1∑CEij(oij−Eij)2
Alternative | Rejection Region | p-value |
---|---|---|
X and Y not independent | χ2>χdf2(0.05) | P(χdf2>χ2) |
row_i <- c(o_i1, o_i2, ..., o_iC)
counts <- rbind(row_1, ..., row_R)
counts
with null.p
Alternative | R command |
---|---|
X and Y are independent | chisq.test(counts) |
Note: Compute Monte Carlo p-value with option simulate.p.value = TRUE
Owned | Rented | Total | |
---|---|---|---|
North West | 2180 | 871 | 3051 |
London | 1820 | 1400 | 3220 |
South West | 1703 | 614 | 2317 |
Total | 5703 | 2885 | 8588 |
People are sampled at random in NW, London and SW
To each person we associate two categories:
Question: Are Residential Status and Region independent?
Owned | Rented | Tot | |
---|---|---|---|
NW | 2180 | 871 | o1+=3051 |
Lon | 1820 | 1400 | o2+=3220 |
SW | 1703 | 614 | o2+=2317 |
Tot | o+1=5703 | o+2=2885 | m=8588 |
Owned | Rented | Tot | |
---|---|---|---|
NW | 85883051×5703≈2026 | 85883051×2885≈1025 | o1+=3051 |
Lon | 85883220×5703≈2138 | 85883220×2885≈1082 | o2+=3220 |
SW | 85882317×5703≈1539 | 85882317×1885≈778 | o2+=2317 |
Tot | o+1=5703 | o+2=2885 | m=8588 |
Owned | Rented | |
---|---|---|
North West | 71.5% | 28.5% |
London | 56.5% | 43.5% |
South West | 73.5% | 26.5% |
Pearson's Chi-squared test
data: counts
X-squared = 228.11, df = 2, p-value < 2.2e-16
Manager | Won | Drawn | Lost |
---|---|---|---|
Moyes | 27 | 9 | 15 |
Van Gaal | 54 | 25 | 24 |
Mourinho | 84 | 32 | 28 |
Solskjaer | 91 | 37 | 40 |
Rangnick | 11 | 10 | 8 |
ten Hag | 61 | 12 | 28 |
Table contains Man Utd games since 2014 season (updated to 2024)
To each Man Utd game in the sample we associate two categories:
Question: Is there association between Manager and Team Performance?
To answer the question, we perform chi-squared test of independence in R
Test can be performed by suitably modifying the code independence_test.R
# Store each row into and R vector
row_1 <- c(27, 9, 15)
row_2 <- c(54, 25, 24)
row_3 <- c(84, 32, 28)
row_4 <- c(91, 37, 40)
row_5 <- c(11, 10, 8)
row_6 <- c(61, 12, 28)
# Assemble the rows into an R matrix
counts <- rbind(row_1, row_2, row_3, row_4, row_5, row_6)
# Perform chi-squared test of independence
ans <- chisq.test(counts)
# Print answer
print(ans)
Pearson's Chi-squared test
data: counts
X-squared = 12.678, df = 10, p-value = 0.2422
The distribution of the χ2 statistic is approximately χ2=i=1∑nEi(Oi−Ei)2 ≈ χn−12
The approximation is:
Good | Ei≥5 for all i=1,…n |
---|---|
Bad | Ei<5 for some i=1,…n |
How to compute the p-value: When approximation is
Question: What is a Monte Carlo simulation?
4π≈nr⟹π≈n4r
Draw x1,…,xN and y1,…,yN from Uniform(−1,1)
Count the number of points (xi,yi) falling inside circle of radius 1
These are points satisfying condition xi2+yi2≤1
Area of circle estimated with proportion of points falling inside circle: Area = π ≈ NNumber of points (xi,yi) inside circle × 4
Note: 4 is the area of square of side 2
Plot: 1000 random points in square of side 2
N <- 10000 # We do 100000 iterations and print every 10000
total <- 0 # Counts total number of points
in_circle <- 0 # Counts points falling in circle
for (j in 1:10) { # This loop is for printing message every N iterations
for (i in 1:N) { # Loop in which points are counted
x <- runif(1, -1, 1); y <- runif(1, -1, 1); # sample point (x,y)
if (x^2 + y^2 <= 1) {
in_circle <- in_circle + 1 # If (x,y) in circle increase counter
}
total <- total + 1 # Increase total counter (in any case)
}
pi_approx <- ( in_circle / total ) * 4 # Compute approximate area
cat(sprintf("After %8d iterations pi is %.08f, error is %.08f\n",
(j * N), pi_approx, abs(pi_approx - pi)))
}
After 10000 iterations pi is 3.14040000, error is 0.00119265
After 20000 iterations pi is 3.14460000, error is 0.00300735
After 30000 iterations pi is 3.14173333, error is 0.00014068
After 40000 iterations pi is 3.13920000, error is 0.00239265
After 50000 iterations pi is 3.14200000, error is 0.00040735
After 60000 iterations pi is 3.14593333, error is 0.00434068
After 70000 iterations pi is 3.14645714, error is 0.00486449
After 80000 iterations pi is 3.14190000, error is 0.00030735
After 90000 iterations pi is 3.14537778, error is 0.00378512
After 100000 iterations pi is 3.14256000, error is 0.00096735
Note:
The error decreases overall, but there are random fluctuations
This is because we are trying to approximate π by a random quantity
Deriving error estimates for numerical methods is not easy (Numerical Analysis)
Going back to our example:
For every ε>0 fixed, it can be shown that P(∣Xn−π/4∣≥ε)≤Error=20nε21
To get small Error: ε has to be small, n has to be large
Estimate can be derived by interpreting the problem as the Monte Carlo integration of ∫011−x2dx=π/4, see stackexchange post
Simulation can be used to gain insight into the shape of a distribution:
Mean and variance
Related probabilities (p-values)
How to simulate in R:
for
loop to simulate πreplicate(n, expr)
repeats expression n times and outputs resultsapply(x, expr)
applies expression to the object x
Consider sample X1,…,Xn from N(μ,σ2)
Recall that the sample mean satisfies IE[X]=μ,Var[X]=nσ2 (As n gets larger, Var[X] will be closer to 0 ⟹X≈μ when n is large)
For example, if the sample size is n=10, mean μ=0 and sd σ=1, then IE[X]=0,Var[X]=101=0.1
We can obtain the above quantitites through simulation
(In fancy terms: Estimating mean and variance of X)
The vector sim.xbar
contains M=1000 simulations from X
Reasonable to suppose that sim.xbar
approximates the random variable X
(This can be made rigorous with Central Limit Thm)
Therefore, we can make the following estimation
Quantity | Estimation |
---|---|
IE[X] | Sample mean of sim.xbar |
Var[X] | Sample variance of sim.xbar |
# Print mean and variance of sample mean
cat("Expected value:", mean(sim.xbar), "Variance:", var(sim.xbar))
Expected value: 0.005180686 Variance: 0.1058328
These are very close to the theoretical values: IE[X]=0,Var[X]=101=0.1
Both sample mean and median can be used to estimate the center of a distribution
But which one is better?
Suppose the population is N(μ,σ2)
We know that the sample mean X satisfies IE[X]=μ,Var[X]=nσ2
However, no such formulas are available for the sample meadian
Question: How can we compare them? Use simulation
To simulate sample mean and median from N(0,1), we do:
Generate sample x1,…,x10 from N(0,1), and compute sample mean and median
Repeat M=1000 times
M <- 1000; n <- 10
sim.mean <- replicate(M, mean(rnorm(n)))
sim.median <- replicate(M, median(rnorm(n)))
To compare simulated sample mean and median, we can produce a boxplot
The one with smallest variance, will be the better estimator for the true population mean μ=0
Question: What is the probability of rolling 7 with two dice?
Of course this problem has combinatorial answer: p=61
But we can also answer using simulation
To simulate the roll of a dice, we use the function sample
sample(x, n, replace = T)
x
with replacementTo roll 2 dice, we input:
To compute the probability of rolling 7 with two dice we do:
M <- 1000
sim.rolls <- replicate(M, sum( sample(1:6, 2, replace = T) ))
number.of.7s <- sum( sim.rolls == 7 )
p <- number.of.7s / M
cat("The estimated probability of rolling 7 is", p)
The estimated probability of rolling 7 is 0.166
Suppose given a normal population N(μ,σ2)
A 95% confidence interval for the true mean μ is a (random) interval [a,b] s.t. P(μ∈[a,b])=0.95
Confidence interval is not probability statement about μ – note μ is a constant!
It is probability statement about [a,b]
Interpretation: The random interval [a,b] contains the mean μ with probability 0.95
Constructing the confidence interval with t-statistic:
Suppose given a sample x1,…,xn from N(μ,σ2)
Recall: the t-statistic has t-distribution t=e.s.e.x−μ∼tn−1,e.s.e.=ns
We impose that t is observed with probability 0.95 P(−t∗≤t≤t∗)=0.95,t∗=tn−1(0.025)
The 95% confidence interval is obtained by solving above equation for μ P(μ∈[a,b])=0.95,a=x−t∗×e.s.e.,b=x+t∗×e.s.e.
Interpretation: If you sample over and over again, the interval [a,b] will contain μ about 95% of the times
Each row of points is a sample from the same normal distribution (Image from Wikipedia)
The colored lines are confidence intervals for the mean μ
At the center of each interval is the sample mean (diamond)
The blue intervals contain the population mean, and the red ones do not
Simulating one confidence interval in R:
$conf.int
# Sample 100 times from N(0,1)
x <- rnorm(100)
# Compute confidence interval using t.test
interval <- t.test(x)$conf.int
interval
is a vector containing the simulated confidence interval
interval[1]
Right extreme in interval[2]
Simulated confidence interval: a = -0.2262375 b = 0.1689576
Testing the claim: P(μ∈[a,b])=0.95
We test with the following method:
Sample n=100 times from N(0,1) and compute confidence interval
Repeat M=1000 times
Check how many times μ=0 belongs to simulated confidence intervals
Estimate the probability of the CI containing μ as P(μ∈[a,b])≈M# of times μ=0 belongs to simulated [a,b]
The estimate should approach 0.95
# Simulate M confidence intervals for the mean
n <- 100; M <- 10000
intervals <- replicate(M, t.test(rnorm(n)) $ conf.int)
intervals
is a matrix containing the M=1000 simulated intervals (as columns)
intervals[1, ]
intervals[2, ]
[,1] [,2] [,3] [,4] [,5]
[1,] -0.2262375 -0.07222545 -0.2428610 -0.1788659 -0.07420973
[2,] 0.1689576 0.35116698 0.1201007 0.2189942 0.30661995
Now, we have our M=1000 confidence intervals for μ=0
Check how many times μ=0 belongs to simulated confidence intervals
# Save left and right extremes in vectors
a <- intervals[1, ]; b <- intervals[2, ]
# Check how many times mu = 0 belongs to (a,b)
mu.belongs <- sum( (a < 0) & (0 < b) )
# Estimate the probability that mu = 0 belongs to (a,b)
p <- mu.belongs / M
cat("mu = 0 belongs to the confidence interval with probability", p)
mu = 0 belongs to the confidence interval with probability 0.9494
Suppose given a random variable X such that
The distribution of X is (possibly) unknown
A method for simulating X is available
Goal: For a given x∗, we want to estimate the p-value
p=P(X>x∗)
Goal: Estimate the p-value p=P(X>x∗)
Simuate the random varible X. You obtain a numerical value Xsim
The simulated value is extreme if Xsim>x∗
There are two possibilities:
We estimate the p-value as p=P(X>x∗) ≈ Total number of simulations# of extreme simulated values
Goal: use Monte Carlo simulations to compute p-value for goodness-of-fit test
Type | 1 | … | n | Total |
---|---|---|---|---|
Observed counts | o1 | … | on | m |
Null Probabilities | p10 | … | pn0 | 1 |
The expected counts are Ei=mpi0
Under the null hypothesis, the observed counts (o1,…,on) come from (O1,…,On)∼Mult(m,p10,…,pn0)
The chi-squared statistic random variable is χ2=i=1∑nEi(Oi−Ei)2
The observed chi-squared statistics is χobs2=i=1∑nEi(oi−Ei)2
The p-value is defined as p=P(χ2>χobs2)
Suppose Ei<5 for some i
In this case, χ2 is not χn−12 distributed
In fact, the distribution is unknown χ2=i=1∑nEi(Oi−Ei)2 ∼ ???
If the distribution is unknown, how do we compute the p-value p=P(χ2>χobs2) ???
Exercise: Come up with a simulation to approximate p
Simulate counts (o1sim,…,onsim) from Mult(m,p10,…,pn0)
Compute the corresponding simulated chi-squared statistic χsim2=i=1∑nEi(oisim−Ei)2,Ei:=mpi0
The simulated chi-squared statistic is extreme if χsim2>χobs2, where χobs2=i=1∑nEi(oi−Ei)2,oi the observed counts
We estimate the theoretical p-value by p=P(χ2>χobs2) ≈ Total number of simulations# of extreme simulated statistics
Data: Number of defects in printed circuit boards, along with null probabilities
# Defects | 0 | 1 | 2 | 3 |
---|---|---|---|---|
Counts | 32 | 15 | 9 | 4 |
Null Probabilities | 0.5 | 0.3 | 0.15 | 0.05 |
E1 | E2 | E3 | E4 |
---|---|---|---|
30 | 18 | 9 | 3 |
Note: E4=3<5⟹ the distribution of χ2 is unknown
Exercise: Write R code to perform a goodness-of-fit test on above data
chisq.test
We now generate M=10000 simulated chi-squared statistics:
Simulate counts (o1sim,…,onsim) from Mult(m,p10,…,pn0)
Compute the corresponding simulated chi-squared statistic χsim2=i=1∑nEi(oisim−Ei)2
The vector sim.chi
contains M=10000 simulated chi-squared statistics
We approximate the p-value by p=P(χ2>χobs2)≈ M# of extreme simulated statistics
chisq.test
# Perform chi-squared test using built-in R function
ans <- chisq.test(counts, p = null.p, simulate.p.value = T)
# Extract p-value from chi-squared test result
R.p.value <- ans$p.value
# Print p-values for comparison
cat("\nOur simulated p-value is:", sim.p.value)
cat("\nR simulated p-value is:", R.p.value)
Our simulated p-value is: 0.8169
R simulated p-value is: 0.8195902
Note: The simulated p-values are the same! (up to random fluctuations)
Conclusion: We see that p>0.05. There is not enogh evidence to reject H0