Bootstrap Confidence Interval (95%): 13.86 15.71
t-statistic Confidence Interval (95%): 13.74545 15.99455
Lecture 7
So far, our simulations used knowledge of the population distribution, e.g.
Problem: Sometimes the population distribution f is unknown
Bootstrap: Replace the unknown distribution f with a known distribution \hat{f}
Assume given a sample x_1,\ldots,x_n from a population f
The Sample Distribution is the discrete distribution which puts mass \frac1n at each sample point \hat{f}(x) = \begin{cases} \frac1n & \quad \text{ if } \, x \in \{x_1, \ldots, x_n\} \\ 0 & \quad \text{ otherwise } \end{cases}
(more precisely, the convergence is uniform almost everywhere wrt the cdfs: \hat{F} \to F)
Setting: Assume given the original sample x_1,\ldots,x_n from unknown population f
Bootstrap: Regard the sample as the whole population
Replace the unknown distribution f with the sample distribution \hat{f}(x) = \begin{cases} \frac1n & \quad \text{ if } \, x \in \{x_1, \ldots, x_n\} \\ 0 & \quad \text{ otherwise } \end{cases}
Any sampling will be done from \hat{f} \qquad \quad (motivated by Glivenko-Cantelli Thm)
Note: \hat{f} puts mass \frac1n at each sample point
Drawing an observation from \hat f is equivalent to drawing one point at random from the original sample \{x_1,\ldots,x_n\}
Setting: Given the original sample x_1,\ldots,x_n from unknown population f
Bootstrap Algorithm: to estimate the distribution of a statistic of interest T
Draw sample x_1^*,\ldots,x_n^* from \{x_1, \ldots, x_n\} with replacement
Compute the statistic T on the bootstrap sample T^* = T(x_1^*,\ldots,x_n^*)
Repeat Steps 1 and 2, B times, to get B bootstrap simulations of T T^*_1, \ldots , T^*_B These values represent the bootstrap distribution of T
Note: If population distribution f is known, we would just sample from f
Original sample \{x_1,\ldots,x_n\} is drawn from the population \qquad (Image from Wikipedia)
Resamples \{x_1^*,\ldots,x_n^*\} are generated by drawing from \{x_1,\ldots,x_n\} with replacement
Note: Data points x_i can be drawn more than once (in red and sligthly offsetted)
For each resample, the statistic T^* is calculated
Therefore, a histogram can be calculated to estimate the bootstrap distribution of T
References?
Simulation and bootstrap are huge topics
Good introduction: the book by Efron (inventor of bootstrap) and Tibshirani [1]
Why is the Bootstrap useful?
Because a lot of the times population distribution is unknown
Despite this, bootstrap allows to do inference on statistic T
Why does the Bootstrap work? We are just resampling from the original sample!
It (often) works because the original sample encodes population variability
By resampling the original sample, we are simulating this variability
How good is the bootstrap distribution of T?
What are the limitations of bootstrap?
However, problems are usually mitigated by large enough original sample (n \geq 50)
Setting: Given the original sample x_1,\ldots,x_n from unknown population f
Bootstrap Algorithm: to estimate the distribution of the sample mean \overline{X}
Draw sample x_1^*,\ldots,x_n^* from \{x_1, \ldots, x_n\} with replacement
Compute the sample mean \overline{X} on the bootstrap sample \overline{X}^* = \frac{1}{n} \sum_{i=1}^n x_i^*
Repeat Steps 1 and 2, B times, to get B bootstrap simulations of \overline{X}^* \overline{X}^*_1, \ldots , \overline{X}^*_B These values represent the bootstrap distribution of \overline{X}
# Bootstrap sample mean B = 10,000 times
B <- 10000
xbar.star <- replicate(B, {
# Generate bootstrap sample
x.star <- sample(x, n, replace = TRUE)
# Return mean of bootstrap sample
mean(x.star)
})
xbar.star
contains B = 10,000 bootstrap samples of \overline{X}Why is this useful?: When the population is normal N(\mu,\sigma^2), we know that \overline{X} \sim N(\mu, \sigma^2/n) However: population distribution unknown \implies distribution of \overline{X} unknown
Bootstrap gives a way to estimate distribution of \overline{X}
Original sample: CEOs compensation in 2012 (USA data - in million dollars)
CEO Pay 2012 | 23.5 | 6.4 | 11.1 | 3.8 | 8.9 | 4.8 | 23.8 | 3.0 | 2.9 | 3.2 |
We want to simulate the bootstrap distribution of the sample mean \overline{X}
# Enter original sample and compute size
x <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3, 2.9, 3.2)
n <- length(x)
# Bootstrap sample mean B = 10,000 times
B <- 10000
xbar.star <- replicate(B, {
# Generate bootstrap sample
x.star <- sample(x, n, replace = TRUE)
# Return mean of bootstrap sample
mean(x.star)
})
The vector xbar.star
contains B = 10000 bootstrap samples from \overline{X}
We can examine the bootstrap distribution of \overline{X} with a histogram
x
shows that the data is heavily skewed (not normal)When the population is normally distributed N(\mu,\sigma^2), we used the t-statistic t = \frac{\overline{x} - \mu}{\mathop{\mathrm{e.s.e.}}} to form a (1-\alpha)100\% confidence interval [a,b] for the population mean \mu P(\mu \in [a,b]) = 1-\alpha \,, \quad a, b = \overline{x} \mp t^* \times \mathop{\mathrm{e.s.e.}}, \quad t^* = t_{n-1}\left(\frac{\alpha}{2}\right)
If the population is not normal, the interval [a,b] might not be accurate, meaning that P(\mu \in [a,b]) \neq 1 - \alpha
In this case, the bootstrap sample mean can be used to form a CI
Moreover, bootstrap CI easily generalize to any estimator T
Setting: Given the original sample x_1,\ldots,x_n from unknown population f, and
Bootstrap CI Algorithm: to simulate (1-\alpha)100\% CI for parameter \theta
Bootstrap B times the statistic T, obtaining the bootstrap simulations T^*_1, \ldots , T^*_B
Order the values T^*_1, \ldots , T^*_B to obtain T^*_{(1)} \leq \ldots \leq T^*_{(B)}
Let \, m = [(\alpha/2)B], where [\cdot] is the ceiling function. The percentile bootstrap CI for \theta is \left[ T^*_{(m)} , T^*_{(B + 1- m)} \right] Endpoints are \frac{\alpha}{2}100\% and (1 − \frac{\alpha}{2})100\% percentiles of sample distribution of T^*_1, \ldots , T^*_B
Original sample:
A consumer group wishes to see whether the actual mileage of a new SUV matches the advertised 17 miles per gallon
To test the claim, the group fills the SUV’s tank and records the mileage
This is repeated 10 times. The results are below
mpg | 11.4 | 13.1 | 14.7 | 14.7 | 15.0 | 15.5 | 15.6 | 15.9 | 16.0 | 16.8 |
We want to simulate 95\% bootstrap CI for the population mean \mu
The vector xbar.star
contains B = 10000 bootstrap samples from \overline{X}
The (1-\alpha)\% bootstrap CI for the population mean \mu is \left[ \overline{X}^*_{(m)} , \overline{X}^*_{(B + 1- m)} \right]\,, \qquad m = [(\alpha/2)B]
We have B = 10,000 and \alpha = 0.05 \quad \implies \quad m = 250
Therefore, the 95\% bootstrap CI for the population mean \mu is \left[ \overline{X}^*_{(250)} , \overline{X}^*_{(9751)} \right]
These percentiles can be automatically computed using the quantile
function
# Compute 95% t-statistic CI
t.stat.CI <- t.test(x)$conf.int
# Print results
cat("Bootstrap Confidence Interval (95%):", boot.CI)
cat("\nt-statistic Confidence Interval (95%):", t.stat.CI)
Bootstrap Confidence Interval (95%): 13.86 15.71
t-statistic Confidence Interval (95%): 13.74545 15.99455
The CI are almost overlapping
This is a very strong indication that the original population is normal
mpg
The advertised 17 mpg do not fall in either CI. We have reason to doubt the claim
The code can be downloaded here bootstrap_CI.R
Original sample: Compensation data for a sample of CEOs in 2012
(USA data, in million dollars)
CEO Pay 2012 | 23.5 | 6.4 | 11.1 | 3.8 | 8.9 | 4.8 | 23.8 | 3.0 | 2.9 | 3.2 |
We want to simulate 95\% bootstrap CI for the mean compensation \mu
ceo12
, we see that the data is not normal
# Enter original sample and compute size
x <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3, 2.9, 3.2)
n <- length(x)
# Bootstrap sample mean B = 10,000 times
B <- 10000
xbar.star <- replicate(B, mean( sample(x, n, replace = TRUE) ))
# Compute 95% CI from bootstrap samples
alpha <- 0.05
boot.CI <- quantile(xbar.star, probs = c(alpha/2, 1-alpha/2))
# Compute 95% t-statistic CI
t.stat.CI <- t.test(x)$conf.int
# Print results
cat("Bootstrap Confidence Interval (95%):", boot.CI,
"\nt-statistic Confidence Interval (95%):", t.stat.CI)
Bootstrap Confidence Interval (95%): 4.84 14.4005
t-statistic Confidence Interval (95%): 3.339166 14.94083
The bootstrap and t-statistic CI differ considerably, especially at low end
Bootstrap CI has to be preferred, due to non-normality of data
CEO Pay 2013 | 3.2 | 3.8 | 2.6 | 3.5 | 7.0 | 20.4 | 7.5 | 3.4 | 5.0 | 6.0 |
CEO Pay 2012 | 23.5 | 6.4 | 11.1 | 3.8 | 8.9 | 4.8 | 23.8 | 3.0 | 2.9 | 3.2 |
Want to test for a difference in means via the t-statistic t = \frac{\bar{x} -\bar{y}}{s_p\sqrt{\frac{1}{n}+\frac{1}{m}}} \,, \qquad s_p = \sqrt{\frac{s^2_X(n-1)+s^2_Y(m-1)}{n+m-2}}
Want to test for a difference in variance via the F-statistic F=\frac{s_X^2}{s_Y^2}
CEO Pay 2013 | 3.2 | 3.8 | 2.6 | 3.5 | 7.0 | 20.4 | 7.5 | 3.4 | 5.0 | 6.0 |
CEO Pay 2012 | 23.5 | 6.4 | 11.1 | 3.8 | 8.9 | 4.8 | 23.8 | 3.0 | 2.9 | 3.2 |
Under assumptions of normality and independence of the populations, we have t \sim t_{n+m-2} \,, \qquad F \sim F_{n-1,m-1}
This information allows to compute the exact p-values for the t and F tests p_t = 2 P( t_{n+m-2} > |t| ) \,, \qquad p_F = 2P(F_{n-1,m-1} > F )
Question: What if we cannot assume normality? How do we compute p-values?
Answer: Bootstrap p-values
Suppose given two samples:
Note: Population distributions have the same shape
Denoting by \mu_X and \mu_Y the means of F(x) and F(x - \Delta), we have \Delta = \mu_X - \mu_Y
We want to test for difference in means H_0 \colon \mu_X = \mu_Y \,, \quad \qquad H_1 \colon \mu_X \neq \mu_Y , \quad \mu_X < \mu_Y \,, \quad \text{ or } \quad \mu_X > \mu_Y
We are not making assumptions on F \implies distribution of t-statistic is unknown
Assume the null hypothesis is true: \mu_X = \mu_Y \quad \implies \quad \Delta = \mu_X - \mu_Y = 0
Hence, under H_0, the two samples come from the same population F
This means the two samples are actually part of a single sample of size n+m \mathcal{S} = \{x_1, \ldots, x_n , y_1, \ldots, y_m \}
The sample distribution \hat{F} of \mathcal{S} is an approximation of F
(Glivenko-Cantelli Theorem)
We can therefore bootstrap the t-statistic from the combined sample \mathcal{S}
The bootstrap samples are generated as follows:
Compute a bootstrap simulation of the t-statistic t^* = \frac{\bar{x}^* -\bar{y}^*}{s_p\sqrt{\frac{1}{n}+\frac{1}{m}}} \,, \qquad s_p = \sqrt{\frac{(s^2_X)^*(n-1)+(s^2_Y)^*(m-1)}{n+m-2}}
Repeat B times to obtain t^*_1, \ldots , t^*_B
These values represent the bootstrap distribution of the t-statistic
We now compare the bootstrap t-statistic to the observed statistic t_{\rm obs} = \frac{\bar{x} -\bar{y}}{s_p\sqrt{\frac{1}{n}+\frac{1}{m}}} \,, \qquad s_p = \sqrt{\frac{s^2_X(n-1)+ s^2_Y(m-1)}{n+m-2}}
A bootstrap simulation of the t-statistic t_i^* is extreme if
Alternative | t_i^* extreme |
---|---|
\mu_X \neq \mu_Y | |t_i^*| > |t_{\rm obs}| |
\mu_X < \mu_Y | t_i^* < t_{\rm obs} |
\mu_X > \mu_Y | t_i^* > t_{\rm obs} |
Setting: Given the two samples x_1,\ldots,x_n and y_1, \ldots, y_m
Compute the observed t-statistic t_{\rm obs} on the given data
Combine the two samples into one sample \mathcal{S} = \{ x_1, \ldots, x_n, y_1, \ldots, y_m\}
Sample \{x_1^*, \ldots, x_n^*\} from \mathcal{S} with replacement
Sample \{y_1^*, \ldots, y_m^*\} from \mathcal{S} with replacement
Compute the t-statistic on the bootstrap samples
Repeat steps 3-5, B times, obtaining B bootstrap simulations of t-statistic t^*_1, \ldots , t^*_B
Alternative | t_i^* extreme |
---|---|
\mu_X^2 \neq \mu^2_Y | |t_i^*| > |t_{\rm obs}| |
\mu_X < \mu_Y | t_i^* < t_{\rm obs} |
\mu_X > \mu_Y | t_i^* > t_{\rm obs} |
CEO Pay 2013 | 3.2 | 3.8 | 2.6 | 3.5 | 7.0 | 20.4 | 7.5 | 3.4 | 5.0 | 6.0 |
CEO Pay 2012 | 23.5 | 6.4 | 11.1 | 3.8 | 8.9 | 4.8 | 23.8 | 3.0 | 2.9 | 3.2 |
H_0 \colon \mu_X = \mu_Y \,, \qquad H_1 \colon \mu_X \neq \mu_Y
Does average pay change from 2012 to 2013?
# Enter the Wages data
ceo13 <- c(3.2, 3.8, 2.6, 3.5, 7.0, 20.4, 7.5, 3.4, 5.0, 6.0)
ceo12 <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3.0, 2.9, 3.2)
# Compute the estimated distributions
d.ceo13 <- density(ceo13)
d.ceo12 <- density(ceo12)
# Plot the estimated distributions
plot(d.ceo13, # Plot d.ceo13
xlim = range(c(d.ceo12$x, d.ceo13$x)), # Set x-axis range
ylim = range(c(d.ceo12$y, d.ceo13$y)), # Set y-axis range
main = "Estimated Distributions of CEOs Pay") # Add title to plot
lines(d.ceo12, # Layer plot of d.ceo12
lty = 2) # Use different line style
legend("topleft", # Add legend at top-right
legend = c("CEOs Pay 2013", # Labels for legend
"CEOs Pay 2012"),
lty = c(1, 2)) # Assign curves to legend
ceo13 <- c(3.2, 3.8, 2.6, 3.5, 7.0, 20.4, 7.5, 3.4, 5.0, 6.0)
ceo12 <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3.0, 2.9, 3.2)
t.test
var.equal = T
t
-0.9491981
# Compute sample size
n <- length(ceo13)
m <- length(ceo12)
# Combine the samples
combined <- c(ceo13, ceo12)
t
-0.02904949
In the previous slide we generated one simulated value of the t-statistic
To generate B = 10,000 bootstrap simulations, we use replicate
# Bootstrap the t-statistic B = 10,000 times
B <- 10000
t.boot <- replicate(B, {
# Single bootstrap sample
ceo13.boot <- sample(combined, n, replace = T)
ceo12.boot <- sample(combined, m, replace = T)
# Return single bootstrap t-statistic
t.test(ceo13.boot, ceo12.boot, var.equal = T)$statistic
})
t.boot
contains B=10,000 bootstrap simulations of the t-statistic# Count number of extreme statistics for two-sided test
extreme <- sum ( abs( t.boot ) > abs (t.obs) )
# Compute the p-value
p <- extreme / B
# Print
cat("The bootstrap p-value is:", p)
The bootstrap p-value is: 0.3715
Conclusion: No evidence (p > 0.05) of a difference in pay between 2012 and 2013
The code can be downloaded here bootstrap_t_test.R
The bootstrap p-value just computed is p = 0.3715
We compare it to the two-sample t-test p-value
The t-test p-value is: 0.3550911
The two p-values are quite similar, even though the data is non-normal
This indicates that the t-test is fairly robust to non-normality
However, since data is non-normal, the bootstrap t-test is the preferred choice
H_0 \colon \sigma^2_X = \sigma^2_Y \,, \quad \qquad H_1 \colon \sigma^2_X \neq \sigma^2_Y \,, \quad \text{ or } \quad \sigma^2_X > \sigma^2_Y
As usual for F-test, we label the samples so that s_X^2 \geq s_Y^2
Not making assumptions on population G \implies distribution of F-statistic is unknown
Assume the null hypothesis is true: \sigma_X^2 = \mu_Y^2 \quad \implies \quad \Delta = \frac{\sigma^2_X}{\sigma^2_Y} = 1
Hence, under H_0, the two samples X and Y have cdfs G \left( x - \mu_X \right) \quad \text{ and } \quad G \left(y - \mu_Y \right)
By centering, we obtain that X - \mu_X \quad \text{ and } \quad Y - \mu_Y \quad \text{ have both cdf } \quad G \left( z \right)
Thus, the two centered samples are part of a single sample of size n+m \mathcal{S} = \{x_1 - \overline{x}, \ldots, x_n - \overline{x} , y_1 - \overline{y}, \ldots, y_m - \overline{y} \}
The sample distribution \hat{G} of \mathcal{S} is an approximation of G
(Glivenko-Cantelli Theorem)
We bootstrap the F-statistic from the centered combined sample \mathcal{S}
The bootstrap samples are generated as follows:
Compute a bootstrap simulation of the F-statistic F^* = \frac{(s_X^2)^*}{(s_Y^2)^*}
Repeat B times to obtain F^*_1, \ldots , F^*_B
These values represent the bootstrap distribution of the F-statistic
We now compare the bootstrap F-statistic to the observed statistic F_{\rm obs} = \frac{s^2_X}{s^2_Y}
A bootstrap simulation of the F-statistic F_i^* is extreme if
Alternative | F_i^* extreme |
---|---|
\sigma_X^2 \neq \sigma^2_Y | F_i^* > F_{\rm obs} \, or F_i^* < 1/F_{\rm obs} |
\sigma_X^2 > \sigma^2_Y | F_i^* > F_{\rm obs} |
p = \frac{\# \text{ of extreme bootstrap simulations}}{B}
Setting: Given the two samples x_1,\ldots,x_n and y_1, \ldots, y_m
Compute sample variances s_X^2 and s_Y^2. If s_Y^2 > s_X^2, swap the samples
Compute the observed F-statistic F_{\rm obs} on the given data
Combine (centered) samples into \mathcal{S} = \{ x_1 - \overline{x}, \ldots, x_n - \overline{x}, y_1 - \overline{y}, \ldots, y_m- \overline{y}\}
Sample \{x_1^*, \ldots, x_n^*\} from \mathcal{S} with replacement
Sample \{y_1^*, \ldots, y_m^*\} from \mathcal{S} with replacement
Compute the F-statistic on the bootstrap samples
Repeat steps 3-5, B times, obtaining B bootstrap simulations of F-statistic F^*_1, \ldots , F^*_B
Alternative | F_i^* extreme |
---|---|
\sigma_X^2 \neq \sigma^2_Y | F_i^*> F_{\rm obs} \, of \, F_i^* < 1/F_{\rm obs} |
\sigma_X^2 > \sigma^2_Y | F_i^* > F_{\rm obs} |
CEO Pay 2013 | 3.2 | 3.8 | 2.6 | 3.5 | 7.0 | 20.4 | 7.5 | 3.4 | 5.0 | 6.0 |
CEO Pay 2012 | 23.5 | 6.4 | 11.1 | 3.8 | 8.9 | 4.8 | 23.8 | 3.0 | 2.9 | 3.2 |
We want to test for a difference in variances H_0 \colon \sigma^2_X = \sigma^2_Y \,, \qquad H_1 \colon \sigma^2_X \neq \sigma^2_Y
We already know that data is non-normal \implies Bootstrap F-test is appropriate
# Enter the data
ceo13 <- c(3.2, 3.8, 2.6, 3.5, 7.0, 20.4, 7.5, 3.4, 5.0, 6.0)
ceo12 <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3.0, 2.9, 3.2)
# Compare variances
if (var(ceo12) > var(ceo13)) cat("Swap the samples!")
Swap the samples!
F_{\rm obs} = \frac{s_X^2}{s_Y^2} = \frac{\rm Variance \,\,\, CEO \, Pay \, in \, 2012}{\rm Variance \,\,\, CEO \, Pay \, in \, 2013}
[1] 2.383577
# Compute sample size (samples are swapped)
n <- length(ceo12)
m <- length(ceo13)
# Combine the centered samples
combined.centered <- c(ceo12 - mean(ceo12), ceo13 - mean(ceo13))
ceo12.boot <-sample(combined.centered, n, replace = T)
ceo13.boot <- sample(combined.centered, m, replace = T)
[1] 1.589727
In the previous slide we generated one simulated value of the F-statistic
To generate B = 10,000 bootstrap simulations, we use replicate
# Bootstrap the F-statistic B = 10,000 times
B <- 10000
F.boot <- replicate(B, {
# Single bootstrap sample
ceo12.boot <-sample(combined.centered, n, replace = T)
ceo13.boot <- sample(combined.centered, m, replace = T)
# Return single bootstrap F-statistic
var(ceo12.boot) / var(ceo13.boot)
})
F.boot
contains B=10,000 bootstrap simulations of the F-statistic# Count number of extreme statistics for two-sided test
extreme <- sum ( (F.boot > F.obs) | (F.boot < 1/F.obs) )
# Compute the p-value
p <- extreme / B
# Print
cat("The bootstrap p-value is:", p)
The bootstrap p-value is: 0.3506
Conclusion: No evidence (p > 0.05) of a difference in variance between populations
The code can be downloaded here bootstrap_F_test.R
The bootstrap p-value just computed is p = 0.3506
We compare it to the F-test p-value
The F-test p-value is: 0.2117675
The two p-values are quite different (although they give same decision)
This is due to the non-normality of data
Since data is non-normal, the bootstrap F-test is the preferred choice
Vectors can contain only one data type (number, character, boolean)
Lists are data structures that can contain any R object
Lists can be created similarly to vectors, with the command list()
Elements of a list can be retrieved by indexing
my_list[[k]]
returns k-th element of my_list
You can return multiple items of a list via slicing
my_list[c(k1, ..., kn)]
returns elements in positions k1, ..., kn
my_list[k1:k2]
returns elements k1
to k2
names(my_list) <- c("name_1", ..., "name_k")
# Create list with 3 elements
my_list <- list(2, c(T,F,T,T), "hello")
# Name each of the 3 elements
names(my_list) <- c("number", "TF_vector", "string")
# Print the named list: the list is printed along with element names
print(my_list)
$number
[1] 2
$TF_vector
[1] TRUE FALSE TRUE TRUE
$string
[1] "hello"
my_list
named my_name
can be accessed with dollar operator
my_list$my_name
# Create list with 3 elements and name them
my_list <- list(2, c(T,F,T,T), "hello")
names(my_list) <- c("number", "TF_vector", "string")
# Access 2nd element using dollar operator and store it in variable
second_component <- my_list$TF_vector
# Print 2nd element
print(second_component)
[1] TRUE FALSE TRUE TRUE
names(my_vector) <- c("name_1", ..., "name_k")
my_vector["name_k"]
Green
3
[1] 3
Data Frames are the preferred way of presenting a data set in R:
Data frames can contain any R object
Data Frames are similar to Lists, with the difference that:
In simpler terms:
Data frames are constructed similarly to lists, using
data.frame()
Important: Elements of data frame must be vectors of the same length
Example: We construct the Family Guy data frame. Variables are
person
– Name of characterage
– Age of charactersex
– Sex of characterThink of a data frame as a matrix
You can access element in position (m,n)
by using
my_data[m, n]
Example
(1,1)
[1] "Peter"
To access multiple elements on the same row or column, type
my_data[c(k1,...,kn), m]
\quad or \quad my_data[k1:k2, m]
my_data[n, c(k1,...,km)]
\quad or \quad my_data[n, k1:k2]
Example: We want to access Stewie’s Sex and Age:
age sex
5 1 M
k1,...,kn
my_data[c(k1,...,kn), ]
k1
to kn
my_data[k1:k2, ]
person age sex
1 Peter 42 M
3 Meg 17 F
k1,...,kn
my_data[ , c(k1,...,kn)]
k1
to kn
my_data[ ,k1:k2]
age person
1 42 Peter
2 40 Lois
3 17 Meg
4 14 Chris
5 1 Stewie
Use dollar operator to access data frame columns
my_data
contains a variable called my_variable
my_data$my_variable
accesses column my_variable
my_data$my_variable
is a vectorExample: To access age
in the family
data frame type
Ages of the Family Guy characters are: 42 40 17 14 1
Meg's age is 17
Command | Output |
---|---|
nrow(my_data) |
# of rows |
ncol(my_data) |
# of columns |
dim(my_data) |
vector containing # of rows and columns |
family_dim <- dim(family) # Stores dimensions of family in a vector
cat("The Family Guy data frame has",
family_dim[1],
"rows and",
family_dim[2],
"columns")
The Family Guy data frame has 5 rows and 3 columns
Assume given a data frame my_data
rbind
)
new_record
new_record
must match the structure of my_data
my_data
with my_data <- rbind(my_data, new_record)
cbind
)
new_variable
new_variable
must have as many components as rows in my_data
my_data
with my_data <- cbind(my_data, new_variable)
family
person age sex
1 Brian 7 M
Important: Names have to match existing data frame
new_record
to family
person age sex
1 Peter 42 M
2 Lois 40 F
3 Meg 17 F
4 Chris 14 M
5 Stewie 1 M
6 Brian 7 M
family
funny
funny
with entries matching each character (including Brian)funny
to the Family Guy data frame family
person age sex funny
1 Peter 42 M High
2 Lois 40 F High
3 Meg 17 F Low
4 Chris 14 M Med
5 Stewie 1 M High
6 Brian 7 M Med
Instead of using cbind
we can use dollar operator:
new_variable
v
containing values for the new variablev
must have as many components as rows in my_data
my_data
with
my_data$new_variable <- v
family
family$age
by 12v <- family$age * 12 # Computes vector of ages in months
family$age.months <- v # Stores vector as new column in family
print(family)
person age sex funny age.months
1 Peter 42 M High 504
2 Lois 40 F High 480
3 Meg 17 F Low 204
4 Chris 14 M Med 168
5 Stewie 1 M High 12
6 Brian 7 M Med 84
We saw how to use logical flag vectors to subset vectors
We can use logical flag vectors to subset data frames as well
Suppose to have data frame my_data
containing a variable my_variable
Want to subset records in my_data
for which my_variable
satisfies a condition
Use commands
flag <- condition(my_data$my_variable)
my_data[flag, ]
family
family$sex == "F"
[1] FALSE TRUE TRUE FALSE FALSE FALSE
# Subset data frame "family" and store in data frame "subset"
subset <- family[flag, ]
print(subset)
person age sex funny age.months
2 Lois 40 F High 480
3 Meg 17 F Low 204
R has a many functions for reading characters from stored files
We will see how to read Table-Format files
Table-Formats are just tables stored in plain-text files
Typical file estensions are:
.txt
for plain-text files.csv
for comma-separated valuesTable-Formats can be read into R with the command
read.table()
NA
#
*
Table-Formats can be read via read.table()
Reads .txt
or .csv
files
outputs a data frame
Options of read.table()
header = T/F
na.strings = "string"
"string"
means NA
(missing values)To read family_guy.txt
into R proceed as follows:
Download family_guy.txt and move file to Desktop
Open the R Console and change working directory to Desktop
family_guy.txt
into R and store it in data frame family
with coderead.table()
that
family_guy.txt
has a header*
family
to screen person age sex funny age.mon
1 Peter NA M High 504
2 Lois 40 F <NA> 480
3 Meg 17 F Low 204
4 Chris 14 M Med 168
5 Stewie 1 M High NA
6 Brian NA M Med NA
.txt
fileData: Analysis of Consumer Confidence Index for 2008 crisis (seen in Lecture 3)
c()
.txt
file insteadGoal: Perform t-test on CCI difference for mean difference \mu = 0
read.table()
The CCI dataset can be downloaded here 2008_crisis.txt
The text file looks like this
To perform the t-test on data 2008_crisis.txt
we proceed as follows:
Download dataset 2008_crisis.txt and move file to Desktop
Open the R Console and change working directory to Desktop
2008_crisis.txt
into R and store it in data frame scores
with codescores
into 2 vectors# CCI from 2007 is stored in 2nd column
score_2007 <- scores[, 2]
# CCI from 2009 is stored in 3nd column
score_2009 <- scores[, 3]
# Compute vector of differences
difference <- score_2007 - score_2009
# Perform t-test on difference with null hypothesis mu = 0
t.test(difference, mu = 0)$p.value
[1] 4.860686e-13
10 patients treated with both Drug A and Drug B
i | x_i | y_i |
---|---|---|
1 | 1.9 | 0.7 |
2 | 0.8 | -1.0 |
3 | 1.1 | -0.2 |
4 | 0.1 | -1.2 |
5 | -0.1 | -0.1 |
6 | 4.4 | 3.4 |
7 | 4.6 | 0.0 |
8 | 1.6 | 0.8 |
9 | 5.5 | 3.7 |
10 | 3.4 | 2.0 |
Goal:
Plot:
Linear relation:
Try to fit a line through the data
Line roughly predicts y_i from x_i
However note the outlier (x_7,y_7) = (4.6, 0) (red point)
How is such line constructed?
Least Squares Line:
Least Squares Line:
We would like predicted and actual value to be close \hat{y}_i \approx y_i
Hence the vertical difference has to be small y_i - \hat{y}_i \approx 0
Least Squares Line:
We want \hat{y}_i - y_i \approx 0 \,, \qquad \forall \, i
Can be achieved by minimizing the sum of squares \min_{\alpha, \beta} \ \sum_{i} \ (y_i - \hat{y}_i)^2 \hat{y}_i = \beta x_i + \alpha
Note: \mathop{\mathrm{RSS}} can be seen as a function \mathop{\mathrm{RSS}}\colon \mathbb{R}^2 \to \mathbb{R}\qquad \quad \mathop{\mathrm{RSS}}= \mathop{\mathrm{RSS}}(\alpha,\beta)
For a given sample (x_1,y_1), \ldots, (x_n, y_n), define
Sample Means: \overline{x} := \frac{1}{n} \sum_{i=1}^n x_i \qquad \quad \overline{y} := \frac{1}{n} \sum_{i=1}^n y_i
Sums of squares: S_{xx} := \sum_{i=1}^n ( x_i - \overline{x} )^2 \qquad \quad S_{yy} := \sum_{i=1}^n ( y_i - \overline{y} )^2
Sum of cross-products: S_{xy} := \sum_{i=1}^n ( x_i - \overline{x} ) ( y_i - \overline{y} )
Given (x_1,y_1), \ldots, (x_n, y_n), consider the minimization problem \begin{equation} \tag{M} \min_{\alpha,\beta } \ \mathop{\mathrm{RSS}}= \min_{\alpha,\beta} \ \sum_{i=1}^n (y_i-\alpha-{\beta}x_i)^2 \end{equation} Then
To prove the Theorem we need some background results
A symmetric matrix is positive semi-definite if all the eigenvalues \lambda_i satisfy \lambda_i \geq 0
Proposition: A 2 \times 2 symmetric matrix M is positive semi-definite iff \det M \geq 0 \,, \qquad \quad \operatorname{Tr}(M) \geq 0
f \colon \mathbb{R}^2 \to \mathbb{R}\qquad \quad f = f (x,y)
\nabla^2 f = \left( \begin{array}{cc} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \\ \end{array} \right)
\det \nabla^2 f = f_{xx} f_{yy} - f_{xy}^2 \geq 0 \qquad \quad f_{xx} + f_{yy} \geq 0
\nabla^2 f \, \text{ is positive semi-definite} \qquad \iff \qquad f \, \text{ is convex}
Suppose f \colon \mathbb{R}^2 \to \mathbb{R} has positive semi-definite Hessian. They are equivalent
The point (\hat{x},\hat{y}) is a minimizer of f, that is, f(\hat{x}, \hat{y}) = \min_{x,y} \ f(x,y)
The point (\hat{x},\hat{y}) satisfies the optimality conditions \nabla f (\hat{x},\hat{y}) = 0
Note: The proof of the above Lemma can be found in [2]
f(x,y) = x^2 + y^2
It is clear that \min_{x,y} \ f(x,y) = \min_{x,y} \ x^2 + y^2 = 0 \,, with the only minimizer being (0,0)
However, let us use the Lemma to prove this fact
\nabla f = (f_x,f_y) = (2x, 2y)
\nabla f = 0 \qquad \iff \qquad x = y = 0
\nabla^2 f = \left( \begin{array}{cc} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{array} \right) = \left( \begin{array}{cc} 2 & 0 \\ 0 & 2 \end{array} \right)
\det (\nabla^2) f = 4 > 0 \qquad \qquad \operatorname{Tr}(\nabla^2 f) = 4 > 0
0 = f(0,0) = \min_{x,y} \ f(x,y)
We go back to proving the RSS Minimization Theorem
Suppose given data points (x_1,y_1), \ldots, (x_n, y_n)
We want to solve the minimization problem
\begin{equation} \tag{M} \min_{\alpha,\beta } \ \mathop{\mathrm{RSS}}= \min_{\alpha,\beta} \ \sum_{i=1}^n (y_i-\alpha-{\beta}x_i)^2 \end{equation}
\nabla \mathop{\mathrm{RSS}}\quad \text{ and } \quad \nabla^2 \mathop{\mathrm{RSS}}
We first compute \nabla \mathop{\mathrm{RSS}} and solve the optimality conditions \nabla \mathop{\mathrm{RSS}}(\alpha,\beta) = 0
To this end, recall that \overline{x} := \frac{\sum_{i=1}^nx_i}{n} \qquad \implies \qquad \sum_{i=1}^n x_i = n \overline{x}
Similarly, we have \sum_{i=1}^n y_i = n \overline{y}
\begin{align*} \mathop{\mathrm{RSS}}_{\alpha} & = -2\sum_{i=1}^n(y_i- \alpha- \beta x_i) \\[10pt] & = - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} \\[20pt] \mathop{\mathrm{RSS}}_{\beta} & = -2\sum_{i=1}^n x_i (y_i- \alpha - \beta x_i) \\[10pt] & = - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2 \end{align*}
\begin{align} - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} & = 0 \tag{1} \\[20pt] - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2 & = 0 \tag{2} \end{align}
-2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} = 0
\alpha = \overline{y}- \beta \overline{x}
\sum_{i=1}^n x_i y_i - \alpha n \overline{x} - \beta \sum_{i=1}^n x^2_i = 0
Hence Equation (2) is equivalent to \beta = \frac{S_{xy}}{ S_{xx} }
Also recall that Equation (1) is equivalent to \alpha = \overline{y}- \beta \overline{x}
Therefore (\hat\alpha, \hat\beta) solves the optimality conditions \nabla \mathop{\mathrm{RSS}}= 0 iff \hat\alpha = \overline{y}- \hat\beta \overline{x} \,, \qquad \quad \hat\beta = \frac{S_{xy}}{ S_{xx} }
We need to compute \nabla^2 \mathop{\mathrm{RSS}}
To this end recall that \mathop{\mathrm{RSS}}_{\alpha} = - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} \,, \quad \mathop{\mathrm{RSS}}_{\beta} = - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2
Therefore we have \begin{align*} \mathop{\mathrm{RSS}}_{\alpha \alpha} & = 2n \qquad & \mathop{\mathrm{RSS}}_{\alpha \beta} & = 2 n \overline{x} \\ \mathop{\mathrm{RSS}}_{\beta \alpha } & = 2 n \overline{x} \qquad & \mathop{\mathrm{RSS}}_{\beta \beta} & = 2 \sum_{i=1}^{n} x_i^2 \end{align*}
\begin{align*} \det (\nabla^2 \mathop{\mathrm{RSS}}) & = \mathop{\mathrm{RSS}}_{\alpha \alpha}\mathop{\mathrm{RSS}}_{\beta \beta} - \mathop{\mathrm{RSS}}_{\alpha \beta}^2 \\[10pt] & = 4n \sum_{i=1}^{n} x_i^2 - 4 n^2 \overline{x}^2 \\[10pt] & = 4n \left( \sum_{i=1}^{n} x_i^2 - n \overline{x}^2 \right) \\[10pt] & = 4n S_{xx} \end{align*}
S_{xx} = \sum_{i=1}^n (x_i - \overline{x})^2 \geq 0
\det (\nabla^2 \mathop{\mathrm{RSS}}) = 4n S_{xx} \geq 0
\operatorname{Tr}(\nabla^2\mathop{\mathrm{RSS}}) = \mathop{\mathrm{RSS}}_{\alpha \alpha} + \mathop{\mathrm{RSS}}_{\beta \beta} = 2n + 2 \sum_{i=1}^{n} x_i^2 \geq 0
Therefore we have proven \det( \nabla^2 \mathop{\mathrm{RSS}}) \geq 0 \,, \qquad \quad \operatorname{Tr}(\nabla^2\mathop{\mathrm{RSS}}) \geq 0
As the Hessian is symmetric, we conclude that \nabla^2 \mathop{\mathrm{RSS}} is positive semi-definite
By the Lemma, we have that all the solutions (\alpha,\beta) to the optimality conditions \nabla \mathop{\mathrm{RSS}}(\alpha,\beta) = 0 are minimizers
Therefore (\hat \alpha,\hat\beta) with \hat\alpha = \overline{y}- \hat\beta \overline{x} \,, \qquad \quad \hat\beta = \frac{S_{xy}}{ S_{xx} } is a minimizer of \mathop{\mathrm{RSS}}, ending the proof
The previous Theorem motivates the following definition
In R, we want to do the following:
Input the data into vectors x and y
Compute the least-square line coefficients \hat{\alpha} = \overline{y} - \hat{\beta} \ \overline{x} \qquad \qquad \hat{\beta} = \frac{S_{xy}}{S_{xx}}
Plot the data points (x_i,y_i)
Overlay the least squares line
i | x_i | y_i |
---|---|---|
1 | 1.9 | 0.7 |
2 | 0.8 | -1.0 |
3 | 1.1 | -0.2 |
4 | 0.1 | -1.2 |
5 | -0.1 | -0.1 |
6 | 4.4 | 3.4 |
7 | 4.6 | 0.0 |
8 | 1.6 | 0.8 |
9 | 5.5 | 3.7 |
10 | 3.4 | 2.0 |
We give a first solution using elementary R functions
The code to input the data into a data-frame is as follows
# Input blood pressure changes data into data-frame
changes <- data.frame(drug_A = c(1.9, 0.8, 1.1, 0.1, -0.1,
4.4, 4.6, 1.6, 5.5, 3.4),
drug_B = c(0.7, -1.0, -0.2, -1.2, -0.1,
3.4, 0.0, 0.8, 3.7, 2.0)
)
drug_A
and drug_B
to vectors x
and y
# Compute least-square line coefficients
beta <- S_xy / S_xx
alpha <- ybar - beta * xbar
# Print coefficients
cat("\nCoefficient alpha =", alpha)
cat("\nCoefficient beta =", beta)
Coefficient alpha = -0.7861478
Coefficient beta = 0.685042
# Plot the data
plot(x, y, xlab = "", ylab = "", pch = 16, cex = 2)
# Add labels
mtext("Drug A reaction x_i", side = 1, line = 3, cex = 2.1)
mtext("Drug B reaction y_i", side = 2, line = 2.5, cex = 2.1)
Note: We have added a few cosmetic options
pch = 16
plots points with black circlescex = 2
stands for character expansion – Specifies width of pointsmtext
is used to fine-tune the axis labels
side = 1
stands for x-axisside = 2
stands for y-axisline
specifies distance of label from axisabline(a, b)
col
specifies color of the plotlwd
specifies line widthThe code can be downloaded here least_squares_1.R
Running the code we obtain the plot below
lm
lm
stands for linear modellm
to fit the least-squares line
lm(formula, data)
data
expects a data-frame in inputformula
stands for the relation to fitformula = y ~ x
x
and y
are names of two variables in the data-framey ~ x
can be read as: y modelled as function of xNote: Storing data in data-frame is optional
We can just store data in vectors x
and y
Then fit the least-squares line with lm(y ~ x)
changes
is# Fit least squares line
least_squares <- lm(formula = drug_B ~ drug_A, data = changes)
# Print output
print(least_squares)
Call:
lm(formula = drug_B ~ drug_A, data = changes)
Coefficients:
(Intercept) drug_A
-0.7861 0.6850
- The above tells us that the estimators are
\hat \alpha = -0.7861 \,, \qquad \quad \hat \beta = 0.6850
changes$drug_A
changes$drug_B
The least-squares line is currently stored in least_squares
To add the line to the current plot use abline
The code can be downloaded here least_squares_2.R
Running the code we obtain the plot below