Statistical Models

Lecture 7

Lecture 7:
Bootstrap &
Least Squares

Outline of Lecture 7

  1. The Bootstrap
  2. Bootstrap Confidence Intervals
  3. Bootstrap t-test
  4. Bootstrap F-test
  1. Lists
  2. Data Frames
  3. Data Entry
  4. Least squares
  5. Worked Example

Part 1:
The Bootstrap

Motivation

  • So far, our simulations used knowledge of the population distribution, e.g.

    • To simulate \pi, we sampled points from {\rm Uniform}([-1,1] \times [-1,1])
    • To simulate the \chi^2 statistic, we sampled Multinomial counts
    • To simulate confidence intervals, we sampled from a normal population
  • Problem: Sometimes the population distribution f is unknown

  • Bootstrap: Replace the unknown distribution f with a known distribution \hat{f}

The Sample Distribution

  • Assume given a sample x_1,\ldots,x_n from a population f

  • The Sample Distribution is the discrete distribution which puts mass \frac1n at each sample point \hat{f}(x) = \begin{cases} \frac1n & \quad \text{ if } \, x \in \{x_1, \ldots, x_n\} \\ 0 & \quad \text{ otherwise } \end{cases}

Theorem: Glivenko-Cantelli
As the sample size increases \lim_{n \to \infty} \hat{f} = f

(more precisely, the convergence is uniform almost everywhere wrt the cdfs: \hat{F} \to F)

Main Bootstrap idea

Setting: Assume given the original sample x_1,\ldots,x_n from unknown population f

Bootstrap: Regard the sample as the whole population

  • Replace the unknown distribution f with the sample distribution \hat{f}(x) = \begin{cases} \frac1n & \quad \text{ if } \, x \in \{x_1, \ldots, x_n\} \\ 0 & \quad \text{ otherwise } \end{cases}

  • Any sampling will be done from \hat{f} \qquad \quad (motivated by Glivenko-Cantelli Thm)

Note: \hat{f} puts mass \frac1n at each sample point

Drawing an observation from \hat f is equivalent to drawing one point at random from the original sample \{x_1,\ldots,x_n\}

The Bootstrap Algorithm

Setting: Given the original sample x_1,\ldots,x_n from unknown population f

Bootstrap Algorithm: to estimate the distribution of a statistic of interest T

  1. Draw sample x_1^*,\ldots,x_n^* from \{x_1, \ldots, x_n\} with replacement

  2. Compute the statistic T on the bootstrap sample T^* = T(x_1^*,\ldots,x_n^*)

  3. Repeat Steps 1 and 2, B times, to get B bootstrap simulations of T T^*_1, \ldots , T^*_B These values represent the bootstrap distribution of T

Note: If population distribution f is known, we would just sample from f


  • Original sample \{x_1,\ldots,x_n\} is drawn from the population \qquad (Image from Wikipedia)

  • Resamples \{x_1^*,\ldots,x_n^*\} are generated by drawing from \{x_1,\ldots,x_n\} with replacement

  • Note: Data points x_i can be drawn more than once (in red and sligthly offsetted)

  • For each resample, the statistic T^* is calculated

  • Therefore, a histogram can be calculated to estimate the bootstrap distribution of T

Q & A

References?

  • Simulation and bootstrap are huge topics

  • Good introduction: the book by Efron (inventor of bootstrap) and Tibshirani [1]

Why is the Bootstrap useful?

  • Because a lot of the times population distribution is unknown

  • Despite this, bootstrap allows to do inference on statistic T

Why does the Bootstrap work? We are just resampling from the original sample!

  • It (often) works because the original sample encodes population variability

  • By resampling the original sample, we are simulating this variability

Q & A

How good is the bootstrap distribution of T?

  • It can be a good approximation of the true distribution of T when the
    original sample is sufficiently variable
    • Things tend to go well for large samples (n \geq 50)
  • Bootstrap works well when the statistic T is a mean (or something like a mean)
    • for example median, regression coefficient, or standard deviation.
  • The bootstrap has difficulties when the statistic T is influenced by outliers
    • for example if T is the range, that is, T = max value - min value

Q & A

What are the limitations of bootstrap?

  • Computationally expensive
    • For decent simulation of bootstrap distribution of T, need at least
      B = 10,000 resamples (100,000 would be better!)
  • Relies on a sufficiently variable original sample
    • If the original sample is not good, the bootstrap cannot recover from this problem
    • In this case the bootstrap population will not look like the true population
    • The bootstrap resampling will not be useful
  • No way to determine if the original sample is good enough

However, problems are usually mitigated by large enough original sample (n \geq 50)

Example: Bootstrapping the mean

Setting: Given the original sample x_1,\ldots,x_n from unknown population f

Bootstrap Algorithm: to estimate the distribution of the sample mean \overline{X}

  1. Draw sample x_1^*,\ldots,x_n^* from \{x_1, \ldots, x_n\} with replacement

  2. Compute the sample mean \overline{X} on the bootstrap sample \overline{X}^* = \frac{1}{n} \sum_{i=1}^n x_i^*

  3. Repeat Steps 1 and 2, B times, to get B bootstrap simulations of \overline{X}^* \overline{X}^*_1, \ldots , \overline{X}^*_B These values represent the bootstrap distribution of \overline{X}

Implementation in R

  • Store the original sample into a vector and compute length
# Store original sample and compute length

x <- c(x_1, ..., x_n)
n <- length(x)
  • To generate one bootstrap simulation of \overline{X}:
    1. Sample n times from \{x_1,\ldots,x_n\} with replacement
    2. This gives the bootstrap sample \{x_1^*,\ldots,x_n^*\}
    3. Compute sample mean of the bootstrap sample
# Bootstrap sample mean one time
x.star <- sample(x, n, replace = TRUE)     # Sample n times with replacement
xbar.star <- mean(x.star)

  • To generate B = 10,000 bootstrap simulations of \overline{X}, use replicate
# Bootstrap sample mean B = 10,000 times
B <- 10000

xbar.star <- replicate(B, {
                       # Generate bootstrap sample
                       x.star <- sample(x, n, replace = TRUE)
                       # Return mean of bootstrap sample
                       mean(x.star)
                      })
  • The vector xbar.star contains B = 10,000 bootstrap samples of \overline{X}

Why is this useful?: When the population is normal N(\mu,\sigma^2), we know that \overline{X} \sim N(\mu, \sigma^2/n) However: population distribution unknown \implies distribution of \overline{X} unknown

Bootstrap gives a way to estimate distribution of \overline{X}

Worked Example

Original sample: CEOs compensation in 2012 (USA data - in million dollars)

CEO Pay 2012 23.5 6.4 11.1 3.8 8.9 4.8 23.8 3.0 2.9 3.2

We want to simulate the bootstrap distribution of the sample mean \overline{X}

  • To do so, we generate B = 10,000 bootstrap simulations of sample mean \overline{X}
# Enter original sample and compute size
x <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3, 2.9, 3.2)
n <- length(x)

# Bootstrap sample mean B = 10,000 times
B <- 10000

xbar.star <- replicate(B, {
                       # Generate bootstrap sample
                       x.star <- sample(x, n, replace = TRUE)
                       # Return mean of bootstrap sample
                       mean(x.star)
                      })

  • The vector xbar.star contains B = 10000 bootstrap samples from \overline{X}

  • We can examine the bootstrap distribution of \overline{X} with a histogram

    • If population was normal, we would expect \overline{X} to be normal
    • However \overline{X} is skewed to the right \implies population might not be normal
hist(xbar.star)

  • Plotting the density of x shows that the data is heavily skewed (not normal)
plot(density(x))

Part 2:
Bootstrap CI

Bootstrap Confidence Intervals

  • When the population is normally distributed N(\mu,\sigma^2), we used the t-statistic t = \frac{\overline{x} - \mu}{\mathop{\mathrm{e.s.e.}}} to form a (1-\alpha)100\% confidence interval [a,b] for the population mean \mu P(\mu \in [a,b]) = 1-\alpha \,, \quad a, b = \overline{x} \mp t^* \times \mathop{\mathrm{e.s.e.}}, \quad t^* = t_{n-1}\left(\frac{\alpha}{2}\right)

  • If the population is not normal, the interval [a,b] might not be accurate, meaning that P(\mu \in [a,b]) \neq 1 - \alpha

  • In this case, the bootstrap sample mean can be used to form a CI

  • Moreover, bootstrap CI easily generalize to any estimator T

Algorithm: Bootstrap Confidence Intervals

Setting: Given the original sample x_1,\ldots,x_n from unknown population f, and

  • \theta population parameter \,\, (e.g. \, \theta = mean), \,\, T estimator for \theta \,\, (e.g. \, T = sample mean)

Bootstrap CI Algorithm: to simulate (1-\alpha)100\% CI for parameter \theta

  1. Bootstrap B times the statistic T, obtaining the bootstrap simulations T^*_1, \ldots , T^*_B

  2. Order the values T^*_1, \ldots , T^*_B to obtain T^*_{(1)} \leq \ldots \leq T^*_{(B)}

  3. Let \, m = [(\alpha/2)B], where [\cdot] is the ceiling function. The percentile bootstrap CI for \theta is \left[ T^*_{(m)} , T^*_{(B + 1- m)} \right] Endpoints are \frac{\alpha}{2}100\% and (1 − \frac{\alpha}{2})100\% percentiles of sample distribution of T^*_1, \ldots , T^*_B

Worked Example 1

Original sample:

  • A consumer group wishes to see whether the actual mileage of a new SUV matches the advertised 17 miles per gallon

  • To test the claim, the group fills the SUV’s tank and records the mileage

  • This is repeated 10 times. The results are below

mpg 11.4 13.1 14.7 14.7 15.0 15.5 15.6 15.9 16.0 16.8

We want to simulate 95\% bootstrap CI for the population mean \mu

  • First, we generate B = 10,000 bootstrap simulations of sample mean \overline{X}
# Enter original sample and compute size
x <- c(11.4, 13.1, 14.7, 14.7, 15.0, 15.5, 15.6, 15.9, 16.0, 16.8)
n <- length(x)

# Bootstrap sample mean B = 10,000 times
B <- 10000

xbar.star <- replicate(B, mean( sample(x, n, replace = TRUE) ))

  • The vector xbar.star contains B = 10000 bootstrap samples from \overline{X}

  • The (1-\alpha)\% bootstrap CI for the population mean \mu is \left[ \overline{X}^*_{(m)} , \overline{X}^*_{(B + 1- m)} \right]\,, \qquad m = [(\alpha/2)B]

  • We have B = 10,000 and \alpha = 0.05 \quad \implies \quad m = 250

  • Therefore, the 95\% bootstrap CI for the population mean \mu is \left[ \overline{X}^*_{(250)} , \overline{X}^*_{(9751)} \right]

  • These percentiles can be automatically computed using the quantile function

# Compute 95% CI from bootstrap samples
alpha <- 0.05
boot.CI <- quantile(xbar.star, probs = c(alpha/2, 1-alpha/2))

  • We compare the bootstrap CI to the usual t-statistic CI
# Compute 95% t-statistic CI
t.stat.CI <- t.test(x)$conf.int

# Print results
cat("Bootstrap Confidence Interval (95%):", boot.CI)
cat("\nt-statistic Confidence Interval (95%):", t.stat.CI)


Bootstrap Confidence Interval (95%): 13.86 15.71

t-statistic Confidence Interval (95%): 13.74545 15.99455


  • The CI are almost overlapping

  • This is a very strong indication that the original population is normal

    • Indeed, this is the case: You can verify it by plotting the density of mpg
  • The advertised 17 mpg do not fall in either CI. We have reason to doubt the claim

  • The code can be downloaded here bootstrap_CI.R

Worked Example 2

Original sample: Compensation data for a sample of CEOs in 2012
(USA data, in million dollars)


CEO Pay 2012 23.5 6.4 11.1 3.8 8.9 4.8 23.8 3.0 2.9 3.2


We want to simulate 95\% bootstrap CI for the mean compensation \mu

ceo12 <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3, 2.9, 3.2)
plot(density(ceo12))
  • Plotting the density of ceo12, we see that the data is not normal
    • Therefore, CI based on t-statistic would not be appropriate
    • Need to compute a Bootstrap CI for \mu


View the R Code
# Enter original sample and compute size
x <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3, 2.9, 3.2)
n <- length(x)

# Bootstrap sample mean B = 10,000 times
B <- 10000

xbar.star <- replicate(B, mean( sample(x, n, replace = TRUE) ))

# Compute 95% CI from bootstrap samples
alpha <- 0.05
boot.CI <- quantile(xbar.star, probs = c(alpha/2, 1-alpha/2))

# Compute 95% t-statistic CI
t.stat.CI <- t.test(x)$conf.int

# Print results
cat("Bootstrap Confidence Interval (95%):", boot.CI, 
"\nt-statistic Confidence Interval (95%):", t.stat.CI)
Bootstrap Confidence Interval (95%): 4.84 14.4005 
t-statistic Confidence Interval (95%): 3.339166 14.94083


  • The bootstrap and t-statistic CI differ considerably, especially at low end

  • Bootstrap CI has to be preferred, due to non-normality of data

Part 3:
Bootstrap t-test

Motivation

  • Compensation data for a sample of CEOs in 2012 and 2013
    (USA data, in million dollars)
CEO Pay 2013 3.2 3.8 2.6 3.5 7.0 20.4 7.5 3.4 5.0 6.0
CEO Pay 2012 23.5 6.4 11.1 3.8 8.9 4.8 23.8 3.0 2.9 3.2
  • Want to test for a difference in means via the t-statistic t = \frac{\bar{x} -\bar{y}}{s_p\sqrt{\frac{1}{n}+\frac{1}{m}}} \,, \qquad s_p = \sqrt{\frac{s^2_X(n-1)+s^2_Y(m-1)}{n+m-2}}

  • Want to test for a difference in variance via the F-statistic F=\frac{s_X^2}{s_Y^2}

Motivation

  • Compensation data for a sample of CEOs in 2012 and 2013
    (USA data, in million dollars)
CEO Pay 2013 3.2 3.8 2.6 3.5 7.0 20.4 7.5 3.4 5.0 6.0
CEO Pay 2012 23.5 6.4 11.1 3.8 8.9 4.8 23.8 3.0 2.9 3.2
  • Under assumptions of normality and independence of the populations, we have t \sim t_{n+m-2} \,, \qquad F \sim F_{n-1,m-1}

  • This information allows to compute the exact p-values for the t and F tests p_t = 2 P( t_{n+m-2} > |t| ) \,, \qquad p_F = 2P(F_{n-1,m-1} > F )

  • Question: What if we cannot assume normality? How do we compute p-values?

  • Answer: Bootstrap p-values

    • like simulated p-values, but sampling from the sample, instead of the population

Bootstrap t-test procedure

  • Suppose given two samples:

    • x_1, \ldots, x_n from a population with cdf F(x)
    • y_1, \ldots, y_m from a population with cdf F(x - \Delta)
  • Note: Population distributions have the same shape

    • Same spread, but they are shifted by \Delta \in \mathbb{R}
  • Denoting by \mu_X and \mu_Y the means of F(x) and F(x - \Delta), we have \Delta = \mu_X - \mu_Y

  • We want to test for difference in means H_0 \colon \mu_X = \mu_Y \,, \quad \qquad H_1 \colon \mu_X \neq \mu_Y , \quad \mu_X < \mu_Y \,, \quad \text{ or } \quad \mu_X > \mu_Y

  • We are not making assumptions on F \implies distribution of t-statistic is unknown

    • Need to bootstrap the t-statistic

  • Assume the null hypothesis is true: \mu_X = \mu_Y \quad \implies \quad \Delta = \mu_X - \mu_Y = 0

  • Hence, under H_0, the two samples come from the same population F

  • This means the two samples are actually part of a single sample of size n+m \mathcal{S} = \{x_1, \ldots, x_n , y_1, \ldots, y_m \}

  • The sample distribution \hat{F} of \mathcal{S} is an approximation of F
    (Glivenko-Cantelli Theorem)

  • We can therefore bootstrap the t-statistic from the combined sample \mathcal{S}

  • The bootstrap samples are generated as follows:

    • Sample \{x_1^*, \ldots, x_n^*\} from \mathcal{S} with replacement
    • Sample \{y_1^*, \ldots, y_m^*\} from \mathcal{S} with replacement
  • Compute a bootstrap simulation of the t-statistic t^* = \frac{\bar{x}^* -\bar{y}^*}{s_p\sqrt{\frac{1}{n}+\frac{1}{m}}} \,, \qquad s_p = \sqrt{\frac{(s^2_X)^*(n-1)+(s^2_Y)^*(m-1)}{n+m-2}}

  • Repeat B times to obtain t^*_1, \ldots , t^*_B

  • These values represent the bootstrap distribution of the t-statistic

  • We now compare the bootstrap t-statistic to the observed statistic t_{\rm obs} = \frac{\bar{x} -\bar{y}}{s_p\sqrt{\frac{1}{n}+\frac{1}{m}}} \,, \qquad s_p = \sqrt{\frac{s^2_X(n-1)+ s^2_Y(m-1)}{n+m-2}}

  • A bootstrap simulation of the t-statistic t_i^* is extreme if

Alternative t_i^* extreme
\mu_X \neq \mu_Y |t_i^*| > |t_{\rm obs}|
\mu_X < \mu_Y t_i^* < t_{\rm obs}
\mu_X > \mu_Y t_i^* > t_{\rm obs}
  • The bootstrap p-value is computed by p = \frac{\# \text{ of extreme bootstrap simulations}}{B} Note: Condition |t_i^*| > |t_{\rm obs}| is equivalent to t_i^* > |t_{\rm obs}|\, or \, t_i^* < |t_{\rm obs}|

Algorithm: Bootstrap t-test

Setting: Given the two samples x_1,\ldots,x_n and y_1, \ldots, y_m

  1. Compute the observed t-statistic t_{\rm obs} on the given data

  2. Combine the two samples into one sample \mathcal{S} = \{ x_1, \ldots, x_n, y_1, \ldots, y_m\}

  3. Sample \{x_1^*, \ldots, x_n^*\} from \mathcal{S} with replacement

  4. Sample \{y_1^*, \ldots, y_m^*\} from \mathcal{S} with replacement

  5. Compute the t-statistic on the bootstrap samples

  6. Repeat steps 3-5, B times, obtaining B bootstrap simulations of t-statistic t^*_1, \ldots , t^*_B

  1. Compute the p-value by p = \frac{\# \text{ extreme } t_i^*}{B}
Alternative t_i^* extreme
\mu_X^2 \neq \mu^2_Y |t_i^*| > |t_{\rm obs}|
\mu_X < \mu_Y t_i^* < t_{\rm obs}
\mu_X > \mu_Y t_i^* > t_{\rm obs}

Worked Example

  • Compensation data for a sample of CEOs in 2012 and 2013
    (USA data, in million dollars)
CEO Pay 2013 3.2 3.8 2.6 3.5 7.0 20.4 7.5 3.4 5.0 6.0
CEO Pay 2012 23.5 6.4 11.1 3.8 8.9 4.8 23.8 3.0 2.9 3.2


  • We want to test for a difference in means

H_0 \colon \mu_X = \mu_Y \,, \qquad H_1 \colon \mu_X \neq \mu_Y

Does average pay change from 2012 to 2013?

Plotting the densities

  • We see that Pay is not normally distributed \implies Bootstrap t-test is appropriate
View the R Code
# Enter the Wages data
ceo13 <- c(3.2, 3.8, 2.6, 3.5, 7.0, 20.4, 7.5, 3.4, 5.0, 6.0)
ceo12 <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3.0, 2.9, 3.2)


# Compute the estimated distributions
d.ceo13 <- density(ceo13)
d.ceo12 <- density(ceo12)


# Plot the estimated distributions

plot(d.ceo13,                                    # Plot d.ceo13
     xlim = range(c(d.ceo12$x, d.ceo13$x)),        # Set x-axis range
     ylim = range(c(d.ceo12$y, d.ceo13$y)),        # Set y-axis range
     main = "Estimated Distributions of CEOs Pay") # Add title to plot
lines(d.ceo12,                                    # Layer plot of d.ceo12
      lty = 2)                                  # Use different line style
         
legend("topleft",                               # Add legend at top-right
       legend = c("CEOs Pay 2013",             # Labels for legend
                  "CEOs Pay 2012"), 
       lty = c(1, 2))                           # Assign curves to legend          

Enter the data and compute t_{\rm obs}

  • Enter the data
ceo13 <- c(3.2, 3.8, 2.6, 3.5, 7.0, 20.4, 7.5, 3.4, 5.0, 6.0)
ceo12 <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3.0, 2.9, 3.2)
  • Calculate t_{\rm obs}
    • This can be done by using t.test
    • Under the null hypothesis, the two samples come from the same population
    • Therefore, we need to specify the populations have the same variance
    • This is done by including var.equal = T
# Calculate observed t-statistic
t.obs <- t.test(ceo13, ceo12, var.equal = T)$statistic
         t 
-0.9491981 

Simulating a single bootstrap t-statistic

  • Compute sample size, and combine the observations into a single sample
# Compute sample size
n <- length(ceo13)
m <- length(ceo12)

# Combine the samples
combined <- c(ceo13, ceo12)
  • Calculate one bootstrap sample from mathematicians and accountants
ceo13.boot <-sample(combined, n, replace = T)
ceo12.boot <- sample(combined, m, replace = T)
  • Calculate the simulated bootstrap t-statistic
t.boot <- t.test(ceo13.boot, ceo12.boot, var.equal = T)$statistic
          t 
-0.02904949 

Bootstrapping the t-static

  • In the previous slide we generated one simulated value of the t-statistic

  • To generate B = 10,000 bootstrap simulations, we use replicate

# Bootstrap the t-statistic B = 10,000 times
B <- 10000

t.boot <- replicate(B, {
                    # Single bootstrap sample
                    ceo13.boot <- sample(combined, n, replace = T)
                    ceo12.boot <- sample(combined, m, replace = T)
                    
                    # Return single bootstrap t-statistic
                    t.test(ceo13.boot, ceo12.boot, var.equal = T)$statistic
                    })
  • The vector t.boot contains B=10,000 bootstrap simulations of the t-statistic

Compute the bootstrap p-value

  • We are conducting a two-sided test. Hence, the bootstrap p-value is p = \frac{ \# \text{ extreme } t^*_i}{B} = \frac{ \#_{i=1}^B \, |t^*_i| > |t_{\rm obs}| }{B}
# Count number of extreme statistics for two-sided test
extreme <- sum ( abs( t.boot ) > abs (t.obs) )

# Compute the p-value
p <- extreme / B

# Print
cat("The bootstrap p-value is:", p)
The bootstrap p-value is: 0.3715
  • Conclusion: No evidence (p > 0.05) of a difference in pay between 2012 and 2013

  • The code can be downloaded here bootstrap_t_test.R

Comparison with normal family test

  • The bootstrap p-value just computed is p = 0.3715

  • We compare it to the two-sample t-test p-value

p2 <- t.test(ceo13, ceo12, var.equal = T)$p.value

cat("The t-test p-value is:", p2)
The t-test p-value is: 0.3550911


  • The two p-values are quite similar, even though the data is non-normal

  • This indicates that the t-test is fairly robust to non-normality

However, since data is non-normal, the bootstrap t-test is the preferred choice

Part 4:
Bootstrap F-test

Bootstrap F-test procedure

  • Suppose given a cdf G(\cdot), and two samples:
    • x_1, \ldots, x_n from a population X with cdf \, G \left( x - \mu_X \right)
    • y_1, \ldots, y_m from a population Y with cdf \, G \left( \dfrac{y - \mu_Y }{\Delta} \right)
  • Note: Population distributions have similar shape,
    • However means and spreads are (potentially) different
  • Denoting by \sigma^2_X and \sigma^2_Y the variances of X and Y, we have \Delta = \frac{\sigma^2_X}{\sigma^2_Y}

  • We want to test for difference in variances

H_0 \colon \sigma^2_X = \sigma^2_Y \,, \quad \qquad H_1 \colon \sigma^2_X \neq \sigma^2_Y \,, \quad \text{ or } \quad \sigma^2_X > \sigma^2_Y

  • As usual for F-test, we label the samples so that s_X^2 \geq s_Y^2

  • Not making assumptions on population G \implies distribution of F-statistic is unknown

    • Need to bootstrap the F-statistic
  • Assume the null hypothesis is true: \sigma_X^2 = \mu_Y^2 \quad \implies \quad \Delta = \frac{\sigma^2_X}{\sigma^2_Y} = 1

  • Hence, under H_0, the two samples X and Y have cdfs G \left( x - \mu_X \right) \quad \text{ and } \quad G \left(y - \mu_Y \right)

  • By centering, we obtain that X - \mu_X \quad \text{ and } \quad Y - \mu_Y \quad \text{ have both cdf } \quad G \left( z \right)

  • Thus, the two centered samples are part of a single sample of size n+m \mathcal{S} = \{x_1 - \overline{x}, \ldots, x_n - \overline{x} , y_1 - \overline{y}, \ldots, y_m - \overline{y} \}

  • The sample distribution \hat{G} of \mathcal{S} is an approximation of G
    (Glivenko-Cantelli Theorem)

  • We bootstrap the F-statistic from the centered combined sample \mathcal{S}

  • The bootstrap samples are generated as follows:

    • Sample \{x_1^*, \ldots, x_n^*\} from \mathcal{S} with replacement
    • Sample \{y_1^*, \ldots, y_m^*\} from \mathcal{S} with replacement
  • Compute a bootstrap simulation of the F-statistic F^* = \frac{(s_X^2)^*}{(s_Y^2)^*}

  • Repeat B times to obtain F^*_1, \ldots , F^*_B

  • These values represent the bootstrap distribution of the F-statistic

  • We now compare the bootstrap F-statistic to the observed statistic F_{\rm obs} = \frac{s^2_X}{s^2_Y}

  • A bootstrap simulation of the F-statistic F_i^* is extreme if

Alternative F_i^* extreme
\sigma_X^2 \neq \sigma^2_Y F_i^* > F_{\rm obs} \, or F_i^* < 1/F_{\rm obs}
\sigma_X^2 > \sigma^2_Y F_i^* > F_{\rm obs}
  • The bootstrap p-value is computed by

p = \frac{\# \text{ of extreme bootstrap simulations}}{B}

Algorithm: Bootstrap F-test

Setting: Given the two samples x_1,\ldots,x_n and y_1, \ldots, y_m

  1. Compute sample variances s_X^2 and s_Y^2. If s_Y^2 > s_X^2, swap the samples

  2. Compute the observed F-statistic F_{\rm obs} on the given data

  3. Combine (centered) samples into \mathcal{S} = \{ x_1 - \overline{x}, \ldots, x_n - \overline{x}, y_1 - \overline{y}, \ldots, y_m- \overline{y}\}

  4. Sample \{x_1^*, \ldots, x_n^*\} from \mathcal{S} with replacement

  5. Sample \{y_1^*, \ldots, y_m^*\} from \mathcal{S} with replacement

  6. Compute the F-statistic on the bootstrap samples

  7. Repeat steps 3-5, B times, obtaining B bootstrap simulations of F-statistic F^*_1, \ldots , F^*_B

  1. Compute the p-value by p = \frac{\# \text{ extreme } F_i^*}{B}
Alternative F_i^* extreme
\sigma_X^2 \neq \sigma^2_Y F_i^*> F_{\rm obs} \, of \, F_i^* < 1/F_{\rm obs}
\sigma_X^2 > \sigma^2_Y F_i^* > F_{\rm obs}

Worked Example

  • Compensation data for a sample of CEOs in 2012 and 2013
    (USA data, in million dollars)
CEO Pay 2013 3.2 3.8 2.6 3.5 7.0 20.4 7.5 3.4 5.0 6.0
CEO Pay 2012 23.5 6.4 11.1 3.8 8.9 4.8 23.8 3.0 2.9 3.2


  • We want to test for a difference in variances H_0 \colon \sigma^2_X = \sigma^2_Y \,, \qquad H_1 \colon \sigma^2_X \neq \sigma^2_Y

  • We already know that data is non-normal \implies Bootstrap F-test is appropriate

Enter the data

  • Enter the data, and compute sample variances
# Enter the data
ceo13 <- c(3.2, 3.8, 2.6, 3.5, 7.0, 20.4, 7.5, 3.4, 5.0, 6.0)
ceo12 <- c(23.5, 6.4, 11.1, 3.8, 8.9, 4.8, 23.8, 3.0, 2.9, 3.2)

# Compare variances
if (var(ceo12) > var(ceo13)) cat("Swap the samples!")
Swap the samples!


  • We need to swap the labels:
    • X = \, CEO Pay in 2012 \qquad \quad Y = \, CEO Pay in 2013

Compute F_{\rm obs}

  • Calculate F_{\rm obs}

F_{\rm obs} = \frac{s_X^2}{s_Y^2} = \frac{\rm Variance \,\,\, CEO \, Pay \, in \, 2012}{\rm Variance \,\,\, CEO \, Pay \, in \, 2013}

  • Note that the samples are swapped (to ensure s_X^2 \geq s_Y^2)


# Calculate observed F-statistic (samples are swapped)
F.obs <- var(ceo12) / var(ceo13)
[1] 2.383577

Simulating a single bootstrap F-statistic

  • Compute sample size; combine (centered) observations into a single sample
# Compute sample size (samples are swapped)
n <- length(ceo12)
m <- length(ceo13)

# Combine the centered samples
combined.centered <- c(ceo12 - mean(ceo12), ceo13 - mean(ceo13))
  • Calculate one bootstrap sample from accountants and mathematicians
ceo12.boot <-sample(combined.centered, n, replace = T)
ceo13.boot <- sample(combined.centered, m, replace = T)
  • Calculate the simulated bootstrap F-statistic
F.boot <- var(ceo12.boot) / var(ceo13.boot)
[1] 1.589727

Bootstrapping the F-static

  • In the previous slide we generated one simulated value of the F-statistic

  • To generate B = 10,000 bootstrap simulations, we use replicate

# Bootstrap the F-statistic B = 10,000 times
B <- 10000

F.boot <- replicate(B, {
                    # Single bootstrap sample
                    ceo12.boot <-sample(combined.centered, n, replace = T)
                    ceo13.boot <- sample(combined.centered, m, replace = T)
                    
                    # Return single bootstrap F-statistic
                    var(ceo12.boot) / var(ceo13.boot)
})
  • The vector F.boot contains B=10,000 bootstrap simulations of the F-statistic

Compute the bootstrap p-value

  • We are conducting a two-sided test. Hence, the bootstrap p-value is p = \frac{ \# \text{ extreme } F^*_i}{B} = \frac{ \#_{i=1}^B \,\, F^*_i > F_{\rm obs} \,\, \text{ or } \,\, F^*_i < 1/F_{\rm obs}}{B}
# Count number of extreme statistics for two-sided test
extreme <- sum ( (F.boot > F.obs) | (F.boot < 1/F.obs) )

# Compute the p-value
p <- extreme / B

# Print
cat("The bootstrap p-value is:", p)
The bootstrap p-value is: 0.3506
  • Conclusion: No evidence (p > 0.05) of a difference in variance between populations

  • The code can be downloaded here bootstrap_F_test.R

Comparison with normal family test

  • The bootstrap p-value just computed is p = 0.3506

  • We compare it to the F-test p-value

p2 <- var.test(ceo12, ceo13)$p.value

cat("The F-test p-value is:", p2)
The F-test p-value is: 0.2117675


  • The two p-values are quite different (although they give same decision)

  • This is due to the non-normality of data

    • shows that the F-test is not very robust to non-normality

Since data is non-normal, the bootstrap F-test is the preferred choice

Part 5:
Lists

Lists

  • Vectors can contain only one data type (number, character, boolean)

  • Lists are data structures that can contain any R object

  • Lists can be created similarly to vectors, with the command list()

# List containing a number, a vector, and a string
my_list <- list(2, c(T,F,T,T), "hello")

# Print the list
print(my_list)
[[1]]
[1] 2

[[2]]
[1]  TRUE FALSE  TRUE  TRUE

[[3]]
[1] "hello"

Retrieving elements

Elements of a list can be retrieved by indexing

  • my_list[[k]] returns k-th element of my_list


# Consider again the same list
my_list <- list(2, c(T,F,T,T), "hello")

# Access 2nd element of my_list and store it in variable
second_element <- my_list[[2]]

# In this case the variable second_element is a vector
print(second_element)
[1]  TRUE FALSE  TRUE  TRUE

List slicing

You can return multiple items of a list via slicing

  • my_list[c(k1, ..., kn)] returns elements in positions k1, ..., kn
  • my_list[k1:k2] returns elements k1 to k2
my_list <- list(2, c(T,F), "Cat", "Dog", pi, 42)

# We store 1st, 3rd, 5th entries of my_list in slice
slice <- my_list[c(1, 3, 5)]

print(slice)
[[1]]
[1] 2

[[2]]
[1] "Cat"

[[3]]
[1] 3.141593

List slicing

my_list <- list(2, c(T,F), "Cat", "Dog", pi, 42)

# We store 2nd to 4th entries of my_list in slice
slice <- my_list[2:4]

print(slice)
[[1]]
[1]  TRUE FALSE

[[2]]
[1] "Cat"

[[3]]
[1] "Dog"

Naming

  • Components of lists can be named
    • names(my_list) <- c("name_1", ..., "name_k")
# Create list with 3 elements
my_list <- list(2, c(T,F,T,T), "hello")

# Name each of the 3 elements
names(my_list) <- c("number", "TF_vector", "string")

# Print the named list: the list is printed along with element names 
print(my_list)
$number
[1] 2

$TF_vector
[1]  TRUE FALSE  TRUE  TRUE

$string
[1] "hello"

Accessing named entries

  • A component of my_list named my_name can be accessed with dollar operator
    • my_list$my_name
# Create list with 3 elements and name them
my_list <- list(2, c(T,F,T,T), "hello")
names(my_list) <- c("number", "TF_vector", "string")

# Access 2nd element using dollar operator and store it in variable
second_component <- my_list$TF_vector

# Print 2nd element
print(second_component)
[1]  TRUE FALSE  TRUE  TRUE

Naming vectors

  • Vectors can be named, using the same syntax
    • names(my_vector) <- c("name_1", ..., "name_k")


# Create vector storing RGB levels of purple color
purple <- c(99, 3, 48)
names(purple) <- c("Red", "Green", "Blue")

# Print the vector
print(purple)
  Red Green  Blue 
   99     3    48 

Accessing named entries

  • A named component is accessed via indexing
    • my_vector["name_k"]


# Print 2nd element
print(purple["Green"])
Green 
    3 


# Print only the value stored in "Green"
print(as.numeric(purple["Green"]))
[1] 3

Part 6:
Data Frames

Data Frames

  • Data Frames are the preferred way of presenting a data set in R:

    • Each variable has assigned a collection of recorded observations
  • Data frames can contain any R object

  • Data Frames are similar to Lists, with the difference that:

    • Members of a Data Frame must all be vectors of equal length
  • In simpler terms:

    • Data Frame is a rectangular table of entries

Constructing a Data Frame

  • Data frames are constructed similarly to lists, using

    • data.frame()
  • Important: Elements of data frame must be vectors of the same length

  • Example: We construct the Family Guy data frame. Variables are

    • person – Name of character
    • age – Age of character
    • sex – Sex of character
family <- data.frame(
  person = c("Peter", "Lois", "Meg", "Chris", "Stewie"),
  age = c(42, 40, 17, 14, 1),
  sex = c("M", "F" , "F", "M", "M")
)

Printing a Data Frame

  • R prints data frames like matrices
  • First row contains vector names
  • First column contains row names
  • Data are paired: e.g. Peter is 42 and Male
family <- data.frame(
  person = c("Peter", "Lois", "Meg", "Chris", "Stewie"),
  age = c(42, 40, 17, 14, 1),
  sex = c("M", "F" , "F", "M", "M")
)

print(family)
  person age sex
1  Peter  42   M
2   Lois  40   F
3    Meg  17   F
4  Chris  14   M
5 Stewie   1   M

Accessing single entry

  • Think of a data frame as a matrix

  • You can access element in position (m,n) by using

    • my_data[m, n]
  • Example

    • Peter is in 1st row
    • Names are in 1st column
    • Therefore, Peter’s name is in position (1,1)
data <- family[1, 1]

print(data)
[1] "Peter"

Accessing multiple entries

To access multiple elements on the same row or column, type

  • my_data[c(k1,...,kn), m] \quad or \quad my_data[k1:k2, m]
  • my_data[n, c(k1,...,km)] \quad or \quad my_data[n, k1:k2]

Example: We want to access Stewie’s Sex and Age:

  • Stewie is listed in 5th row
  • Age and Sex are in 2nd and 3rd column
stewie_data <- family[5, 2:3]

print(stewie_data)
  age sex
5   1   M

Accessing rows

  • To access rows k1,...,kn
    • my_data[c(k1,...,kn), ]
  • To access rows k1 to kn
    • my_data[k1:k2, ]


data <- family[c(1,3), ]      # Access 1st and 3rd row - Peter & Meg

print(data)
  person age sex
1  Peter  42   M
3    Meg  17   F

Accessing columns

  • To access columns k1,...,kn
    • my_data[ , c(k1,...,kn)]
  • To access columns k1 to kn
    • my_data[ ,k1:k2]


age.name <- family[, c(2,1)]    # Access 2nd and 1st columns: Age and Name

print(age.name)
  age person
1  42  Peter
2  40   Lois
3  17    Meg
4  14  Chris
5   1 Stewie

The dollar operator

Use dollar operator to access data frame columns

  • Suppose data frame my_data contains a variable called my_variable
    • my_data$my_variable accesses column my_variable
    • my_data$my_variable is a vector

Example: To access age in the family data frame type

ages <- family$age        # Stores ages in a vector

cat("Ages of the Family Guy characters are:", ages)
Ages of the Family Guy characters are: 42 40 17 14 1
cat("Meg's age is", ages[3])
Meg's age is 17

Size of a data frame

Command Output
nrow(my_data) # of rows
ncol(my_data) # of columns
dim(my_data) vector containing # of rows and columns


family_dim <- dim(family)    # Stores dimensions of family in a vector

cat("The Family Guy data frame has", 
    family_dim[1], 
    "rows and", 
    family_dim[2], 
    "columns")
The Family Guy data frame has 5 rows and 3 columns

Adding Data

Assume given a data frame my_data

  • To add more records (adding to rows rbind)
    • Create single row data frame new_record
    • new_record must match the structure of my_data
    • Add to my_data with my_data <- rbind(my_data, new_record)
  • To add a set of observations for a new variable (adding to columns cbind)
    • Create a vector new_variable
    • new_variable must have as many components as rows in my_data
    • Add to my_data with my_data <- cbind(my_data, new_variable)

Example: Add new record

  • Consider the usual Family Guy data frame family
  • Suppose we want to add data for Brian
  • Create a new record: a single row data frame with columns
    • person, age, sex
new_record <- data.frame(
  person = "Brian",
  age = 7,
  sex = "M"
)

print(new_record)
  person age sex
1  Brian   7   M

Important: Names have to match existing data frame

Example: Add new record

  • Now, we add new_record to family
  • This is done by appending one row to the existing data frame


family <- rbind(family, new_record)

print(family)
  person age sex
1  Peter  42   M
2   Lois  40   F
3    Meg  17   F
4  Chris  14   M
5 Stewie   1   M
6  Brian   7   M

Example: Add new variable

  • We want to add a new variable to the Family Guy data frame family
  • This variable is called funny
  • It records how funny each character is, with levels
    • Low, Med, High
  • Create a vector funny with entries matching each character (including Brian)
funny <- c("High", "High", "Low", "Med", "High", "Med")

print(funny)
[1] "High" "High" "Low"  "Med"  "High" "Med" 

Example: Add new variable

  • Add funny to the Family Guy data frame family
  • This is done by appending one column to the existing data frame
family <- cbind(family, funny)

print(family)
  person age sex funny
1  Peter  42   M  High
2   Lois  40   F  High
3    Meg  17   F   Low
4  Chris  14   M   Med
5 Stewie   1   M  High
6  Brian   7   M   Med

Adding a new variable

Alternative way

Instead of using cbind we can use dollar operator:

  • Want to add variable called new_variable
  • Create a vector v containing values for the new variable
  • v must have as many components as rows in my_data
  • Add to my_data with
    • my_data$new_variable <- v

Example

  • We add age expressed in months to the Family Guy data frame family
  • Age in months can be computed by multiplying vector family$age by 12
v <- family$age * 12       # Computes vector of ages in months

family$age.months <- v     # Stores vector as new column in family

print(family)
  person age sex funny age.months
1  Peter  42   M  High        504
2   Lois  40   F  High        480
3    Meg  17   F   Low        204
4  Chris  14   M   Med        168
5 Stewie   1   M  High         12
6  Brian   7   M   Med         84

Logical Record Subsets

  • We saw how to use logical flag vectors to subset vectors

  • We can use logical flag vectors to subset data frames as well

  • Suppose to have data frame my_data containing a variable my_variable

  • Want to subset records in my_data for which my_variable satisfies a condition

  • Use commands

    • flag <- condition(my_data$my_variable)
    • my_data[flag, ]

Example

  • Consider again the Family Guy data frame family
  • We subset Female characters using flag
    • family$sex == "F"
# Create flag vector for female Family Guy characters
flag <- (family$sex == "F")
print(flag)
[1] FALSE  TRUE  TRUE FALSE FALSE FALSE


# Subset data frame "family" and store in data frame "subset"
subset <- family[flag, ]
print(subset)
  person age sex funny age.months
2   Lois  40   F  High        480
3    Meg  17   F   Low        204

Part 7:
Data Entry

Reading data from files

  • R has a many functions for reading characters from stored files

  • We will see how to read Table-Format files

  • Table-Formats are just tables stored in plain-text files

  • Typical file estensions are:

    • .txt for plain-text files
    • .csv for comma-separated values
  • Table-Formats can be read into R with the command

    • read.table()

Table-Formats

4 key features

  1. Header: Used to name columns
    • Header should be first line
    • Header is optional
    • If header present, need to tell R when importing
    • If not, R cannot tell if first line is header or data
  2. Delimiter: Character used to separate entries in each line
    • Delimiter character cannot be used for anything else in the file
    • Default delimiter is whitespace

Table-Formats

4 key features

  1. Missing value: Character string used exclusively to denote a missing value
    • When reading the file, R will turn these entries into NA
  2. Comments:
    • Table files can include comments
    • Comment lines start with \quad #
    • R ignores such comments

Example

  • Table-Format for Family Guy characters can be downloaded here family_guy.txt
  • The text file looks like this

  • Remarks:
    • Header is present
    • Delimiter is whitespace
    • Missing values denoted by *

read.table command

Table-Formats can be read via read.table()

  • Reads .txt or .csv files

  • outputs a data frame

Options of read.table()

  • header = T/F
    • Tells R if a header is present
  • na.strings = "string"
    • Tells R that "string" means NA (missing values)

Reading our first Table-Format file

To read family_guy.txt into R proceed as follows:

  1. Download family_guy.txt and move file to Desktop

  2. Open the R Console and change working directory to Desktop

# In MacOS type
setwd("~/Desktop")

# In Windows type
setwd("C:/Users/YourUsername/Desktop")

  1. Read family_guy.txt into R and store it in data frame family with code
family = read.table(file = "family_guy.txt",
                    header = TRUE,
                    na.strings = "*")
  1. Note that we are telling read.table() that
    • family_guy.txt has a header
    • Missing values are denoted by *
  1. Print data frame family to screen
print(family)
  person age sex funny age.mon
1  Peter  NA   M  High     504
2   Lois  40   F  <NA>     480
3    Meg  17   F   Low     204
4  Chris  14   M   Med     168
5 Stewie   1   M  High      NA
6  Brian  NA   M   Med      NA
  • For comparison this is the .txt file

Example: t-test

Data: Analysis of Consumer Confidence Index for 2008 crisis (seen in Lecture 3)

  • We imported data into R using c()
  • This is ok for small datasets
  • Suppose the CCI data is stored in a .txt file instead

Goal: Perform t-test on CCI difference for mean difference \mu = 0

  • By reading CCI data into R using read.table()
  • By manipulating CCI data using data frames

  • The CCI dataset can be downloaded here 2008_crisis.txt

  • The text file looks like this

To perform the t-test on data 2008_crisis.txt we proceed as follows:

  1. Download dataset 2008_crisis.txt and move file to Desktop

  2. Open the R Console and change working directory to Desktop

# In MacOS type
setwd("~/Desktop")

# In Windows type
setwd("C:/Users/YourUsername/Desktop")
  1. Read 2008_crisis.txt into R and store it in data frame scores with code
scores = read.table(file = "2008_crisis",
                    header = TRUE
                    )

  1. Store 2nd and 3rd columns of scores into 2 vectors
# CCI from 2007 is stored in 2nd column
score_2007 <- scores[, 2]

# CCI from 2009 is stored in 3nd column
score_2009 <- scores[, 3]
  1. From here, the t-test can be performed as done in Lecture 3
# Compute vector of differences
difference <- score_2007 - score_2009

# Perform t-test on difference with null hypothesis mu = 0
t.test(difference, mu = 0)$p.value
[1] 4.860686e-13
  • We obtain the same result of Lecture 3
    • p-value is p < 0.05 \implies Reject H_0: The mean difference is not 0
  • For convenience, you can download the full code 2008_crisis_code.R

Part 8:
Least squares

Example: Blood Pressure

10 patients treated with both Drug A and Drug B

  • These drugs cause change in blood pressure
  • Patients are given the drugs one at a time
  • Changes in blood pressure are recorded
  • For patient i we denote by
    • x_i the change caused by Drug A
    • y_i the change caused by Drug B
i x_i y_i
1 1.9 0.7
2 0.8 -1.0
3 1.1 -0.2
4 0.1 -1.2
5 -0.1 -0.1
6 4.4 3.4
7 4.6 0.0
8 1.6 0.8
9 5.5 3.7
10 3.4 2.0

Example: Blood Pressure

Goal:

  • Predict reaction to Drug B, knowing reaction to Drug A
  • This means predict y_i from x_i

Plot:

  • To visualize data we can plot pairs (x_i,y_i)
  • Points seem to align
  • It seems there is a linear relation between x_i and y_i

Example: Blood Pressure

Linear relation:

  • Try to fit a line through the data

  • Line roughly predicts y_i from x_i

  • However note the outlier (x_7,y_7) = (4.6, 0) (red point)

  • How is such line constructed?

Example: Blood Pressure

Least Squares Line:

  • A general line has equation y = \beta x + \alpha for some
    • slope \beta
    • intercept \alpha
  • Value predicted by the line for x_i is \hat{y}_i = \beta x_i + \alpha

Example: Blood Pressure

Least Squares Line:

  • We would like predicted and actual value to be close \hat{y}_i \approx y_i

  • Hence the vertical difference has to be small y_i - \hat{y}_i \approx 0

Example: Blood Pressure

Least Squares Line:

  • We want \hat{y}_i - y_i \approx 0 \,, \qquad \forall \, i

  • Can be achieved by minimizing the sum of squares \min_{\alpha, \beta} \ \sum_{i} \ (y_i - \hat{y}_i)^2 \hat{y}_i = \beta x_i + \alpha

Residual Sum of Squares

Definition
Let (x_1,y_1), \ldots, (x_n, y_n) be a set of n pair of points. Consider the line y = \beta x + \alpha The Residual Sum of Squares associated to the line is \mathop{\mathrm{RSS}}(\alpha,\beta) := \sum_{i=1}^n (y_i-\alpha-{\beta}x_i)^2

Note: \mathop{\mathrm{RSS}} can be seen as a function \mathop{\mathrm{RSS}}\colon \mathbb{R}^2 \to \mathbb{R}\qquad \quad \mathop{\mathrm{RSS}}= \mathop{\mathrm{RSS}}(\alpha,\beta)

\mathop{\mathrm{RSS}}(\alpha,\beta) for Blood Pressure data

Summary statistics

For a given sample (x_1,y_1), \ldots, (x_n, y_n), define

  • Sample Means: \overline{x} := \frac{1}{n} \sum_{i=1}^n x_i \qquad \quad \overline{y} := \frac{1}{n} \sum_{i=1}^n y_i

  • Sums of squares: S_{xx} := \sum_{i=1}^n ( x_i - \overline{x} )^2 \qquad \quad S_{yy} := \sum_{i=1}^n ( y_i - \overline{y} )^2

  • Sum of cross-products: S_{xy} := \sum_{i=1}^n ( x_i - \overline{x} ) ( y_i - \overline{y} )

Minimizing the RSS

Theorem

Given (x_1,y_1), \ldots, (x_n, y_n), consider the minimization problem \begin{equation} \tag{M} \min_{\alpha,\beta } \ \mathop{\mathrm{RSS}}= \min_{\alpha,\beta} \ \sum_{i=1}^n (y_i-\alpha-{\beta}x_i)^2 \end{equation} Then

  1. There exists a unique line solving (M)
  2. Such line has the form y = \hat{\beta} x + \hat{\alpha} with \hat{\beta} = \frac{S_{xy}}{S_{xx}} \qquad \qquad \hat{\alpha} = \overline{y} - \hat{\beta} \ \overline{x}

Positive semi-definite matrix

To prove the Theorem we need some background results

  • A symmetric matrix is positive semi-definite if all the eigenvalues \lambda_i satisfy \lambda_i \geq 0

  • Proposition: A 2 \times 2 symmetric matrix M is positive semi-definite iff \det M \geq 0 \,, \qquad \quad \operatorname{Tr}(M) \geq 0

Positive semi-definite Hessian

  • Suppose given a smooth function of 2 variables

f \colon \mathbb{R}^2 \to \mathbb{R}\qquad \quad f = f (x,y)

  • The Hessian of f is the matrix

\nabla^2 f = \left( \begin{array}{cc} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \\ \end{array} \right)

Positive semi-definite Hessian

  • In particular the Hessian is positive semi-definite iff

\det \nabla^2 f = f_{xx} f_{yy} - f_{xy}^2 \geq 0 \qquad \quad f_{xx} + f_{yy} \geq 0

  • Side Note: For C^2 functions it holds that

\nabla^2 f \, \text{ is positive semi-definite} \qquad \iff \qquad f \, \text{ is convex}

Optimality conditions

Lemma

Suppose f \colon \mathbb{R}^2 \to \mathbb{R} has positive semi-definite Hessian. They are equivalent

  1. The point (\hat{x},\hat{y}) is a minimizer of f, that is, f(\hat{x}, \hat{y}) = \min_{x,y} \ f(x,y)

  2. The point (\hat{x},\hat{y}) satisfies the optimality conditions \nabla f (\hat{x},\hat{y}) = 0

Note: The proof of the above Lemma can be found in [2]

Example

  • The main example of strictly convex function in 2D is

f(x,y) = x^2 + y^2

  • It is clear that \min_{x,y} \ f(x,y) = \min_{x,y} \ x^2 + y^2 = 0 \,, with the only minimizer being (0,0)

  • However, let us use the Lemma to prove this fact

  • The gradient of f = x^2 + y^2 is

\nabla f = (f_x,f_y) = (2x, 2y)

  • Therefore the optimality condition has unique solution

\nabla f = 0 \qquad \iff \qquad x = y = 0

  • The Hessian of f is

\nabla^2 f = \left( \begin{array}{cc} f_{xx} & f_{xy} \\ f_{yx} & f_{yy} \end{array} \right) = \left( \begin{array}{cc} 2 & 0 \\ 0 & 2 \end{array} \right)

  • The Hessian is positive semi-definite since

\det (\nabla^2) f = 4 > 0 \qquad \qquad \operatorname{Tr}(\nabla^2 f) = 4 > 0

  • By the Lemma, we conclude that (0,0) is the unique minimizer of f, that is,

0 = f(0,0) = \min_{x,y} \ f(x,y)

Minimizing the RSS

Proof of Theorem

  • We go back to proving the RSS Minimization Theorem

  • Suppose given data points (x_1,y_1), \ldots, (x_n, y_n)

  • We want to solve the minimization problem

\begin{equation} \tag{M} \min_{\alpha,\beta } \ \mathop{\mathrm{RSS}}= \min_{\alpha,\beta} \ \sum_{i=1}^n (y_i-\alpha-{\beta}x_i)^2 \end{equation}

  • In order to use the Lemma we need to compute

\nabla \mathop{\mathrm{RSS}}\quad \text{ and } \quad \nabla^2 \mathop{\mathrm{RSS}}

Minimizing the RSS

Proof of Theorem

  • We first compute \nabla \mathop{\mathrm{RSS}} and solve the optimality conditions \nabla \mathop{\mathrm{RSS}}(\alpha,\beta) = 0

  • To this end, recall that \overline{x} := \frac{\sum_{i=1}^nx_i}{n} \qquad \implies \qquad \sum_{i=1}^n x_i = n \overline{x}

  • Similarly, we have \sum_{i=1}^n y_i = n \overline{y}

Minimizing the RSS

Proof of Theorem

  • Therefore we get

\begin{align*} \mathop{\mathrm{RSS}}_{\alpha} & = -2\sum_{i=1}^n(y_i- \alpha- \beta x_i) \\[10pt] & = - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} \\[20pt] \mathop{\mathrm{RSS}}_{\beta} & = -2\sum_{i=1}^n x_i (y_i- \alpha - \beta x_i) \\[10pt] & = - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2 \end{align*}

Minimizing the RSS

Proof of Theorem

  • Hence the optimality conditions are

\begin{align} - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} & = 0 \tag{1} \\[20pt] - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2 & = 0 \tag{2} \end{align}

Minimizing the RSS

Proof of Theorem

  • Equation (1) is

-2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} = 0

  • By simplifying and rearraging, we find that (1) is equivalent to

\alpha = \overline{y}- \beta \overline{x}

Minimizing the RSS

Proof of Theorem

  • Equation (2) is equivalent to

\sum_{i=1}^n x_i y_i - \alpha n \overline{x} - \beta \sum_{i=1}^n x^2_i = 0

  • From the previous slide we have \alpha = \overline{y}- \beta \overline{x}

Minimizing the RSS

Proof of Theorem

  • Substituting in Equation (2) we get \begin{align*} 0 & = \sum_{i=1}^n x_i y_i - \alpha n \overline{x} - \beta \sum_{i=1}^n x^2_i \\ & = \sum_{i=1}^n x_i y_i - n \overline{x} \, \overline{y} + \beta n \overline{x}^2 - \beta \sum_{i=1}^n x^2_i \\ & = \sum_{i=1}^n (x_i y_i - \overline{x} \, \overline{y} ) - \beta \left( \sum_{i=1}^n x^2_i - n\overline{x}^2 \right) = S_{xy} - \beta S_{xx} \end{align*} where we used the usual identity S_{xx} = \sum_{i=1}^n ( x_i - \overline{x})^2 = \sum_{i=1}^n x_i^2 - n\overline{x}^2

Minimizing the RSS

Proof of Theorem

  • Hence Equation (2) is equivalent to \beta = \frac{S_{xy}}{ S_{xx} }

  • Also recall that Equation (1) is equivalent to \alpha = \overline{y}- \beta \overline{x}

  • Therefore (\hat\alpha, \hat\beta) solves the optimality conditions \nabla \mathop{\mathrm{RSS}}= 0 iff \hat\alpha = \overline{y}- \hat\beta \overline{x} \,, \qquad \quad \hat\beta = \frac{S_{xy}}{ S_{xx} }

Minimizing the RSS

Proof of Theorem

  • We need to compute \nabla^2 \mathop{\mathrm{RSS}}

  • To this end recall that \mathop{\mathrm{RSS}}_{\alpha} = - 2 n \overline{y} + 2n \alpha + 2 \beta n \overline{x} \,, \quad \mathop{\mathrm{RSS}}_{\beta} = - 2 \sum_{i=1}^n x_i y_i + 2 \alpha n \overline{x} + 2 \beta \sum_{i=1}^n x_i^2

  • Therefore we have \begin{align*} \mathop{\mathrm{RSS}}_{\alpha \alpha} & = 2n \qquad & \mathop{\mathrm{RSS}}_{\alpha \beta} & = 2 n \overline{x} \\ \mathop{\mathrm{RSS}}_{\beta \alpha } & = 2 n \overline{x} \qquad & \mathop{\mathrm{RSS}}_{\beta \beta} & = 2 \sum_{i=1}^{n} x_i^2 \end{align*}

Minimizing the RSS

Proof of Theorem

  • The Hessian determinant is

\begin{align*} \det (\nabla^2 \mathop{\mathrm{RSS}}) & = \mathop{\mathrm{RSS}}_{\alpha \alpha}\mathop{\mathrm{RSS}}_{\beta \beta} - \mathop{\mathrm{RSS}}_{\alpha \beta}^2 \\[10pt] & = 4n \sum_{i=1}^{n} x_i^2 - 4 n^2 \overline{x}^2 \\[10pt] & = 4n \left( \sum_{i=1}^{n} x_i^2 - n \overline{x}^2 \right) \\[10pt] & = 4n S_{xx} \end{align*}

Minimizing the RSS

Proof of Theorem

  • Recall that

S_{xx} = \sum_{i=1}^n (x_i - \overline{x})^2 \geq 0

  • Therefore we have

\det (\nabla^2 \mathop{\mathrm{RSS}}) = 4n S_{xx} \geq 0

Minimizing the RSS

Proof of Theorem

  • We also have

\operatorname{Tr}(\nabla^2\mathop{\mathrm{RSS}}) = \mathop{\mathrm{RSS}}_{\alpha \alpha} + \mathop{\mathrm{RSS}}_{\beta \beta} = 2n + 2 \sum_{i=1}^{n} x_i^2 \geq 0

  • Therefore we have proven \det( \nabla^2 \mathop{\mathrm{RSS}}) \geq 0 \,, \qquad \quad \operatorname{Tr}(\nabla^2\mathop{\mathrm{RSS}}) \geq 0

  • As the Hessian is symmetric, we conclude that \nabla^2 \mathop{\mathrm{RSS}} is positive semi-definite

Minimizing the RSS

Proof of Theorem

  • By the Lemma, we have that all the solutions (\alpha,\beta) to the optimality conditions \nabla \mathop{\mathrm{RSS}}(\alpha,\beta) = 0 are minimizers

  • Therefore (\hat \alpha,\hat\beta) with \hat\alpha = \overline{y}- \hat\beta \overline{x} \,, \qquad \quad \hat\beta = \frac{S_{xy}}{ S_{xx} } is a minimizer of \mathop{\mathrm{RSS}}, ending the proof

Least-squares line

The previous Theorem motivates the following definition

Definition
Given (x_1,y_1), \ldots, (x_n, y_n), the least-squares line is y = \hat\beta x + \hat \alpha where we define \hat{\alpha} = \overline{y} - \hat{\beta} \ \overline{x} \qquad \qquad \hat{\beta} = \frac{S_{xy}}{S_{xx}}

Part 9: Worked Example

Worked Example: Blood Pressure

Computing the least-squares line in R

In R, we want to do the following:

  • Input the data into vectors x and y

  • Compute the least-square line coefficients \hat{\alpha} = \overline{y} - \hat{\beta} \ \overline{x} \qquad \qquad \hat{\beta} = \frac{S_{xy}}{S_{xx}}

  • Plot the data points (x_i,y_i)

  • Overlay the least squares line

i x_i y_i
1 1.9 0.7
2 0.8 -1.0
3 1.1 -0.2
4 0.1 -1.2
5 -0.1 -0.1
6 4.4 3.4
7 4.6 0.0
8 1.6 0.8
9 5.5 3.7
10 3.4 2.0

First Solution

  • We give a first solution using elementary R functions

  • The code to input the data into a data-frame is as follows

# Input blood pressure changes data into data-frame

changes <- data.frame(drug_A = c(1.9, 0.8, 1.1, 0.1, -0.1, 
                                 4.4, 4.6, 1.6, 5.5, 3.4),
                      drug_B = c(0.7, -1.0, -0.2, -1.2, -0.1, 
                                 3.4, 0.0, 0.8, 3.7, 2.0)
                     )
  • To shorten the code we assign drug_A and drug_B to vectors x and y
# Assign data-frame columns to vectors x and y
x <- changes$drug_A
y <- changes$drug_B

  • Compute averages \overline{x}, \overline{y} and covariances S_{xx}, S_{xy}
# Compute averages xbar and ybar
xbar <- mean(x)
ybar <- mean(y)

# Compute covariances S_xx and S_xy
S_xx <- var(x)
S_xy <- cov(x,y)

  • Compute the least-square line coefficients \hat{\beta} = \frac{S_{xy}}{S_{xx}} \,, \qquad \qquad \hat{\alpha} = \overline{y} - \hat{\beta} \ \overline{x}
# Compute least-square line coefficients
beta <- S_xy / S_xx
alpha <- ybar - beta * xbar

# Print coefficients
cat("\nCoefficient alpha =", alpha)
cat("\nCoefficient beta =", beta)



Coefficient alpha = -0.7861478

Coefficient beta = 0.685042

  • Plot the data pairs (x_i,y_i)
# Plot the data
plot(x, y, xlab = "", ylab = "", pch = 16, cex = 2)

# Add labels
mtext("Drug A reaction x_i", side = 1, line = 3, cex = 2.1)
mtext("Drug B reaction y_i", side = 2, line = 2.5, cex = 2.1)

Note: We have added a few cosmetic options

  • pch = 16 plots points with black circles
  • cex = 2 stands for character expansion – Specifies width of points
  • mtext is used to fine-tune the axis labels
    • side = 1 stands for x-axis
    • side = 2 stands for y-axis
    • line specifies distance of label from axis

  • To plot a line y = bx + a
    • abline(a, b)
  • Overlay the least squares line y = \hat{\beta}x + \hat{\alpha}
# Overlay least-squares line
abline(a = alpha, b = beta, col = "red", lwd = 3)
  • Note: Cosmetic options
    • col specifies color of the plot
    • lwd specifies line width
  • For clarity, we can add legend to plot
# Add legend
legend("topleft", 
       legend = c("Data", "Least-Squares Line"), 
       col = c("black", "red"), 
       pch = c(16, NA), 
       lty = c(NA, 1), 
       lwd = c(NA, 3))

  • The code can be downloaded here least_squares_1.R

  • Running the code we obtain the plot below

Second Solution

  • The second solution uses the R function lm
  • lm stands for linear model
  • First we input the data into a data-frame
# Input blood pressure changes data into data-frame
changes <- data.frame(drug_A = c(1.9, 0.8, 1.1, 0.1, -0.1, 
                                 4.4, 4.6, 1.6, 5.5, 3.4),
                      drug_B = c(0.7, -1.0, -0.2, -1.2, -0.1, 
                                 3.4, 0.0, 0.8, 3.7, 2.0)
                     )

  • Use lm to fit the least-squares line
    • lm(formula, data)
    • data expects a data-frame in input
    • formula stands for the relation to fit
  • In the case of least-squares:
    • formula = y ~ x
    • x and y are names of two variables in the data-frame
    • Symbol y ~ x can be read as: y modelled as function of x

Note: Storing data in data-frame is optional

  • We can just store data in vectors x and y

  • Then fit the least-squares line with lm(y ~ x)

  • The command to fit the least-squares line on changes is
# Fit least squares line
least_squares <- lm(formula = drug_B ~ drug_A, data = changes) 

# Print output
print(least_squares)

Call:
lm(formula = drug_B ~ drug_A, data = changes)

Coefficients:
(Intercept)       drug_A  
    -0.7861       0.6850  


- The above tells us that the estimators are

\hat \alpha = -0.7861 \,, \qquad \quad \hat \beta = 0.6850

  • We can now plot the data with the following command
    • 1st coordinate is the vector changes$drug_A
    • 2nd coordinate is the vector changes$drug_B
# Plot data 
plot(changes$drug_A, changes$drug_B, pch = 16, cex = 2)


  • The least-squares line is currently stored in least_squares

  • To add the line to the current plot use abline

# Plot least-squares line
abline(least_squares, col = "red", lwd = 3)

  • The code can be downloaded here least_squares_2.R

  • Running the code we obtain the plot below

References

[1]
Efron, Bradley, Tibshirani, Robert J., An introduction to the Bootstrap, Chapman & Hall / CRC, 1994.
[2]
Fusco, Nicola, Marcellini, Paolo, Sbordone, Carlo, Mathematical analysis: Functions of several real variables and applications, Springer, 2022.