plot(rnorm(1000))
Lecture 4
Goal: estimate the mean \mu of a normal population N(\mu,\sigma^2). If \mu_0 is guess for \mu H_0 \colon \mu = \mu_0 \qquad H_1 \colon \mu \neq \mu_0
Compute the estimated standard error \mathop{\mathrm{e.s.e.}}= \frac{s}{\sqrt{n}}
Compute the t-statistic t = \frac{\text{estimate } - \text{ hypothesised value}}{\mathop{\mathrm{e.s.e.}}} = \frac{\overline x - \mu_0}{s/\sqrt{n}}
\mu_0 is the value of the null hypothesis H_0
After computing t-statistic we need to compute p-value
p-value is a measure of how strange the data is in relation to the null hypothesis
We have 2 options:
In this course we reject H_0 for p-values p<0.05
For two-sided t-test the p-value is defined as p := 2P(t_{n-1}> |t|) where t_{n-1} is the t-distribution with n-1 degrees of freedom
Therefore the p-value is p = 2P(\text{Observing t }| \, \mu=\mu_0)
p<0.05 means that the test statistic t is extreme: \,\, P(t_{n-1}> |t|)<0.025
t falls in the grey areas in the t_{n-1} plot below: Each grey area measures 0.025
Find Table 13.1 in this file
The critical value t^* = t_{n-1}(0.025) found in the table satisfies P(t_{n-1}>t^*) = 0.025
By definition of p-value for two-sided t-test we have p := 2P(t_{n-1}>|t|)
Therefore, for |t|>t^* \begin{align*} p & := 2P(t_{n-1}>|t|) \\ & < 2P(t_{n-1}>t^*) = 2 \cdot (0.025) = 0.05 \end{align*}
Conclusion: \qquad |t|>t^* \quad \iff \quad p<0.05
Recall that p = 2P ( \text{Observing t } | \mu = \mu_0). We have two possibilities:
Month | J | F | M | A | M | J | J | A | S | O | N | D |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CCI 2007 | 86 | 86 | 88 | 90 | 99 | 97 | 97 | 96 | 99 | 97 | 90 | 90 |
CCI 2009 | 24 | 22 | 21 | 21 | 19 | 18 | 17 | 18 | 21 | 23 | 22 | 21 |
Difference | 62 | 64 | 67 | 69 | 80 | 79 | 80 | 78 | 78 | 74 | 68 | 69 |
Month | J | F | M | A | M | J | J | A | S | O | N | D |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CCI 2007 | 86 | 86 | 88 | 90 | 99 | 97 | 97 | 96 | 99 | 97 | 90 | 90 |
CCI 2009 | 24 | 22 | 21 | 21 | 19 | 18 | 17 | 18 | 21 | 23 | 22 | 21 |
Difference | 62 | 64 | 67 | 69 | 80 | 79 | 80 | 78 | 78 | 74 | 68 | 69 |
Using the available data, we need to compute:
Sample mean and standard deviation \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i \qquad s = \sqrt{\frac{\sum_{i=1}^n x_i^2 - n \overline{x}^2}{n-1}}
Test statistic t = \frac{\overline x - \mu_0}{s/\sqrt{n}}
CCI | J | F | M | A | M | J | J | A | S | O | N | D |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Difference | 62 | 64 | 67 | 69 | 80 | 79 | 80 | 78 | 78 | 74 | 68 | 69 |
\begin{align*} \overline{x} & =\frac{1}{n} \sum_{i=1}^{n} x_i=\frac{1}{12} \left(62+64+67+{\ldots}+68+69\right)=\frac{868}{12}=72.33 \\ \sum_{i=1}^{n} x_i^2 & = 62^2+64^2+67^2+{\ldots}+68^2+69^2 = 63260 \\ s & = \sqrt{ \frac{\sum_{i=1}^n x_i^2 - n \overline{x}^2}{n-1} } = \sqrt{\frac{63260-12\left(\frac{868}{12}\right)^2}{11}} = \sqrt{\frac{474.666}{11}} = 6.5689 \end{align*}
Find Table 13.1 in this file
We have computed:
Therefore |t| = 38.145 > 2.201 = t^*
This implies rejecting the null hypothesis H_0 \colon \mu = 0
t-test implies that mean difference in CCI is \mu \neq 0
The sample mean difference is positive (\bar{x}=72.33)
Conclusions:
Concise Statistics with R
Comprehensive R manual
We have installed R. What now?
Launch the R Console. There are two ways:
>
Enter
to executeExample: Plotting 1000 values randomly generated from normal distribution
Below you can see R code and the corresponding answer
The interactive R Console is OK for short codes
For longer code use R scripts
.txt
or .R
extensionsource("file_name.R")
Examples of text editors
R session has a working directory associated with it
Unless specified, R will use a default working directory
To check the location of the working directory, use the getwd
function
On my MacOS system I get \qquad
File paths are always enclosed in double quotation marks
Note that R uses forward slashes (not backslashes) for paths
You can change the default working directory using the function setwd
In RStudio you can set the working directory from the menu bar:
->
Set Working Directory ->
Choose Directoryhelp(object_name)
help()
can be crypticLet us go back to the example of the command plot(rnorm(1000))
The function rnorm(n)
outputs n randomly generated numbers from N(0,1)
These can then be plotted by concatenating the plot command
Note:
The values plotted (next slide) are, for sure, different from the ones listed above
This is because every time you call rnorm(5)
new values are generated
We need to store the generated values if we want to re-use them
<-
<-
denotes an arrow pointing to the variable to which the value is assigned2
to the variable x
enter x <- 2
x
just type x
x
has the value 2
x
can be used in subsequent operationsx
If you save the following code in a .R
file and run it, you will obtain no output
This is because you need to tell R to print x
to screen
print()
[1] 2
sentence
to screen we can usecat()
cat
can be used to combine strings and variables in a single outputSave to a plain text file named either
my_first_code.R
my_first_code.txt
Move this file to Desktop
Open the R Console and change working directory to Desktop
Code run successfully!
The sum of 1 and 2 is 3
ls()
You can remove variables from workspace by using
rm()
To completely clear the workspace use
rm(list = ls())
save.image("file_name.RData")
file_name.RData
load("file_name.RData")
Recommended: keep all the files related to a project in a single folder
Such folder will have to be set as working directory in R Console
Saving the workspace could be dangerous
Always store your code in R Scripts
To quit the R Console type q()
.RData
file in the working directory.txt
or .R
text files.txt
filesc()
# Constuct two vectors of radius and height of 6 cylinders
radius <- c(6, 7, 5, 9, 9, 7)
height <- c(1.7, 1.8, 1.6, 2, 1, 1.9)
# Compute the volume of each cylinder and store it in "volume"
volume <- pi * radius^2 * height
# Print volume
print(volume)
[1] 192.2655 277.0885 125.6637 508.9380 254.4690 292.4823
a
has 7 components while b
has 2 componentsa + b
is executed as follows:
b
is copied 4 times to match the length of a
a + b
is then obtained by
a + \tilde{b} = (1, 2, 3, 4, 5, 6, 7) + (0, 1, 0, 1, 0, 1, 0) =
(1, 3, 3, 5, 5, 7, 7)
Useful applications of broadcasting are:
Two very useful vector operators are:
sum(x)
which returns the sum of the components of x
length(x)
which returns the length of x
x <- c(1, 2, 3, 4, 5)
sum <- sum(x)
length <- length(x)
cat("Here is the vector x:", x)
cat("The components of vector x sum to", sum)
cat("The length of vector x is", length)
Here is the vector x: ( 1 2 3 4 5 )
The components of vector x sum to 15
The length of vector x is 5
Given a vector \mathbf{x}= (x_1,\ldots,x_n) we want to compute sample mean and variance \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i \,, \qquad s^2 = \frac{\sum_{i=1}^n (x_i - \overline{x})^2 }{n-1}
mean(x)
computes the sample mean of x
sd(x)
computes the sample standard deviation of x
var(x)
computes the sample variance of x
Professional | x_1 | x_2 | x_3 | x_4 | x_5 | x_6 | x_7 | x_8 | x_9 | x_{10} |
---|---|---|---|---|---|---|---|---|---|---|
Wage | 36 | 40 | 46 | 54 | 57 | 58 | 59 | 60 | 62 | 63 |
# First store the wage data into a vector
x <- c(36, 40, 46, 54, 57, 58, 59, 60, 62, 63)
# Compute the sample mean using formula
xbar = sum(x) / length(x)
# Compute the sample mean using built in R function
xbar_check = mean(x)
# We now print both results to screen
cat("Sample mean computed with formula is", xbar)
cat("Sample mean computed by R is", xbar_check)
cat("They coincide!")
Sample mean computed with formula is 53.5
Sample mean computed by R is 53.5
They coincide!
# Compute the sample variance using formula
xbar = mean(x)
n = length(x)
s2 = sum( (x - xbar)^2 ) / (n - 1)
# Compute the sample variance using built in R function
s2_check = var(x)
# We now print both results to screen
cat("Sample variance computed with formula is", s2)
cat("Sample variance computed by R is", s2_check)
cat("They coincide!")
Sample variance computed with formula is 90.27778
Sample variance computed by R is 90.27778
They coincide!
Goal of t-test: Estimate mean \mu of normal population N(\mu,\sigma^2). If \mu_0 is guess for \mu H_0 \colon \mu = \mu_0 \qquad H_1 \colon \mu \neq \mu_0
Method: Given sample X_1 ,\ldots,X_n we look at the statistics T = \frac{\overline{X}-\mu_0}{S/\sqrt{n}} \sim t_{n-1}
Given the data x_1,\ldots,x_n, compute the t-statistic t = \frac{\text{estimate } - \text{ hypothesised value}}{\mathop{\mathrm{e.s.e.}}} = \frac{\overline x - \mu_0}{s/\sqrt{n}} with sample mean and sample standard deviation \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i \,, \qquad s = \sqrt{\frac{\sum_{i=1}^n x_i^2 - n \overline{x}^2}{n-1}}
Find the critical value t^* = t_{n-1}(0.025) in Statistical Table 13.1
Given the sample x_1,\ldots,x_n, R can compute the t-statistic t = \frac{\text{estimate } - \text{ hypothesised value}}{\mathop{\mathrm{e.s.e.}}} = \frac{\overline x - \mu_0}{s/\sqrt{n}}
R can compute the precise p-value (no need for Statistical Tables) p = 2P(|t_{n-1}|>t)
Note: The above operations can be done at the same time by command t.test
data_vector <- c(x1, ..., xn)
data_vector
with null hypothesis mu0
using
t.test(data_vector, mu = mu0)
Relevant options of t.test
are
mu = mu0
tells R to test null hypothesis
H_0 \colon \mu = \mu_0
If mu = mu0
is not specified, R assumes \mu_0 = 0
alternative = "greater"
tells R to perform one-sided t-test
H_0 \colon \mu > \mu_0
alternative = "smaller"
tells R to perform one-sided t-test
H_0 \colon \mu < \mu_0
conf.level = n
changes the confidence interval level to n
(default is 0.95)
Let us go back to the 2008 Crisis example
Month | J | F | M | A | M | J | J | A | S | O | N | D |
---|---|---|---|---|---|---|---|---|---|---|---|---|
CCI 2007 | 86 | 86 | 88 | 90 | 99 | 97 | 97 | 96 | 99 | 97 | 90 | 90 |
CCI 2009 | 24 | 22 | 21 | 21 | 19 | 18 | 17 | 18 | 21 | 23 | 22 | 21 |
Difference | 62 | 64 | 67 | 69 | 80 | 79 | 80 | 78 | 78 | 74 | 68 | 69 |
We want to test if there was a change in CCI from 2007 to 2009
We interested in the difference in CCI
The null hypothesis is that there was (on average) no change in CCI H_0 \colon \mu = 0
The alternative hypothesis is that there was some change: H_1 \colon \mu \neq 0
# Enter CCI data in 2 vectors using function c()
score_2007 <- c(86, 86, 88, 90, 99, 97, 97, 96, 99, 97, 90, 90)
score_2009 <- c(24, 22, 21, 21, 19, 18, 17, 18, 21, 23, 22, 21)
# Compute vector of differences in CCI
difference <- score_2007 - score_2009
# Perform t-test on difference with null hypothesis mu = 0
# Store answer in "answer"
answer -> t.test(difference, mu = 0)
# Print the answer
print(answer)
One Sample t-test
data: difference
t = 38.144, df = 11, p-value = 4.861e-13
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
68.15960 76.50706
sample estimates:
mean of x
72.33333
One Sample t-test
t.test
has automatically assumed that a one-sample test is desireddata: difference
difference
t = 38.144, df = 11, p-value = 4.861e-13
This is the best part:
Note:
alternative hypothesis: true mean is not equal to 0
R tells us the alternative hypothesis is \mu \neq 0
Hence the Null hypothesis tested is H_0 \colon \mu = 0
Warning:
95 percent confidence interval:
68.15960 76.50706
This is a 95 \% confidence interval for the true mean:
95 percent confidence interval:
68.15960 76.50706
95 percent confidence interval:
68.15960 76.50706
R calculated the quantities \overline{x} \pm t_{n-1}(0.025) \times \mathop{\mathrm{e.s.e.}}
Based on the data, we conclude that the set of (hypothetical) mean values is \mu \in [68.15960, 76.50706]
sample estimates:
mean of x
72.33333
mean(difference)
The key information is:
R has extensive built in graphing functions:
Fancier graphing functions are contained in the library ggplot2
(see link)
However we will be using the basic built in R graphing functions
x
and y
of same lengthplot(x, y)
Example: Suppose to have data of weights and heights of 6 people
plot()
in R Console the plot will appear in a pop-up windowpch = 2
pch
stands for plotting characterplot(x, y)
lines
to linearly interpolate the scatter plot:
lines(x, y)
Let us plot the parabola y = x^2 \,, \qquad x \in [-1,1]
The previous plot was quite rough
This is because we only computed y=x^2 on the grid x = (-1, -0.5, 0, 0.5, 1)
We could refine the grid by hand, but this is not practical
To generate a finer grid we can use the built in R function
seq()
seq(from, to, by, length.out)
generates a vector containing a sequence:
from
– The beginning number of the sequenceto
– The ending number of the sequenceby
– The step-size of the sequence (the increment)length.out
– The total length of the sequenceExample: Generate the vector of even numbers from 2 to 20
Note: The following commands are equivalent:
seq(from = x1, to = x2, by = s)
seq(x1, x2, s)
Example: Generate the vector of odd numbers from 1 to 11
x <- seq(from = 1, to = 11, by = 2)
y <- seq(1, 11, 2)
cat("Vector x is: (", x, ")")
cat("Vector y is: (", y, ")")
cat("They are the same!")
Vector x is: ( 1 3 5 7 9 11 )
Vector y is: ( 1 3 5 7 9 11 )
They are the same!
Let us go back to the example of plotting random normal values
x
with 1000 random normal valuesx
via plot(x)
plot(x)
implicitly assumes that:
x
is the second argument: Values to plot on y-axisseq(1, 1000)
seq(1, 1000)
is the vector of components numbers of x
Comments
#