Statistical Models

Lecture 2

Dr. Silvio Fanzon

S.Fanzon@hull.ac.uk

University of Hull

Lecture 2:
Random samples

Outline of Lecture 2

Probability revision III
Multivariate random vectors
Random samples
Unbiased estimators
Chi-squared distribution
Sampling from normal distribution
t-distribution

Part 1:
Probability revision III

Probability revision III

You are expected to be familiar with the main concepts from Y1 module
Introduction to Probability & Statistics
Self-contained revision material available in Appendix A

Topics to review: Sections 6–7 of Appendix A

Independence of random variables
Covariance and correlation

Independence of random variables

Definition: Independence

(X,Y) random vector with joint pdf or pmf f_{X,Y} and marginal pdfs or pmfs f_X,f_Y. We say that X and Y are independent random variables if f_{X,Y}(x,y) = f_X(x)f_Y(y) \,, \quad \forall \, (x,y) \in \mathbb{R}^2

Independence of random variables

Conditional distributions and probabilities

If X and Y are independent then X gives no information on Y (and vice-versa):

Conditional distribution: Y|X is same as Y f(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)} = \frac{f_X(x)f_Y(y)}{f_X(x)} = f_Y(y)
Conditional probabilities: From the above we also obtain \begin{align*} P(Y \in A | x) & = \sum_{y \in A} f(y|x) = \sum_{y \in A} f_Y(y) = P(Y \in A) & \, \text{ discrete rv} \\ P(Y \in A | x) & = \int_{y \in A} f(y|x) \, dy = \int_{y \in A} f_Y(y) \, dy = P(Y \in A) & \, \text{ continuous rv} \end{align*}

Independence of random variables

Characterization of independence - Densities

Theorem

(X,Y) random vector with joint pdf or pmf f_{X,Y}. They are equivalent:

X and Y are independent random variables
There exist functions g(x) and h(y) such that f_{X,Y}(x,y) = g(x)h(y) \,, \quad \forall \, (x,y) \in \mathbb{R}^2

Note:

g(x) and h(y) are not necessarily the pdfs or pmfs of X and Y
However they coincide with f_X and f_Y, up to rescaling by a constant

Exercise

A student leaves for class between 8 AM and 8:30 AM and takes between 40 and 50 minutes to get there

Denote by X the time of departure
- X = 0 corresponds to 8 AM
- X = 30 corresponds to 8:30 AM
Denote by Y the travel time
Assume that X and Y are independent and uniformly distributed

Question: Find the probability that the student arrives to class before 9 AM

Solution

By assumption X is uniform on (0,30). Therefore f_X(x) = \begin{cases} \frac{1}{30} & \text{ if } \, x \in (0,30) \\ 0 & \text{ otherwise } \end{cases}
By assumption Y is uniform on (40,50). Therefore f_Y(y) = \begin{cases} \frac{1}{10} & \text{ if } \, y \in (40,50) \\ 0 & \text{ otherwise } \end{cases} where we used that 50 - 40 = 10

Solution

Define the rectangle R = (0,30) \times (40,50)
Since X and Y are independent, we get

f_{X,Y}(x,y) = f_X(x)f_Y(y) = \begin{cases} \frac{1}{300} & \text{ if } \, (x,y) \in R \\ 0 & \text{ otherwise } \end{cases}

Solution

The arrival time is given by X + Y
Therefore, the student arrives to class before 9 AM iff X + Y < 60
Notice that \{X + Y < 60 \} = \{ (x,y) \in \mathbb{R}^2 \, \colon \, 0 \leq x < 60 - y, 40 \leq y < 50 \}

Solution

Therefore, the probability of arriving before 9 AM is

\begin{align*} P(\text{arrives before 9 AM}) & = P(X + Y < 60) \\ & = \int_{\{X+Y < 60\}} f_{X,Y} (x,y) \, dxdy \\ & = \int_{40}^{50} \left( \int_0^{60-y} \frac{1}{300} \, dx \right) \, dy \\ & = \frac{1}{300} \int_{40}^{50} (60 - y) \, dy \\ & = \frac{1}{300} \ y \left( 60 - \frac{y}{2} \right) \Bigg|_{y=40}^{y=50} \\ & = \frac{1}{300} \cdot (1750 - 1600) = \frac12 \end{align*}

Consequences of independence

Theorem

Suppose X and Y are independent random variables. Then

For any A,B \subset \mathbb{R} we have P(X \in A, Y \in B) = P(X \in A) P(Y \in B)
Suppose g(x) is a function of (only) x, h(y) is a function of (only) y. Then {\rm I\kern-.3em E}[g(X)h(Y)] = {\rm I\kern-.3em E}[g(X)]{\rm I\kern-.3em E}[h(Y)]

Application: MGF of sums

Theorem

Suppose X and Y are independent random variables and denote by M_X and M_Y their MGFs. Then M_{X + Y} (t) = M_X(t) M_Y(t)

Proof: Follows by previous Theorem \begin{align*} M_{X + Y} (t) & = {\rm I\kern-.3em E}[e^{t(X+Y)}] = {\rm I\kern-.3em E}[e^{tX}e^{tY}] \\ & = {\rm I\kern-.3em E}[e^{tX}] {\rm I\kern-.3em E}[e^{tY}] \\ & = M_X(t) M_Y(t) \end{align*}

Example - Sum of independent normals

Suppose X \sim N (\mu_1, \sigma_1^2) and Y \sim N (\mu_2, \sigma_2^2) are independent normal random variables
We have seen in Lecture 1 that for normal distributions M_X(t) = \exp \left( \mu_1 t + \frac{t^2 \sigma_1^2}{2} \right) \,, \qquad M_Y(t) = \exp \left( \mu_2 t + \frac{t^2 \sigma_2^2}{2} \right)
Since X and Y are independent, from previous Theorem we get \begin{align*} M_{X+Y}(t) & = M_{X}(t) M_{Y}(t) = \exp \left( \mu_1 t + \frac{t^2 \sigma_1^2}{2} \right) \exp \left( \mu_2 t + \frac{t^2 \sigma_2^2}{2} \right) \\ & = \exp \left( (\mu_1 + \mu_2) t + \frac{t^2 (\sigma_1^2 + \sigma_2^2)}{2} \right) \end{align*}

Example - Sum of independent normals

Therefore Z := X + Y has moment generating function M_{Z}(t) = M_{X+Y}(t) = \exp \left( (\mu_1 + \mu_2) t + \frac{t^2 (\sigma_1^2 + \sigma_2^2)}{2} \right)
The above is the mgf of a normal distribution with \text{mean }\quad \mu_1 + \mu_2 \quad \text{ and variance} \quad \sigma_1^2 + \sigma_2^2
By the Theorem in Slide 68 of Lecture 1 we have Z \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)
Sum of independent normals is normal

Covariance & Correlation

Relationship between RV

Given two random variables X and Y we said that

X and Y are independent if f_{X,Y}(x,y) = f_X(x) g_Y(y)
In this case there is no relationship between X and Y
This is reflected in the conditional distributions: X|Y \sim X \qquad \qquad Y|X \sim Y

Covariance & Correlation

Relationship between RV

If X and Y are not independent then there is a relationship between them

Question

How do we measure the strength of such dependence?

Answer: By introducing the notions of

Covariance
Correlation

Covariance

Definition

Notation: Given two rv X and Y we denote \begin{align*} & \mu_X := {\rm I\kern-.3em E}[X] \qquad & \mu_Y & := {\rm I\kern-.3em E}[Y] \\ & \sigma^2_X := {\rm Var}[X] \qquad & \sigma^2_Y & := {\rm Var}[Y] \end{align*}

Definition

The covariance of X and Y is the number {\rm Cov}(X,Y) := {\rm I\kern-.3em E}[ (X - \mu_X) (Y - \mu_Y) ]

Covariance

Alternative Formula

Theorem

The covariance of X and Y can be computed via {\rm Cov}(X,Y) = {\rm I\kern-.3em E}[XY] - {\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Y]

Correlation

Remark:

{\rm Cov}(X,Y) encodes only qualitative information about the relationship between X and Y
To obtain quantitative information we introduce the correlation

Definition

The correlation of X and Y is the number \rho_{XY} := \frac{{\rm Cov}(X,Y)}{\sigma_X \sigma_Y}

Correlation detects linear relationships

Theorem

For any random variables X and Y we have

- 1\leq \rho_{XY} \leq 1
|\rho_{XY}|=1 if and only if there exist a,b \in \mathbb{R} such that P(Y = aX + b) = 1
- If \rho_{XY}=1 then a>0 \qquad \qquad \quad (positive linear correlation)
- If \rho_{XY}=-1 then a<0 \qquad \qquad (negative linear correlation)

Proof: Omitted, see page 172 of [1]

Correlation & Covariance

Independent random variables

Theorem

If X and Y are independent random variables then {\rm Cov}(X,Y) = 0 \,, \qquad \rho_{XY}=0

Proof:

If X and Y are independent then {\rm I\kern-.3em E}[XY]={\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Y]
Therefore {\rm Cov}(X,Y)= {\rm I\kern-.3em E}[XY]-{\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Y] = 0
Moreover \rho_{XY}=0 by definition

Formula for Variance

Variance is quadratic

Theorem

For any two random variables X and Y and a,b \in \mathbb{R} {\rm Var}[aX + bY] = a^2 {\rm Var}[X] + b^2 {\rm Var}[Y] + 2 {\rm Cov}(X,Y) If X and Y are independent then {\rm Var}[aX + bY] = a^2 {\rm Var}[X] + b^2 {\rm Var}[Y]

Proof: Exercise

Example 1

Assume X and Z are independent, and X \sim {\rm uniform} \left( 0,1 \right) \,, \qquad Z \sim {\rm uniform} \left( 0, \frac{1}{10} \right)
Consider the random variable Y = X + Z
Since X and Z are independent, and Z is uniform, we have that Y | X = x \, \sim \, {\rm uniform} \left( x, x + \frac{1}{10} \right) (adding x to Z simply shifts the uniform distribution of Z by x)
Question: Is the correlation \rho_{XY} between X and Y high or low?

Example 1

As Y | X \, \sim \, {\rm uniform} \left( X, X + \frac{1}{10} \right), the conditional pdf of Y given X = x is f(y|x) = \begin{cases} 10 & \text{ if } \, y \in \left( x , x + \frac{1}{10} \right) \\ 0 & \text{ otherwise} \end{cases}
As X \sim {\rm uniform} (0,1), its pdf is f_X(x) = \begin{cases} 1 & \text{ if } \, x \in \left( 0 , 1 \right) \\ 0 & \text{ otherwise} \end{cases}
Therefore, the joint distribution of (X,Y) is f_{X,Y}(x,y) = f(y|x)f_X(x) = \begin{cases} 10 & \text{ if } \, x \in (0,1) \, \text{ and } \, y \in \left( x , x + \frac{1}{10} \right) \\ 0 & \text{ otherwise} \end{cases}

Example 1

In gray: the region where f_{X,Y}(x,y)>0

When X increases, Y increases linearly (not surprising, since Y = X + Z)
We expect the correlation \rho_{XY} to be close to 1

Example 1 – Computing \rho_{XY}

For a random variable W \sim {\rm uniform} (a,b), we have {\rm I\kern-.3em E}[W] = \frac{a+b}{2} \,, \qquad {\rm Var}[W] = \frac{(b-a)^2}{12}
Since X \sim {\rm uniform} (0,1) and Z \sim {\rm uniform} (0,1/10), we have {\rm I\kern-.3em E}[X] = \frac12 \,, \qquad {\rm Var}[X] = \frac{1}{12} \,, \qquad {\rm I\kern-.3em E}[Z] = \frac{1}{20} \,, \qquad {\rm Var}[Z] = \frac{1}{1200}
Since X and Z are independent, we also have {\rm Var}[Y] = {\rm Var}[X + Z] = {\rm Var}[X] + {\rm Var}[Z] = \frac{1}{12} + \frac{1}{1200}

Example 1 – Computing \rho_{XY}

Since X and Z are independent, we have {\rm I\kern-.3em E}[XZ] = {\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Z]
We conclude that \begin{align*} {\rm Cov}(X,Y) & = {\rm I\kern-.3em E}[XY] - {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[Y] \\ & = {\rm I\kern-.3em E}[X(X + Z)] - {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[X + Z] \\ & = {\rm I\kern-.3em E}[X^2] - {\rm I\kern-.3em E}[X]^2 + {\rm I\kern-.3em E}[XZ] - {\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Z] \\ & = {\rm Var}[X] = \frac{1}{12} \end{align*}

Example 1 – Computing \rho_{XY}

The correlation between X and Y is \begin{align*} \rho_{XY} & = \frac{{\rm Cov}(X,Y)}{\sqrt{{\rm Var}[X]}\sqrt{{\rm Var}[Y]}} \\ & = \frac{\frac{1}{12}}{\sqrt{\frac{1}{12}} \sqrt{ \frac{1}{12} + \frac{1}{1200}} } = \sqrt{\frac{100}{101}} \end{align*}
As expected, we have very high correlation \rho_{XY} \approx 1
This confirms a very strong linear relationship between X and Y

Example 2

Assume X and Z are independent, and X \sim {\rm uniform} \left( -1,1 \right) \,, \qquad Z \sim {\rm uniform} \left( 0, \frac{1}{10} \right)
Define the random variable Y = X^2 + Z
Since X and Z are independent, and Z is uniform, we have that Y | X = x \, \sim \, {\rm uniform} \left( x^2, x^2 + \frac{1}{10} \right) (adding x^2 to Z simply shifts the uniform distribution of Z by x^2)
Question: Is the correlation \rho_{XY} between X and Y high or low?

Example 2

As Y | X \, \sim \, {\rm uniform} \left( X^2, X^2 + \frac{1}{10} \right), the conditional pdf of Y given X = x is f(y|x) = \begin{cases} 10 & \text{ if } \, y \in \left( x^2 , x^2 + \frac{1}{10} \right) \\ 0 & \text{ otherwise} \end{cases}
As X \sim {\rm uniform} (-1,1), its pdf is f_X(x) = \begin{cases} \frac12 & \text{ if } \, x \in \left( -1 , 1 \right) \\ 0 & \text{ otherwise} \end{cases}
Therefore, the joint distribution of (X,Y) is f_{X,Y}(x,y) = f(y|x)f_X(x) = \begin{cases} 10 & \text{ if } \, x \in (-1,1) \, \text{ and } \, y \in \left( x^2 , x^2 + \frac{1}{10} \right) \\ 0 & \text{ otherwise} \end{cases}

Example 2

In gray: the region where f_{X,Y}(x,y)>0

When X increases, Y increases quadratically (not surprising, as Y = X^2 + Z)
There is no linear relationship between X and Y \,\, \implies \,\, we expect \, \rho_{XY} \approx 0

Example 2 – Computing \rho_{XY}

Since X \sim {\rm uniform} (-1,1), we can compute that {\rm I\kern-.3em E}[X] = {\rm I\kern-.3em E}[X^3] = 0
Since X and Z are independent, we have {\rm I\kern-.3em E}[XZ] = {\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Z] = 0

Example 2 – Computing \rho_{XY}

Compute the covariance \begin{align*} {\rm Cov}(X,Y) & = {\rm I\kern-.3em E}[XY] - {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[Y] \\ & = {\rm I\kern-.3em E}[XY] \\ & = {\rm I\kern-.3em E}[X(X^2 + Z)] \\ & = {\rm I\kern-.3em E}[X^3] + {\rm I\kern-.3em E}[XZ] = 0 \end{align*}
The correlation between X and Y is \rho_{XY} = \frac{{\rm Cov}(X,Y)}{\sqrt{{\rm Var}[X]}\sqrt{{\rm Var}[Y]}} = 0
This confirms there is no linear relationship between X and Y

Part 2:
Multivariate random vectors

Multivariate Random Vectors

Recall

A Random vector is a function \mathbf{X}\colon \Omega \to \mathbb{R}^n
\mathbf{X} is a multivariate random vector if n \geq 3
We denote the components of \mathbf{X} by \mathbf{X}= (X_1,\ldots,X_n) \,, \qquad X_i \colon \Omega \to \mathbb{R}
We denote the components of a point \mathbf{x}\in \mathbb{R}^n by \mathbf{x}= (x_1,\ldots,x_n)

Discrete and Continuous Multivariate Random Vectors

Everything we defined for bivariate vectors extends to multivariate vectors

Definition

The random vector \mathbf{X}\colon \Omega \to \mathbb{R}^n is:

continuous if components X_is are continuous
discrete if components X_i are discrete

Joint pmf

Definition

The joint pmf of a continuous random vector \mathbf{X} is f_{\mathbf{X}} \colon \mathbb{R}^n \to \mathbb{R} defined by f_{\mathbf{X}} (\mathbf{x}) = f_{\mathbf{X}}(x_1,\ldots,x_n) := P(X_1 = x_1 , \ldots , X_n = x_n ) \,, \qquad \forall \, \mathbf{x}\in \mathbb{R}^n

Note: For all A \subset \mathbb{R}^n it holds P(\mathbf{X}\in A) = \sum_{\mathbf{x}\in A} f_{\mathbf{X}}(\mathbf{x})

Joint pdf

Definition

The joint pdf of a continuous random vector \mathbf{X} is a function f_{\mathbf{X}} \colon \mathbb{R}^n \to \mathbb{R} such that P (\mathbf{X}\in A) := \int_A f_{\mathbf{X}}(x_1 ,\ldots, x_n) \, dx_1 \ldots dx_n = \int_{A} f_{\mathbf{X}}(\mathbf{x}) \, d\mathbf{x}\,, \quad \forall \, A \subset \mathbb{R}^n

Note: \int_A denotes an n-fold intergral over all points \mathbf{x}\in A

Expected Value

Definition

\mathbf{X}\colon \Omega \to \mathbb{R}^n random vector and g \colon \mathbb{R}^n \to \mathbb{R} function. The expected value of the random variable g(X) is \begin{align*} {\rm I\kern-.3em E}[g(\mathbf{X})] & := \sum_{x \in \mathbb{R}^n} g(\mathbf{x}) f_{\mathbf{X}} (\mathbf{x}) \qquad & (\mathbf{X}\text{ discrete}) \\ {\rm I\kern-.3em E}[g(\mathbf{X})] & := \int_{\mathbb{R}^n} g(\mathbf{x}) f_{\mathbf{X}} (\mathbf{x}) \, d\mathbf{x}\qquad & \qquad (\mathbf{X}\text{ continuous}) \end{align*}

Marginal distributions

Marginal pmf or pdf of any subset of the coordinates (X_1,\ldots,X_n) can be computed by integrating or summing the remaining coordinates
To ease notations, we only define maginals wrt the first k coordinates

Definition

The marginal pmf or marginal pdf of the random vector \mathbf{X} with respect to the first k coordinates is the function f \colon \mathbb{R}^k \to \mathbb{R} defined by \begin{align*} f(x_1,\ldots,x_k) & := \sum_{ (x_{k+1}, \ldots, x_n) \in \mathbb{R}^{n-k} } f_{\mathbf{X}} (x_1 , \ldots , x_n) \quad & (\mathbf{X}\text{ discrete}) \\ f(x_1,\ldots,x_k) & := \int_{\mathbb{R}^{n-k}}f_{\mathbf{X}} (x_1 , \ldots, x_n ) \, dx_{k+1} \ldots dx_{n} \quad & \quad (\mathbf{X}\text{ continuous}) \end{align*}

Marginal distributions

We use a special notation for marginal pmf or pdf wrt a single coordinate

Definition

The marginal pmf or pdf of the random vector \mathbf{X} with respect to the i-th coordinate is the function f_{X_i} \colon \mathbb{R}\to \mathbb{R} defined by \begin{align*} f_{X_i}(x_i) & := \sum_{ \tilde{x} \in \mathbb{R}^{n-1} } f_{\mathbf{X}} (x_1, \ldots, x_n) \quad & (\mathbf{X}\text{ discrete}) \\ f_{X_i}(x_i) & := \int_{\mathbb{R}^{n-1}}f_{\mathbf{X}} (x_1, \ldots, x_n) \, d\tilde{x} \quad & \quad (\mathbf{X}\text{ continuous}) \end{align*} where \tilde{x} \in \mathbb{R}^{n-1} denotes the vector \mathbf{x} with i-th component removed \tilde{x} := (x_1, \ldots, x_{i-1}, x_{i+1},\ldots, x_n)

Conditional distributions

We now define conditional distributions given the first k coordinates

Definition

Let \mathbf{X} be a random vector and suppose that the marginal pmf or pdf wrt the first k coordinates satisfies f(x_1,\ldots,x_k) > 0 \,, \quad \forall \, (x_1,\ldots,x_k ) \in \mathbb{R}^k The conditional pmf or pdf of (X_{k+1},\ldots,X_n) given X_1 = x_1, \ldots , X_k = x_k is the function of (x_{k+1},\ldots,x_{n}) defined by f(x_{k+1},\ldots,x_n | x_1 , \ldots , x_k) := \frac{f_{\mathbf{X}}(x_1,\ldots,x_n)}{f(x_1,\ldots,x_k)}

Conditional distributions

Similarly, we can define the conditional distribution given the i-th coordinate

Definition

Let \mathbf{X} be a random vector and suppose that for a given x_i \in \mathbb{R} f_{X_i}(x_i) > 0 The conditional pmf or pdf of \tilde{X} given X_i = x_i is the function of \tilde{x} defined by f(\tilde{x} | x_i ) := \frac{f_{\mathbf{X}}(x_1,\ldots,x_n)}{f_{X_i}(x_i)} where we denote \tilde{X} := (X_1, \ldots, X_{i-1}, X_{i+1},\ldots, X_n) \,, \quad \tilde{x} := (x_1, \ldots, x_{i-1}, x_{i+1},\ldots, x_n)

Independence

Definition

\mathbf{X}=(X_1,\ldots,X_n) random vector with joint pmf or pdf f_{\mathbf{X}} and marginals f_{X_i}. We say that the random variables X_1,\ldots,X_n are mutually independent if f_{\mathbf{X}}(x_1,\ldots,x_n) = f_{X_1}(x_1) \cdot \ldots \cdot f_{X_n}(x_n) = \prod_{i=1}^n f_{X_i}(x_i)

Proposition

If X_1,\ldots,X_n are mutually independent then for all A_i \subset \mathbb{R} P(X_1 \in A_1 , \ldots , X_n \in A_n) = \prod_{i=1}^n P(X_i \in A_i)

Independence

Characterization result

Theorem

\mathbf{X}=(X_1,\ldots,X_n) random vector with joint pmf or pdf f_{\mathbf{X}}. They are equivalent:

The random variables X_1,\ldots,X_n are mutually independent
There exist functions g_i(x_i) such that f_{\mathbf{X}}(x_1,\ldots,x_n) = \prod_{i=1}^n g_{i}(x_i)

Independence

A very useful theorem

Theorem

Let X_1,\ldots,X_n be mutually independent random variables and g_i(x_i) function only of x_i. Then the random variables g_1(X_1) \,, \ldots \,, g_n(X_n) are mutually independent

Proof: Omitted. See [1] page 184

Example: X_1,\ldots,X_n \, independent \,\, \implies \,\, X_1^2, \ldots, X_n^2 \, independent

Independence

Expectation of product

Theorem

Let X_1,\ldots,X_n be mutually independent random variables and g_i(x_i) functions. Then {\rm I\kern-.3em E}[ g_1(X_1) \cdot \ldots \cdot g_n(X_n) ] = \prod_{i=1}^n {\rm I\kern-.3em E}[g_i(X_i)]

Application: MGF of sums

Theorem

Let X_1,\ldots,X_n be mutually independent random variables, with mgfs M_{X_1}(t),\ldots, M_{X_n}(t). Define the random variable Z := X_1 + \ldots + X_n The mgf of Z satisfies M_Z(t) = \prod_{i=1}^n M_{X_i}(t)

Application: MGF of sums

Proof of Theorem

Follows by the previous Theorem

\begin{align*} M_{Z} (t) & = {\rm I\kern-.3em E}[e^{tZ}] \\ & = {\rm I\kern-.3em E}[\exp( t X_1 + \ldots + tX_n)] \\ & = {\rm I\kern-.3em E}\left[ e^{t X_1} \cdot \ldots \cdot e^{ t X_n} \right] \\ & = \prod_{i=1}^n {\rm I\kern-.3em E}[e^{tX_i}] \\ & = \prod_{i=1}^n M_{X_i}(t) \end{align*}

Example – Sum of independent Normals

Theorem

Let X_1,\ldots,X_n be mutually independent random variables with normal distribution X_i \sim N (\mu_i,\sigma_i^2). Define Z := X_1 + \ldots + X_n and \mu :=\mu_1 + \ldots + \mu_n \,, \quad \sigma^2 := \sigma_1^2 + \ldots + \sigma_n^2 Then Z is normally distributed with Z \sim N(\mu,\sigma^2)

Example – Sum of independent Normals

Proof of Theorem

We have seen in Lecture 1 that X_i \sim N(\mu_i,\sigma_i^2) \quad \implies \quad M_{X_i}(t) = \exp \left( \mu_i t + \frac{t^2 \sigma_i^2}{2} \right)
As X_1,\ldots,X_n are mutually independent, from the Theorem in Slide 47, we get \begin{align*} M_{Z}(t) & = \prod_{i=1}^n M_{X_i}(t) = \prod_{i=1}^n \exp \left( \mu_i t + \frac{t^2 \sigma_i^2}{2} \right) \\ & = \exp \left( (\mu_1 + \ldots + \mu_n) t + \frac{t^2 (\sigma_1^2 + \ldots +\sigma_n^2)}{2} \right) \\ & = \exp \left( \mu t + \frac{t^2 \sigma^2 }{2} \right) \end{align*}

Example – Sum of independent Normals

Proof of Theorem

Therefore Z has moment generating function M_{Z}(t) = \exp \left( \mu t + \frac{t^2 \sigma^2 }{2} \right)
The above is the mgf of a normal distribution with \text{mean }\quad \mu \quad \text{ and variance} \quad \sigma^2
Since mgfs characterize distributions (see Theorem in Slide 71 of Lecture 1), we conclude Z \sim N(\mu, \sigma^2 )

Example – Sum of independent Gammas

Theorem

Let X_1,\ldots,X_n be mutually independent random variables with Gamma distribution X_i \sim \Gamma (\alpha_i,\beta). Define Z := X_1 + \ldots + X_n and \alpha :=\alpha_1 + \ldots + \alpha_n Then Z has Gamma distribution Z \sim \Gamma(\alpha,\beta)

Example – Sum of independent Gammas

Proof of Theorem

We have seen in Lecture 1 that X_i \sim \Gamma(\alpha_i,\beta) \qquad \implies \qquad M_{X_i}(t) = \frac{\beta^{\alpha_i}}{(\beta-t)^{\alpha_i}}
As X_1,\ldots,X_n are mutually independent, from the Theorem in Slide 47, we get \begin{align*} M_{Z}(t) & = \prod_{i=1}^n M_{X_i}(t) = \prod_{i=1}^n \frac{\beta^{\alpha_i}}{(\beta-t)^{\alpha_i}} \\ & = \frac{\beta^{(\alpha_1 + \ldots + \alpha_n)}}{(\beta-t)^{(\alpha_1 + \ldots + \alpha_n)}} \\ & = \frac{\beta^{\alpha}}{(\beta-t)^{\alpha}} \end{align*}

Example – Sum of independent Gammas

Proof of Theorem

Therefore Z has moment generating function M_{Z}(t) = \frac{\beta^{\alpha}}{(\beta-t)^{\alpha}}
The above is the mgf of a Gamma distribution with parameters \alpha and \beta
Since mgfs characterize distributions (see Theorem in Slide 71 of Lecture 1), we conclude Z \sim \Gamma(\alpha, \beta )

Expectation of sums

Expectation is linear

Theorem

For random variables X_1,\ldots,X_n and scalars a_1,\ldots,a_n we have {\rm I\kern-.3em E}[a_1X_1 + \ldots + a_nX_n] = a_1 {\rm I\kern-.3em E}[X_1] + \ldots + a_n {\rm I\kern-.3em E}[X_n]

Variance of sums

Variance is quadratic

Theorem

For random variables X_1,\ldots,X_n and scalars a_1,\ldots,a_n we have \begin{align*} {\rm Var}[a_1X_1 + \ldots + a_nX_n] = a_1^2 {\rm Var}[X_1] & + \ldots + a^2_n {\rm Var}[X_n] \\ & + 2 \sum_{i \neq j} {\rm Cov}(X_i,X_j) \end{align*} If X_1,\ldots,X_n are mutually independent then {\rm Var}[a_1X_1 + \ldots + a_nX_n] = a_1^2 {\rm Var}[X_1] + \ldots + a^2_n {\rm Var}[X_n]

Part 3:
Random samples

iid random variables

Definition

The random variables X_1,\ldots,X_n are independent identically distributed or iid with pdf or pmf f(x) if

X_1,\ldots,X_n are mutually independent
The marginal pdf or pmf of each X_i satisfies f_{X_i}(x) = f(x) \,, \quad \forall \, x \in \mathbb{R}

Random sample

Suppose the data in an experiment consists of observations on a population
Suppose the population has distribution f(x)
Each observation is labelled X_i
We always assume that the population is infinite
Therefore each X_i has distribution f(x)
We also assume the observations are independent

Definition

The random variables X_1,\ldots,X_n are a random sample of size n from the population f(x) if X_1,\ldots,X_n are iid with pdf or pmf f(x)

Random sample

Remark: Let X_1,\ldots,X_n be a random sample of size n from the population f(x). The joint distribution of \mathbf{X}= (X_1,\ldots,X_n) is f_{\mathbf{X}}(x_1,\ldots,x_n) = f(x_1) \cdot \ldots \cdot f(x_n) = \prod_{i=1}^n f(x_i) (since the X_is are mutually independent with distribution f)

Definition

We call f_{\mathbf{X}} the joint sample distribution

Random sample

Notation:

When the population distribution f(x) depends on a parameter \theta we write f = f(x|\theta)
In this case the joint sample distribution is f_{\mathbf{X}}(x_1,\ldots,x_n | \theta) = \prod_{i=1}^n f(x_i | \theta)

Example

Suppose a population has \mathop{\mathrm{Exponential}}(\beta) distribution f(x|\beta) = \frac{1}{\beta} e^{-x/\beta} \,, \qquad \text{ if } \,\, x > 0
Suppose X_1,\ldots,X_n is random sample from the population f(x|\beta)
The joint sample distribution is then \begin{align*} f_{\mathbf{X}}(x_1,\ldots,x_n | \beta) & = \prod_{i=1}^n f(x_i|\beta) \\ & = \prod_{i=1}^n \frac{1}{\beta} e^{-x_i/\beta} \\ & = \frac{1}{\beta^n} e^{-(x_1 + \ldots + x_n)/\beta} \end{align*}

Example

We have P(X_1 > 2) = \int_{2}^\infty f(x|\beta) \, dx = \int_{2}^\infty \frac{1}{\beta} e^{-x/\beta} \, dx = e^{-2/\beta}
Thanks to iid assumption we can easily compute \begin{align*} P(X_1 > 2 , \ldots, X_n > 2) & = \prod_{i=1}^n P(X_i > 2) \\ & = \prod_{i=1}^n P(X_1 > 2) \\ & = P(X_1 > 2)^n \\ & = e^{-2n/\beta} \end{align*}

Part 4:
Unbiased estimators

Point estimation

Usual situation: Suppose a population has distribution f(x|\theta)

In general, the parameter \theta is unknown
Suppose that knowing \theta is sufficient to characterize f(x|\theta)

Example: A population could be normally distributed f(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma}} \, \exp\left( -\frac{(x-\mu)^2}{2\sigma^2}\right) \,, \quad x \in \mathbb{R}

Here \mu is the mean and \sigma^2 the variance
Knowing \mu and \sigma^2 completely characterizes the normal distribution

Point estimation

Goal: We want to make predictions about the population

In order to do that, we need to know the population distribution f(x|\theta)
It is therefore desirable to determine \theta, with reasonable certainty

Definitions:

Point estimation is the procedure of estimating \theta from random sample
A point estimator is any function of a random sample W(X_1,\ldots,X_n)
Point estimators are also called statistics

Unbiased estimator

Definition

Suppose W is a point estimator of a parameter \theta

The bias of W is the quantity \rm{Bias}_{\theta} := {\rm I\kern-.3em E}[W] - \theta
W is an unbiased estimator if \rm{Bias}_{\theta} = 0, that is, {\rm I\kern-.3em E}[W] = \theta

Note: A point estimator W = W(X_1, \ldots, X_n) is itself a random variable. Thus {\rm I\kern-.3em E}[W] is the mean of such random variable

Next goal

We want to estimate mean and variance of a population
Unbiased estimators for such quantities are:
- Sample mean
- Sample variance

Estimating the population mean

Problem

Suppose to have a population with distribution f(x|\theta) We want to estimate the population mean \mu := \int_{\mathbb{R}} x f(x|\theta) \, dx

Sample mean

Definition

The sample mean of a random sample X_1,\ldots,X_n is the statistic W(X_1,\ldots,X_n) := \overline{X} := \frac{1}{n} \sum_{i=1}^n X_i

Sample mean

Sample mean is unbiased estimator of mean

Theorem

The sample mean \overline{X} is an unbiased estimator of the population mean \mu, that is, {\rm I\kern-.3em E}[\overline{X}] = \mu

Sample mean

Proof of theorem

X_1,\ldots,X_n is a random sample from f(x|\theta)
Therefore X_i \sim f(x|\theta) and {\rm I\kern-.3em E}[X_i] = \int_{\mathbb{R}} x f(x|\theta) \, dx = \mu
By linearity of expectation we have {\rm I\kern-.3em E}[\overline{X}] = \frac{1}{n} \sum_{i=1}^n {\rm I\kern-.3em E}[X_i] = \frac{1}{n} \sum_{i=1}^n \mu = \mu
This shows \overline{X} is an unbiased estimator of \mu

Variance of Sample mean

For reasons clear later, it is useful to compute the variance of the sample mean \overline{X}

Lemma

X_1,\ldots,X_n random sample from population with mean \mu and variance \sigma^2. Then {\rm Var}[\overline{X}] = \frac{\sigma^2}{n}

Variance of Sample mean

Proof of Lemma

By assumption,the population has mean \mu and variance \sigma^2
Since X_i is sampled from the population, we have {\rm I\kern-.3em E}[X_i] = \mu \,, \quad {\rm Var}[X_i] = \sigma^2
Since the variance is quadratic, and the X_is are independent, \begin{align*} {\rm Var}[\overline{X}] & = {\rm Var}\left[ \frac{1}{n} \sum_{i=1}^n X_i \right] = \frac{1}{n^2} \sum_{i=1}^n {\rm Var}[X_i] \\ & = \frac{1}{n^2} \cdot n \sigma^2 = \frac{\sigma^2}{n} \end{align*}

Estimating the population variance

Problem

Suppose to have a population f(x|\theta) with mean \mu and variance \sigma^2. We want to estimate the population variance

Sample variance

Definition

The sample variance of a random sample X_1,\ldots,X_n is the statistic S^2 := \frac{1}{n-1} \sum_{i=1}^n \left( X_i - \overline{X} \right)^2 where \overline{X} is the sample mean \overline{X} := \frac{1}{n} \sum_{i=1}^n X_i

Sample variance

Equivalent formulation

Proposition

It holds that S^2 := \frac{ \sum_{i=1}^n \left( X_i - \overline{X} \right)^2}{n-1} = \frac{ \sum_{i=1}^n X_i^2 - n\overline{X}^2 }{n-1}

Sample variance

Proof of Proposition

We have \begin{align*} \sum_{i=1}^n \left( X_i - \overline{X} \right)^2 & = \sum_{i=1}^n \left(X_i^2 + \overline{X}^2 - 2 X_i \overline{X} \right) = \sum_{i=1}^n X_i^2 + n\overline{X}^2 - 2 \overline{X} \sum_{i=1}^n X_i \\ & = \sum_{i=1}^n X_i^2 + n\overline{X}^2 - 2 n \overline{X}^2 = \sum_{i=1}^n X_i^2 -n \overline{X}^2 \end{align*}
Dividing by n-1 yields the desired identity S^2 = \frac{ \sum_{i=1}^n X_i^2 -n \overline{X}^2 }{n-1}

Sample variance

Sample variance is unbiased estimator of variance

Theorem

The sample variance S^2 is an unbiased estimator of the population variance \sigma^2, that is, {\rm I\kern-.3em E}[S^2] = \sigma^2

Sample variance

Proof of theorem

By linearity of expectation we infer {\rm I\kern-.3em E}[(n-1)S^2] = {\rm I\kern-.3em E}\left[ \sum_{i=1}^n X_i^2 - n\overline{X}^2 \right] = \sum_{i=1}^n {\rm I\kern-.3em E}[X_i^2] - n {\rm I\kern-.3em E}[\overline{X}^2]
Since X_i \sim f(x|\theta), we have {\rm I\kern-.3em E}[X_i] = \mu \,, \quad {\rm Var}[X_i] = \sigma^2
Therefore by definition of variance, we infer {\rm I\kern-.3em E}[X_i^2] = {\rm Var}[X_i] + {\rm I\kern-.3em E}[X]^2 = \sigma^2 + \mu^2

Sample variance

Proof of theorem

Also recall that {\rm I\kern-.3em E}[\overline{X}] = \mu \,, \quad {\rm Var}[\overline{X}] = \frac{\sigma^2}{n}
By definition of variance, we get {\rm I\kern-.3em E}[\overline{X}^2] = {\rm Var}[\overline{X}] + {\rm I\kern-.3em E}[\overline{X}]^2 = \frac{\sigma^2}{n} + \mu^2

Sample variance

Proof of theorem

Hence \begin{align*} {\rm I\kern-.3em E}[(n-1)S^2] & = \sum_{i=1}^n {\rm I\kern-.3em E}[X_i^2] - n {\rm I\kern-.3em E}[\overline{X}^2] \\ & = \sum_{i=1}^n \left( \mu^2 + \sigma^2 \right) - n \left( \mu^2 + \frac{\sigma^2}{n} \right) \\ & = n\mu^2 + n\sigma^2 - n \mu^2 - \sigma^2 \\ & = (n-1) \sigma^2 \end{align*}
Dividing both sides by (n-1) yields the thesis {\rm I\kern-.3em E}[S^2] = \sigma^2

Additional note

The sample variance is defined by S^2=\frac{\sum_{i=1}^{n} (X_i-\overline{X})^2}{n-1}=\frac{\sum_{i=1}^n X_i^2-n{\overline{X}^2}}{n-1}
Where does the n-1 factor in the denominator come from?
(It would look more natural to divide by n, instead that by n-1)
The n-1 factor is caused by a loss of precision:
- Ideally, the sample variance S^2 should contain the population mean \mu
- Since \mu is not available, we estimate it with the sample mean \overline{X}
- This leads to the loss of 1 degree of freedom

Additional note

General statistical rule: \text{Lose 1 degree of freedom for each parameter estimated}
In the case of the sample variance S^2, we have to estimate one parameter (the population mean \mu). Hence \begin{align*} \text{degrees of freedom} & = \text{Sample size}-\text{No. of estimated parameters} \\ & = n-1 \end{align*}
This is where the n-1 factor comes from!

Notation

The realization of a random sample X_1,\ldots,X_n is denoted by x_1, \ldots, x_n
The realization of the sample mean \overline{X} is denoted \overline{x} := \frac{1}{n} \sum_{i=1}^n x_i
The realization of the sample variance S^2 is denoted s^2=\frac{\sum_{i=1}^{n}(x_i-\overline{x})^2}{n-1}=\frac{\sum_{i=1}^n x_i^2-n{\overline{x}^2}}{n-1}
Capital letters denote random variables, while lowercase letters denote specific values (realizations) of those variables

Exercise

Wage data on 10 Mathematicians

Mathematician	x_1	x_2	x_3	x_4	x_5	x_6	x_7	x_8	x_9	x_{10}
Wage	36	40	46	54	57	58	59	60	62	63

Question: Estimate population mean and variance

Solution to the Exercise

Number of advertising professionals n=10
Sample Mean: \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i = \frac{36+40+46+{\dots}+62+63}{10}=\frac{535}{10}=53.5
Sample Variance: \begin{align*} s^2 & = \frac{\sum_{i=1}^n x_{i}^2 - n \overline{x}^2}{n-1} \\ \sum_{i=1}^n x_i^2 & = 36^2+40^2+46^2+{\ldots}+62^2+63^2 = 29435 \\ s^2 & = \frac{29435-10(53.5)^2}{9} = 90.2778 \end{align*}

Part 5:
Chi-squared distribution

Overview

Chi-squared distribution:

defined in terms of squares of N(0, 1) random variables
designed to describe variance estimation
used to define other members of the normal family
- Student t-distribution
- F-distribution

Why the normal family is important

Classical hypothesis testing and regression problems
The same maths solves apparently unrelated problems
Easy to compute
- Statistics tables
- Software
Enables the development of approximate methods in more complex (and interesting) problems

Reminder: Normal distribution

X has normal distribution with mean \mu and variance \sigma^2 if pdf is f(x) := \frac{1}{\sqrt{2\pi\sigma^2}} \, \exp\left( -\frac{(x-\mu)^2}{2\sigma^2}\right) \,, \quad x \in \mathbb{R}
In this case we write X \sim N(\mu,\sigma^2)
The standard normal distribution is denoted N(0,1)

Chi-squared distribution

Definition

Definition

Let Z_1,\ldots,Z_r be iid N(0, 1) random variables. The chi-squared distribution with r degrees of freedom is the distribution \chi^2_r \sim Z^2_1+...+Z^2_r

Chi-squared distribution

Pdf characterization

Theorem

The \chi^2_r distribution is equivalent to a Gamma distribution \chi^2_r \sim \Gamma(r/2, 1/2) Therefore the pdf of \chi^2_r can be written in closed form as f_{\chi^2_r}(x)=\frac{x^{(r/2)-1} \, e^{-x/2}}{\Gamma(r/2) 2^{r/2}} \,, \quad x>0

Chi-squared distribution

Plots of chi-squared pdf for different choices of r

Proof of Theorem – Case r =1

We start with the case r=1
Need to prove that \chi^2_1 \sim \Gamma(1/2, 1/2)
Therefore we need to show that the pdf of \chi^2_1 is f_{\chi^2_1}(x)=\frac{x^{-1/2} \, e^{-x/2}}{\Gamma(1/2) 2^{1/2}} \,, \quad x>0

Proof of Theorem – Case r =1

To this end, notice that by definition \chi^2_1 \sim Z^2 \,, \qquad Z \sim N(0,1)
Hence, for x>0 we can compute cdf via \begin{align*} F_{\chi^2_1}(x) & = P(\chi^2_1 \leq x) \\ & = P(Z^2 \leq x ) \\ & = P(- \sqrt{x} \leq Z \leq \sqrt{x} ) \\ & = 2 P (0 \leq Z \leq \sqrt{x}) \end{align*} where in the last equality we used symmetry of Z around x=0

Proof of Theorem – Case r =1

Recalling the definition of standard normal pdf we get \begin{align*} F_{\chi^2_1}(x) & = 2 P (0 \leq Z \leq \sqrt{x}) \\ & = 2 \frac{1}{\sqrt{2\pi}} \int_0^{\sqrt{x}} e^{-t^2/2} \, dt \\ & = 2 \frac{1}{\sqrt{2\pi}} G( \sqrt{x} ) \end{align*} where we set G(x) := \int_0^{x} e^{-t^2/2} \, dt

Proof of Theorem – Case r =1

We can now compute pdf of \chi_1^2 by differentiating the cdf
By the Fundamental Theorem of Calculus we have G'(x) = \frac{d}{dx} \left( \int_0^{x} e^{-t^2/2} \, dt \right) = e^{-x^2/2} \quad \implies \quad G'(\sqrt{x}) = e^{-x/2}
Chain rule yields \begin{align*} f_{\chi^2_1}(x) & = \frac{d}{dx} F_{\chi^2_1}(x) = \frac{d}{dx} \left( 2 \frac{1}{\sqrt{2\pi}} G( \sqrt{x} ) \right) \\ & = 2 \frac{1}{\sqrt{2\pi}} G'( \sqrt{x} ) \frac{x^{-1/2}}{2} = \frac{x^{-1/2} e^{-x/2}}{2^{1/2} \sqrt{\pi}} \end{align*}

Proof of Theorem – Case r =1

It is well known that \Gamma(1/2) = \sqrt{\pi}
Hence, we conclude f_{\chi^2_1}(x) = \frac{x^{-1/2} e^{-x/2}}{2^{1/2} \sqrt{\pi}} = \frac{x^{-1/2} e^{-x/2}}{2^{1/2} \Gamma(1/2)}
This shows \chi_1^2 \sim \Gamma(1/2,1/2)

Proof of Theorem – Case r \geq 2

We need to prove that \chi^2_r \sim \Gamma(r/2, 1/2)
By definition \chi^2_r \sim Z^2_1 + \ldots + Z^2_r \,, \qquad Z_i \sim N(0,1) \quad \text{iid}
By the Theorem in Slide 46, we have Z_1,\ldots,Z_r \,\,\, \text{iid} \quad \implies \quad Z_1^2,\ldots,Z_r^2 \,\,\, \text{iid}
Moreover, by definition, Z_i^2 \sim \chi_1^2
Therefore, we have \chi^2_r = \sum_{i=1}^r X_i, \qquad X_i \sim \chi^2_1 \quad \text{iid}

Proof of Theorem – Case r \geq 2

We have just proven that \chi_1^2 \sim \Gamma (1/2,1/2)
Moreover, the Theorem in Slide 53 guarantees that Y_i \sim \Gamma(\alpha_i, \beta) \quad \text{independent} \quad \implies \quad Y_1 + \ldots + Y_n \sim \Gamma(\alpha,\beta) where \alpha = \alpha_1 + \ldots + \alpha_n
Therefore, we conclude that \chi^2_r = \sum_{i=1}^r X_i, \qquad X_i \sim \Gamma(1/2,1/2) \quad \text{iid} \quad \implies \quad \chi^2_r \sim \Gamma(r/2,1/2)

Part 6:
Sampling from normal distribution

Sampling from Normal distribution

Sample mean and variance: For a random sample X_1,\ldots,X_n defined by S^2 := \frac{1}{n-1} \sum_{i=1}^n \left( X_i - \overline{X} \right)^2 \,, \qquad \overline{X} := \frac{1}{n} \sum_{i=1}^n X_i

Question

Assume the sample is normal X_i \sim N(\mu,\sigma^2) \,, \quad \forall \, i = 1 , \ldots, n What are the distributions of \overline{X} and S^2?

Properties of Sample Mean and Variance

Theorem

Let X_1,\ldots,X_n be a random sample from N(\mu,\sigma^2). Then

\overline{X} and S^2 are independent random variables
\overline{X} and S^2 are distributed as follows \overline{X} \sim N(\mu,\sigma^2/n) \,, \qquad \frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2

Properties of Sample Mean and Variance

Proof of Theorem

To prove independence of \overline{X} and S^2 we make use of the following Lemma
Proof of this Lemma is technical and omitted
For a proof see Lemma 5.3.3 in [1]

Lemma

Let X and Y be normal random variables. Then X \text{ and } Y \text{ independent } \quad \iff \quad {\rm Cov}(X,Y) = 0

Properties of Sample Mean and Variance

Proof of Theorem

Note that X_i - \overline{X} and \overline{X} are normally distributed, being sums of iid normals
Therefore, we can apply the Lemma to X_i - \overline X and \overline{X}
To this end, recall that {\rm Var}[\overline X] = \sigma^2/n
Also note that, by independence of X_1,\ldots,X_n {\rm Cov}(X_i,X_j) = \begin{cases} {\rm Var}[X_i] & \text{ if } \, i = j \\ 0 & \text{ if } \, i \neq j \\ \end{cases}

Properties of Sample Mean and Variance

Proof of Theorem

Using bilinearity of covariance (i.e. linearity in both arguments) \begin{align*} {\rm Cov}(X_i - \overline X, \overline X) & = {\rm Cov}(X_i,\overline{X}) - {\rm Cov}(\overline X,\overline{X}) \\ & = \frac{1}{n} \sum_{j=1}^n {\rm Cov}(X_i,X_j) - {\rm Var}[\overline X] \\ & = \frac{1}{n} {\rm Var}[X_i] - {\rm Var}[\overline X] \\ & = \frac{1}{n} \sigma^2 - \frac{\sigma^2}{n} = 0 \end{align*}
By the Lemma, we infer independence of X_i - \overline X and \overline X

Properties of Sample Mean and Variance

Proof of Theorem

We have shown X_i - \overline X \quad \text{and} \quad \overline X \quad \text{independent}
By the Theorem in Slide 46, we hence have (X_i - \overline X)^2 \quad \text{and} \quad \overline X \quad \text{independent}
By the same Theorem we also get \sum_{i=1}^n (X_i - \overline X)^2 = (n-1)S^2 \quad \text{and} \quad \overline X \quad \text{independent}
Again the same Theorem, finally implies independence of S^2 and \overline X

Properties of Sample Mean and Variance

Proof of Theorem

We now want to show that \overline{X} \sim N(\mu,\sigma^2/n)
We are assuming that X_1,\ldots,X_n are iid with {\rm I\kern-.3em E}[X_i] = \mu \,, \qquad {\rm Var}[X_i] = \sigma^2
We have already seen in Slides 70 and 72 that, in this case, {\rm I\kern-.3em E}[\overline X] = \mu \,, \quad {\rm Var}[\overline{X}] = \frac{\sigma^2}{n}
Sum of independent normals is normal (see the Theorem in slide 50)
Therefore \overline{X} is normal, with mean \mu and variance \sigma^2/n

Properties of Sample Mean and Variance

Proof of Theorem

We are left to prove that \frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2
- This is somewhat technical and we don’t actually prove it
- For a proof see Theorem 5.3.1 in [1]
- We however want to provide some intuition on why it holds
Recall that the chi-squared distribution with r degrees of freedom is \chi_r^2 \sim Z_1^2 + \ldots + Z_r^2 with Z_i iid and N(0,1)

Properties of Sample Mean and Variance

Proof of Theorem

By definition of S^2 we have \frac{(n-1)S^2}{\sigma^2} = \sum_{i=1}^n \frac{(X_i - \overline X)^2}{\sigma^2}
If we replace the sample mean \overline X with the actual mean \mu we get the approximation \frac{(n-1)S^2}{\sigma^2} = \sum_{i=1}^n \frac{(X_i - \overline X)^2}{\sigma^2} \approx \sum_{i=1}^n \frac{(X_i - \mu)^2}{\sigma^2}

Properties of Sample Mean and Variance

Proof of Theorem

Since X_i \sim N(\mu,\sigma^2), we have that Z_i := \frac{X_i - \mu}{\sigma} \sim N(0,1)
Therefore \frac{(n-1)S^2}{\sigma^2} \approx \sum_{i=1}^n \frac{(X_i - \mu)^2}{\sigma^2} = \sum_{i=1}^n Z_i^2 \sim \chi_n^2
The above is just an approximation:
When replacing \mu with \overline X, we lose 1 degree of freedom \frac{(n-1)S^2}{\sigma^2} \sim \chi_{n-1}^2

Part 7:
t-distribution

Estimating the Mean

Problem

Estimate the mean \mu of a normal population

What to do?

We can collect normal samples X_1, \ldots, X_n with X_i \sim N(\mu,\sigma^2)
We then compute the sample mean \overline X := \frac{1}{n} \sum_{i=1}^n X_i
We know that {\rm I\kern-.3em E}[\overline X] = \mu

\overline X approximates \mu

Question

How good is this approximation? How to quantify it?

Answer: We consider the Test Statistic T := \frac{\overline{X}-\mu}{\sigma/\sqrt{n}} \, \sim \,N(0,1)

This is because \overline X \sim N(\mu,\sigma^2/n) – see Slide 101
If \sigma is known, then the only unknown in T is \mu

T can be used to estimate \mu \quad \implies \quad Hypothesis Testing

Hypothesis testing

Suppose that \mu=\mu_0 (this is called the null hypothesis)
Using the data collected \mathbf{x}= (x_1,\ldots,x_n), we compute t := \frac{\overline{x}-\mu_0}{\sigma/\sqrt{n}} \,, \qquad \overline{x} = \frac{1}{n} \sum_{i=1}^n x_i
When \mu = \mu_0, the number t is a realization of the test statistic (random variable) T = \frac{\overline{X}-\mu_0}{\sigma/\sqrt{n}} \, \sim \,N(0,1)
Therefore, we can compute the probability of T being close to t p := P(T \approx t)

Hypothesis testing

Given the value p := P(T \approx t) we have 2 cases:

p is small \quad \implies \quad reject the null hypothesis \mu = \mu_0
- p small means it is unlikely to observe such value of t
- Recall that t depends only on the data \mathbf{x}, and on our guess \mu_0
- We conclude that our guess must be wrong \quad \implies \quad \mu \neq \mu_0
p is large \quad \implies \quad do not reject the null hypothesis \mu = \mu_0
- p large means that t occurs with reasonably high probability
- There is no reason to believe our guess \mu_0 was wrong
- But we also do not have sufficient reason to believe \mu_0 was correct

Important Remark

The key step in Hypothesis Testing is computing p = P(T \approx t)
This is only possible if we know the distribution of T = \frac{\overline{X}-\mu}{\sigma/\sqrt{n}}
If we assume that the variance \sigma^2 is known, then T \sim N(0,1) and p is easily computed

Unknown variance

Problem

In general, the population variance \sigma^2 is unknown. What to do?

Idea: We can replace \sigma^2 with the sample variance S^2 = \frac{\sum_{i=1}^n X_i^2 - n \overline{X}^2}{n-1} The new test statistic is hence T := \frac{\overline{X}-\mu}{S/\sqrt{n}}

Distribution of the test statistic

Question

What is the distribution of

T := \frac{\overline{X}-\mu}{S/\sqrt{n}} \qquad ?

Answer: T has t-distribution with n-1 degrees of freedom

This is also known as Student’s t-distribution
Student was the pen name under which W.S. Gosset was publishing his research
He was head brewer at Guinness, at the time the largest brewery in the world!
Used t-distribution to study chemical properties of barley from low samples [2] (see original paper )

t-distribution

Definition

A random variable T has Student’s t-distribution with p degrees of freedom, denoted by T \sim t_p \,, if the pdf of T is f_T(t) = \frac{\Gamma \left( \frac{p+1}{2} \right) }{\Gamma \left( \frac{p}{2} \right)} \, \frac{1}{(p\pi)^{1/2}} \, \frac{ 1 }{ (1 + t^2/p)^{(p+1)/2} } \,, \qquad t \in \mathbb{R}

Characterization of the t-distribution

Theorem

Let U \sim N(0,1) and V \sim \chi_p^2 be independent random variables. Then T := \frac{U}{\sqrt{V/p}} \, \sim \, t_p \,, that is, T has t-distribution with p degrees of freedom.

Proof: Given as exercise in Homework assignments

Distribution of t-statistic

As a consequence of the Theorem in previous slide we obtain:

Theorem

Let X_1,\ldots,X_n be a random sample from N(\mu,\sigma^2). Then the random variable T = \frac{\overline{X}-\mu}{S/\sqrt{n}} has t-distribution with n-1 degrees of freedom, that is, T \sim t_{n-1}

Distribution of t-statistic

Proof of Theorem

Since X_1,\ldots,X_n is random sample from N(\mu,\sigma^2), we have that (see Slide 101) \overline{X} \sim N(\mu, \sigma^2/n)
Therefore, we can renormalize and obtain U := \frac{ \overline{X} - \mu }{ \sigma/\sqrt{n} } \sim N(0,1)

Distribution of t-statistic

Proof of Theorem

We have also shown that V := \frac{ (n-1) S^2 }{ \sigma^2 } \sim \chi_{n-1}^2
Finally, we can rewrite T as T = \frac{\overline{X}-\mu}{S/\sqrt{n}} = \frac{U}{ \sqrt{V/(n-1)} }
By the Theorem in Slide 118, we conclude that T \sim t_{n-1}

Properties of t-distribution

Proposition: Expectation and Variance of t-distribution

Suppose that T \sim t_p. We have:

If p>1 then {\rm I\kern-.3em E}[T] = 0
If p>2 then {\rm Var}[T] = \frac{p}{p-2}

Notes:

We have to assume p>1, otherwise {\rm I\kern-.3em E}[T] = \infty for p=1
We have to assume p>2, otherwise {\rm Var}[T] = \infty for p=1,2
{\rm I\kern-.3em E}[T] = 0 follows trivially from symmetry of the pdf f_T(t) around t=0
Computing {\rm Var}[T] is quite involved, and we skip it

t-distribution

Comparison with Standard Normal

The t_p distribution approximates the standard normal N(0,1):

t_p it is symmetric around zero and bell-shaped, like N(0,1)
t_p has heavier tails compared to N(0,1)
While the variance of N(0,1) is 1, the variance of t_p is \frac{p}{p-2}
We have that t_p \to N(0,1) \quad \text{as} \quad p \to \infty

Plot: Comparison with Standard Normal

References

[1]

Casella, George, Berger, Roger L., Statistical inference, second edition, Brooks/Cole, 2002.

[2]

Gosset (Student), W.S., The probable error of a mean, Biometrika. 6 (1908) 1–25.

Statistical Models

Lecture 2: Random samples

Outline of Lecture 2

Part 1: Probability revision III

Probability revision III

Independence of random variables

Independence of random variables

Conditional distributions and probabilities

Independence of random variables

Characterization of independence - Densities

Exercise

Solution

Solution

Solution

Solution

Consequences of independence

Application: MGF of sums

Example - Sum of independent normals

Example - Sum of independent normals

Covariance & Correlation

Relationship between RV

Covariance & Correlation

Relationship between RV

Covariance

Definition

Covariance

Alternative Formula

Correlation

Correlation detects linear relationships

Correlation & Covariance

Independent random variables

Formula for Variance

Variance is quadratic

Example 1

Example 1

Example 1

Example 1 – Computing \rho_{XY}

Example 1 – Computing \rho_{XY}

Example 1 – Computing \rho_{XY}

Example 2

Example 2

Example 2

Example 2 – Computing \rho_{XY}

Example 2 – Computing \rho_{XY}

Part 2: Multivariate random vectors

Multivariate Random Vectors

Recall

Discrete and Continuous Multivariate Random Vectors

Joint pmf

Joint pdf

Expected Value

Marginal distributions

Marginal distributions

Conditional distributions

Conditional distributions

Independence

Independence

Characterization result

Independence

A very useful theorem

Independence

Expectation of product

Application: MGF of sums

Application: MGF of sums

Proof of Theorem

Example – Sum of independent Normals

Example – Sum of independent Normals

Proof of Theorem

Example – Sum of independent Normals

Proof of Theorem

Example – Sum of independent Gammas

Example – Sum of independent Gammas

Proof of Theorem

Example – Sum of independent Gammas

Proof of Theorem

Expectation of sums

Expectation is linear

Variance of sums

Variance is quadratic

Part 3: Random samples

Lecture 2:
Random samples

Part 1:
Probability revision III

Part 2:
Multivariate random vectors

Part 3:
Random samples

Part 4:
Unbiased estimators

Part 5:
Chi-squared distribution

Part 6:
Sampling from normal distribution