Lecture 2
You are expected to know the main concepts from Y1 module
Introduction to Probability & Statistics
Self-contained revision material available in Appendix A
Topics to review: Sections 4–8 of Appendix A
Recall: a random variable is a measurable function X \colon \Omega \to \mathbb{R}\,, \quad \Omega \,\, \text{ sample space}
A random vector is a measurable function \mathbf{X}\colon \Omega \to \mathbb{R}^n. We say that
The components of a random vector \mathbf{X} are denoted by \mathbf{X}= (X_1, \ldots, X_n) with X_i \colon \Omega \to \mathbb{R} random variables
We denote a two-dimensional bivariate random vector by (X,Y) with X,Y \colon \Omega \to \mathbb{R} random variables
(X,Y) discrete random vector | (X,Y) continuous random vector |
---|---|
X and Y discrete RV | X and Y continuous RV |
Joint pmf | Joint pdf |
f_{X,Y}(x,y) := P(X=x,Y=y) | P((X,Y) \in A) = \int_A f_X(x,y) \,dxdy |
f_{X,Y} \geq 0 | f_{X,Y} \geq 0 |
\sum_{(x,y)\in \mathbb{R}^2} f_{X,Y}(x,y)=1 | \int_{\mathbb{R}^2} f_{X,Y}(x,y) \, dxdy= 1 |
Marginal pmfs | Marginal pdfs |
f_X (x) := P(X=x) | P(a \leq X \leq b) = \int_a^b f_X(x) \,dx |
f_Y (y) := P(Y=y) | P(a \leq Y \leq b) = \int_a^b f_Y(y) \,dy |
f_X (x)=\sum_{y \in \mathbb{R}} f_{X,Y}(x,y) | f_X(x) = \int_{\mathbb{R}} f_{X,Y}(x,y) \,dy |
f_Y (y)=\sum_{x \in \mathbb{R}} f_{X,Y}(x,y) | f_Y(y) = \int_{\mathbb{R}} f_{X,Y}(x,y) \,dx |
(X,Y) rv with joint pdf (or pmf) f_{X,Y} and marginal pdfs (or pmfs) f_X, f_Y
The conditional pdf (or pmf) of Y given that X=x is the function f(\cdot | x) f(y|x) := \frac{f_{X,Y}(x,y)}{f_X(x)} \, , \qquad \text{ whenever} \quad f_X(x)>0
The conditional pdf (or pmf) of X given that Y=y is the function f(\cdot | y) f(x|y) := \frac{f_{X,Y}(x,y)}{f_Y(y)}\, , \qquad \text{ whenever} \quad f_Y(y)>0
Notation: We will often write
Note: The above formula contains abuse of notation – {\rm I\kern-.3em E} has 3 meanings
Suppose (X,Y) is continuous
Recall that {\rm I\kern-.3em E}[X|Y] denotes the random variable g(Y) with g(y):= {\rm I\kern-.3em E}[X|y] := \int_{\mathbb{R}} xf(x|y) \, dx
Also recall that by definition f_{X,Y}(x,y)= f(x|y)f_Y(y)
Therefore \begin{align*} {\rm I\kern-.3em E}[{\rm I\kern-.3em E}[X|Y]] & = {\rm I\kern-.3em E}[g(Y)] = \int_{\mathbb{R}} g(y) f_Y(y) \, dy \\ & = \int_{\mathbb{R}} \left( \int_{\mathbb{R}} xf(x|y) \, dx \right) f_Y(y)\, dy = \int_{\mathbb{R}^2} x f(x|y) f_Y(y) \, dx dy \\ & = \int_{\mathbb{R}^2} x f_{X,Y}(x,y) \, dx dy = \int_{\mathbb{R}} x \left( \int_{\mathbb{R}} f_{X,Y}(x,y)\, dy \right) \, dx \\ & = \int_{\mathbb{R}} x f_{X}(x) \, dx = {\rm I\kern-.3em E}[X] \end{align*}
If (X,Y) is discrete the thesis follows by replacing intergrals with series
Consider again the continuous random vector (X,Y) with joint pdf f_{X,Y}(x,y) := e^{-y} \,\, \text{ if } \,\, 0 < x < y \,, \quad f_{X,Y}(x,y) :=0 \,\, \text{ otherwise}
We have proven that {\rm I\kern-.3em E}[Y|X] = X + 1
We have also shown that f_X is exponential f_{X}(x) = \begin{cases} e^{-x} & \text{ if } x > 0 \\ 0 & \text{ if } x \leq 0 \end{cases}
From the knowledge of f_X we can compute {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[X] = \int_0^\infty x e^{-x} \, dx = -(x+1)e^{-x} \bigg|_{x=0}^{x=\infty} = 1
Using the Theorem we can compute {\rm I\kern-.3em E}[Y] without computing f_Y: \begin{align*} {\rm I\kern-.3em E}[Y] & = {\rm I\kern-.3em E}[ {\rm I\kern-.3em E}[Y|X] ] \\ & = {\rm I\kern-.3em E}[X + 1] \\ & = {\rm I\kern-.3em E}[X] + 1 \\ & = 1 + 1 = 2 \end{align*}
In previous example: the conditional distribution of Y given X=x was f(y|x) = \begin{cases} e^{-(y-x)} & \text{ if } y > x \\ 0 & \text{ if } y \leq x \end{cases}
In particular f(y|x) depends on x
This means that knowledge of X gives information on Y
When X does not give any information on Y we say that X and Y are independent
If X and Y are independent then X gives no information on Y (and vice-versa):
Conditional distribution: Y|X is same as Y f(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)} = \frac{f_X(x)f_Y(y)}{f_X(x)} = f_Y(y)
Conditional probabilities: From the above we also obtain \begin{align*} P(Y \in A | x) & = \sum_{y \in A} f(y|x) = \sum_{y \in A} f_Y(y) = P(Y \in A) & \, \text{ discrete rv} \\ P(Y \in A | x) & = \int_{y \in A} f(y|x) \, dy = \int_{y \in A} f_Y(y) \, dy = P(Y \in A) & \, \text{ continuous rv} \end{align*}
(X,Y) random vector with joint pdf or pmf f_{X,Y}. They are equivalent:
Suppose X and Y are independent random variables. Then
For any A,B \subset \mathbb{R} we have P(X \in A, Y \in B) = P(X \in A) P(Y \in B)
Suppose g(x) is a function of (only) x, h(y) is a function of (only) y. Then {\rm I\kern-.3em E}[g(X)h(Y)] = {\rm I\kern-.3em E}[g(X)]{\rm I\kern-.3em E}[h(Y)]
Define the function p(x,y):=g(x)h(y). Then \begin{align*} {\rm I\kern-.3em E}[g(X)h(Y)] & = {\rm I\kern-.3em E}(p(X,Y)) = \int_{\mathbb{R}^2} p(x,y) f_{X,Y}(x,y) \, dxdy \\ & = \int_{\mathbb{R}^2} g(x)h(y) f_X(x) f_Y(y) \, dxdy \\ & = \left( \int_{-\infty}^\infty g(x) f_X(x) \, dx \right) \left( \int_{-\infty}^\infty h(y) f_Y(y) \, dy \right) \\ & = {\rm I\kern-.3em E}[g(X)] {\rm I\kern-.3em E}[h(Y)] \end{align*}
Proof in the discrete case is the same: replace intergrals with series
Define the product set A \times B :=\{ (x,y) \in \mathbb{R}^2 \colon x \in A , y \in B\}
Therefore we get \begin{align*} P(X \in A , Y \in B) & = \int_{A \times B} f_{X,Y}(x,y) \, dxdy \\ & = \int_{A \times B} f_X(x) f_Y(y) \, dxdy \\ & = \left(\int_{A} f_X(x) \, dx \right) \left(\int_{B} f_Y(y) \, dy \right) \\ & = P(X \in A) P(Y \in B) \end{align*}
Proof: Follows by previous Theorem \begin{align*} M_{X + Y} (t) & = {\rm I\kern-.3em E}[e^{t(X+Y)}] = {\rm I\kern-.3em E}[e^{tX}e^{tY}] \\ & = {\rm I\kern-.3em E}[e^{tX}] {\rm I\kern-.3em E}[e^{tY}] \\ & = M_X(t) M_Y(t) \end{align*}
Suppose X \sim N (\mu_1, \sigma_1^2) and Y \sim N (\mu_2, \sigma_2^2) are independent normal random variables
We have seen in Slide 119 in Lecture 1 that for normal distributions M_X(t) = \exp \left( \mu_1 t + \frac{t^2 \sigma_1^2}{2} \right) \,, \qquad M_Y(t) = \exp \left( \mu_2 t + \frac{t^2 \sigma_2^2}{2} \right)
Since X and Y are independent, from previous Theorem we get \begin{align*} M_{X+Y}(t) & = M_{X}(t) M_{Y}(t) = \exp \left( \mu_1 t + \frac{t^2 \sigma_1^2}{2} \right) \exp \left( \mu_2 t + \frac{t^2 \sigma_2^2}{2} \right) \\ & = \exp \left( (\mu_1 + \mu_2) t + \frac{t^2 (\sigma_1^2 + \sigma_2^2)}{2} \right) \end{align*}
Therefore Z := X + Y has moment generating function M_{Z}(t) = M_{X+Y}(t) = \exp \left( (\mu_1 + \mu_2) t + \frac{t^2 (\sigma_1^2 + \sigma_2^2)}{2} \right)
The above is the mgf of a normal distribution with \text{mean }\quad \mu_1 + \mu_2 \quad \text{ and variance} \quad \sigma_1^2 + \sigma_2^2
By the Theorem in Slide 132 of Lecture 1 we have Z \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)
Sum of independent normals is normal
Given two random variables X and Y we said that
X and Y are independent if f_{X,Y}(x,y) = f_X(x) g_Y(y)
In this case there is no relationship between X and Y
This is reflected in the conditional distributions: X|Y \sim X \qquad \qquad Y|X \sim Y
If X and Y are not independent then there is a relationship between them
Answer: By introducing the notions of
Notation: Given two rv X and Y we denote \begin{align*} & \mu_X := {\rm I\kern-.3em E}[X] \qquad & \mu_Y & := {\rm I\kern-.3em E}[Y] \\ & \sigma^2_X := {\rm Var}[X] \qquad & \sigma^2_Y & := {\rm Var}[Y] \end{align*}
The sign of {\rm Cov}(X,Y) gives information about the relationship between X and Y:
The sign of {\rm Cov}(X,Y) gives information about the relationship between X and Y
X small: \, X<\mu_X | X large: \, X>\mu_X | |
---|---|---|
Y small: \, Y<\mu_Y | (X-\mu_X)(Y-\mu_Y)>0 | (X-\mu_X)(Y-\mu_Y)<0 |
Y large: \, Y>\mu_Y | (X-\mu_X)(Y-\mu_Y)<0 | (X-\mu_X)(Y-\mu_Y)>0 |
X small: \, X<\mu_X | X large: \, X>\mu_X | |
---|---|---|
Y small: \, Y<\mu_Y | {\rm Cov}(X,Y)>0 | {\rm Cov}(X,Y)<0 |
Y large: \, Y>\mu_Y | {\rm Cov}(X,Y)<0 | {\rm Cov}(X,Y)>0 |
Using linearity of {\rm I\kern-.3em E} and the fact that {\rm I\kern-.3em E}[c]=c for c \in \mathbb{R}: \begin{align*} {\rm Cov}(X,Y) : & = {\rm I\kern-.3em E}[ \,\, (X - {\rm I\kern-.3em E}[X]) (Y - {\rm I\kern-.3em E}[Y]) \,\, ] \\ & = {\rm I\kern-.3em E}\left[ \,\, XY - X {\rm I\kern-.3em E}[Y] - Y {\rm I\kern-.3em E}[X] + {\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Y] \,\, \right] \\ & = {\rm I\kern-.3em E}[XY] - {\rm I\kern-.3em E}[ X {\rm I\kern-.3em E}[Y] ] - {\rm I\kern-.3em E}[ Y {\rm I\kern-.3em E}[X] ] + {\rm I\kern-.3em E}[{\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[Y]] \\ & = {\rm I\kern-.3em E}[XY] - {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[Y] - {\rm I\kern-.3em E}[Y] {\rm I\kern-.3em E}[X] + {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[Y] \\ & = {\rm I\kern-.3em E}[XY] - {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[Y] \end{align*}
Remark:
{\rm Cov}(X,Y) encodes only qualitative information about the relationship between X and Y
To obtain quantitative information we introduce the correlation
Correlation detects linear relationships between X and Y
For any random variables X and Y we have
Proof: Omitted, see page 172 of [1]
Proof:
Proof: Exercise
Everything we defined for bivariate vectors extends to multivariate vectors
The random vector \mathbf{X}\colon \Omega \to \mathbb{R}^n is:
Note: For all A \subset \mathbb{R}^n it holds P(\mathbf{X}\in A) = \sum_{\mathbf{x}\in A} f_{\mathbf{X}}(\mathbf{x})
Note: \int_A denotes an n-fold intergral over all points \mathbf{x}\in A
Marginal pmf or pdf of any subset of the coordinates (X_1,\ldots,X_n) can be computed by integrating or summing the remaining coordinates
To ease notations, we only define maginals wrt the first k coordinates
We now define conditional distributions given the first k coordinates
Similarly, we can define the conditional distribution given the i-th coordinate
\mathbf{X}=(X_1,\ldots,X_n) random vector with joint pmf or pdf f_{\mathbf{X}}. They are equivalent:
Proof: Omitted. See [1] page 184
Example: X_1,\ldots,X_n \, independent \,\, \implies \,\, X_1^2, \ldots, X_n^2 \, independent
We have seen in Slide 119 in Lecture 1 that if X_i \sim N(\mu_i,\sigma_i^2) then M_{X_i}(t) = \exp \left( \mu_i t + \frac{t^2 \sigma_i^2}{2} \right)
Since X_1,\ldots,X_n are mutually independent, from previous Theorem we get \begin{align*} M_{Z}(t) & = \prod_{i=1}^n M_{X_i}(t) = \prod_{i=1}^n \exp \left( \mu_i t + \frac{t^2 \sigma_i^2}{2} \right) \\ & = \exp \left( (\mu_1 + \ldots + \mu_n) t + \frac{t^2 (\sigma_1^2 + \ldots +\sigma_n^2)}{2} \right) \\ & = \exp \left( \mu t + \frac{t^2 \sigma^2 }{2} \right) \end{align*}
Therefore Z has moment generating function M_{Z}(t) = \exp \left( \mu t + \frac{t^2 \sigma^2 }{2} \right)
The above is the mgf of a normal distribution with \text{mean }\quad \mu \quad \text{ and variance} \quad \sigma^2
Since mgfs characterize distributions (see Theorem in Slide 132 of Lecture 1), we conclude Z \sim N(\mu, \sigma^2 )
We have seen in Slide 126 in Lecture 1 that if X_i \sim \Gamma(\alpha_i,\beta) then M_{X_i}(t) = \frac{\beta^{\alpha_i}}{(\beta-t)^{\alpha_i}}
Since X_1,\ldots,X_n are mutually independent we get \begin{align*} M_{Z}(t) & = \prod_{i=1}^n M_{X_i}(t) = \prod_{i=1}^n \frac{\beta^{\alpha_i}}{(\beta-t)^{\alpha_i}} \\ & = \frac{\beta^{(\alpha_1 + \ldots + \alpha_n)}}{(\beta-t)^{(\alpha_1 + \ldots + \alpha_n)}} \\ & = \frac{\beta^{\alpha}}{(\beta-t)^{\alpha}} \end{align*}
Therefore Z has moment generating function M_{Z}(t) = \frac{\beta^{\alpha}}{(\beta-t)^{\alpha}}
The above is the mgf of a Gamma distribution with \text{mean }\quad \alpha \quad \text{ and variance} \quad \beta
Since mgfs characterize distributions (see Theorem in Slide 132 of Lecture 1), we conclude Z \sim \Gamma(\alpha, \beta )