Statistical Models

Appendix A

Appendix A:
Probability revision

Introduction

  • In this Appendix we review some fundamental notions from Y1 module
    Introduction to Probability & Statistics

  • The mathematical descriptions might look (a bit) different, but the concepts are the same

Topics reviewed:

  • Sample space, Events
  • Probability measure
  • Conditional probability
  • Random Variables
  • Distribution, cdf, pmf, pdf
  • Expected value and Variance
  • Random vectors
  • Joint pdf and pmf, Marginals
  • Conditional distributions and expectation
  • Independence of random variables
  • Covariance and correlation
  • Multivariate random vectors

Outline of Appendix A

  1. Probability space
  2. Random variables
  3. Expected value
  4. Bivariate random vectors
  5. Conditional distributions
  6. Independence
  7. Covariance and correlation
  8. Multivariate random vectors

Part 1:
Probability space

Sample space

Definition: Sample space
A set \Omega of all possible outcomes of some experiment

Examples:

  • Coin toss: results in Heads = H and Tails = T \Omega = \{ H, T \}

  • Student grade for Statistical Models: a number between 0 and 100 \Omega = \{ x \in \mathbb{R} \, \colon \, 0 \leq x \leq 100 \} = [0,100]

Events

Definition: Event
A subset E of the sample space \Omega (including \emptyset and \Omega itself)

Operations with events:

  • Union of two events A and B A \cup B := \{ x \in \Omega \colon x \in A \, \text{ or } \, x \in B \}

  • Intersection of two events A and B A \cap B := \{ x \in \Omega \colon x \in A \, \text{ and } \, x \in B \}

Events

More Operations with events:

  • Complement of an event A A^c := \{ x \in \Omega \colon x \notin A \}

  • Infinite Union of a family of events A_i with i \in I

\bigcup_{i \in I} A_i := \{ x \in \Omega \colon x \in A_i \, \text{ for some } \, i \in I \}

  • Infinite Intersection of a family of events A_i with i \in I

\bigcap_{i \in I} A_i := \{ x \in \Omega \colon x \in A_i \, \text{ for all } \, i \in I \}

Events

Example: Consider sample space and events \Omega := (0,1] \,, \quad A_i = \left[\frac{1}{i} , 1 \right] \,, \quad i \in \mathbb{N} Then \bigcup_{i \in I} A_i = (0,1] \,, \quad \bigcap_{i \in I} A_i = \{ 1 \}

Events

Definition: Disjoint
Two events A and B are disjoint if A \cap B = \emptyset Events A_1, A_2, \ldots are pairwise disjoint if A_i \cap A_j = \emptyset \,, \quad \forall \, i \neq j

Events

Definition: Partition

The collection of events A_1, A_2, \ldots is a partition of \Omega if

  1. A_1, A_2, \ldots are pairwise disjoint
  2. \Omega = \cup_{i=1}^\infty A_i

What’s a Probability?

  • To each event E \subset \Omega we would like to associate a number P(E) \in [0,1]

  • The number P(E) is called the probability of E

  • The number P(E) models the frequency of occurrence of E:

    • P(E) small means E has low chance of occurring
    • P(E) large means E has high chance of occurring
  • Technical issue:

    • One cannot associate a number P(E) for all events in \Omega
    • Probability function P only defined for a smaller family of events
    • Such family of events is called \sigma-algebra

\sigma-algebras

Definition: sigma-algebra

Let \mathcal{B} be a collection of events. We say that \mathcal{B} is a \sigma-algebra if

  1. \emptyset \in \mathcal{B}
  2. If A \in \mathcal{B} then A^c \in \mathcal{B}
  3. If A_1,A_2 , \ldots \in \mathcal{B} then \cup_{i=1}^\infty A_i \in \mathcal{B}

Remarks:

  • Since \emptyset \in \mathcal{B} and \emptyset^c = \Omega, we deduce that \Omega \in \mathcal{B}

  • Thanks to DeMorgan’s Law we have that A_1,A_2 , \ldots \in \mathcal{B} \quad \implies \quad \cap_{i=1}^\infty A_i \in \mathcal{B}

\sigma-algebras

Examples

Suppose \Omega is any set:

  • Then \mathcal{B} = \{ \emptyset, \Omega \} is a \sigma-algebra

  • The power set of \Omega \mathcal{B} = \operatorname{Power} (\Omega) := \{ A \colon A \subset \Omega \} is a \sigma-algebra

\sigma-algebras

Examples

  • If \Omega has n elements then \mathcal{B} = \operatorname{Power} (\Omega) contains 2^n sets

  • If \Omega = \{ 1,2,3\} then \begin{align*} \mathcal{B} = \operatorname{Power} (\Omega) = \big\{ & \{1\} , \, \{2\}, \, \{3\} \\ & \{1,2\} , \, \{2,3\}, \, \{1,3\} \\ & \emptyset , \{1,2,3\} \big\} \end{align*}

  • If \Omega is uncountable then the power set of \Omega is not easy to describe

Lebesgue \sigma-algebra

Question
\mathbb{R} is uncountable. Which \sigma-algebra do we consider?

Definition: Lebesgue sigma-algebra
The Lebesgue \sigma-algebra on \mathbb{R} is the smallest \sigma-algebra \mathcal{L} containing all sets of the form (a,b) \,, \quad (a,b] \,, \quad [a,b) \,, \quad [a,b] for all a,b \in \mathbb{R}

Lebesgue \sigma-algebra

Important

Therefore the events of \mathbb{R} are

  • Intervals
  • Unions and intersection of intervals
  • Countable Unions and intersection of intervals
Warning
  • I only told you that the Lebsesgue \sigma-algebra \mathcal{L} exists
  • Explicitly showing that \mathcal{L} exists is not easy, see [1]

Probability measure

Suppose given:

  • \Omega sample space
  • \mathcal{B} a \sigma-algebra on \Omega
Definition: Probability measure

A probability measure on \Omega is a map P \colon \mathcal{B} \to [0,1] such that the Axioms of Probability hold

  1. P(\Omega) = 1
  2. If A_1, A_2,\ldots are pairwise disjoint then P\left( \bigcup_{i=1}^\infty A_i \right) = \sum_{i=1}^\infty P(A_i)

Properties of Probability

Let A, B \in \mathcal{B}. As a consequence of the Axioms of Probability:

  1. P(\emptyset) = 0
  2. If A and B are disjoint then P(A \cup B) = P(A) + P(B)
  3. P(A^c) = 1 - P(A)
  4. P(A) = P(A \cap B) + P(A \cap B^c)
  5. P(A\cup B) = P(A) + P(B) - P(A \cap B)
  6. If A \subset B then P(A) \leq P(B)

Properties of Probability

  1. Suppose A is an event and B_1,B_2, \ldots a partition of \Omega. Then P(A) = \sum_{i=1}^\infty P(A \cap B_i)
  2. Suppose A_1,A_2, \ldots are events. Then P\left( \bigcup_{i=1}^\infty A_i \right) \leq \sum_{i=1}^\infty P(A_i)

Example: Fair Coin Toss

  • The sample space for coin toss is \Omega = \{ H, T \}

  • We take as \sigma-algebra the power set of \Omega \mathcal{B} = \{ \emptyset , \, \{H\} , \, \{T\} , \, \{H,T\} \}

  • We suppose that the coin is fair

    • This means P \colon \mathcal{B} \to [0,1] satisfies P(\{H\}) = P(\{T\})
    • Assuming the above we get 1 = P(\Omega) = P(\{H\} \cup \{T\}) = P(\{H\}) + P(\{T\}) = 2 P(\{H\})
    • Therefore P(\{H\}) = P(\{T\}) = \frac12

Conditional Probability

Definition: Conditional Probability
Let A,B be events in \Omega with P(B)>0 The conditional probability of A given B is P(A|B) := \frac{P(A \cap B)}{P(B)}

Conditional Probability

Intuition

The conditional probability P(A|B) = \frac{P( A \cap B)}{P(B)} represents the probability of A, knowing that B has happened:

  • If B has happened, then B is the new sample space
  • Therefore A \cap B^c cannot happen, and we are only interested in A \cap B
  • Hence it makes sense to define P(A|B) \propto P(A \cap B)
  • We divide P(A\cap B) by P(B) so that P(A|B) \in [0,1] is still a probability
  • The function A \mapsto P(A|B) is a probability measure on \Omega

Bayes’ Rule

  • For two events A and B is holds

P(A | B ) = P(B|A) \frac{P(A)}{P(B)}

  • Given a partition A_1, A_2, \ldots of the sample space we have

P(A_i | B ) = \frac{ P(B|A_i) P(A_i)}{\sum_{j=1}^\infty P(B | A_j) P(A_j)}

Independence

Definition
Two events A and B are independent if P(A \cap B) = P(A)P(B) A collection of events A_1 , \ldots ,A_n are mutually independent if for any subcollection A_{i_1}, \ldots, A_{i_k} it holds P \left( \bigcap_{j=1}^k A_j \right) = \prod_{j=1}^k P(A_{i_j})

Part 2:
Random variables

Random Variables

Motivation

  • Consider the experiment of flipping a coin 50 times
  • The sample space consists of 2^{50} elements
  • Elements are vectors of 50 entries recording the outcome H or T of each flip
  • This is a very large sample space!

Suppose we are only interested in X = \text{ number of } \, H \, \text{ in } \, 50 \, \text{flips}

  • Then the new sample space is the set of integers \{ 0,1,2,\ldots,50\}
  • This is much smaller!
  • X is called a Random Variable

Random Variables

Assume given

  • \Omega sample space
  • \mathcal{B} a \sigma-algebra of events on \Omega
  • P \colon \mathcal{B} \to [0,1] a probability measure

Definition: Random variable
A function X \colon \Omega \to \mathbb{R}

We will abbreviate Random Variable with rv

Random Variables

Technical remark

Definition: Random variable
A measurable function X \colon \Omega \to \mathbb{R}

Technicality: X is a measurable function if \{ X \in I \} := \{ \omega \in \Omega \colon X(\omega) \in I \} \in \mathcal{B} \,, \quad \forall \, I \in \mathcal{L} where

  • \mathcal{L} is the Lebsgue \sigma-algebra on \mathbb{R}
  • \mathcal{B} is the given \sigma-algebra on \Omega

Random Variables

Notation

  • In particular I \in \mathcal{L} can be of the form (a,b) \,, \quad (a,b] \,, \quad [a,b) \,, \quad [a,b] \,, \quad \forall \, a, b \in \mathbb{R}

  • In this case the set \{X \in I\} \in \mathcal{B} is denoted by, respectively: \{ a < X < b \} \,, \quad \{ a < X \leq b \} \,, \quad \{ a \leq X < b \} \,, \quad \{ a \leq X \leq b \}

  • If a=b=x then I=[x,x]=\{x\}. Then we denote \{X \in I\} = \{X = x\}

Distribution

Why do we require measurability?

Answer: Because it allows to define a new probability measure on \mathbb{R}

Definition: Distribution
The distribution of a random variable X \colon \Omega \to \mathbb{R} is the probability measure on \mathbb{R} P_X \colon \mathcal{L} \to [0,1] \,, \quad P_X (I) := P \left( \{X \in I\} \right) \,, \,\, \forall \, I \in \mathcal{L}

Note:

  • One can show that P_X satisfies the Probability Axioms
  • Thus P_X is a probability measure on \mathbb{R}
  • In the future we will denote P \left( X \in I \right) := P \left( \{X \in I\} \right)

Distribution

Why is the distribution useful?

Answer: Because it allows to define a random variable X

  • by specifying the distribution values P \left( X \in I \right)
  • rather than defining an explicit function X \colon \Omega \to \mathbb{R}

Important: More often than not

  • We care about the distribution of X
  • We do not care about how X is defined

Example - Three coin tosses

  • Sample space \Omega given by the below values of \omega
\omega HHH HHT HTH THH TTH THT HTT TTT
  • The probability of each outcome is the same P(\omega) = \frac{1}{2} \times \frac{1}{2} \times \frac{1}{2} = \frac{1}{8} \,, \quad \forall \, \omega \in \Omega

  • Define the random variable X \colon \Omega \to \mathbb{R} by X(\omega) := \text{ Number of H in } \omega

\omega HHH HHT HTH THH TTH THT HTT TTT
X(\omega) 3 2 2 2 1 1 1 0

Example - Three coin tosses

  • Recall the definition of X
\omega HHH HHT HTH THH TTH THT HTT TTT
X(\omega) 3 2 2 2 1 1 1 0
  • The range of X is \{0,1,2,3\}

  • Hence the only interesting values of P_X are P(X=0) \,, \quad P(X=1) \,, \quad P(X=2) \,, \quad P(X=3)

Example - Three coin tosses

  • Recall the definition of X
\omega HHH HHT HTH THH TTH THT HTT TTT
X(\omega) 3 2 2 2 1 1 1 0
  • We compute \begin{align*} P(X=0) & = P(TTT) = \frac{1}{8} \\ P(X=1) & = P(TTH) + P(THT) + P(HTT) = \frac{3}{8} \\ P(X=2) & = P(HHT) + P(HTH) + P(THH) = \frac{3}{8} \\ P(X=3) & = P(HHH) = \frac{1}{8} \end{align*}

Example - Three coin tosses

  • Recall the definition of X
\omega HHH HHT HTH THH TTH THT HTT TTT
X(\omega) 3 2 2 2 1 1 1 0
  • The distribution of X is summarized in the table below
x 0 1 2 3
P(X=x) \frac{1}{8} \frac{3}{8} \frac{3}{8} \frac{1}{8}

Cumulative Distribution Function

Recall: The distribution of a rv X \colon \Omega \to \mathbb{R} is the probability measure on \mathbb{R} P_X \colon \mathcal{L} \to [0,1] \,, \quad P_X (I) := P \left( X \in I \right) \,, \,\, \forall \, I \in \mathcal{L}

Definition: cdf
The cumulative distribution function or cdf of a rv X \colon \Omega \to \mathbb{R} is F_X \colon \mathbb{R} \to \mathbb{R} \,, \quad F_X(x) := P_X (X \leq x)

Cumulative Distribution Function

Intuition

  • F_X is the primitive of P_X:
    • Recall from Analysis: The primitive of a continuous function g \colon \mathbb{R}\to \mathbb{R} is G(x):=\int_{-\infty}^x g(y) \,dy
    • Note that P_X is not a function but a distribution
    • However the definition of cdf as a primitive still makes sense
  • P_X will be the derivative of F_X - In a suitable generalized sense
    • Recall from Analysis: Fundamental Theorem of Calculus says G'(x)=g(x)
    • Since F_X is the primitive of P_X, it will still hold F_X'=P_X in the sense of distributions

Distribution Function

Example

  • Consider again 3 coin tosses and the rv X(\omega) := \text{ Number of H in } \omega

  • We computed that the distribution P_X of X is

x 0 1 2 3
P(X=x) \frac{1}{8} \frac{3}{8} \frac{3}{8} \frac{1}{8}
  • One can compute F_X(x) = \begin{cases} 0 & \text{if } x < 0 \\ \frac{1}{8} & \text{if } 0 \leq x < 1 \\ \frac{1}{2} & \text{if } 1 \leq x < 2 \\ \frac{7}{8} & \text{if } 2 \leq x < 3 \\ 1 & \text{if } 3 \leq x \end{cases}
  • For example \begin{align*} F_X(2.1) & = P(X \leq 2.1) \\ & = P(X=0,1 \text{ or } 2) \\ & = P(X=0) + P(X=1) + P(X=2) \\ & = \frac{1}{8} + \frac{3}{8} + \frac{3}{8} = \frac{7}{8} \end{align*}

Cumulative Distribution Function

Example

  • Plot of F_X: it is a step function
  • F_X'=0 except at x=0,1,2,3
  • F_X jumps at x=0,1,2,3
  • Size of jump at x is P(X=x)
  • F_X'=P_X in the sense of distributions
    (Advanced analysis concept - not covered)

Discrete Random Variables

In the previous example:

  • The cdf F_X had jumps
  • Hence F_X was discountinuous
  • We take this as definition of discrete rv

Definition
X \colon \Omega \to \mathbb{R} is discrete if F_X has jumps

Probability Mass Function

  • In this slide X is a discrete rv
  • Therefore F_X has jumps

Definition
The Probability Mass Function or pmf of a discrete rv X is f_X \colon \mathbb{R} \to \mathbb{R} \,, \quad f_X(x) := P(X = x)

Probability Mass Function

Proposition

The pmf f_X(x) = P(X=x) can be used to

  • compute probabilities P(a \leq X \leq b) = \sum_{k = a}^b f_X (k) \,, \quad \forall \, a,b \in \mathbb{Z} \,, \,\, a \leq b
  • compute the cdf F_X(x) = P(X \leq x) = \sum_{k=-\infty}^x f_X(k)

Example 1 - Discrete RV

  • Consider again 3 coin tosses and the RV X(\omega) := \text{ Number of H in } \omega

  • The pmf of X is f_X(x):=P(X=x), which we have already computed

x 0 1 2 3
f_X(x)= P(X=x) \frac{1}{8} \frac{3}{8} \frac{3}{8} \frac{1}{8}

Example 2 - Geometric Distribution

  • Suppose p \in (0,1) is a given probability of success
  • Hence 1-p is probability of failure
  • Consider the random variable X = \text{ Number of attempts to obtain first success}
  • Since each trial is independent, the pmf of X is f_X (x) = P(X=x) = (1-p)^{x-1} p \,, \quad \forall \, x \in \mathbb{N}
  • This is called geometric distribution

Example 2 - Geometric Distribution

  • We want to compute the cdf of X: For x \in \mathbb{N} with x > 0 \begin{align*} F_X(x) & = P(X \leq x) = \sum_{k=1}^x P(X=k) = \sum_{k=1}^x f_X(k) \\ & = \sum_{k=1}^x (1-p)^{k-1} p = \frac{1-(1-p)^x}{1-(1-p)} p = 1 - (1-p)^x \end{align*} where we used the formula for the sum of geometric series: \sum_{k=1}^x t^{k-1} = \frac{1-t^x}{1-t} \,, \quad t \neq 1

Example 2 - Geometric Distribution

  • F_X is flat between two consecutive natural numbers: \begin{align*} F_X(x+k) & = P(X \leq x+k) \\ & = P(X \leq x) \\ & = F_X(x) \end{align*} for all x \in \mathbb{N}, k \in [0,1)

  • Therefore F_X has jumps and X is discrete

Continuous Random Variables

Recall: X is discrete if F_X has jumps

Definition: Continuous Random Variable
X \colon \Omega \to \mathbb{R} is continuous if F_X is continuous

Probability Mass Function?

  • Suppose X is a continuous rv
  • Therefore F_X is continuous

Question
Can we define the Probability Mass Function for X?

Answer:

  • Yes we can, but it would be useless - pmf carries no information
  • This is because f_X(x) = P(X=x) = 0 \,, \quad \forall \, x \in \mathbb{R}

Probability Mass Function?

  • Indeed, for all \varepsilon>0 we have \{ X = x \} \subset \{ x - \varepsilon < X \leq x \}
  • Therefore by the properties of probabilities we have \begin{align*} P (X = x ) & \leq P( x - \varepsilon < X \leq x ) \\ & = P(X \leq x) - P(X \leq x - \varepsilon) \\ & = F_X(x) - F_X(x-\varepsilon) \end{align*} where we also used the definition of F_X
  • Since F_X is continuous we get 0 \leq P(X = x) \leq \lim_{\varepsilon \to 0} F_X(x) - F_X(x-\varepsilon) = 0
  • Then f_X(x) = P(X=x) = 0 for all x \in \mathbb{R}

Probability Density Function

pmf carries no information for continuous RV – We instead define the pdf

Definition
The Probability Density Function or pdf of a continuous rv X is a function f_X \colon \mathbb{R} \to \mathbb{R} s.t. F_X(x) = \int_{-\infty}^x f_X(t) \, dt \,, \quad \forall \, x \in \mathbb{R}

Technical issue:

  • If X is continuous then pdf does not exist in general
    (absolute continuity is required)
  • Counterexamples are rare, therefore we will assume existence of pdf

Probability Density Function

Properties

Proposition

Suppose X is continuous rv. They hold

  • The cdf F_X is continuous and differentiable (a.e.) with F_X' = f_X

  • Probability can be computed via P(a \leq X \leq b) = \int_{a}^b f_X (t) \, dt \,, \quad \forall \, a,b \in \mathbb{R} \,, \,\, a \leq b

Example - Logistic Distribution

  • The random variable X has logistic distribution if its pdf is f_X(x) = \frac{e^{-x}}{(1+e^{-x})^2}

Example - Logistic Distribution

  • The random variable X has logistic distribution if its pdf is f_X(x) = \frac{e^{-x}}{(1+e^{-x})^2}

  • The cdf can be computed to be F_X(x) = \int_{-\infty}^x f_X(t) \, dt = \frac{1}{1+e^{-x}}

  • The RHS is known as logistic function

Example - Logistic Distribution

Application: Logistic function models expected score in chess (see Wikipedia)

  • R_A is ELO rating of player A, R_B is ELO rating of player B
  • E_A is expected score of player A: E_A := P(A \text{ wins}) + \frac12 P(A \text{ draws})
  • E_A modelled by logistic function E_A := \frac{1}{1+ 10^{(R_B-R_A)/400} }
  • Example: Beginner is rated 1000, International Master is rated 2400 R_{\rm Begin} = 1000, \quad R_{\rm IM}=2400 , \quad E_{\rm Begin} = \frac{1}{1 + 10^{1400/400}} = 0.00031612779

Characterization of pmf and pdf

Theorem

Let f \colon \mathbb{R} \to \mathbb{R}. Then f is pmf or pdf of a RV X iff

  1. f(x) \geq 0 for all x \in \mathbb{R}
  2. \sum_{x=-\infty}^\infty f(x) = 1 \,\,\, (pmf) \quad or \quad \int_{-\infty}^\infty f(x) \, dx = 1\,\,\, (pdf)

In the above setting:

  • The RV X has distribution P(X = x) = f(x) \,\,\, \text{ (pmf) } \quad \text{ or } \quad P(a \leq X \leq b) = \int_a^b f(t) \, dt \,\,\, \text{ (pdf)}

  • The symbol X \sim f denotes that X has distribution f

Summary - Random Variables

  • Given probability space (\Omega, \mathcal{B}, P) and a Random Variable X \colon \Omega \to \mathbb{R}

  • Cumulative Density Function (cdf): F_X(x) := P(X \leq x)

Discrete RV Continuous RV
F_X has jumps F_X is continuous
Probability Mass Function (pmf) Probability Density Function (pdf)
f_X(x) := P(X=x) f_X(x) := F_X'(x)
f_X \geq 0 f_X \geq 0
\sum_{x=-\infty}^\infty f_X(x) = 1 \int_{-\infty}^\infty f_X(x) \, dx = 1
F_X (x) = \sum_{k=-\infty}^x f_X(k) F_X (x) = \int_{-\infty}^x f_X(t) \, dt
P(a \leq X \leq b) = \sum_{k = a}^{b} f_X(k) P(a \leq X \leq b) = \int_a^b f_X(t) \, dt

Part 3:
Expected value

Functions of Random Variables

  • X \colon \Omega \to \mathbb{R} random variable and g \colon \mathbb{R} \to \mathbb{R} function
  • Then Y:=g(X) \colon \Omega \to \mathbb{R} is random variable
  • For A \subset \mathbb{R} we define the pre-image g^{-1}(A) := \{ x \in \mathbb{R} \colon g(x) \in A \}
  • For A=\{y\} single element set we denote g^{-1}(\{y\}) = g^{-1}(y) = \{ x \in \mathbb{R} \colon g(x) = y \}
  • The distribution of Y is P(Y \in A) = P(g(X) \in A ) = P(X \in g^{-1}(A))

Functions of Random Variables

Question: What is the relationship between f_X and f_Y?

  • X discrete: Then Y is discrete and f_Y (y) = P(Y = y) = \sum_{x \in g^{-1}(y)} P(X=x) = \sum_{x \in g^{-1}(y)} f_X(x)

  • X and Y continuous: Then \begin{align*} F_Y(y) & = P(Y \leq y) = P(g(X) \leq y) \\ & = P(\{ x \in \mathbb{R} \colon g(x) \leq y \} ) = \int_{\{ x \in \mathbb{R} \colon g(x) \leq y \}} f_X(t) \, dt \end{align*}

Functions of Random Variables

Issue: The below set may be tricky to compute \{ x \in \mathbb{R} \colon g(x) \leq y \}

However it can be easily computed if g is strictly monotone:

  • g strictly increasing: Meaning that x_1 < x_2 \quad \implies \quad g(x_1) < g(x_2)

  • g strictly decreasing: Meaning that x_1 < x_2 \quad \implies \quad g(x_1) > g(x_2)

  • In both cases g is invertible

Functions of Random Variables

Let g be strictly increasing:

  • Then \{ x \in \mathbb{R} \colon g(x) \leq y \} = \{ x \in \mathbb{R} \colon x \leq g^{-1}(y) \}

  • Therefore \begin{align*} F_Y(y) & = \int_{\{ x \in \mathbb{R} \colon g(x) \leq y \}} f_X(t) \, dt = \int_{\{ x \in \mathbb{R} \colon x \leq g^{-1}(y) \}} f_X(t) \, dt \\ & = \int_{-\infty}^{g^{-1}(y)} f_X(t) \, dt = F_X(g^{-1}(y)) \end{align*}

Functions of Random Variables

Let g be strictly decreasing:

  • Then \{ x \in \mathbb{R} \colon g(x) \leq y \} = \{ x \in \mathbb{R} \colon x \geq g^{-1}(y) \}

  • Therefore \begin{align*} F_Y(y) & = \int_{\{ x \in \mathbb{R} \colon g(x) \leq y \}} f_X(t) \, dt = \int_{\{ x \in \mathbb{R} \colon x \geq g^{-1}(y) \}} f_X(t) \, dt \\ & = \int_{g^{-1}(y)}^{\infty} f_X(t) \, dt = 1 - \int_{-\infty}^{g^{-1}(y)}f_X(t) \, dt \\ & = 1 - F_X(g^{-1}(y)) \end{align*}

Summary - Functions of Random Variables

  • X discrete: Then Y is discrete and f_Y (y) = \sum_{x \in g^{-1}(y)} f_X(x)

  • X and Y continuous: Then F_Y(y) = \int_{\{ x \in \mathbb{R} \colon g(x) \leq y \}} f_X(t) \, dt

  • X and Y continuous and

    • g strictly increasing: F_Y(y) = F_X(g^{-1}(y))
    • g strictly decreasing: F_Y(y) = 1 - F_X(g^{-1}(y))

Expected Value

Expected value is the average value of a random variable

Definition

X rv and g \colon \mathbb{R} \to \mathbb{R} function. The expected value or mean of g(X) is {\rm I\kern-.3em E}[g(X)]

  • If X discrete {\rm I\kern-.3em E}[g(X)]:= \sum_{x \in \mathbb{R}} g(x) f_X(x) = \sum_{x \in \mathbb{R}} g(x) P(X = x)

  • If X continuous {\rm I\kern-.3em E}[g(X)]:= \int_{-\infty}^{\infty} g(x) f_X(x) \, dx

Expected Value

Properties

In particular we have1

  • If X discrete {\rm I\kern-.3em E}[X] = \sum_{x \in \mathbb{R}} x f_X(x) = \sum_{x \in \mathbb{R}} x P(X = x)

  • If X continuous {\rm I\kern-.3em E}[X] = \int_{-\infty}^{\infty} x f_X(x) \, dx

Expected Value

Properties

Theorem
X rv, g,h \colon \mathbb{R}\to \mathbb{R} functions and a,b,c \in \mathbb{R}. The expected value is linear \begin{equation} \tag{1} {\rm I\kern-.3em E}[a g(X) + b h(X) + c] = a{\rm I\kern-.3em E}[g(X)] + b {\rm I\kern-.3em E}[h(X)] + c \end{equation} In particular \begin{align} \tag{2} {\rm I\kern-.3em E}[aX] & = a {\rm I\kern-.3em E}[X] \\ {\rm I\kern-.3em E}[c] & = c \tag{3} \end{align}

Expected Value

Proof of Theorem

  • Equation (2) follows from (1) by setting g(x)=x and b=c=0

  • Equation (3) follows from (1) by setting a=b=0

  • To show (1), suppose X is continuous and set p(x):=ag(x)+bh(x)+c \begin{align*} {\rm I\kern-.3em E}[ag(X) + & b h(X) + c] = {\rm I\kern-.3em E}[p(X)] = \int_{\mathbb{R}} p(x) f_X(x) \, dx \\ & = \int_{\mathbb{R}} (ag(x) + bh(x) + c) f_X(x) \, dx \\ & = a\int_{\mathbb{R}} g(x) f_X(x) \, dx + b\int_{\mathbb{R}} h(x) f_X(x) \, dx + c\int_{\mathbb{R}} f_X(x) \, dx \\ & = a {\rm I\kern-.3em E}[g(X)] + b {\rm I\kern-.3em E}[h(X)] + c \end{align*}

  • If X is discrete just replace integrals with series in the above argument

Expected Value

Further Properties

Below are further properties of {\rm I\kern-.3em E}, which we do not prove

Theorem

Suppose X and Y are rv. The expected value is:

  • Monotone: X \leq Y \quad \implies \quad {\rm I\kern-.3em E}[X] \leq {\rm I\kern-.3em E}[Y]

  • Non-degenerate: {\rm I\kern-.3em E}[|X|] = 0 \quad \implies \quad X = 0

  • X=Y \quad \implies \quad {\rm I\kern-.3em E}[X]={\rm I\kern-.3em E}[Y]

Variance

Variance measures how much a rv X deviates from {\rm I\kern-.3em E}[X]

Definition: Variance
The variance of a random variable X is {\rm Var}[X]:= {\rm I\kern-.3em E}[(X - {\rm I\kern-.3em E}[X])^2]

Note:

  • {\rm Var}[X] = 0 \quad \implies \quad (X - {\rm I\kern-.3em E}[X])^2 = 0 \quad \implies \quad X = {\rm I\kern-.3em E}[X]
  • If {\rm Var}[X] is small then X is close to {\rm I\kern-.3em E}[X]
  • If {\rm Var}[X] is large then X is very variable

Variance

Equivalent formula

Proposition
{\rm Var}[X] = {\rm I\kern-.3em E}[X^2] - {\rm I\kern-.3em E}[X]^2

Proof: \begin{align*} {\rm Var}[X] & = {\rm I\kern-.3em E}[(X - {\rm I\kern-.3em E}[X])^2] \\ & = {\rm I\kern-.3em E}[X^2 - 2 X {\rm I\kern-.3em E}[X] + {\rm I\kern-.3em E}[X]^2] \\ & = {\rm I\kern-.3em E}[X^2] - {\rm I\kern-.3em E}[2 X {\rm I\kern-.3em E}[X]] + {\rm I\kern-.3em E}[ {\rm I\kern-.3em E}[X]^2] \\ & = {\rm I\kern-.3em E}[X^2] - 2 {\rm I\kern-.3em E}[X]^2 + {\rm I\kern-.3em E}[X]^2 \\ & = {\rm I\kern-.3em E}[X^2] - {\rm I\kern-.3em E}[X]^2 \end{align*}

Variance

Variance is quadratic

Proposition
X rv and a,b \in \mathbb{R}. Then {\rm Var}[a X + b] = a^2 {\rm Var}[X]

Proof: Using linearity of {\rm I\kern-.3em E} and the fact that {\rm I\kern-.3em E}[c]=c for constants: \begin{align*} {\rm Var}[a X + b] & = {\rm I\kern-.3em E}[ (aX + b)^2 ] - {\rm I\kern-.3em E}[ aX + b ]^2 \\ & = {\rm I\kern-.3em E}[ a^2X^2 + b^2 + 2abX ] - ( a{\rm I\kern-.3em E}[X] + b)^2 \\ & = a^2 {\rm I\kern-.3em E}[ X^2 ] + b^2 + 2ab {\rm I\kern-.3em E}[X] - a^2 {\rm I\kern-.3em E}[X]^2 - b^2 - 2ab {\rm I\kern-.3em E}[X] \\ & = a^2 ( {\rm I\kern-.3em E}[ X^2 ] - {\rm I\kern-.3em E}[ X ]^2 ) = a^2 {\rm Var}[X] \end{align*}

Variance

How to compute the Variance

We have {\rm Var}[X] = {\rm I\kern-.3em E}[X^2] - {\rm I\kern-.3em E}[X]^2

  • X discrete: E[X] = \sum_{x \in \mathbb{R}} x f_X(x) \,, \qquad E[X^2] = \sum_{x \in \mathbb{R}} x^2 f_X(x)

  • X continuous: E[X] = \int_{-\infty}^\infty x f_X(x) \, dx \,, \qquad E[X^2] = \int_{-\infty}^\infty x^2 f_X(x) \, dx

Example - Gamma distribution

Definition

The Gamma distribution with parameters \alpha,\beta>0 is f(x) := \frac{x^{\alpha-1} e^{-\beta{x}} \beta^{\alpha}}{\Gamma(\alpha)} \,, \quad x > 0 where \Gamma is the Gamma function \Gamma(a) :=\int_0^{\infty} x^{a-1} e^{-x} \, dx

Example - Gamma distribution

Definition

Properties of \Gamma:

  • The Gamma function coincides with the factorial on natural numbers \Gamma(n)=(n-1)! \,, \quad \forall \, n \in \mathbb{N}

  • More in general \Gamma(a)=(a-1)\Gamma(a-1) \,, \quad \forall \, a > 0

  • Definition of \Gamma implies normalization of the Gamma distribution: \int_0^{\infty} f(x) \,dx = \int_0^{\infty} \frac{x^{\alpha-1} e^{-\beta{x}} \beta^{\alpha}}{\Gamma(\alpha)} \, dx = 1

Example - Gamma distribution

Definition

X has Gamma distribution with parameters \alpha,\beta if

  • the pdf of X is f_X(x) = \begin{cases} \dfrac{x^{\alpha-1} e^{-\beta{x}} \beta^{\alpha}}{\Gamma(\alpha)} & \text{ if } x > 0 \\ 0 & \text{ if } x \leq 0 \end{cases}

  • In this case we write X \sim \Gamma(\alpha,\beta)

  • \alpha is shape parameter

  • \beta is rate parameter

Example - Gamma distribution

Plot

Plotting \Gamma(\alpha,\beta) for parameters (2,1) and (3,2)

Example - Gamma distribution

Expected value

Let X \sim \Gamma(\alpha,\beta). We have: \begin{align*} {\rm I\kern-.3em E}[X] & = \int_{-\infty}^\infty x f_X(x) \, dx \\ & = \int_0^\infty x \, \frac{x^{\alpha-1} e^{-\beta{x}} \beta^{\alpha}}{\Gamma(\alpha)} \, dx \\ & = \frac{ \beta^{\alpha} }{ \Gamma(\alpha) } \int_0^\infty x^{\alpha} e^{-\beta{x}} \, dx \end{align*}

Example - Gamma distribution

Expected value

Recall previous calculation: {\rm I\kern-.3em E}[X] = \frac{ \beta^{\alpha} }{ \Gamma(\alpha) } \int_0^\infty x^{\alpha} e^{-\beta{x}} \, dx Change variable y=\beta x and recall definition of \Gamma: \begin{align*} \int_0^\infty x^{\alpha} e^{-\beta{x}} \, dx & = \int_0^\infty \frac{1}{\beta^{\alpha}} (\beta x)^{\alpha} e^{-\beta{x}} \frac{1}{\beta} \, \beta \, dx \\ & = \frac{1}{\beta^{\alpha+1}} \int_0^\infty y^{\alpha} e^{-y} \, dy \\ & = \frac{1}{\beta^{\alpha+1}} \Gamma(\alpha+1) \end{align*}

Example - Gamma distribution

Expected value

Therefore \begin{align*} {\rm I\kern-.3em E}[X] & = \frac{ \beta^{\alpha} }{ \Gamma(\alpha) } \int_0^\infty x^{\alpha} e^{-\beta{x}} \, dx \\ & = \frac{ \beta^{\alpha} }{ \Gamma(\alpha) } \, \frac{1}{\beta^{\alpha+1}} \Gamma(\alpha+1) \\ & = \frac{\Gamma(\alpha+1)}{\beta \Gamma(\alpha)} \end{align*}

Recalling that \Gamma(\alpha+1)=\alpha \Gamma(\alpha): {\rm I\kern-.3em E}[X] = \frac{\Gamma(\alpha+1)}{\beta \Gamma(\alpha)} = \frac{\alpha}{\beta}

Example - Gamma distribution

Variance

We want to compute {\rm Var}[X] = {\rm I\kern-.3em E}[X^2] - {\rm I\kern-.3em E}[X]^2

  • We already have {\rm I\kern-.3em E}[X]
  • Need to compute {\rm I\kern-.3em E}[X^2]

Example - Gamma distribution

Variance

Proceeding similarly we have:

\begin{align*} {\rm I\kern-.3em E}[X^2] & = \int_{-\infty}^{\infty} x^2 f_X(x) \, dx \\ & = \int_{0}^{\infty} x^2 \, \frac{ x^{\alpha-1} \beta^{\alpha} e^{- \beta x} }{ \Gamma(\alpha) } \, dx \\ & = \frac{\beta^{\alpha}}{\Gamma(\alpha)} \int_{0}^{\infty} x^{\alpha+1} e^{- \beta x} \, dx \end{align*}

Example - Gamma distribution

Variance

Recall previous calculation: {\rm I\kern-.3em E}[X^2] = \frac{\beta^{\alpha}}{\Gamma(\alpha)} \int_{0}^{\infty} x^{\alpha+1} e^{- \beta x} \, dx Change variable y=\beta x and recall definition of \Gamma: \begin{align*} \int_0^\infty x^{\alpha+1} e^{-\beta{x}} \, dx & = \int_0^\infty \frac{1}{\beta^{\alpha+1}} (\beta x)^{\alpha + 1} e^{-\beta{x}} \frac{1}{\beta} \, \beta \, dx \\ & = \frac{1}{\beta^{\alpha+2}} \int_0^\infty y^{\alpha + 1 } e^{-y} \, dy \\ & = \frac{1}{\beta^{\alpha+2}} \Gamma(\alpha+2) \end{align*}

Example - Gamma distribution

Variance

Therefore {\rm I\kern-.3em E}[X^2] = \frac{ \beta^{\alpha} }{ \Gamma(\alpha) } \int_0^\infty x^{\alpha+1} e^{-\beta{x}} \, dx = \frac{ \beta^{\alpha} }{ \Gamma(\alpha) } \, \frac{1}{\beta^{\alpha+2}} \Gamma(\alpha+2) = \frac{\Gamma(\alpha+2)}{\beta^2 \Gamma(\alpha)} Now use following formula twice \Gamma(\alpha+1)=\alpha \Gamma(\alpha): \Gamma(\alpha+2)= (\alpha + 1) \Gamma(\alpha + 1) = (\alpha + 1) \alpha \Gamma(\alpha) Substituting we get {\rm I\kern-.3em E}[X^2] = \frac{\Gamma(\alpha+2)}{\beta^2 \Gamma(\alpha)} = \frac{(\alpha+1) \alpha}{\beta^2}

Example - Gamma distribution

Variance

Therefore {\rm I\kern-.3em E}[X] = \frac{\alpha}{\beta} \quad \qquad {\rm I\kern-.3em E}[X^2] = \frac{(\alpha+1) \alpha}{\beta^2} and the variance is \begin{align*} {\rm Var}[X] & = {\rm I\kern-.3em E}[X^2] - {\rm I\kern-.3em E}[X]^2 \\ & = \frac{(\alpha+1) \alpha}{\beta^2} - \frac{\alpha^2}{\beta^2} \\ & = \frac{\alpha}{\beta^2} \end{align*}

Part 4:
Bivariate random vectors

Univariate vs Bivariate vs Multivariate

  • Probability models seen so far only involve 1 random variable
    • These are called univariate models
  • We are also interested in probability models involving multiple variables:
    • Models with 2 random variables are called bivariate
    • Models with more than 2 random variables are called multivariate

Random vectors

Definition

Recall: a random variable is a measurable function X \colon \Omega \to \mathbb{R}\,, \quad \Omega \,\, \text{ sample space}

Definition

A random vector is a measurable function \mathbf{X}\colon \Omega \to \mathbb{R}^n. We say that

  • \mathbf{X} is univariate if n=1
  • \mathbf{X} is bivariate if n=2
  • \mathbf{X} is multivariate if n \geq 3

Random vectors

Notation

  • The components of a random vector \mathbf{X} are denoted by \mathbf{X}= (X_1, \ldots, X_n) with X_i \colon \Omega \to \mathbb{R} random variables

  • We denote a two-dimensional bivariate random vector by (X,Y) with X,Y \colon \Omega \to \mathbb{R} random variables

Discrete bivariate random vectors

Main definitions

Definition
The (bivariate) random vector (X,Y) is discrete if X and Y are discrete random variables

Definition
The joint probability mass function or joint pmf of a discrete random vector (X,Y) is the function f_{X,Y} \colon \mathbb{R}^2 \to \mathbb{R} defined by f_{X,Y}(x,y) := P(X=x, Y=y ) \,, \qquad \forall \, (x,y) \in \mathbb{R}^2

Notation: P(X=x, Y=y ) := P( \{X=x \} \cap \{ Y=y \})

Discrete bivariate random vectors

Computing probabilities

  • The joint pmf can be used to compute the probability of A \subset \mathbb{R}^2 \begin{align*} P((X,Y) \in A) & := P( \{ \omega \in \Omega \colon ( X(\omega), Y(\omega) ) \in A \} ) \\ & = \sum_{(x,y) \in A} f_{X,Y} (x,y) \end{align*}

  • In particular we obtain \sum_{(x,y) \in \mathbb{R}^2} f_{X,Y} (x,y) = 1

Discrete bivariate random vectors

Expected value

  • Suppose (X,Y) \colon \Omega \to \mathbb{R}^2 random vector and g \colon \mathbb{R}^2 \to \mathbb{R} function
  • Then g(X,Y) \colon \Omega \to \mathbb{R} is random variable

Definition
The expected value of the random variable g(X,Y) is {\rm I\kern-.3em E}[g(X,Y)] := \sum_{(x,y) \in \mathbb{R}^2} g(x,y) f_{X,Y}(x,y) = \sum_{(x,y) \in \mathbb{R}^2} g(x,y) P(X=x,Y=y)

Discrete bivariate random vectors

Marginals

Definition
Let (X,Y) be a discrete random vector. The marginal pmfs of X and Y are the functions f_X (x) := P(X = x) \quad \text{ and } \quad f_Y(y) := P(Y = y)

Note: The marginal pmfs of X and Y are just the usual pmfs of X and Y

Discrete bivariate random vectors

Marginals

Marginals of X and Y can be computed from the joint pmf f_{X,Y}

Theorem
Let (X,Y) be a discrete random vector with joint pmf f_{X,Y}. The marginal pmfs of X and Y are given by f_X(x) = \sum_{y \in \mathbb{R}} f_{X,Y}(x,y) \quad \text{ and } \quad f_Y(y) = \sum_{x \in \mathbb{R}} f_{X,Y}(x,y)

Example - Discrete random vector

Setting

  • Consider experiment of tossing two dice. Then sample space is \Omega = \{ (m,n) \colon m,n \in \{1,\ldots,6\} \} with m and n being the outcomes of first and second dice, respectively
  • We assume that the dice are fair. Therefore P(\{(m,n)\})=1/36
  • Define the discrete random variables \begin{align*} X(m,n) & := m + n \quad & \text{ sum of the dice} \\ Y(m,n) & := | m - n| \quad & \text{ |difference of the dice|} \end{align*}
  • For example X(3,3) = 3 + 3 = 6 \qquad \qquad Y(3,3) = |3 - 3| = 0

Example - Discrete random vector

Computing joint pmf

  • To compute joint pmf one needs to consider all the cases f_{X,Y}(x,y) = P(X=x,Y=y) \,, \quad (x,y) \in \mathbb{R}^2

  • For example X=4 and Y=0 is only obtained for (2,2). Hence f_{X,Y}(4,0) = P(X=4,Y=0) = P(\{(2,2)\}) = \frac{1}{6} \cdot \frac{1}{6} = \frac{1}{36}

  • Similarly X=5 and Y=2 is only obtained for (4,1) and (1,4). Thus f_{X,Y}(5,2) = P(X=5,Y=2) = P(\{(4,1)\} \cup \{(1,4)\}) = \frac{1}{36} + \frac{1}{36} = \frac{1}{18}

Example - Discrete random vector

Computing joint pmf

  • f_{X,Y}(x,y)=0 for most of the pairs (x,y). Indeed f_{X,Y}(x,y)=0 if x \notin X(\Omega) \quad \text{ or } \quad y \notin Y(\Omega)

  • We have X(\Omega)=\{2,3,4,5,6,7,8,9,10,11,12\}

  • We have Y(\Omega)=\{0,1,2,3,4,5\}

  • Hence f_{X,Y} only needs to be computed for pairs (x,y) satisfying 2 \leq x \leq 12 \quad \text{ and } \quad 0 \leq y \leq 5

  • Within this range, other values will be zero. For example f_{X,Y}(3,0) = P(X=3,Y=0) = P(\emptyset) = 0

Example - Discrete random vector

Table of values of joint pmf

Below are all the values for f_{X,Y}. Empty entries correspond to f_{X,Y}(x,y) = 0

x
2 3 4 5 6 7 8 9 10 11 12
0 1/36 1/36 1/36 1/36 1/36 1/36
1 1/18 1/18 1/18 1/18 1/18
y 2 1/18 1/18 1/18 1/18
3 1/18 1/18 1/18
4 1/18 1/18
5 1/18

Example - Discrete random vector

Expected value

  • We want to compute {\rm I\kern-.3em E}[XY]
  • Hence consider the function g(x,y):=xy
  • We obtain \begin{align*} {\rm I\kern-.3em E}[XY] & = {\rm I\kern-.3em E}[g(X,Y)] \\ & = \sum_{(x,y) \in \mathbb{R}^2} g(x,y) f_{X,Y}(x,y)\\ & = \sum_{(x,y) \in \mathbb{R}^2} xy f_{X,Y}(x,y) \end{align*}

Example - Discrete random vector

Expected value

We can use the non-zero entries in the table for f_{X,Y} to compute: \begin{align*} {\rm I\kern-.3em E}[XY] & = 3 \cdot 1 \cdot \frac{1}{18} + 5 \cdot 1 \cdot \frac{1}{18} + 7 \cdot 1 \cdot \frac{1}{18} + 9 \cdot 1 \cdot \frac{1}{18} + 11 \cdot 1 \cdot \frac{1}{18} \\ & + 4 \cdot 2 \cdot \frac{1}{18} + 6 \cdot 2 \cdot \frac{1}{18} + 8 \cdot 2 \cdot \frac{1}{18} + 10\cdot 2 \cdot \frac{1}{18} \\ & + 5 \cdot 3 \cdot \frac{1}{18} + 7 \cdot 3 \cdot \frac{1}{18} + 9 \cdot 3 \cdot \frac{1}{18} \\ & + 6 \cdot 4 \cdot \frac{1}{18} + 8 \cdot 4 \cdot \frac{1}{18} \\ & + 7 \cdot 5 \cdot \frac{1}{18} \\ & = (35 + 56 + 63 + 56 + 35 ) \frac{1}{18} = \frac{245}{18} \end{align*}

Example - Discrete random vector

Marginals

  • We want to compute the marginal of Y via the formula f_Y(y) = \sum_{x \in \mathbb{R}} f_{X,Y}(x,y)

  • Again looking at the table for f_{X,Y}, we get \begin{align*} f_Y(0) & = f_{X,Y}(2,0) + f_{X,Y}(4,0) + f_{X,Y}(6,0) \\ & + f_{X,Y}(8,0) + f_{X,Y}(10,0) + f_{X,Y}(12,0) \\ & = 6 \cdot \frac{1}{36} = \frac{3}{18} \end{align*}

Example - Discrete random vector

Marginals

  • Similarly, we get \begin{align*} f_Y(1) & = f_{X,Y}(3,1) + f_{X,Y}(5,1) + f_{X,Y}(7,1) \\ & + f_{X,Y}(9,1) + f_{X,Y}(11,1) \\ & = 5 \cdot \frac{1}{18} = \frac{5}{18} \end{align*}
  • And the remaining values follow a similar pattern: f_Y(2) = \frac{4}{18} \,, \quad f_Y(3) = \frac{3}{18} \,, \quad f_Y(4) = \frac{2}{18} \,, \quad f_Y(5) = \frac{1}{18} \,, \quad

Example - Discrete random vector

Marginals

Hence the pmf of Y is given by the table below

y 0 1 2 3 4 5
f_Y(y) \frac{3}{18} \frac{5}{18} \frac{4}{18} \frac{3}{18} \frac{2}{18} \frac{1}{18}

Note that f_Y is indeed a pmf, since \sum_{y \in \mathbb{R}} f_Y(y) = \sum_{y=0}^5 f_Y(y) = 1

Continuous bivariate random vectors

Definition
The random vector (X,Y) is continuous if X and Y are continuous rv

Definition
The joint probability density function or joint pdf of a continuous random vector (X,Y) is a function f_{X,Y} \colon \mathbb{R}^2 \to \mathbb{R} s.t. P((X,Y) \in A) = \int_{A} f_{X,Y}(x,y) \, dxdy

  • \int_A is a double integral over A, like the ones you saw in Calculus
  • The joint pdf is defined over the whole \mathbb{R}^2

Continuous bivariate random vectors

Expected value

  • Suppose (X,Y) \colon \Omega \to \mathbb{R}^2 continuous random vector and g \colon \mathbb{R}^2 \to \mathbb{R} function
  • Then g(X,Y) \colon \Omega \to \mathbb{R} is random variable

Definition
The expected value of the random variable g(X,Y) is {\rm I\kern-.3em E}[g(X,Y)] := \int_{\mathbb{R}^2} g(x,y) f_{X,Y}(x,y) \, dxdy

Notation:The symbol \int_{\mathbb{R}^2} denotes the double integral \int_{-\infty}^\infty\int_{-\infty}^\infty

Continuous bivariate random vectors

Marginals

Definition
Let (X,Y) be a continuous random vector. The marginal pdfs of X and Y are functions f_X,f_Y \colon \mathbb{R}\to \mathbb{R} s.t. P(a \leq X \leq b) = \int_{a}^b f_X (x) \,dx \quad \text{ and } \quad P(a \leq Y \leq b) = \int_{a}^b f_Y (y) \,dy

Note: The marginal pdfs of X and Y are just the usual pdfs of X and Y

Continuous bivariate random vectors

Marginals

Marginals of X and Y can be computed from the joint pdf f_{X,Y}

Theorem
Let (X,Y) be a discrete random vector with joint pdf f_{X,Y}. The marginal pdfs of X and Y are given by f_X(x) = \int_{-\infty}^\infty f_{X,Y}(x,y) \,dy \quad \text{ and } \quad f_Y(y) = \int_{-\infty}^\infty f_{X,Y}(x,y) \, dx

Characterization of joint pmf and pdf

Theorem

Let f \colon \mathbb{R}^2 \to \mathbb{R}. Then f is joint pmf or joint pdf of a random vector (X,Y) iff

  1. f(x,y) \geq 0 for all (x,y) \in \mathbb{R}^2
  2. \sum_{(x,y) \in \mathbb{R}^2} f(x,y) = 1 \,\,\, (joint pmf) \quad or \quad \int_{\mathbb{R}^2} f(x,y) \,dxdy = 1 \,\,\, (joint pdf)

In the above setting:

  • The random vector (X,Y) has distribution
    • P(X=x,Y=y ) = f(x,y) \,\,\,\text{ (joint pmf)}
    • P((X,Y) \in A) = \int_A f (x,y) \, dxdy \,\,\, \text{ (joint pdf)}
  • The symbol (X,Y) \sim f denotes that (X,Y) has distribution f

Summary - Random Vectors

(X,Y) discrete random vector (X,Y) continuous random vector
X and Y discrete X and Y continuous
Joint pmf Joint pdf
f_{X,Y}(x,y) := P(X=x,Y=y) P((X,Y) \in A) = \int_A f_X(x,y) \,dxdy
f_{X,Y} \geq 0 f_{X,Y} \geq 0
\sum_{(x,y)\in \mathbb{R}^2} f_{X,Y}(x,y)=1 \int_{\mathbb{R}^2} f_{X,Y}(x,y) \, dxdy= 1
Marginal pmfs Marginal pdfs
f_X (x) := P(X=x) P(a \leq X \leq b) = \int_a^b f_X(x) \,dx
f_Y (y) := P(Y=y) P(a \leq Y \leq b) = \int_a^b f_Y(y) \,dy
f_X (x)=\sum_{y \in \mathbb{R}} f_{X,Y}(x,y) f_X(x) = \int_{\mathbb{R}} f_{X,Y}(x,y) \,dy
f_Y (y)=\sum_{x \in \mathbb{R}} f_{X,Y}(x,y) f_Y(y) = \int_{\mathbb{R}} f_{X,Y}(x,y) \,dx

Linearity of Expected Value

Theorem
(X,Y) random vector, g,h \colon \mathbb{R}^2 \to \mathbb{R} functions and a,b,c \in \mathbb{R}. The expectation is linear: \begin{equation} \tag{1} {\rm I\kern-.3em E}( a g (X,Y) + b h(X,Y)+ c ) = a {\rm I\kern-.3em E}[g(X,Y)] + b {\rm I\kern-.3em E}[h(X,Y)] + c \end{equation} In particular \begin{equation} \tag{2} {\rm I\kern-.3em E}[a X + b Y] = a{\rm I\kern-.3em E}[X] + b{\rm I\kern-.3em E}[Y] \end{equation}

  • Proof of (1) follows by definition (see also argument in Slide 64)
  • Equation (2) follows from (1) by setting c=0 \,, \quad g(x,y)=x \,, \qquad h(x,y)=y

Part 5:
Conditional distributions

Conditional distributions - Discrete case

  • Suppose given a discrete random vector (X,Y)

  • It might happen that the event \{X=x\} depends on \{Y=y\}

  • If P(Y=y)>0 we can define the conditional probability P(X=x|Y=y) := \frac{P(X=x,Y=y)}{P(Y=y)} = \frac{f_{X,Y}(x,y)}{f_Y(y)} where f_{X,Y} is joint pmf of (X,Y) and f_Y the marginal pmf of Y

Conditional pmf

Definition

(X,Y) discrete random vector with joint pmf f_{X,Y} and marginal pmfs f_X, f_Y

  • For any x such that f_X(x)=P(X=x)>0 the conditional pmf of Y given that X=x is the function f(\cdot | x) defined by f(y|x) := P(Y=y|X=x) = \frac{f_{X,Y}(x,y)}{f_X(x)}

  • For any y such that f_Y(y)=P(X=y)>0 the conditional pmf of X given that Y=y is the function f(\cdot | y) defined by f(x|y) := P(X=x|Y=y) =\frac{f_{X,Y}(x,y)}{f_Y(y)}

Conditional pmf

  • Conditional pmf f(y|x) is indeed a pmf:

    • f(y|x) \geq 0
    • \sum_{y} f(y|x) = \dfrac{\sum_{y} f_{X,Y}(x,y)}{f_X(x)} = \dfrac{f_X(x)}{f_X(x)} = 1
    • Hence there exists a discrete rv Z whose pmf is f(y|x)
    • This is true by the Theorem in Slide 53
  • Similar reasoning yields that also f(x|y) is a pmf

  • Notation: We will often write

    • X|Y to denote the distribution f(x|y)
    • Y|X to denote the distribution f(y|x)

Conditional distributions - Continuous case

  • In the discrete case we consider the conditional probability P(X=x|Y=y) = \frac{P(X=x,Y=y)}{P(Y=y)}
  • However when Y is continuous random variable we have P(Y=y) = 0 \quad \forall \, y \in \mathbb{R}
  • Question: How do we define conditional distributions in the continuous case?
  • Answer: By replacing pmfs with pdfs

Conditional pdf

Definition

(X,Y) continuous random vector with joint pdf f_{X,Y} and marginal pdfs f_X, f_Y

  • For any x such that f_X(x)>0 the conditional pdf of Y given that X=x is the function f(\cdot | x) defined by f(y|x) := \frac{f_{X,Y}(x,y)}{f_X(x)}

  • For any y such that f_Y(y)>0 the conditional pdf of X given that Y=y is the function f(\cdot | y) defined by f(x|y) := \frac{f_{X,Y}(x,y)}{f_Y(y)}

Conditional pdf

  • Conditional pdf f(y|x) is indeed a pdf:
    • f(y|x) \geq 0
    • \int_{y \in \mathbb{R}} f(y|x) \, dy = \dfrac{\int_{y \in \mathbb{R}} f_{X,Y}(x,y) \, dy}{f_X(x)} = \dfrac{f_X(x)}{f_X(x)} = 1
    • Hence there exists a continuous rv Z whose pdf is f(y|x)
    • This is true by the Theorem in Slide 53
  • Similar reasoning yields that also f(x|y) is a pdf

Conditional expectation

Definition
(X,Y) random vector and g \colon \mathbb{R}\to \mathbb{R} function. The conditional expectation of g(Y) given X=x is \begin{align*} {\rm I\kern-.3em E}[g(Y) | x] & := \sum_{y} g(y) f(y|x) \quad \text{ if } (X,Y) \text{ discrete} \\ {\rm I\kern-.3em E}[g(Y) | x] & := \int_{y \in \mathbb{R}} g(y) f(y|x) \, dy \quad \text{ if } (X,Y) \text{ continuous} \end{align*}

  • {\rm I\kern-.3em E}[g(Y) | x] is a real number for all x \in \mathbb{R}
  • {\rm I\kern-.3em E}[g(Y) | X] denotes the Random Variable h(X) where h(x):={\rm I\kern-.3em E}[g(Y) | x]

Conditional variance

Definition
(X,Y) random vector. The conditional variance of Y given X=x is {\rm Var}[Y | x] := {\rm I\kern-.3em E}[Y^2|x] - {\rm I\kern-.3em E}[Y|x]^2

  • {\rm Var}[Y | x] is a real number for all x \in \mathbb{R}
  • {\rm Var}[Y | X] denotes the Random Variable {\rm Var}[Y | X] := {\rm I\kern-.3em E}[Y^2|X] - {\rm I\kern-.3em E}[Y|X]^2

Example - Conditional distribution

  • Continuous random vector (X,Y) with joint pdf f_{X,Y}(x,y) := e^{-y} \,\, \text{ if } \,\, 0 < x < y \,, \quad f_{X,Y}(x,y) :=0 \,\, \text{ otherwise}

Example - Conditional distribution

  • We compute f_X, the marginal pdf of X:
    • If x \leq 0 then f_{X,Y}(x,y)=0. Therefore f_X(x) = \int_{-\infty}^\infty f_{X,Y}(x,y) \, dy = 0
    • If x > 0 then f_{X,Y}(x,y)=e^{-y} if y>x, and f_{X,Y}(x,y)=0 if y \leq x. Thus \begin{align*} f_X(x) & = \int_{-\infty}^\infty f_{X,Y}(x,y) \, dy = \int_{x}^\infty e^{-y} \, dy \\ & = - e^{-y} \bigg|_{y=x}^{y=\infty} = -e^{-\infty} + e^{-x} = e^{-x} \end{align*}

Example - Conditional distribution

  • The marginal pdf of X has then exponential distribution f_{X}(x) = \begin{cases} e^{-x} & \text{ if } x > 0 \\ 0 & \text{ if } x \leq 0 \end{cases}

Example - Conditional distribution

  • We now compute f(y|x), the conditional pdf of Y given X=x:
    • Note that f_X(x)>0 for all x>0
    • Hence assume fixed some x>0
    • If y>x we have f_{X,Y}(x,y)=e^{-y}. Hence f(y|x) := \frac{f_{X,Y}(x,y)}{f_X(x)} = \frac{e^{-y}}{e^{-x}} = e^{-(y-x)}
    • If y \leq x we have f_{X,Y}(x,y)=0. Hence f(y|x) := \frac{f_{X,Y}(x,y)}{f_X(x)} = \frac{0}{e^{-x}} = 0

Example - Conditional distribution

  • The conditional distribution Y|X is therefore exponential f(y|x) = \begin{cases} e^{-(y-x)} & \text{ if } y > x \\ 0 & \text{ if } y \leq x \end{cases}

  • The conditional expectation of Y given X=x is \begin{align*} {\rm I\kern-.3em E}[Y|x] & = \int_{-\infty}^\infty y f(y|x) \, dy = \int_{x}^\infty y e^{-(y-x)} \, dy \\ & = -(y+1) e^{-(y-x)} \bigg|_{x}^\infty = x + 1 \end{align*} where we integrated by parts

Example - Conditional distribution

  • Therefore conditional expectation of Y given X=x is {\rm I\kern-.3em E}[Y|x] = x + 1

  • This can also be interpreted as the random variable {\rm I\kern-.3em E}[Y|X] = X + 1

Example - Conditional distribution

  • The conditional second moment of Y given X=x is \begin{align*} {\rm I\kern-.3em E}[Y^2|x] & = \int_{-\infty}^\infty y^2 f(y|x) \, dy = \int_{x}^\infty y^2 e^{-(y-x)} \, dy \\ & = (y^2+2y+2) e^{-(y-x)} \bigg|_{x}^\infty = x^2 + 2x + 2 \end{align*} where we integrated by parts

  • The conditional variance of Y given X=x is {\rm Var}[Y|x] = {\rm I\kern-.3em E}[Y^2|x] - {\rm I\kern-.3em E}[Y|x]^2 = x^2 + 2x + 2 - (x+1)^2 = 1

  • This can also be interpreted as the random variable {\rm Var}[Y|X] = 1

Conditional Expectation

A useful formula

Theorem
(X,Y) random vector. Then {\rm I\kern-.3em E}[X] = {\rm I\kern-.3em E}[ {\rm I\kern-.3em E}[X|Y] ]

Note: The above formula contains abuse of notation – {\rm I\kern-.3em E} has 3 meanings

  • First {\rm I\kern-.3em E} is with respect to the marginal of X
  • Second {\rm I\kern-.3em E} is with respect to the marginal of Y
  • Third {\rm I\kern-.3em E} is with respect to the conditional distribution X|Y

Conditional Expectation

Proof of Theorem

  • Suppose (X,Y) is continuous

  • Recall that {\rm I\kern-.3em E}[X|Y] denotes the random variable g(Y) with g(y):= {\rm I\kern-.3em E}[X|y] := \int_{\mathbb{R}} xf(x|y) \, dx

  • Also recall that by definition f_{X,Y}(x,y)= f(x|y)f_Y(y)

Conditional Expectation

Proof of Theorem

  • Therefore \begin{align*} {\rm I\kern-.3em E}[{\rm I\kern-.3em E}[X|Y]] & = {\rm I\kern-.3em E}[g(Y)] = \int_{\mathbb{R}} g(y) f_Y(y) \, dy \\ & = \int_{\mathbb{R}} \left( \int_{\mathbb{R}} xf(x|y) \, dx \right) f_Y(y)\, dy = \int_{\mathbb{R}^2} x f(x|y) f_Y(y) \, dx dy \\ & = \int_{\mathbb{R}^2} x f_{X,Y}(x,y) \, dx dy = \int_{\mathbb{R}} x \left( \int_{\mathbb{R}} f_{X,Y}(x,y)\, dy \right) \, dx \\ & = \int_{\mathbb{R}} x f_{X}(x) \, dx = {\rm I\kern-.3em E}[X] \end{align*}

  • If (X,Y) is discrete the thesis follows by replacing intergrals with series

Conditional Expectation

Example - Application of the formula

  • Consider again the continuous random vector (X,Y) with joint pdf f_{X,Y}(x,y) := e^{-y} \,\, \text{ if } \,\, 0 < x < y \,, \quad f_{X,Y}(x,y) :=0 \,\, \text{ otherwise}

  • We have proven that {\rm I\kern-.3em E}[Y|X] = X + 1

  • We have also shown that f_X is exponential f_{X}(x) = \begin{cases} e^{-x} & \text{ if } x > 0 \\ 0 & \text{ if } x \leq 0 \end{cases}

Conditional Expectation

Example - Application of the formula

  • From the knowledge of f_X we can compute {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[X] = \int_0^\infty x e^{-x} \, dx = -(x+1)e^{-x} \bigg|_{x=0}^{x=\infty} = 1

  • Using the Theorem we can compute {\rm I\kern-.3em E}[Y] without computing f_Y: \begin{align*} {\rm I\kern-.3em E}[Y] & = {\rm I\kern-.3em E}[ {\rm I\kern-.3em E}[Y|X] ] \\ & = {\rm I\kern-.3em E}[X + 1] \\ & = {\rm I\kern-.3em E}[X] + 1 \\ & = 1 + 1 = 2 \end{align*}

Part 6:
Independence

Independence of random variables

Intuition

  • In previous example: the conditional distribution of Y given X=x was f(y|x) = \begin{cases} e^{-(y-x)} & \text{ if } y > x \\ 0 & \text{ if } y \leq x \end{cases}

  • In particular f(y|x) depends on x

  • This means that knowledge of X gives information on Y

  • When X does not give any information on Y we say that X and Y are independent

Independence of random variables

Formal definition

Definition
(X,Y) random vector with joint pdf or pmf f_{X,Y} and marginal pdfs or pmfs f_X,f_Y. We say that X and Y are independent random variables if f_{X,Y}(x,y) = f_X(x)f_Y(y) \,, \quad \forall \, (x,y) \in \mathbb{R}^2

Independence of random variables

Conditional distributions and probabilities

If X and Y are independent then X gives no information on Y (and vice-versa):

  • Conditional distribution: Y|X is same as Y f(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)} = \frac{f_X(x)f_Y(y)}{f_X(x)} = f_Y(y)

  • Conditional probabilities: From the above we also obtain \begin{align*} P(Y \in A | x) & = \sum_{y \in A} f(y|x) = \sum_{y \in A} f_Y(y) = P(Y \in A) & \, \text{ discrete rv} \\ P(Y \in A | x) & = \int_{y \in A} f(y|x) \, dy = \int_{y \in A} f_Y(y) \, dy = P(Y \in A) & \, \text{ continuous rv} \end{align*}

Independence of random variables

Characterization of independence - Densities

Theorem

(X,Y) random vector with joint pdf or pmf f_{X,Y}. They are equivalent:

  • X and Y are independent random variables
  • There exist functions g(x) and h(y) such that f_{X,Y}(x,y) = g(x)h(y) \,, \quad \forall \, (x,y) \in \mathbb{R}^2

Independence of random variables

Consequences of independence

Theorem

Suppose X and Y are independent random variables. Then

  • For any A,B \subset \mathbb{R} we have P(X \in A, Y \in B) = P(X \in A) P(Y \in B)

  • Suppose g(x) is a function of (only) x, h(y) is a function of (only) y. Then {\rm I\kern-.3em E}[g(X)h(Y)] = {\rm I\kern-.3em E}[g(X)]{\rm I\kern-.3em E}[h(Y)]

Independence of random variables

Proof of First Statement

  • Define the function p(x,y):=g(x)h(y). Then \begin{align*} {\rm I\kern-.3em E}[g(X)h(Y)] & = {\rm I\kern-.3em E}(p(X,Y)) = \int_{\mathbb{R}^2} p(x,y) f_{X,Y}(x,y) \, dxdy \\ & = \int_{\mathbb{R}^2} g(x)h(y) f_X(x) f_Y(y) \, dxdy \\ & = \left( \int_{-\infty}^\infty g(x) f_X(x) \, dx \right) \left( \int_{-\infty}^\infty h(y) f_Y(y) \, dy \right) \\ & = {\rm I\kern-.3em E}[g(X)] {\rm I\kern-.3em E}[h(Y)] \end{align*}

  • Proof in the discrete case is the same: replace intergrals with series

Independence of random variables

Proof of Second Statement

  • Define the product set A \times B :=\{ (x,y) \in \mathbb{R}^2 \colon x \in A , y \in B\}

  • Therefore we get \begin{align*} P(X \in A , Y \in B) & = \int_{A \times B} f_{X,Y}(x,y) \, dxdy \\ & = \int_{A \times B} f_X(x) f_Y(y) \, dxdy \\ & = \left(\int_{A} f_X(x) \, dx \right) \left(\int_{B} f_Y(y) \, dy \right) \\ & = P(X \in A) P(Y \in B) \end{align*}

Part 7:
Covariance and correlation

Relationship between Random Variables

Given two random variables X and Y we said that

  • X and Y are independent if f_{X,Y}(x,y) = f_X(x) g_Y(y)

  • In this case there is no relationship between X and Y

  • This is reflected in the conditional distributions: X|Y \sim X \qquad \qquad Y|X \sim Y

Relationship between Random Variables

If X and Y are not independent then there is a relationship between them

Question
How do we measure the strength of such dependence?

Answer: By introducing the notions of

  • Covariance
  • Correlation

Covariance

Definition

Notation: Given two rv X and Y we denote \begin{align*} & \mu_X := {\rm I\kern-.3em E}[X] \qquad & \mu_Y & := {\rm I\kern-.3em E}[Y] \\ & \sigma^2_X := {\rm Var}[X] \qquad & \sigma^2_Y & := {\rm Var}[Y] \end{align*}

Definition
The covariance of X and Y is the number {\rm Cov}(X,Y) := {\rm I\kern-.3em E}[ (X - \mu_X) (Y - \mu_Y) ]

Covariance

Remark

The sign of {\rm Cov}(X,Y) gives information about the relationship between X and Y:

  • If X is large, is Y likely to be small or large?
  • If X is small, is Y likely to be small or large?
  • Covariance encodes the above questions

Covariance

Remark

The sign of {\rm Cov}(X,Y) gives information about the relationship between X and Y

X small: \, X<\mu_X X large: \, X>\mu_X
Y small: \, Y<\mu_Y (X-\mu_X)(Y-\mu_Y)>0 (X-\mu_X)(Y-\mu_Y)<0
Y large: \, Y>\mu_Y (X-\mu_X)(Y-\mu_Y)<0 (X-\mu_X)(Y-\mu_Y)>0


X small: \, X<\mu_X X large: \, X>\mu_X
Y small: \, Y<\mu_Y {\rm Cov}(X,Y)>0 {\rm Cov}(X,Y)<0
Y large: \, Y>\mu_Y {\rm Cov}(X,Y)<0 {\rm Cov}(X,Y)>0

Covariance

Alternative Formula

Theorem
The covariance of X and Y can be computed via {\rm Cov}(X,Y) = {\rm I\kern-.3em E}[XY] - {\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Y]

Covariance

Proof of Theorem

Using linearity of {\rm I\kern-.3em E} and the fact that {\rm I\kern-.3em E}[c]=c for c \in \mathbb{R}: \begin{align*} {\rm Cov}(X,Y) : & = {\rm I\kern-.3em E}[ \,\, (X - {\rm I\kern-.3em E}[X]) (Y - {\rm I\kern-.3em E}[Y]) \,\, ] \\ & = {\rm I\kern-.3em E}\left[ \,\, XY - X {\rm I\kern-.3em E}[Y] - Y {\rm I\kern-.3em E}[X] + {\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Y] \,\, \right] \\ & = {\rm I\kern-.3em E}[XY] - {\rm I\kern-.3em E}[ X {\rm I\kern-.3em E}[Y] ] - {\rm I\kern-.3em E}[ Y {\rm I\kern-.3em E}[X] ] + {\rm I\kern-.3em E}[{\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[Y]] \\ & = {\rm I\kern-.3em E}[XY] - {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[Y] - {\rm I\kern-.3em E}[Y] {\rm I\kern-.3em E}[X] + {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[Y] \\ & = {\rm I\kern-.3em E}[XY] - {\rm I\kern-.3em E}[X] {\rm I\kern-.3em E}[Y] \end{align*}

Correlation

Remark:

  • {\rm Cov}(X,Y) encodes only qualitative information about the relationship between X and Y

  • To obtain quantitative information we introduce the correlation

Definition
The correlation of X and Y is the number \rho_{XY} := \frac{{\rm Cov}(X,Y)}{\sigma_X \sigma_Y}

Correlation

Correlation detects linear relationships between X and Y

Theorem

For any random variables X and Y we have

  • - 1\leq \rho_{XY} \leq 1
  • |\rho_{XY}|=1 if and only if there exist a,b \in \mathbb{R} such that Y = aX + b
    • If \rho_{XY}=1 then a>0 \qquad \qquad \quad (positive linear correlation)
    • If \rho_{XY}=-1 then a<0 \qquad \qquad (negative linear correlation)

Proof: Omitted, see page 172 of [2]

Correlation & Covariance

Independent random variables

Theorem
If X and Y are independent random variables then {\rm Cov}(X,Y) = 0 \,, \qquad \rho_{XY}=0

Proof:

  • If X and Y are independent then {\rm I\kern-.3em E}[XY]={\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Y]
  • Therefore {\rm Cov}(X,Y)= {\rm I\kern-.3em E}[XY]-{\rm I\kern-.3em E}[X]{\rm I\kern-.3em E}[Y] = 0
  • Moreover \rho_{XY}=0 by definition

Formula for Variance

Variance is quadratic

Theorem
For any two random variables X and Y and a,b \in \mathbb{R} {\rm Var}[aX + bY] = a^2 {\rm Var}[X] + b^2 {\rm Var}[Y] + 2 {\rm Cov}(X,Y) If X and Y are independent then {\rm Var}[aX + bY] = a^2 {\rm Var}[X] + b^2 {\rm Var}[Y]

Proof: Exercise

Part 8:
Multivariate random vectors

Multivariate Random Vectors

Recall

  • A Random vector is a function \mathbf{X}\colon \Omega \to \mathbb{R}^n
  • \mathbf{X} is a multivariate random vector if n \geq 3
  • We denote the components of \mathbf{X} by \mathbf{X}= (X_1,\ldots,X_n) \,, \qquad X_i \colon \Omega \to \mathbb{R}
  • We denote the components of a point \mathbf{x}\in \mathbb{R}^n by \mathbf{x}= (x_1,\ldots,x_n)

Discrete and Continuous Multivariate Random Vectors

Everything we defined for bivariate vectors extends to multivariate vectors

Definition

The random vector \mathbf{X}\colon \Omega \to \mathbb{R}^n is:

  • continuous if components X_is are continuous
  • discrete if components X_i are discrete

Joint pmf

Definition
The joint pmf of a continuous random vector \mathbf{X} is f_{\mathbf{X}} \colon \mathbb{R}^n \to \mathbb{R} defined by f_{\mathbf{X}} (\mathbf{x}) = f_{\mathbf{X}}(x_1,\ldots,x_n) := P(X_1 = x_1 , \ldots , X_n = x_n ) \,, \qquad \forall \, \mathbf{x}\in \mathbb{R}^n

Note: For all A \subset \mathbb{R}^n it holds P(\mathbf{X}\in A) = \sum_{\mathbf{x}\in A} f_{\mathbf{X}}(\mathbf{x})

Joint pdf

Definition
The joint pdf of a continuous random vector \mathbf{X} is a function f_{\mathbf{X}} \colon \mathbb{R}^n \to \mathbb{R} such that P (\mathbf{X}\in A) := \int_A f_{\mathbf{X}}(x_1 ,\ldots, x_n) \, dx_1 \ldots dx_n = \int_{A} f_{\mathbf{X}}(\mathbf{x}) \, d\mathbf{x}\,, \quad \forall \, A \subset \mathbb{R}^n

Note: \int_A denotes an n-fold intergral over all points \mathbf{x}\in A

Expected Value

Definition
\mathbf{X}\colon \Omega \to \mathbb{R}^n random vector and g \colon \mathbb{R}^n \to \mathbb{R} function. The expected value of the random variable g(X) is \begin{align*} {\rm I\kern-.3em E}[g(\mathbf{X})] & := \sum_{x \in \mathbb{R}^n} g(\mathbf{x}) f_{\mathbf{X}} (\mathbf{x}) \qquad & (\mathbf{X}\text{ discrete}) \\ {\rm I\kern-.3em E}[g(\mathbf{X})] & := \int_{\mathbb{R}^n} g(\mathbf{x}) f_{\mathbf{X}} (\mathbf{x}) \, d\mathbf{x}\qquad & \qquad (\mathbf{X}\text{ continuous}) \end{align*}

Marginal distributions

  • Marginal pmf or pdf of any subset of the coordinates (X_1,\ldots,X_n) can be computed by integrating or summing the remaining coordinates

  • To ease notations, we only define maginals wrt the first k coordinates

Definition
The marginal pmf or marginal pdf of the random vector \mathbf{X} with respect to the first k coordinates is the function f \colon \mathbb{R}^k \to \mathbb{R} defined by \begin{align*} f(x_1,\ldots,x_k) & := \sum_{ (x_{k+1}, \ldots, x_n) \in \mathbb{R}^{n-k} } f_{\mathbf{X}} (x_1 , \ldots , x_n) \quad & (\mathbf{X}\text{ discrete}) \\ f(x_1,\ldots,x_k) & := \int_{\mathbb{R}^{n-k}}f_{\mathbf{X}} (x_1 , \ldots, x_n ) \, dx_{k+1} \ldots dx_{n} \quad & \quad (\mathbf{X}\text{ continuous}) \end{align*}

Marginal distributions

  • We use a special notation for marginal pmf or pdf wrt a single coordinate

Definition
The marginal pmf or marginal pdf of the random vector \mathbf{X} with respect to the i-th coordinate is the function f_{X_i} \colon \mathbb{R}\to \mathbb{R} defined by \begin{align*} f_{X_i}(x_i) & := \sum_{ \tilde{x} \in \mathbb{R}^{n-1} } f_{\mathbf{X}} (x_1, \ldots, x_n) \quad & (\mathbf{X}\text{ discrete}) \\ f_{X_i}(x_i) & := \int_{\mathbb{R}^{n-1}}f_{\mathbf{X}} (x_1, \ldots, x_n) \, d\tilde{x} \quad & \quad (\mathbf{X}\text{ continuous}) \end{align*} where \tilde{x} \in \mathbb{R}^{n-1} denotes the vector \mathbf{x} with i-th component removed \tilde{x} := (x_1, \ldots, x_{i-1}, x_{i+1},\ldots, x_n)

Conditional distributions

We now define conditional distributions given the first k coordinates

Definition
Let \mathbf{X} be a random vector and suppose that the marginal pmf or pdf wrt the first k coordinates satisfies f(x_1,\ldots,x_k) > 0 \,, \quad \forall \, (x_1,\ldots,x_k ) \in \mathbb{R}^k The conditional pmf or pdf of (X_{k+1},\ldots,X_n) given X_1 = x_1, \ldots , X_k = x_k is the function of (x_{k+1},\ldots,x_{n}) defined by f(x_{k+1},\ldots,x_n | x_1 , \ldots , x_k) := \frac{f_{\mathbf{X}}(x_1,\ldots,x_n)}{f(x_1,\ldots,x_k)}

Conditional distributions

Similarly, we can define the conditional distribution given the i-th coordinate

Definition
Let \mathbf{X} be a random vector and suppose that for a given x_i \in \mathbb{R} f_{X_i}(x_i) > 0 The conditional pmf or pdf of \tilde{X} given X_i = x_i is the function of \tilde{x} defined by f(\tilde{x} | x_i ) := \frac{f_{\mathbf{X}}(x_1,\ldots,x_n)}{f_{X_i}(x_i)} where we denote \tilde{X} := (X_1, \ldots, X_{i-1}, X_{i+1},\ldots, X_n) \,, \quad \tilde{x} := (x_1, \ldots, x_{i-1}, x_{i+1},\ldots, x_n)

Independence

Definition
\mathbf{X}=(X_1,\ldots,X_n) random vector with joint pmf or pdf f_{\mathbf{X}} and marginals f_{X_i}. We say that the random variables X_1,\ldots,X_n are mutually independent if f_{\mathbf{X}}(x_1,\ldots,x_n) = f_{X_1}(x_1) \cdot \ldots \cdot f_{X_n}(x_n) = \prod_{i=1}^n f_{X_i}(x_i)

Proposition
If X_1,\ldots,X_n are mutually independent then for all A_i \subset \mathbb{R} P(X_1 \in A_1 , \ldots , X_n \in A_n) = \prod_{i=1}^n P(X_i \in A_i)

Independence

Characterization result

Theorem

\mathbf{X}=(X_1,\ldots,X_n) random vector with joint pmf or pdf f_{\mathbf{X}}. They are equivalent:

  • The random variables X_1,\ldots,X_n are mutually independent
  • There exist functions g_i(x_i) such that f_{\mathbf{X}}(x_1,\ldots,x_n) = \prod_{i=1}^n g_{i}(x_i)

Independence

Expectation of product

Theorem
X_1,\ldots,X_n be mutually independent random variables and g_i(x_i) functions. Then {\rm I\kern-.3em E}[ g_1(X_1) \cdot \ldots \cdot g_n(X_n) ] = \prod_{i=1}^n {\rm I\kern-.3em E}[g_i(X_i)]

Independence

A very useful theorem

Theorem
X_1,\ldots,X_n be mutually independent random variables and g_i(x_i) function only of x_i. Then the random variables g_1(X_1) \,, \ldots \,, g_n(X_n) are mutually independent

Proof: Omitted. See [2] page 184

Example: X_1,\ldots,X_n \, independent \,\, \implies \,\, X_1^2, \ldots, X_n^2 \, independent

Expectation of sums

Expectation is linear

Theorem
For random variables X_1,\ldots,X_n and scalars a_1,\ldots,a_n we have {\rm I\kern-.3em E}[a_1X_1 + \ldots + a_nX_n] = a_1 {\rm I\kern-.3em E}[X_1] + \ldots + a_n {\rm I\kern-.3em E}[X_n]

Variance of sums

Variance is quadratic

Theorem
For random variables X_1,\ldots,X_n and scalars a_1,\ldots,a_n we have \begin{align*} {\rm Var}[a_1X_1 + \ldots + a_nX_n] = a_1^2 {\rm Var}[X_1] & + \ldots + a^2_n {\rm Var}[X_n] \\ & + 2 \sum_{i \neq j} {\rm Cov}(X_i,X_j) \end{align*} If X_1,\ldots,X_n are mutually independent then {\rm Var}[a_1X_1 + \ldots + a_nX_n] = a_1^2 {\rm Var}[X_1] + \ldots + a^2_n {\rm Var}[X_n]

References

[1]
Rosenthal, Jeffrey S., A first look at rigorous probability theory, Second Edition, World Scientific Publishing, 2006.
[2]
Casella, George, Berger, Roger L., Statistical inference, second edition, Brooks/Cole, 2002.