# Probability and Statistics Background

Statistics – A subject which most statisticians find difficult, but in which nearly all physicians are expert. – Stephen S. Senn

## Introduction

For us, we will regard probability theory as a way of logically reasoning about uncertainty. I realize that this is not a precise mathematical definition, but neither is ‘probability theory is the mathematics arising from studying non-negative numbers which add up to 1’, which is at least partially accurate.

Some additional material is covered elsewhere:
* Statistical inference.

To get well-grounded let’s begin with a sequence of definitions.

### First definitions

#### Definition

A probability space is a measure space $D$ with measure $P$ such that $P(D)=1$. The space $D$ is also sometimes called the sample space and the measurable subsets of $D$ are called events1.

#### Remark

The definition of probability space is sufficiently general to include lots of degenerate examples. For example, we can take any set $S$ and make it into a probability space by decreeing that the only measurable subsets are $S$ and $\emptyset$ with $P(S)=1$. Although we will try to make this explicit, we will almost always want that singleton sets, i.e., sets with just a single element, are measurable. When a probability space has this property and every measurable subset is a countable union of singleton sets we will call the probability space discrete.

#### Exercise

Make the positive integers into a discrete probability space where every point has non-zero probability.

#### Definition

The probability of an event $E$ is $P(E)=:\int_E dP$. For discrete probability spaces we can also write $\int_E dP=\sum_{x\in E} P(x)$2.

#### Construction

Given two probability spaces $D_1$ and $D_2$ with respective probability measures $P_1$ and $P_2$. We can define a probability space $D_1\times D_2$ by:

1. The underlying set is the cartesian product $D_1\times D_2$.
2. The measurable subsets are generated under countable unions and complements by the products sets $I_1\times I_2$, where $I_1\subseteq D_1$ and $I_2\subseteq D_2$ are measurable subsets.
3. The probability measure is determined by $P(I_1\times I_2)=P_1(I_1)\cdot P(I_2)$, where $I_1$ and $I_2$ are as in the previous statement.

#### Example

Suppose we have a fair coin that we flip twice. Each of the four possible outcomes $D=\{HH,HT,TH,TT\}$ are equally likely and form a discrete probability space such that $P(x)=1/4$ for all $x\in D$. The probability of the event $E$, where we get precisely one head, is $P(E)=P(HT)+P(TH)=1/2$.

#### Definition

A random variable $X\colon D\to T$ is a measurable function from a probability space $D$ to a measure space $T$.

We can associate to each such $X$ a probability measure $P_X$ on $T$ by assigning to each measurable subset $U\subset T$, $P_X(U)=P(X^{-1}(U))$. Indeed it is clear that $P_X(T)=1$ and that for the measure of a countable disjoint union is $$P_X(\coprod U_i)=P(X^{-1}(\coprod U_i))=P(\coprod(X^{-1}U_i))=\sum P(U_i).$$

#### Remark

There is an unfortunate clash in the language of probability theory and standard english usage. For example, imagine that we have a box with a single button on it and a numerical display. Every time we push the button the screen displays a number between 1 and 10. In common usage we say that these values are random if there is no way to know which number will appear on the screen every time we push the button.

It is important to know that mathematics/probability theory/statistics do not provide any such mechanism. There is no function whose values are “randomly” chosen given a particular input. In particular, mathematics does not provide a method of randomly choosing objects.

One should keep this in mind when talking about random variables. Random variables are not objects with random values; they are functions. The additional data that a random variable $X$ does define are numbers associated to the preimages $X^{-1}(I)$ (for measurable subsets $I$), which we can use to weight the values of $X$.

This can also be used to shed light on statistical mechanics, which uses probability theory to model situations arising in physics. The fact that such models have been extremely successful in the field of quantum mechanics does not necessarily mean there is something random, in the common usage sense, about the universe; we are not claiming that “God plays dice with the universe”. It is just that our best mathematical models for these phenomena are constructed using the language of probability theory.

Finally, we should remark that the closest mathematical object to a random number generator in the sense of english is a pseudorandom number generator. These are deterministic functions which output sequences of numbers which attempt to model our intuition of what a random number generator should produce. Although not truly random, these are heavily used in simulations and Monte Carlo methods.

#### Conventions

If we are regarding $\Bbb R$ as a measure space and do not specify an alternative measure, we will mean that it is equipped with its standard Borel measurable subsets and the Borel measure3 $E\mapsto \int_E dx$. If we regard a discrete finite set $S$ or any interval $[a,b]$ (with $a<b$) as a probability space and do not specify the measure then we will mean that it is equipped with a uniform measure. In other words, $P(s)=1/|S|$ for all $s\in S$ and for all measurable $E\subset I$ we have $P(E)=P_{\Bbb R}(E)/(b-a)$.

#### Remarks4:

If the measure $P_X$ from the previous definition is absolutely continuous with respect to the standard Borel measure (i.e., the preimage of every measure 0 set with respect to the standard Borel measure is of measure 0), then there is a measurable function $dP_X/dx \colon T\to \Bbb R$ such that for all measurable $E\subset T$, $$P_X(E) := \int_{X^{-1} E} dP := \int_E dP_X = \int_{E} \frac{dP_X}{dx} dx.$$ All of these integrals are Lebesgue integrals.

The measurable function $dP_X/dx$ is called a Radon-Nikodym derivative and any two such derivatives disagree on a set of measure 0, i.e., they agree almost everywhere. Without the absolute continuity hypothesis there is only a distribution satisfying this property. Having a measure defined in such a way obvious implies absolute continuity, so the first sentence can be formulated as an if and only if statement. This is the Radon-Nikodym theorem.

#### Definition

For a discrete probability space $D$ the function $p\colon D\to [0,1]$, defined by $d\mapsto p(d):=P(d)$ is called the probability mass function (PMF) of $D$.

Note that the measure $P$ on $D$ is uniquely determined by the associated probability mass function.

#### Definition

Suppose that $\Bbb R$ is equipped with a probability measure $P$ and the cumulative distribution function (cdf) $F(a)=P(x\leq a)$ is a continuously differentiable function of $a$, then $F(x)=\int_{-\infty}^x F'(x) dx$ and $F'(x)$ is called the probability density function (pdf) of $F$ (or $P$).

Note that the probability measure $P$ is determined by $F$ and hence the probability density function $F’$. This can lead to some confusing abuses of language.

#### Example

Let $D$ be the probability space from the first example. Let $X\colon D\to \mathbb{R}$ be the random variable which counts the number of occurrences of heads in a given event. Then the cumulative density function of $P_X$ is $F_X(x)=0$ if $x<0$, $F_X(x)=1/4$ if $0\leq x < 1$, $F_X(x)=3/4$ if $1\leq x < 2$ and $F_X(x)=1$ if $x\geq 2$. This function is discontinuous and hence the probability density function is not defined5.

### Moments of distributions

Typically when we are handed a probability space $D$, we analyze it by constructing a random variable $X\colon D\to T$ where $T$ is either a countable subset of $\Bbb R$ or to $\Bbb R$. Using the procedure of the previous section we obtain a probability measure $P_X$ on $T$ and we now study this probability space. Usually a great deal of information is lost about $D$ during this process, but it allows us to focus our energies and work in the more tractable and explicit space $T\subset\Bbb R$.

So, we know focus on such probability spaces. This is usually decomposed into two cases, when $T$ is discrete (e.g., a subset of $\Bbb N$) and when $T$ is $\Bbb R$ (or some interval in $\Bbb R$). We could study the first case as a special case of the latter and just studying probability measures on $\Bbb R$, but that would require throwing in a lot of Dirac delta distributions at some point and I sense that you may not like that. We will seek a compromise and still use the integral notation to cover both cases although integrals in the discrete case can be expressed as sums.

There are two special properties of this situation that we will end up using:
1. It makes sense to multiply elements of $T$ with real valued functions.
2. There is a natural ordering on $T$ (so we can define a cdf).
3. We can now meaningfully compare random variables with values in $\Bbb R$ which are defined on different probability spaces, by comparing their associated probability measures on $\Bbb R$ (or their cdfs/pdfs when these exist).

For example the first property allows us to make sense of:

#### Definition

1. The expected value or mean of a random variable $X\colon D\to T\subset \Bbb R$, is $$\mu_X:=E(X)= \int_{x\in T} x\cdot dP_X = \int_{d\in D} X(d) dP.$$
2. Let $F_X$ denote the cdf of $X$. The median of $X$ is those $t\in \Bbb R$ such that $F_X(t)=0.5$.
3. Suppose that $X$ admits a pdf $f_X$. The modes of $X$ are those $t\in \Bbb R$ such that $f_X(t)$ is maximal.

#### Example

In our coin flipping example, the expected value of the random variable $X$ which counts the heads is $$\int_D X dP = \sum_{d\in D} X(d)p(d) = 2/4+1/4+1/4+0/4=1,$$
as expected.

The third property lets us make sense of:

#### Definition

Two random variables $X\colon D_1\to T$ and $Y\colon D_2\to T$ are identically distributed if they define the same probability measure on $T$, i.e., $P_X(I)=P_Y(I)$ for all measurable subsets $I\subseteq T$. In this case, we write $X\sim Y$.

#### Definition

We associate to two random variables $X,Y\colon D\to T$ a random variable $X\times Y\colon D \to T^2$ by $X\times Y(x)=(X(x),Y(x))$. This induces a probability measure $P_{X,Y}$ on $T^2.$ When $T=\Bbb R$ we can then define an associated joint cdf, $F_{X,Y}\colon \Bbb R^2\to [0,1]$ defined by $F_{X,Y}(a,b)=P_{X,Y}(x\leq a, y\leq b)$, which when $X\times Y$ is absolutely continuous with respect to the Lebesgue measure admits a joint pdf. Similarly, we can extend this to joint probability distributions of any number of random variables.

#### Definition

Two random variables $X,Y\colon D\to T$ with corresponding probability measures $P_X$ and $P_Y$ on $T$ are independent if the associated joint probability measure $P_{X,Y}$ on $T^2$ satisfies $P_{X,Y}(I_1\times I_2)=P_X(I_1)P_Y(I_2)$ for all measurable subsets $I_1, I_2\subseteq T$. When two variables are both independent and identically distributed then we abbreviate this to iid.

#### Definition

Suppose that $T$ is a probability space whose singleton sets are measurable. A random sample of size $n$ from $T$ will be any point of the associated product probability space $T^n$.

#### Exercise

1. Show that if $X$ and $Y$ are two $\Bbb R$ random variables then they are independent if and only if their joint cdf is the product of their individual cdfs.
2. Suppose that moreover $X$ and $Y$ and the joint distribution admit pdfs $p_X, p_Y,$ and $p_{X,Y}$ respectively, then show that the $f_{X,Y}=f_X f_Y$ if and only if the distributions are independent.

#### Definition

1. The $k$th moment of a random variable $X\colon D \to \Bbb R$ is $E(X^k)$.
2. The variance of a random variable $X\colon D\to \Bbb R$ is
$$\sigma_X^2=E((X-\mu_X)^2)=\int_D (X-\mu_X)^2 dP$$.
3. The standard deviation of $X$ is $\sigma_X=\sqrt{\sigma_X^2}$.
4. The covariance of a pair of random variables $X,Y\colon D\to \Bbb R$ is
$$Cov(X,Y) = E((X-\mu_X)(Y-\mu_Y)).$$
5. The correlation coefficient of a pair of random variables $X,Y\colon D\to \Bbb R$ is $$\frac{Cov(X,Y)}{\sigma_X \sigma_Y}.$$

#### Exercise

Suppose that $X$ and $Y$ are two independent $\Bbb R$-valued random variables with finite means $\mu_X$ and $\mu_Y$ and finite variances $\sigma^2_X$ and $\sigma^2_Y$ respectively.
1. Show that, for $a,b\in \Bbb R$ the mean of $aX+bY$ is $a\mu_X + b\mu_Y$.
1. Show that the variance of $aX+bY$ is $a^2 \sigma^2_X + b^2 \sigma^2_Y$.
1. Show that $E(XY)$ is $\mu_X\cdot \mu_Y$.
1. Show that $E(X^2)=\sigma_X^2+\mu_X^2$.

#### Definition

The characteristic function of a random variable $X\colon D\to \Bbb R$ is the complex function $$\varphi_X(t)=E(e^{itX})=\int_{x\in \Bbb R} e^{itx} dP_X = \int_{d\in D} e^{itX(d)} dP.$$

#### Remarks

1. The characteristic function is always defined (because we are integrating an absolutely bounded function over a finite measure space).
2. When $X$ admits a pdf $p_X$, then up to a reparametrization the characteristic function is the Fourier transform of $p_X$: $F(p_X)(X)=\varphi_X(-2\pi t)$.
3. Two random variables have the same characteristic functions if and only if they are identically distributed6.

## Some Important Results

1. The Law of Large Numbers, which essentially says that the average $S_n$ of a sum of $n$ iif random variables with finite mean $\mu$ “converges” to the common mean.
2. The Central Limit Theorem, which says that under the above hypotheses plus the assumption that the random variables have a finite variance $\sigma^2$, the random variable $\sqrt{n}(S_n-\mu)$ converges in distribution to the normal distribution with mean $0$ and variance $\sigma^2$. This result is the basis behind many normality assumptions and is critical to hypothesis testing which is used throughout the sciences.

## Conditional probability

Suppose we have a probability space $P\colon D\to [0,1]$ and two events $A,B\in D$. Then we write $P(A,B)=P(A\cap B)$. Suppose that $P(B)>0$, then define the conditional probability of $A$ given $B$ as $$P(A|B)=P(A,B)/P(B).$$ A similar definition is also given for the conditional pdf of two random variables $X$ and $Y$: $f_{X,Y}(x|y)=f_{X,Y}(x,y)/f_Y(y)$ where $f_Y(y)=\int_{x \in \Bbb R} f(x,y) dx$ is the marginal density.

#### Bayes Rule

Let $A$ be an event in a probability space $P\colon D\to [0,1]$ and that $\{B_i\}_{i=1}^n$ is a disjoint union of events which cover $D$ and all of these events have non-zero probability. Then


There is also the pdf form:
$$f_{X,Y}(x|y)=\frac{f_{X,Y}(y|x)f_X(x)}{\int f_{X,Y}(y|x) f_X(x) dx}.$$

The usefulness of Bayes rule is that it allows us to write a conditional probability that we do not understand (the dependence of $X$ on $Y$) in terms that we might understand (the dependence of $Y$ on $X$).

1. If you don’t know what a measurable subset is then you don’t know what a measure space is and you should consult the references. If you don’t consult the references and you just believe that all subsets of $D$ are measurable, it will take a long time to find out that you are wrong.
2. Here and elsewhere we will abuse notation and denote measure of a singleton set ${{x}}$ by $P(x)$.
3. For $\Bbb R^n$ and $n>1$ we should probably use the Lebesgue measure, which is a completion of the product Borel measure.
4. This material is quite advanced, so don’t worry if it goes over your head. The notation here is chosen so that in special cases where the [fundamental theorem of calculus] (https://en.wikipedia.org/wiki/Fundamental_theorem_of_calculus) applies the Radon-Nikodym derivative can be chosen to be an actual derivative.
5. Although we could define the “pdf” as a linear combination of Dirac delta distributions, but then it wouldn’t be a function (no matter what a physicist tells you).
6. For a proof see these notes