Supplementary material – Machine Learning for Mathies

Tensor Calculus

Introduction

I will assume that you have seen some calculus, including multivariable calculus. That is you know how to differentiate a differentiable function $f\colon \Bbb R \to \Bbb R$, to obtain a new function $$\frac{\partial f}{\partial x} \colon \Bbb R \to \Bbb R.$$ You also know how to differentiate a multivariable function $f\colon \Bbb R^m \to \Bbb R^n$, to obtain a map $$Jf\colon \Bbb R^m \to \mathrm{Hom}(\Bbb R^m, \Bbb R^n),$$ which takes a point $x\in \Bbb R^m$ to the Jacobian at $x$, which is an $(m,n)$-matrix; alternatively you can think of the Jacobian as a matrix of functions with the $(i,j)$-entry being $\frac{\partial f_i}{\partial x_j}\colon \Bbb R^m \to \Bbb R.$

The Jacobian is a convenient form for packaging the derivative of a vector valued function with respect to a list of variables arranged into a vector. Many of us learn the basic results (the product rule, the sum rule, the chain rule) very well for single variable calculus and, hopefully, pretty well for multivariable calculus. We may have less experience using these compact forms in practice.

While this knowledge is very useful for considering vector valued functions indexed by vectors of variables, it can be a bit tricky to differentiate matrix valued functions with respect to matrices or vectors of variables and get a useful compact form.

Potential objection

A bright student might observe that an element of a matrix is an element of $\Bbb R^m\otimes \Bbb R^n\cong \Bbb R^{mn}$ and hence we can always regard an $(m,n)$-matrix as an $mn$-vector¹. This is true, but we usually organize data into matrices because that is the most natural structure for them to have, e.g., the Jacobian above. Here are some similar naturally occurring structures:

We have a table with 100 entries. On each row is data from some sample, say temperature and pressure. This naturally fits into a $(100,2)$-matrix, whose rows are the samples and whose columns are the respective measurements. So we can recover each datum, by using two numbers: the row index and the column index.
We have a list of 100 grayscale images 640×480 images. To recover a datum (the darkness of a pixel in a sample) in this table, we would use three integers:
- the image number,
- the $x$-coordinate, and
- the $y$-coordinate.
  This is encoded as a tensor of shape $(100,640,480)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{640} \otimes \Bbb R^{480}$. While we could encode this as a $307200000$-vector, it would make it much more difficult to understand and meaningfully manipulate.
We have a list of 100 color (i.e., RGB) images 640×480 images. To recover a datum (the strength of a particular color (red/green/blue) of a pixel in a sample) in this table, we would use four integers:
- the image number,
- the $x$-coordinate,
- the $y$-coordinate, and
- which color we are interested in (e.g., red = 0, green = 1, blue = 2).
  This is encoded as a tensor of shape $(100,640,480,3)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{640} \otimes \Bbb R^{480}\otimes \Bbb R^3$.
We have a list of 100 color (i.e., RGB) videos each consisting of 18000 frames of 640×480 images. To recover a datum (the strength of a particular color (red/green/blue) of a pixel in a sample at a particular frame in the table) in this table, we would use five integers:
- the image number,
- the frame number,
- the $x$-coordinate,
- the $y$-coordinate, and
- which color we are interested in (e.g., red = 0, green = 1, blue = 2).
  This is encoded as a tensor of shape $(100,18000,640,480,3)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{18000}\otimes \Bbb R^{640} \otimes \Bbb R^{480}\otimes \Bbb R^3$. Again, it would be very unnatural to work with this as a single vector.

Definition

Suppose we have a function $F$ taking values in tensors of shape $(n_1,\cdots, n_\ell)$ which depends on variables indexed by a tensor $X$ of shape $(m_1,\cdots, m_k)$. The derivative $\frac{\partial F}{\partial X}$ is a function from $\Bbb R^{m_1}\otimes \cdots \otimes \Bbb R^{m_k}$ to tensors of shape $(n_1,\cdots,n_\ell,m_1,\cdots,m_k)$ whose $(i_1,\cdots, i_\ell, j_1, \cdots, j_k)$-entry is $\frac{\partial F_{i_1,\cdots, i_\ell}}{\partial X_{j_1,\cdots, j_k}}(-)$.

Some scandalous abuses of notation

We will typically abuse notation and let $n$-vectors also denote column vectors, i.e., tensors of shape $(n,1)$.

For convenience, many authors (including this one), will sometimes take tensors of the form $(\cdots, 1, \cdots)$ and regard them as tensors of the form $(\cdots,\cdots)$ (i.e., we omit the 1 term by ‘squeezing the tensor’. This abuse of notation is both convenient and confusing. Avoiding this abuse tends to cause ones to accumulate in the shape of our tensors and these one’s do not add any value to our interpretation. The best way to follow what is going on is to check it for yourself.

A less scandalous abuse of notation is to regularly add transposes to tensors of $(1,1)$ when convenient.

Some results

Since the only way to get used to tensor differentiation is to do some work with it yourself, we leave the following results as exercises.

Exercise

Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables. Show that $\frac{\partial f}{\partial X}=Jf$.
Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=Ax$ for an $(n,m)$-matrix $A$. Show that $\frac{\partial f}{\partial X}=A$ (a constant function).
Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=x^TA$ for an $(m,n)$-matrix $A$. Show that $\frac{\partial f}{\partial X}$ is a function valued in $(1,n,m)$-tensors whose $(1,i,j)$ entry is $(A^T)_{i,j}$.
Suppose that $f\colon \Bbb R^m\to \Bbb R$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=x^TA x$ for an $(m,m)$-matrix $A$. Show that $\frac{\partial
f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose $(1,i)$-entry is $((A+A^T)x)_i$.
Suppose that $f\colon \Bbb R^m\to \Bbb R$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=(Ax-y)^T(Ax-y)$ for an $(m,m)$-matrix $A$ and a constant $(m,1)$-matrix $y$. Show that $\frac{\partial
f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose entries give $2(Ax-y)^TA.$
Suppose that $f\colon \Bbb R^m\to \Bbb R^m$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=2(Ax-y)^TA$ for an $(m,m)$-matrix $A$ and a constant $(m,1)$-matrix $y$. Here we regard the function as taking values in $(m)$-tensors. Show that $\frac{\partial f}{\partial X}$ is the $(m,m)$-tensor $2A^TA$.
f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose entries give $2(Ax-y)^TA.$

To simplify our notation, $\otimes := \otimes_{\Bbb R}$. ↩

Statistical Inference

All models are wrong, but some are useful. – George Box

Introduction

The general setup for statistical inference is that we are given some data $D$ which we assume arise as the values of a random variable that we assume is distributed according to some parametric model $m(\theta)$. The goal is to estimate $\theta$ in this circumstance and to give some measure of the certainty of our estimate. There are two schools of thought on this problem. One can assume that the true parameter value $\hat{\theta}$ is fixed and that we can evaluate the correctness of an estimate by running repeated trials (i.e., sampling $D$ from $m(\hat{\theta})$), this is the frequentist school. The other, Bayesian, school assumes that the data is given and that $\theta$ is actually a random variable.

For more in depth coverage of these approaches please consult:

Question

Consider the following standard situations:

We are trying to determine the percentage $\theta$ of likely-voters who will vote for a particular candidate in an election. We gather data by polling a sample from the population.
We are trying to determine the average standardized test scores $\theta$ of students taught with a new method. We gather data by teaching a test group of students and calculating the average scores of the students.
We are trying to predict the probability $\theta$ of rain tomorrow based on the meteorological data we have gathered.

Is there a “true” value of $\theta$? Or is the proposed $\theta$ we are trying to estimate subject to many additional factors which are difficult or impossible to specify? Is it possible to run repeated tests and somehow sample from the true population?

Sometimes is is said there is also a third school of thought, the “likelihoodists”. In practice, the likelihoodists try to find an optimal value of the likelihood $p(D|\theta)$ while the Bayesian’s typically try to find an optimal value of the posterior $p(\theta | D)$. These are both related by Bayes rule, but the choice of which expression to simplify and what to average over reflects what the user is allowing to vary¹. One feature of the Bayesian method is that they must specify a prior $p(\theta)$ and while there are guidelines for what should be a good choice, there is still a choice to be made and this makes some people unhappy. Both methods often yield similar results (in fact, when assuming a uniform prior, the Bayesian approach is the likelihood approach), but their interpretations differ because of the differing assumptions.

Bayesian methods in fact provide several different approaches to parameter estimation. The most common of which is the MAP estimate. While one could use any estimation method from either the Bayesian or frequentist approach, how the quality of an estimate is evaluated differs from the two approaches. For example, the frequentist perspective leads to the notion of an unbiased estimator, which is not a well-defined notion in the Bayesian approach. The frequentist and Bayesian schools generate (generally different) intervals which are used to quantify our confidence in a given estimate.

Example

To clarify the difference between the frequentist and Bayesian approaches, we suppose that we have to estimate a parameter $\theta$ of a model $m(\hat{\theta})$ given some data $D$. The frequentists produce an estimate $\theta_f=\delta_f(D)$ and the Bayesians produce an estimate $\theta_B$. We would like some measure of the uncertainty in our estimate. The frequentists will produce a 95% confidence interval: $$I_f(D)=\{l(D)\leq \theta_f=\delta_f(D) \leq u(D)\}.$$

A desirable interpretation of this interval is that the value of the true parameter $\hat{\theta}$ has a 95% chance of lying in this interval, however such a statement is meaningless from the frequentist approach: The true parameter $\hat{\theta}$ is fixed, so it either is or is not in $I_f$. Instead the confidence interval satisfies the following: if we were to keep producing datasets $D_1, D_2, \cdots$ sampled from $m(\hat{\theta})$ then $\hat{\theta}$ will lie in 95% of the confidence intervals $I_f(D_1), I_f(D_2), \cdots$. For a visualization of confidence intervals look here.

Since the Bayesians assume that $\hat{\theta}$ is in fact a random variable, they can make sense of the claim that there is a 95% chance of $\hat{\theta}$ lying in some given interval $I_B(D)$ given the data $D$. The Bayesians can produce such intervals and they are called 95% credible intervals.

The frequentist approach also leads to black swan phenomena. For example, suppose that we want to estimate the bias $\theta$ of a coin. The $\theta$ will correspond to the probability of obtaining a heads on one coin flip. If the observed data is two coin flips which each yield heads, then the maximum likelihood estimate for $\theta$ is 1. If we ask for a 99.999999% confidence interval for this measurement we will just obtain the one element set $\{1\}$. Having never seen a tails, this method constructs a model that assigns the probability of seeing tails to 0! Moreover, the model is extremely confident in this prediction even though a fair coin would produce this outcome 25% of the time. So the frequentist approach can lead to highly overconfident estimates.

On the other end of the spectrum, we consider the following example due to Berger. Suppose that two integers $D=(x_1, x_2)$ are drawn from a parametric distribution of the form:
$$ p(x|\theta) = \begin{cases}
0.5 & x = \theta\\
0.5 & x = \theta+1 \\
0 & x \not\in \{\theta, \theta+1\}
\end{cases}
$$
If $\theta = 5$, then we would expect of the following outcomes with equal probability $(5,5), (5,6), (6,5), (6,6)$. Let $m=\min(x_1,x_2)$ and consider the one element interval $[m,m]$. This confidence interval will contain $\theta$ for each of the previous samples except for $(6,6)$, so this will be a 75% confidence interval. But note that if $D=(5,6)$, then $P(\theta = 5 | D)=1.0$, so the model is underconfident about this estimate.

Frequentists are Bayesian (sort of)

In the Bayesian world view, we are studying a joint probability distribution on a data space $\Delta$ and a parameter space $\tau$. The frequentists object to the Bayesian specification of a prior $P(\theta)$ for two closely related reasons:

By specifying a prior, they are tipping the scales and expressing a bias for which they have no evidence.
Frequentists assume that the data is generated from one true model using a particular $\theta_*\in \tau$, so $\theta$ is not random.

We can point out that the frequentist assumption in the second reason above is equivalent to specifying a prior of a specific form. Namely, the frequentists assume that $P(\theta)$ is 1 when $\theta=\theta_*$ is the, unknown, true quantity and 0 otherwise (i.e., $\tau$ follows a Dirac-$\delta$ distribution). So the frequentist approach does fit into the Bayesian setup, they just use a very specific form for the prior which depends on an unknown and, in practice, unknowable quantity.

Unfortunately, this choice of a prior eliminates any possibility of using Bayesian inference (try to apply Bayes rule when using the frequentist prior and see for yourself). On the other hand, it does mean that $p(D |\theta_*)=p(D)$ and hence all witnessed data is generated from the only possible model, while for the Bayesians the data is pulled from all possible conditional distributions $p(D|\theta)$. The Bayesian response is that they don’t care: They simply generate the best value of $\theta$ that fits whatever data they have seen.

Finally, we should point out that in the presence of more and more data sampled from the type of parametric model we are, the Bayesian posterior $p(\theta | D)$ gradually approximates a Dirac-$\delta$ distribution and converges to the frequentist assumption.

It is worth pointing out at this point, that the expression $p(D|\theta)$ only makes sense if $p(\theta)>0$ (so the likelihoodists at least dip their toes into the water of Bayesian methods when constructing estimates). Similarly, $p(\theta | D)$ only makes sense if $p(D)>0$, in which case the Bayesians are also assuming the given data is generated by some random process. ↩

Additional Sources

Textbooks

Machine Learning: A probabilistic perspective by Kevin Murphy. The material in this book is closest to what we will cover in the course, but is unfortunately not available for free. Written by an academic and a practitioner of machine learning, this text is full of real world examples and applications, while eschewing the ad-hoc approach of other texts.
Introduction to statistical learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. This textbook is available for free. It has minimal prerequisites and as a consequence does not touch on some topics of interest. On the other hand, it has detailed R exercises which are wonderful.
The elements of statistical learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. A more advanced version of the previous text (and also free), covering more material, and assuming more background in statistics.
Deep learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. A new text on this burgeoning field. Written by central figures in the area, this text surveys deep learning, includes many useful citations (a great deal of the literature is still in academic articles), provides good rules of thumb, and is a relatively light read. This book is very much from a practitioner’s perspective with less justification for methodology and less emphasis on mathematical rigor. My limited experience is that other papers in this area have a similar emphasis, which reflects the limited understanding of why these methods work very well in some instances and not so well in others.
Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto. A free textbook that has been recently revised and covers quite a few topics.

Lecture series

Trevor Hastie and Robert Tibshirani have made their lectures and slides for their course on statistical learning available. These are very well done.
Machine Learning lectures from Georgia Tech, these are very well done introductory lectures on machine learning directed toward engineers. I personally love the nerdy jokes interspersed throughout.
Deep learning course lectures by Nando de Freitas. Excellent lecture series on deep learning from Oxford for computer scientists. This lecture series is aimed at those with some more background.
Machine learning course lectures by Andrew Ng. A somewhat older lecture series from Stanford aimed at computer scientists.
Machine Learning 101 by Jason Mayes. A very well done overview of the subject.

Websites

Datacamp: An online learning site with an emphasis on data science. Includes mini-lectures and online programming exercises. A nice introduction on how to take the material we are learning and apply it. However, there is a noticeable bump in difficulty in proceeding to implement these methods from scratch on your own computer.
Andrej Karpathy’s blog: A truly excellent sequence of blog posts on topics related to machine learning.
Kaggle: Hosts machine learning competitions, allows machine learning practitioners to share information, and hosts many large datasets.
Seeing Theory: A beautiful sequence of javascript visualizations for statistical concepts.

Programming references

Introduction to R.
Python 3 standard library.
Keras (a significantly simpler way to apply the Tensorflow (or another backend) library for building neural networks).
Tensorflow API (a rich library for building and using neural networks).

Computer Science Background

If you find that you’re spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you find that you’re spending almost all your time on practice, start turning some attention to theoretical things; it will improve your practice. – Donald Knuth

Introduction

This section will contain some very basic computer science background. Since the target audience of this lecture series has little to no training in this area, I will attempt to fill in some gaps (admittedly very poorly).

Numerical analysis

It should come as no surprise that computers store data as sequences of 1’s and 0’s. For example, non-negative integers are represented by their base-2 expansions (e.g. 7 is represented by 111 and 8 by 1000). Since we can only allocate a finite amount of space for each integer (e.g., 16, 32, or 64 bits) we can only hold a finite range of integers (e.g., $-2^{63}\leq i \leq 2^{63}-1$) (see two’s complement). This means that one has to avoid applying operations that will take us out of this range. Arithmetic operations that generate values outside of their allowed range are said to overflow.

Problems with the finite representations of integers are dwarfed by the problems arising from the finite representation of real numbers. The (IEEE) standard way to finitely represent a real number is as a floating point number. The idea is to approximate a real number $x$ by a non-negative integer $\alpha$, an integer $\beta$, and a sign bit $\epsilon$ so that $x\approx (-1)^\epsilon \cdot \alpha \cdot 2^{\beta}$. Since, using a fixed number of bits, we can only represent finitely many integers, there are only finitely many such floating point numbers.

The finiteness of this representation leads to some obvious consequences as well as some non-obvious ones.

Since the exponent $\beta$ is bounded above, there is a bound for the range of real numbers that we can encode. This leads to the same type of overflow problems as for integers.
Since the exponent $\beta$ is bounded below, there is a smallest positive real number we can encode in this representation. This leads to a new set of problems called underflow. For example $e^{-n}=0$ for $n\gg 0$. As a consequence, $e^{-n}\cdot e^{n}\neq e^0=1$ for large $n$. If we instead expanded this expression from the middle then we obtain: $$e\cdot (e\cdot (\cdots(e\cdot e^{-1})\cdot e^{-1})\cdots e^{-1})=1.$$ So multiplication of floating point numbers fails to be associative.
Since $\alpha$ is a bounded positive integer, we can only encode a finite amount of precision. This implies that for certain (large) values of $m$ and $n$, $$\sum_{i=1}^{2^n} 2^{-n} + 2^m = 2^m+1,$$ if we apply the operations from left to right, while the sum is $2^m$ if we apply them from right to left. So addition of floating point numbers fails to be associative.
We can not encode fractions precisely so $1/n \cdot n\neq 1$ in general. On older calculators one can witness this first hand and see that $1/3\cdot 3“=”0.9999999999$.

Remark

I sometimes find it helpful to visualize all of the floating point numbers placed on the real line as a sequence of dots. These dots cluster heavily near 0 and gradually spread farther and farther apart as we move away from the origin until we meet the size bounds. Beyond these points there are no longer any dots.

Since both encodings of real numbers into floating point and floating point operations contribute errors¹, care needs to be taken in the design of floating point algorithms to ensure that these errors do not grow out of control, otherwise things may go seriously wrong).

Remark

In probability theory and, especially, statistical inference, it is very common to take iterated products of probabilities. Since many of these probabilities are very small, there is a high risk of numerical underflow as we calculate the product.

One way to avoid this underflow is to use the simple formula $$ \prod_{i=1}^n x_i = \exp(\sum_{i=1}^n \log x_i). $$ This will help avoid the numerical underflow for the product. To ensure that the sum of quantities of varying magnitude is as accurate as possible, we can first sort the values to be summed and then add them.

Algorithm Analysis

Computer scientists and engineers are very concerned with space and computational efficiency. Making fair comparisons between two implementations of algorithms can be tricky. Implementation details of essentially identical algorithms can result in significant changes in runtime performance and memory costs (e.g., consider a program written in python 2.6 versus one written python 3.0).

An easier comparison can be made about how the memory and runtime requirements of an algorithm change as we increase the size of the input. For example, the standard grade school algorithm for multiplying two $n\times n$ matrices requires $n^3$ multiplications and $n^2*(n-1)$ addition operations². What is important to realize about this algorithm is not the exact count of the operations or how much faster addition is compared to multiplication, it is, for example, that if we replace $n$ with $100n$ then we will require roughly $1 000 000$-times more operations!

At this scale the addition operations and their relative speed in comparison to multiplication do not matter as much as this observation. The dominant feature is that the time to multiply two $n\times n$ matrices grows cubically in $n$.

One way to encode the essential features of the growth of functions is with Big O-notation.

Definition

We say³ $f(n)= O(g(n))$ if there exists an $M>0$ such that for all sufficiently large $n$, $$|f(n)|\leq M|g(n)|.$$

For most big data tasks, any algorithm whose runtime is not $O(n \cdot \log n)$ requires some attention (especially attention to how one can parallelize the computation across a cluster). More importantly, once the data is too large to fit into a single computers memory, this entire approach to comparing algorithms breaks down (as do the standard implementations of the algorithms). In this case, one has to think very carefully about how to break up the task into separate pieces. This is all complicated by the fact that the time move data between computers is much slower than it is to move the data internally in computer memory.

One way to reduce the errors in floating point operations is to use more bits for the representations, do the operation, and then truncate. ↩
It turns out this is not asymptotically optimal. For example see here. ↩
I would like to, on the record, state my disapproval of the standard usage of the equal-sign in the expression $f(n)=O(g(n))$. I can see no reasonable interpretation of the equality symbol here.)https://en.wikiquote.org/wiki/Donald_Knuth ↩

Probability and Statistics Background

Statistics – A subject which most statisticians find difficult, but in which nearly all physicians are expert. – Stephen S. Senn

Introduction

For us, we will regard probability theory as a way of logically reasoning about uncertainty. I realize that this is not a precise mathematical definition, but neither is ‘probability theory is the mathematics arising from studying non-negative numbers which add up to 1’, which is at least partially accurate.

Some additional material is covered elsewhere:
* Statistical inference.

To get well-grounded let’s begin with a sequence of definitions.

First definitions

Definition

A probability space is a measure space $D$ with measure $P$ such that $P(D)=1$. The space $D$ is also sometimes called the sample space and the measurable subsets of $D$ are called events¹.

Remark

The definition of probability space is sufficiently general to include lots of degenerate examples. For example, we can take any set $S$ and make it into a probability space by decreeing that the only measurable subsets are $S$ and $\emptyset$ with $P(S)=1$. Although we will try to make this explicit, we will almost always want that singleton sets, i.e., sets with just a single element, are measurable. When a probability space has this property and every measurable subset is a countable union of singleton sets we will call the probability space discrete.

Exercise

Make the positive integers into a discrete probability space where every point has non-zero probability.

Definition

The probability of an event $E$ is $P(E)=:\int_E dP$. For discrete probability spaces we can also write $\int_E dP=\sum_{x\in E} P(x)$².

Construction

Given two probability spaces $D_1$ and $D_2$ with respective probability measures $P_1$ and $P_2$. We can define a probability space $D_1\times D_2$ by:

The underlying set is the cartesian product $D_1\times D_2$.
The measurable subsets are generated under countable unions and complements by the products sets $I_1\times I_2$, where $I_1\subseteq D_1$ and $I_2\subseteq D_2$ are measurable subsets.
The probability measure is determined by $P(I_1\times I_2)=P_1(I_1)\cdot P(I_2)$, where $I_1$ and $I_2$ are as in the previous statement.

Example

Suppose we have a fair coin that we flip twice. Each of the four possible outcomes $D=\{HH,HT,TH,TT\}$ are equally likely and form a discrete probability space such that $P(x)=1/4$ for all $x\in D$. The probability of the event $E$, where we get precisely one head, is $P(E)=P(HT)+P(TH)=1/2$.

Definition

A random variable $X\colon D\to T$ is a measurable function from a probability space $D$ to a measure space $T$.

We can associate to each such $X$ a probability measure $P_X$ on $T$ by assigning to each measurable subset $U\subset T$, $P_X(U)=P(X^{-1}(U))$. Indeed it is clear that $P_X(T)=1$ and that for the measure of a countable disjoint union is $$P_X(\coprod U_i)=P(X^{-1}(\coprod U_i))=P(\coprod(X^{-1}U_i))=\sum P(U_i).$$

Remark

There is an unfortunate clash in the language of probability theory and standard english usage. For example, imagine that we have a box with a single button on it and a numerical display. Every time we push the button the screen displays a number between 1 and 10. In common usage we say that these values are random if there is no way to know which number will appear on the screen every time we push the button.

It is important to know that mathematics/probability theory/statistics do not provide any such mechanism. There is no function whose values are “randomly” chosen given a particular input. In particular, mathematics does not provide a method of randomly choosing objects.

One should keep this in mind when talking about random variables. Random variables are not objects with random values; they are functions. The additional data that a random variable $X$ does define are numbers associated to the preimages $X^{-1}(I)$ (for measurable subsets $I$), which we can use to weight the values of $X$.

This can also be used to shed light on statistical mechanics, which uses probability theory to model situations arising in physics. The fact that such models have been extremely successful in the field of quantum mechanics does not necessarily mean there is something random, in the common usage sense, about the universe; we are not claiming that “God plays dice with the universe”. It is just that our best mathematical models for these phenomena are constructed using the language of probability theory.

Finally, we should remark that the closest mathematical object to a random number generator in the sense of english is a pseudorandom number generator. These are deterministic functions which output sequences of numbers which attempt to model our intuition of what a random number generator should produce. Although not truly random, these are heavily used in simulations and Monte Carlo methods.

Conventions

If we are regarding $\Bbb R$ as a measure space and do not specify an alternative measure, we will mean that it is equipped with its standard Borel measurable subsets and the Borel measure³ $E\mapsto \int_E dx$. If we regard a discrete finite set $S$ or any interval $[a,b]$ (with $a<b$) as a probability space and do not specify the measure then we will mean that it is equipped with a uniform measure. In other words, $P(s)=1/|S|$ for all $s\in S$ and for all measurable $E\subset I$ we have $P(E)=P_{\Bbb R}(E)/(b-a)$.

Remarks⁴:

If the measure $P_X$ from the previous definition is absolutely continuous with respect to the standard Borel measure (i.e., the preimage of every measure 0 set with respect to the standard Borel measure is of measure 0), then there is a measurable function $dP_X/dx \colon T\to \Bbb R$ such that for all measurable $E\subset T$, $$P_X(E) := \int_{X^{-1} E} dP := \int_E dP_X = \int_{E} \frac{dP_X}{dx} dx.$$ All of these integrals are Lebesgue integrals.

The measurable function $dP_X/dx$ is called a Radon-Nikodym derivative and any two such derivatives disagree on a set of measure 0, i.e., they agree almost everywhere. Without the absolute continuity hypothesis there is only a distribution satisfying this property. Having a measure defined in such a way obvious implies absolute continuity, so the first sentence can be formulated as an if and only if statement. This is the Radon-Nikodym theorem.

Definition

For a discrete probability space $D$ the function $p\colon D\to [0,1]$, defined by $d\mapsto p(d):=P(d)$ is called the probability mass function (PMF) of $D$.

Note that the measure $P$ on $D$ is uniquely determined by the associated probability mass function.

Definition

Suppose that $\Bbb R$ is equipped with a probability measure $P$ and the cumulative distribution function (cdf) $F(a)=P(x\leq a)$ is a continuously differentiable function of $a$, then $F(x)=\int_{-\infty}^x F'(x) dx$ and $F'(x)$ is called the probability density function (pdf) of $F$ (or $P$).

Note that the probability measure $P$ is determined by $F$ and hence the probability density function $F’$. This can lead to some confusing abuses of language.

Example

Let $D$ be the probability space from the first example. Let $X\colon D\to \mathbb{R}$ be the random variable which counts the number of occurrences of heads in a given event. Then the cumulative density function of $P_X$ is $F_X(x)=0$ if $x<0$, $F_X(x)=1/4$ if $0\leq x < 1$, $F_X(x)=3/4$ if $1\leq x < 2$ and $F_X(x)=1$ if $x\geq 2$. This function is discontinuous and hence the probability density function is not defined⁵.

Moments of distributions

Typically when we are handed a probability space $D$, we analyze it by constructing a random variable $X\colon D\to T$ where $T$ is either a countable subset of $\Bbb R$ or to $\Bbb R$. Using the procedure of the previous section we obtain a probability measure $P_X$ on $T$ and we now study this probability space. Usually a great deal of information is lost about $D$ during this process, but it allows us to focus our energies and work in the more tractable and explicit space $T\subset\Bbb R$.

So, we know focus on such probability spaces. This is usually decomposed into two cases, when $T$ is discrete (e.g., a subset of $\Bbb N$) and when $T$ is $\Bbb R$ (or some interval in $\Bbb R$). We could study the first case as a special case of the latter and just studying probability measures on $\Bbb R$, but that would require throwing in a lot of Dirac delta distributions at some point and I sense that you may not like that. We will seek a compromise and still use the integral notation to cover both cases although integrals in the discrete case can be expressed as sums.

There are two special properties of this situation that we will end up using:
1. It makes sense to multiply elements of $T$ with real valued functions.
2. There is a natural ordering on $T$ (so we can define a cdf).
3. We can now meaningfully compare random variables with values in $\Bbb R$ which are defined on different probability spaces, by comparing their associated probability measures on $\Bbb R$ (or their cdfs/pdfs when these exist).

For example the first property allows us to make sense of:

Definition

The expected value or mean of a random variable $X\colon D\to T\subset \Bbb R$, is $$\mu_X:=E(X)= \int_{x\in T} x\cdot dP_X = \int_{d\in D} X(d) dP.$$
Let $F_X$ denote the cdf of $X$. The median of $X$ is those $t\in \Bbb R$ such that $F_X(t)=0.5$.
Suppose that $X$ admits a pdf $f_X$. The modes of $X$ are those $t\in \Bbb R$ such that $f_X(t)$ is maximal.

Example

In our coin flipping example, the expected value of the random variable $X$ which counts the heads is $$\int_D X dP = \sum_{d\in D} X(d)p(d) = 2/4+1/4+1/4+0/4=1,$$
as expected.

The third property lets us make sense of:

Definition

Two random variables $X\colon D_1\to T$ and $Y\colon D_2\to T$ are identically distributed if they define the same probability measure on $T$, i.e., $P_X(I)=P_Y(I)$ for all measurable subsets $I\subseteq T$. In this case, we write $X\sim Y$.

Definition

We associate to two random variables $X,Y\colon D\to T$ a random variable $X\times Y\colon D \to T^2$ by $X\times Y(x)=(X(x),Y(x))$. This induces a probability measure $P_{X,Y}$ on $T^2.$ When $T=\Bbb R$ we can then define an associated joint cdf, $F_{X,Y}\colon \Bbb R^2\to [0,1]$ defined by $F_{X,Y}(a,b)=P_{X,Y}(x\leq a, y\leq b)$, which when $X\times Y$ is absolutely continuous with respect to the Lebesgue measure admits a joint pdf. Similarly, we can extend this to joint probability distributions of any number of random variables.

Definition

Two random variables $X,Y\colon D\to T$ with corresponding probability measures $P_X$ and $P_Y$ on $T$ are independent if the associated joint probability measure $P_{X,Y}$ on $T^2$ satisfies $P_{X,Y}(I_1\times I_2)=P_X(I_1)P_Y(I_2)$ for all measurable subsets $I_1, I_2\subseteq T$. When two variables are both independent and identically distributed then we abbreviate this to iid.

Definition

Suppose that $T$ is a probability space whose singleton sets are measurable. A random sample of size $n$ from $T$ will be any point of the associated product probability space $T^n$.

Exercise

Show that if $X$ and $Y$ are two $\Bbb R$ random variables then they are independent if and only if their joint cdf is the product of their individual cdfs.
Suppose that moreover $X$ and $Y$ and the joint distribution admit pdfs $p_X, p_Y,$ and $p_{X,Y}$ respectively, then show that the $f_{X,Y}=f_X f_Y$ if and only if the distributions are independent.

Definition

The $k$th moment of a random variable $X\colon D \to \Bbb R$ is $E(X^k)$.
The variance of a random variable $X\colon D\to \Bbb R$ is
$$\sigma_X^2=E((X-\mu_X)^2)=\int_D (X-\mu_X)^2 dP$$.
The standard deviation of $X$ is $\sigma_X=\sqrt{\sigma_X^2}$.
The covariance of a pair of random variables $X,Y\colon D\to \Bbb R$ is
$$ Cov(X,Y) = E((X-\mu_X)(Y-\mu_Y)). $$
The correlation coefficient of a pair of random variables $X,Y\colon D\to \Bbb R$ is $$\frac{Cov(X,Y)}{\sigma_X \sigma_Y}.$$

Exercise

Suppose that $X$ and $Y$ are two independent $\Bbb R$-valued random variables with finite means $\mu_X$ and $\mu_Y$ and finite variances $\sigma^2_X$ and $\sigma^2_Y$ respectively.
1. Show that, for $a,b\in \Bbb R$ the mean of $aX+bY$ is $a\mu_X + b\mu_Y$.
1. Show that the variance of $aX+bY$ is $a^2 \sigma^2_X + b^2 \sigma^2_Y$.
1. Show that $E(XY)$ is $\mu_X\cdot \mu_Y$.
1. Show that $E(X^2)=\sigma_X^2+\mu_X^2$.

Definition

The characteristic function of a random variable $X\colon D\to \Bbb R$ is the complex function $$\varphi_X(t)=E(e^{itX})=\int_{x\in \Bbb R} e^{itx} dP_X = \int_{d\in D} e^{itX(d)} dP.$$

Remarks

The characteristic function is always defined (because we are integrating an absolutely bounded function over a finite measure space).
When $X$ admits a pdf $p_X$, then up to a reparametrization the characteristic function is the Fourier transform of $p_X$: $F(p_X)(X)=\varphi_X(-2\pi t)$.
Two random variables have the same characteristic functions if and only if they are identically distributed⁶.

Some Important Results

The Law of Large Numbers, which essentially says that the average $S_n$ of a sum of $n$ iif random variables with finite mean $\mu$ “converges” to the common mean.
The Central Limit Theorem, which says that under the above hypotheses plus the assumption that the random variables have a finite variance $\sigma^2$, the random variable $\sqrt{n}(S_n-\mu)$ converges in distribution to the normal distribution with mean $0$ and variance $\sigma^2$. This result is the basis behind many normality assumptions and is critical to hypothesis testing which is used throughout the sciences.

Conditional probability

Suppose we have a probability space $P\colon D\to [0,1]$ and two events $A,B\in D$. Then we write $P(A,B)=P(A\cap B)$. Suppose that $P(B)>0$, then define the conditional probability of $A$ given $B$ as $$P(A|B)=P(A,B)/P(B).$$ A similar definition is also given for the conditional pdf of two random variables $X$ and $Y$: $f_{X,Y}(x|y)=f_{X,Y}(x,y)/f_Y(y)$ where $f_Y(y)=\int_{x \in \Bbb R} f(x,y) dx$ is the marginal density.

Bayes Rule

Let $A$ be an event in a probability space $P\colon D\to [0,1]$ and that $\{B_i\}_{i=1}^n$ is a disjoint union of events which cover $D$ and all of these events have non-zero probability. Then
$$ $$

There is also the pdf form:
$$ f_{X,Y}(x|y)=\frac{f_{X,Y}(y|x)f_X(x)}{\int f_{X,Y}(y|x) f_X(x) dx}. $$

The usefulness of Bayes rule is that it allows us to write a conditional probability that we do not understand (the dependence of $X$ on $Y$) in terms that we might understand (the dependence of $Y$ on $X$).

If you don’t know what a measurable subset is then you don’t know what a measure space is and you should consult the references. If you don’t consult the references and you just believe that all subsets of $D$ are measurable, it will take a long time to find out that you are wrong. ↩
Here and elsewhere we will abuse notation and denote measure of a singleton set ${{x}}$ by $P(x)$. ↩
For $\Bbb R^n$ and $n>1$ we should probably use the Lebesgue measure, which is a completion of the product Borel measure. ↩
This material is quite advanced, so don’t worry if it goes over your head. The notation here is chosen so that in special cases where the [fundamental theorem of calculus] (https://en.wikipedia.org/wiki/Fundamental_theorem_of_calculus) applies the Radon-Nikodym derivative can be chosen to be an actual derivative. ↩
Although we could define the “pdf” as a linear combination of Dirac delta distributions, but then it wouldn’t be a function (no matter what a physicist tells you). ↩
For a proof see these notes. ↩

Introduction

Potential objection

Definition

Some scandalous abuses of notation

Some results

Exercise

Introduction

Question

Example

Frequentists are Bayesian (sort of)

Textbooks

Lecture series

Websites

Programming references

Introduction

Numerical analysis

Remark

Remark

Algorithm Analysis

Definition

Introduction

First definitions

Definition

Remark

Exercise

Definition

Construction

Example

Definition

Remark

Conventions

Remarks4:

Definition

Definition

Example

Moments of distributions

Definition

Example

Definition

Definition

Definition

Definition

Exercise

Definition

Exercise

Definition

Remarks

Some Important Results

Conditional probability

Bayes Rule

Remarks⁴: