Additional Sources


  1. Machine Learning: A probabilistic perspective by Kevin Murphy. The material in this book is closest to what we will cover in the course, but is unfortunately not available for free. Written by an academic and a practitioner of machine learning, this text is full of real world examples and applications, while eschewing the ad-hoc approach of other texts.
  2. Introduction to statistical learning by Gareth JamesDaniela WittenTrevor Hastie and Robert Tibshirani. This textbook is available for free. It has minimal prerequisites and as a consequence does not touch on some topics of interest.  On the other hand, it has detailed R exercises which are wonderful.
  3. The elements of statistical learning by Trevor Hastie, Robert Tibshirani, and Jerome Friedman. A more advanced version of the previous text (and also free), covering more material, and assuming more background in statistics.
  4. Deep learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. A new text on this burgeoning field. Written by central figures in the area, this text surveys deep learning, includes many useful citations (a great deal of the literature is still in academic articles), provides good rules of thumb, and is a relatively light read. This book is very much from a practitioner’s perspective with less justification for methodology and less emphasis on mathematical rigor. My limited experience is that other papers in this area have a similar emphasis, which reflects the limited understanding of why these methods work very well in some instances and not so well in others.
  5. Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto. A free textbook that has been recently revised and covers quite a few topics.

Lecture series

  1. Trevor Hastie and Robert Tibshirani have made their lectures and slides for their course on statistical learning available. These are very well done.
  2. Machine Learning lectures from Georgia Tech, these are very well done introductory lectures on machine learning directed toward engineers. I personally love the nerdy jokes interspersed throughout.
  3. Deep learning course lectures by Nando de Freitas. Excellent lecture series on deep learning from Oxford for computer scientists. This lecture series is aimed at those with some more background.
  4. Machine learning course lectures by Andrew Ng. A somewhat older lecture series from Stanford aimed at computer scientists.
  5. Machine Learning 101 by Jason Mayes. A very well done overview of the subject.


  1. Datacamp: An online learning site with an emphasis on data science. Includes mini-lectures and online programming exercises. A nice introduction on how to take the material we are learning and apply it. However, there is a noticeable bump in difficulty in proceeding to implement these methods from scratch on your own computer.
  2. Andrej Karpathy’s blog: A truly excellent sequence of blog posts on topics related to machine learning.
  3. Kaggle: Hosts machine learning competitions, allows machine learning practitioners to share information, and hosts many large datasets.
  4. Seeing Theory: A beautiful sequence of javascript visualizations for statistical concepts.

Programming references

  1. Introduction to R.
  2. Python 3 standard library.
  3. Keras (a significantly simpler way to apply the Tensorflow (or another backend) library for building neural networks).
  4. Tensorflow API (a rich library for building and using neural networks).

Computer Science Background

If you find that you’re spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you find that you’re spending almost all your time on practice, start turning some attention to theoretical things; it will improve your practice. – Donald Knuth


This section will contain some very basic computer science background. Since the target audience of this lecture series has little to no training in this area, I will attempt to fill in some gaps (admittedly very poorly).

Numerical analysis

It should come as no surprise that computers store data as sequences of 1’s and 0’s. For example, non-negative integers are represented by their base-2 expansions (e.g. 7 is represented by 111 and 8 by 1000). Since we can only allocate a finite amount of space for each integer (e.g., 16, 32, or 64 bits) we can only hold a finite range of integers (e.g., $-2^{63}\leq i \leq 2^{63}-1$) (see two’s complement). This means that one has to avoid applying operations that will take us out of this range. Arithmetic operations that generate values outside of their allowed range are said to overflow.

Problems with the finite representations of integers are dwarfed by the problems arising from the finite representation of real numbers. The (IEEE) standard way to finitely represent a real number is as a floating point number. The idea is to approximate a real number $x$ by a non-negative integer $\alpha$, an integer $\beta$, and a sign bit $\epsilon$ so that $x\approx  (-1)^\epsilon \cdot \alpha \cdot 2^{\beta}$. Since, using a fixed number of bits, we can only represent finitely many integers, there are only finitely many such floating point numbers.

The finiteness of this representation leads to some obvious consequences as well as some non-obvious ones.

  1. Since the exponent $\beta$ is bounded above, there is a bound for the range of real numbers that we can encode. This leads to the same type of overflow problems as for integers.
  2. Since the exponent $\beta$ is bounded below, there is a smallest positive real number we can encode in this representation. This leads to a new set of problems called underflow. For example $e^{-n}=0$ for $n\gg 0$. As a consequence, $e^{-n}\cdot e^{n}\neq e^0=1$ for large $n$. If we instead expanded this expression from the middle then we obtain: $$e\cdot (e\cdot (\cdots(e\cdot e^{-1})\cdot e^{-1})\cdots e^{-1})=1.$$ So multiplication of floating point numbers fails to be associative.
  3. Since $\alpha$ is a bounded positive integer, we can only encode a finite amount of precision. This implies that for certain (large) values of $m$ and $n$, $$\sum_{i=1}^{2^n} 2^{-n} + 2^m = 2^m+1,$$ if we apply the operations from left to right, while the sum is $2^m$ if we apply them from right to left. So addition of floating point numbers fails to be associative.
  4. We can not encode fractions precisely so $1/n \cdot n\neq 1$ in general. On older calculators one can witness this first hand and see that $1/3\cdot 3“=”0.9999999999$.


I sometimes find it helpful to visualize all of the floating point numbers placed on the real line as a sequence of dots. These dots cluster heavily near 0 and gradually spread farther and farther apart as we move away from the origin until we meet the size bounds. Beyond these points there are no longer any dots.

Since both encodings of real numbers into floating point and floating point operations contribute errors1, care needs to be taken in the design of floating point algorithms to ensure that these errors do not grow out of control, otherwise things may go seriously wrong).


In probability theory and, especially, statistical inference, it is very common to take iterated products of probabilities. Since many of these probabilities are very small, there is a high risk of numerical underflow as we calculate the product.

One way to avoid this underflow is to use the simple formula $$ \prod_{i=1}^n x_i = \exp(\sum_{i=1}^n \log x_i). $$ This will help avoid the numerical underflow for the product. To ensure that the sum of quantities of varying magnitude is as accurate as possible, we can first sort the values to be summed and then add them.

Algorithm Analysis

Computer scientists and engineers are very concerned with space and computational efficiency. Making fair comparisons between two implementations of algorithms can be tricky. Implementation details of essentially identical algorithms can result in significant changes in runtime performance and memory costs (e.g., consider a program written in python 2.6 versus one written python 3.0).

An easier comparison can be made about how the memory and runtime requirements of an algorithm change as we increase the size of the input. For example, the standard grade school algorithm for multiplying two $n\times n$ matrices requires $n^3$ multiplications and $n^2*(n-1)$ addition operations2. What is important to realize about this algorithm is not the exact count of the operations or how much faster addition is compared to multiplication, it is, for example, that if we replace $n$ with $100n$ then we will require roughly $1 000 000$-times more operations!

At this scale the addition operations and their relative speed in comparison to multiplication do not matter as much as this observation. The dominant feature is that the time to multiply two $n\times n$ matrices grows cubically in $n$.

One way to encode the essential features of the growth of functions is with Big O-notation.


We say3 $f(n)= O(g(n))$ if there exists an $M>0$ such that for all sufficiently large $n$,  $$|f(n)|\leq M|g(n)|.$$

For most big data tasks, any algorithm whose runtime is not $O(n \cdot \log n)$ requires some attention (especially attention to how one can parallelize the computation across a cluster). More importantly, once the data is too large to fit into a single computers memory, this entire approach to comparing algorithms breaks down (as do the standard implementations of the algorithms). In this case, one has to think very carefully about how to break up the task into separate pieces. This is all complicated by the fact that the time move data between computers is much slower than it is to move the data internally in computer memory.

  1. One way to reduce the errors in floating point operations is to use more bits for the representations, do the operation, and then truncate. 
  2. It turns out this is not asymptotically optimal. For example see here
  3. I would like to, on the record, state my disapproval of the standard usage of the equal-sign in the expression $f(n)=O(g(n))$. I can see no reasonable interpretation of the equality symbol here.) 

Probability and Statistics Background

Statistics – A subject which most statisticians find difficult, but in which nearly all physicians are expert. – Stephen S. Senn


For us, we will regard probability theory as a way of logically reasoning about uncertainty. I realize that this is not a precise mathematical definition, but neither is ‘probability theory is the mathematics arising from studying non-negative numbers which add up to 1’, which is at least partially accurate.

Some additional material is covered elsewhere:
* Statistical inference.

To get well-grounded let’s begin with a sequence of definitions.

First definitions


A probability space is a measure space $D$ with measure $P$ such that $P(D)=1$. The space $D$ is also sometimes called the sample space and the measurable subsets of $D$ are called events1.


The definition of probability space is sufficiently general to include lots of degenerate examples. For example, we can take any set $S$ and make it into a probability space by decreeing that the only measurable subsets are $S$ and $\emptyset$ with $P(S)=1$. Although we will try to make this explicit, we will almost always want that singleton sets, i.e., sets with just a single element, are measurable. When a probability space has this property and every measurable subset is a countable union of singleton sets we will call the probability space discrete.


Make the positive integers into a discrete probability space where every point has non-zero probability.


The probability of an event $E$ is $P(E)=:\int_E dP$. For discrete probability spaces we can also write $\int_E dP=\sum_{x\in E} P(x)$2.


Given two probability spaces $D_1$ and $D_2$ with respective probability measures $P_1$ and $P_2$. We can define a probability space $D_1\times D_2$ by:

  1. The underlying set is the cartesian product $D_1\times D_2$.
  2. The measurable subsets are generated under countable unions and complements by the products sets $I_1\times I_2$, where $I_1\subseteq D_1$ and $I_2\subseteq D_2$ are measurable subsets.
  3. The probability measure is determined by $P(I_1\times I_2)=P_1(I_1)\cdot P(I_2)$, where $I_1$ and $I_2$ are as in the previous statement.


Suppose we have a fair coin that we flip twice. Each of the four possible outcomes $D=\{HH,HT,TH,TT\}$ are equally likely and form a discrete probability space such that $P(x)=1/4$ for all $x\in D$. The probability of the event $E$, where we get precisely one head, is $P(E)=P(HT)+P(TH)=1/2$.


A random variable $X\colon D\to T$ is a measurable function from a probability space $D$ to a measure space $T$.

We can associate to each such $X$ a probability measure $P_X$ on $T$ by assigning to each measurable subset $U\subset T$, $P_X(U)=P(X^{-1}(U))$. Indeed it is clear that $P_X(T)=1$ and that for the measure of a countable disjoint union is $$P_X(\coprod U_i)=P(X^{-1}(\coprod U_i))=P(\coprod(X^{-1}U_i))=\sum P(U_i).$$


There is an unfortunate clash in the language of probability theory and standard english usage. For example, imagine that we have a box with a single button on it and a numerical display. Every time we push the button the screen displays a number between 1 and 10. In common usage we say that these values are random if there is no way to know which number will appear on the screen every time we push the button.

It is important to know that mathematics/probability theory/statistics do not provide any such mechanism. There is no function whose values are “randomly” chosen given a particular input. In particular, mathematics does not provide a method of randomly choosing objects.

One should keep this in mind when talking about random variables. Random variables are not objects with random values; they are functions. The additional data that a random variable $X$ does define are numbers associated to the preimages $X^{-1}(I)$ (for measurable subsets $I$), which we can use to weight the values of $X$.

This can also be used to shed light on statistical mechanics, which uses probability theory to model situations arising in physics. The fact that such models have been extremely successful in the field of quantum mechanics does not necessarily mean there is something random, in the common usage sense, about the universe; we are not claiming that “God plays dice with the universe”. It is just that our best mathematical models for these phenomena are constructed using the language of probability theory.

Finally, we should remark that the closest mathematical object to a random number generator in the sense of english is a pseudorandom number generator. These are deterministic functions which output sequences of numbers which attempt to model our intuition of what a random number generator should produce. Although not truly random, these are heavily used in simulations and Monte Carlo methods.


If we are regarding $\Bbb R$ as a measure space and do not specify an alternative measure, we will mean that it is equipped with its standard Borel measurable subsets and the Borel measure3 $E\mapsto \int_E dx$. If we regard a discrete finite set $S$ or any interval $[a,b]$ (with $a<b$) as a probability space and do not specify the measure then we will mean that it is equipped with a uniform measure. In other words, $P(s)=1/|S|$ for all $s\in S$ and for all measurable $E\subset I$ we have $P(E)=P_{\Bbb R}(E)/(b-a)$.


If the measure $P_X$ from the previous definition is absolutely continuous with respect to the standard Borel measure (i.e., the preimage of every measure 0 set with respect to the standard Borel measure is of measure 0), then there is a measurable function $dP_X/dx \colon T\to \Bbb R$ such that for all measurable $E\subset T$, $$P_X(E) := \int_{X^{-1} E} dP := \int_E dP_X = \int_{E} \frac{dP_X}{dx} dx.$$ All of these integrals are Lebesgue integrals.

The measurable function $dP_X/dx$ is called a Radon-Nikodym derivative and any two such derivatives disagree on a set of measure 0, i.e., they agree almost everywhere. Without the absolute continuity hypothesis there is only a distribution satisfying this property. Having a measure defined in such a way obvious implies absolute continuity, so the first sentence can be formulated as an if and only if statement. This is the Radon-Nikodym theorem.


For a discrete probability space $D$ the function $p\colon D\to [0,1]$, defined by $d\mapsto p(d):=P(d)$ is called the probability mass function (PMF) of $D$.

Note that the measure $P$ on $D$ is uniquely determined by the associated probability mass function.


Suppose that $\Bbb R$ is equipped with a probability measure $P$ and the cumulative distribution function (cdf) $F(a)=P(x\leq a)$ is a continuously differentiable function of $a$, then $F(x)=\int_{-\infty}^x F'(x) dx$ and $F'(x)$ is called the probability density function (pdf) of $F$ (or $P$).

Note that the probability measure $P$ is determined by $F$ and hence the probability density function $F’$. This can lead to some confusing abuses of language.


Let $D$ be the probability space from the first example. Let $X\colon D\to \mathbb{R}$ be the random variable which counts the number of occurrences of heads in a given event. Then the cumulative density function of $P_X$ is $F_X(x)=0$ if $x<0$, $F_X(x)=1/4$ if $0\leq x < 1$, $F_X(x)=3/4$ if $1\leq x < 2$ and $F_X(x)=1$ if $x\geq 2$. This function is discontinuous and hence the probability density function is not defined5.

Moments of distributions

Typically when we are handed a probability space $D$, we analyze it by constructing a random variable $X\colon D\to T$ where $T$ is either a countable subset of $\Bbb R$ or to $\Bbb R$. Using the procedure of the previous section we obtain a probability measure $P_X$ on $T$ and we now study this probability space. Usually a great deal of information is lost about $D$ during this process, but it allows us to focus our energies and work in the more tractable and explicit space $T\subset\Bbb R$.

So, we know focus on such probability spaces. This is usually decomposed into two cases, when $T$ is discrete (e.g., a subset of $\Bbb N$) and when $T$ is $\Bbb R$ (or some interval in $\Bbb R$). We could study the first case as a special case of the latter and just studying probability measures on $\Bbb R$, but that would require throwing in a lot of Dirac delta distributions at some point and I sense that you may not like that. We will seek a compromise and still use the integral notation to cover both cases although integrals in the discrete case can be expressed as sums.

There are two special properties of this situation that we will end up using:
1. It makes sense to multiply elements of $T$ with real valued functions.
2. There is a natural ordering on $T$ (so we can define a cdf).
3. We can now meaningfully compare random variables with values in $\Bbb R$ which are defined on different probability spaces, by comparing their associated probability measures on $\Bbb R$ (or their cdfs/pdfs when these exist).

For example the first property allows us to make sense of:


  1. The expected value or mean of a random variable $X\colon D\to T\subset \Bbb R$, is $$\mu_X:=E(X)= \int_{x\in T} x\cdot dP_X = \int_{d\in D} X(d) dP.$$
  2. Let $F_X$ denote the cdf of $X$. The median of $X$ is those $t\in \Bbb R$ such that $F_X(t)=0.5$.
  3. Suppose that $X$ admits a pdf $f_X$. The modes of $X$ are those $t\in \Bbb R$ such that $f_X(t)$ is maximal.


In our coin flipping example, the expected value of the random variable $X$ which counts the heads is $$\int_D X dP = \sum_{d\in D} X(d)p(d) = 2/4+1/4+1/4+0/4=1,$$
as expected.

The third property lets us make sense of:


Two random variables $X\colon D_1\to T$ and $Y\colon D_2\to T$ are identically distributed if they define the same probability measure on $T$, i.e., $P_X(I)=P_Y(I)$ for all measurable subsets $I\subseteq T$. In this case, we write $X\sim Y$.


We associate to two random variables $X,Y\colon D\to T$ a random variable $X\times Y\colon D \to T^2$ by $X\times Y(x)=(X(x),Y(x))$. This induces a probability measure $P_{X,Y}$ on $T^2.$ When $T=\Bbb R$ we can then define an associated joint cdf, $F_{X,Y}\colon \Bbb R^2\to [0,1]$ defined by $F_{X,Y}(a,b)=P_{X,Y}(x\leq a, y\leq b)$, which when $X\times Y$ is absolutely continuous with respect to the Lebesgue measure admits a joint pdf. Similarly, we can extend this to joint probability distributions of any number of random variables.


Two random variables $X,Y\colon D\to T$ with corresponding probability measures $P_X$ and $P_Y$ on $T$ are independent if the associated joint probability measure $P_{X,Y}$ on $T^2$ satisfies $P_{X,Y}(I_1\times I_2)=P_X(I_1)P_Y(I_2)$ for all measurable subsets $I_1, I_2\subseteq T$. When two variables are both independent and identically distributed then we abbreviate this to iid.


Suppose that $T$ is a probability space whose singleton sets are measurable. A random sample of size $n$ from $T$ will be any point of the associated product probability space $T^n$.


  1. Show that if $X$ and $Y$ are two $\Bbb R$ random variables then they are independent if and only if their joint cdf is the product of their individual cdfs.
  2. Suppose that moreover $X$ and $Y$ and the joint distribution admit pdfs $p_X, p_Y,$ and $p_{X,Y}$ respectively, then show that the $f_{X,Y}=f_X f_Y$ if and only if the distributions are independent.


  1. The $k$th moment of a random variable $X\colon D \to \Bbb R$ is $E(X^k)$.
  2. The variance of a random variable $X\colon D\to \Bbb R$ is
    $$\sigma_X^2=E((X-\mu_X)^2)=\int_D (X-\mu_X)^2 dP$$.
  3. The standard deviation of $X$ is $\sigma_X=\sqrt{\sigma_X^2}$.
  4. The covariance of a pair of random variables $X,Y\colon D\to \Bbb R$ is
    $$ Cov(X,Y) = E((X-\mu_X)(Y-\mu_Y)). $$
  5. The correlation coefficient of a pair of random variables $X,Y\colon D\to \Bbb R$ is $$\frac{Cov(X,Y)}{\sigma_X \sigma_Y}.$$


Suppose that $X$ and $Y$ are two independent $\Bbb R$-valued random variables with finite means $\mu_X$ and $\mu_Y$ and finite variances $\sigma^2_X$ and $\sigma^2_Y$ respectively.
1. Show that, for $a,b\in \Bbb R$ the mean of $aX+bY$ is $a\mu_X + b\mu_Y$.
1. Show that the variance of $aX+bY$ is $a^2 \sigma^2_X + b^2 \sigma^2_Y$.
1. Show that $E(XY)$ is $\mu_X\cdot \mu_Y$.
1. Show that $E(X^2)=\sigma_X^2+\mu_X^2$.


The characteristic function of a random variable $X\colon D\to \Bbb R$ is the complex function $$\varphi_X(t)=E(e^{itX})=\int_{x\in \Bbb R} e^{itx} dP_X = \int_{d\in D} e^{itX(d)} dP.$$


  1. The characteristic function is always defined (because we are integrating an absolutely bounded function over a finite measure space).
  2. When $X$ admits a pdf $p_X$, then up to a reparametrization the characteristic function is the Fourier transform of $p_X$: $F(p_X)(X)=\varphi_X(-2\pi t)$.
  3. Two random variables have the same characteristic functions if and only if they are identically distributed6.

Some Important Results

  1. The Law of Large Numbers, which essentially says that the average $S_n$ of a sum of $n$ iif random variables with finite mean $\mu$ “converges” to the common mean.
  2. The Central Limit Theorem, which says that under the above hypotheses plus the assumption that the random variables have a finite variance $\sigma^2$, the random variable $\sqrt{n}(S_n-\mu)$ converges in distribution to the normal distribution with mean $0$ and variance $\sigma^2$. This result is the basis behind many normality assumptions and is critical to hypothesis testing which is used throughout the sciences.

Conditional probability

Suppose we have a probability space $P\colon D\to [0,1]$ and two events $A,B\in D$. Then we write $P(A,B)=P(A\cap B)$. Suppose that $P(B)>0$, then define the conditional probability of $A$ given $B$ as $$P(A|B)=P(A,B)/P(B).$$ A similar definition is also given for the conditional pdf of two random variables $X$ and $Y$: $f_{X,Y}(x|y)=f_{X,Y}(x,y)/f_Y(y)$ where $f_Y(y)=\int_{x \in \Bbb R} f(x,y) dx$ is the marginal density.

Bayes Rule

Let $A$ be an event in a probability space $P\colon D\to [0,1]$ and that $\{B_i\}_{i=1}^n$ is a disjoint union of events which cover $D$ and all of these events have non-zero probability. Then
$$ $$

There is also the pdf form:
$$ f_{X,Y}(x|y)=\frac{f_{X,Y}(y|x)f_X(x)}{\int f_{X,Y}(y|x) f_X(x) dx}. $$

The usefulness of Bayes rule is that it allows us to write a conditional probability that we do not understand (the dependence of $X$ on $Y$) in terms that we might understand (the dependence of $Y$ on $X$).

  1. If you don’t know what a measurable subset is then you don’t know what a measure space is and you should consult the references. If you don’t consult the references and you just believe that all subsets of $D$ are measurable, it will take a long time to find out that you are wrong. 
  2. Here and elsewhere we will abuse notation and denote measure of a singleton set ${{x}}$ by $P(x)$. 
  3. For $\Bbb R^n$ and $n>1$ we should probably use the Lebesgue measure, which is a completion of the product Borel measure. 
  4. This material is quite advanced, so don’t worry if it goes over your head. The notation here is chosen so that in special cases where the [fundamental theorem of calculus] ( applies the Radon-Nikodym derivative can be chosen to be an actual derivative. 
  5. Although we could define the “pdf” as a linear combination of Dirac delta distributions, but then it wouldn’t be a function (no matter what a physicist tells you). 
  6. For a proof see these notes

Supervised Learning

A big computer, a complex algorithm, and a long time does not equal science. – Robert Gentleman


Before getting into what supervised learning precisely is, let’s look at some examples of supervised learning tasks:

  1. Identifying breast cancer.
  2. Image classification.
    • List of last year’s ILSVRC Winners
  3. Threat assessment
  4. Language Translation
  5. Identifying faces in images


Supervised learning is concerned with the construction of machine learning algorithms $f\colon 2^{D\times T}\times D\to T$. The subsets of $D\times T$ are referred to as subsets of labeled examples.

We often assume that the labeled examples arise from restricting some presumed function $F\colon D\to T$ whose values we know on $E\subset D$. We can then train $f$ on pairs $\{s, F(s) | s\in E\}$. We can then devise a cost function which measures the distance from the learned $f$ and the presumed $F$ (e.g., $L^2$-distance).

Supervised learning has been the most successful of the three branches to real world problems. The existence of labeled examples usually leads to well-defined performance metrics that transform supervised learning tasks into two other tasks finding an appropriate parametrized class of functions to choose $f$ from and an optimization problem of finding the best function in that class.

Typically $D$ will be some subset of $\Bbb R^n$ and we will refer to the components of $\Bbb R^n$ as features, dependent variables, or attributes (many concepts in machine learning have many names). Sometimes $D$ will be discrete, in which case we refer to these as categorical variables. There are a few simple tricks to map categorical variables into $\Bbb R^n$ (such as one-hot encoding), so it usually does not hurt to think of $D$ as a subset of $\Bbb R^n$.

When $T$ is discrete (typically finite), then the supervised learning problem is called a classification problem. When it is continuous (e.g, $\Bbb R$) then it is called a regression problem.


Let us consider the following sequence of supervised learning methods in turn.

  1. Linear Regression (Regression problems).
  2. $k$-nearest Neighbors (Classification or Regression problems).
  3. Logistic Regression (Classification problems1).
  4. Naive Bayes Classifier (Classification problems).
  5. Linear Discriminant Analysis (Classification problems).
  6. Support Vector Machines (Classification problems).
  7. Decision trees (Primarily Classification problems).
  8. Neural networks (Classification or Regression problems).

  1. I know the name is confusing. 

Machine Learning Overview

Science is knowledge which we understand so well that we can teach it to a computer; and if we don’t fully understand something, it is an art to deal with it. Donald Knuth


First Attempt at a Definition

One says that an algorithm learns if its performance improves with additional experience.

This is not a precise definition, so let’s try to make it so.


An algorithm will be a function1 $f\colon D\to T$ and whose individual values can be calculated by a computer (i.e., a Turing machine) in a finite amount of time using a finite amount of memory. We can refer to $D$ as the space of inputs or samples or examples and $T$ as the outputs or labels.

For the following definition we will want to be able to form a sum indexed by the elements of $D$. For this purpose we could suppose that our input is finite or that the domain is a measure space and that the algorithm will be a measurable function. For simplicity and convenience, we will pretend that $D$ is finite. That is if $D$ is infinite, then we have actually implicitly restricted to to a finite subset (e.g., $D=\Bbb R$ should really be interpreted as the finite subset of floating point numbers expressed in some fixed number of bits). This poses no real restriction in applications (all real world problem domains are effectively finite).


A performance measure of an algorithm will be a function $P\colon D\times T \to \Bbb R$. For two algorithms $f_1$ and $f_2$, we will say that $f_1$ performs better than $f_2$ (relative to $P$) if $$\sum_{d\in D} P(d,f_1(d)) > \sum_{d\in D} P(d, f_2(d)).$$ We similarly define ‘performs worse than’ and ‘performs at least as well as’, etc.

Alternatively, we may be presented a cost function $C\colon D\times T\to \Bbb R$ which we want to minimize. One can postcompose any performance measure (resp. cost function) with an order reversing map to obtain a cost function (resp. performance measure). This allows us to pass between the two notions.


A machine learning algorithm is an algorithm $f_-\colon 2^E \times D\to T$, that learns with respect to a specified performance measure $P$ (defined below). Here $2^E$ is the set of all subsets of $E$. We call an element $S\in 2^E$ (i.e., a subset $S\subseteq E$) a set of training data. For such an $S$ and $f_-$, let $f_S=f_-(S,-) \colon D\to T$ denote the resulting algorithm (which we will say is trained on $S$).


Consider the problem of finding a linear function $f\colon D\subset \Bbb R\to \Bbb R$ which closely approximates a fixed function $F\colon D\subset \Bbb R\to \Bbb R$. We can regard the function $F$ as a subset $S$ of $E=D\times \Bbb R$ by $F\mapsto S= \{(s,F(s)) | s\in D\}$2. We can define a cost function $C\colon D\times \Bbb R\to \Bbb R$ by $$C(d,t)=(F(d)-t)^2.$$ We can reformulate problem of minimizing the cost of $C(d,f(d))$ as maximizing the performance function $P(d,t)=\frac{1}{1+C(d,t)}$. The method of Linear Regression will define a machine learning algorithm $f_-\colon 2^E\times D\to \Bbb R$.

One can play with this problem using this demo.

Second Attempt at a Definition

A machine learning algorithm $f\colon 2^E\times D\to T$ learns (relative to a performance measure $P$) if for all proper subset inclusions $S_1\subset S_2\subseteq E$, $f_{S_2}$ performs better than $f_{S_1}$.

This is at least clearly defined, but it does not cover many examples of learning algorithms. For example, the method of linear regression will perform poorly when given a small training sample containing outliers (those $d\in D$ such that $F(d)$ takes on an unlikely value under some assumed probability distribution associated to $F$). So, the act of adjoining outliers to our training data will typically decrease the performance of this machine learning algorithm.

Final Definition

A machine learning algorithm $f\colon 2^E\times D\to T$ learns (relative to a performance measure $P$) if for each pair of natural numbers $m<n$, the average of the performances of $f_{S}$ over the subsets $S\subseteq E$ of cardinality $n$ is better than the average of the performances over the subsets of cardinality $m$.

Crucial remark

Without a specified performance measure, we have no obvious goal in machine learning. We can only evaluate machine learning algorithms relative to a performance measure3 and an algorithm that performs well with respect to one measure may perform poorly with respect to another. So the field of machine learning depends on first finding a good performance measure for the task at hand and then finding a machine learning algorithm that performs well with respect to this measure.


Now that we have come to a definition of a learning algorithm that learns that seems to cover at least one standard example, I will unveil some unfortunate truths:

  • In practice, we often do not have a performance measure that is defined over all of $D\times T$, instead we only have it over a subset.
  • Machine learning algorithms are often presented without proofs that they learn in the sense above.

Taxonomy of machine learning

We can roughly divide up the field of machine learning into 3 branches:

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement learning

Supervised learning

Before getting into what supervised learning precisely is, let’s look at some examples of supervised learning tasks:

  1. Identifying breast cancer.
  2. Image classification.
    • List of last year’s ILSVRC Winners
  3. Threat assessment
  4. Language Translation
  5. Identifying faces in images


Supervised learning is concerned with the construction of machine learning algorithms $f\colon 2^{D\times T}\times D\to T$. The subsets of $D\times T$ are referred to as subsets of labeled examples.

We often assume that the labeled examples arise from restricting some presumed function $F\colon D\to T$ whose values we know on $E\subset D$. We can then train $f$ on pairs $\{s, F(s) | s\in E\}$. We can then devise a cost function which measures the distance from the learned $f$ and the presumed $F$ (e.g., $L^2$-distance).

Supervised learning has been the most successful of the three branches to real world problems. The existence of labeled examples usually leads to well-defined performance metrics that transform supervised learning tasks into two other tasks:

  1. Find an appropriate parametrized class of functions to choose $f$ from.
  2. Solve the optimization problem of finding the best function in that class.


The first example of a supervised learning task is the linear regression problem mentioned above.

Unsupervised learning

First some examples:

  1. Some sample techniques in unsupervised learning can be seen at this beautiful blog post.
  2. Word vector embeddings.
  3. Independent components analysis


Unsupervised learning is concerned with the construction of machine learning algorithms of the form $f\colon 2^D\times D\to T$.

In this case, there is no obvious measure of performance for such an algorithm. The choice of performance measure essentially defines the unsupervised learning task.

Often, one can vaguely state that the goal of an unsupervised learning task is to summarize the data or determine its essential features. In other words, suppose that we have a function $i\colon T\to D$, then we can state the goal is to find an $f\colon D\to T$ such that $i\circ f$ approximates the identity function. That is our unsupervised learning task is to find a compression algorithm! In this case, we have transformed the unsupervised learning task into a supervised learning task (because we have as many labels as we like of the identity function).

The only task remaining (and it is the hard task) is to make sense of the word `approximates’ above by choosing an appropriate measure. For example, how do you quantitatively measure the quality of a video compression algorithm? Somehow we can usually see the difference between a terrible algorithm and a good one, but it can be difficult to turn this intuition into a well-defined measure.


One of the typical tasks in unsupervised learning is to identify ‘clusters’ of data points in a high dimensional space. We may be looking for difficult to identify relationships between data points. Once we have identified a classification that we deem to be ‘meaningful’, then our algorithm should be able to sort new data points into the correct classes. For example, can try to look at demographic data to identify clusters of voters with similar voting patterns and then try to identify the correct approach to appeal to this group.

Reinforcement learning

Reinforcement learning is applied to tasks where one must choose actions whose outcomes eventually lead to various rewards/punishments which are used to encourage or discourage certain choices:

  1. Breakout
  2. One of the more difficult challenges is Montezuma’s revenge where there is very little feedback and one of the implicit motivations is curiousity.
  3. Another truly amazing achievement is AlphaGo, or its successor AlphaGo Zero.
  4. Self-driving cars

To construct a reasonably precise definition, let’s describe a typical reinforcement learning task4. In this case the algorithm is to construct an agent (the name comes from the field of artificial intelligence). The agent learns a policy which is a mapping from perceived states to actions. Paired with the agent is an environment. The environment maintains an environment state and when given an action returns an observation and a reward to the agent, which it then uses to update its policy and perceived state and choose an action. This creates a feedback loop:

Reinforcement learning dynamics

We can break up the agent’s task into two parts:

  1. It must take in an observation and reward and update its perceived state. To update its perceived state it might maintain a model of its environment and use the observation and reward to update this model.
  2. It must take in a sequence of observations and rewards and update its policy. The policy is supposed to choose an allowable action and this decision could be made by defining a value function, which assigns to each possible action given our state a value, and choosing the action that maximizes the value. Since the agent is operating with imperfect information, the value function will only be an estimate, and it may choose to alternatively explore and choose a lower valued action in the hopes that such a state is being undervalued. We can model this exploration possibility, by saying that the policy assigns a probability to each possible action from a fixed perceived state.

Writing the general task out as a learning algorithm will get quite cumbersome. So let’s just hit the main points.

  • The environment will be encoded as a probability space $S_e \times A\times S_e$, which encodes the probability of transitioning from one environment state to another given a fixed action.
  • The policy will be encoded as a probability space $S_p \times A$, which encodes the probability of choosing an action from a given probability.
  • We will also require some type of model, which will be a probability space $S_p\times A\times O\times R\times S_p$ which will encode the probability of moving from one perceived state to another given an action, an observation, and a reward.
  • We will be a given a environmental feedback, which will be a random variable $F\colon S_e\times A\times S_e \to O\times R$.

All of these components can then be combined and we can form the expected reward of all sequences of length $n$ of tuples of the above variables5. The reinforcement learning task is to learn the policy function and the interpreter function so that we can maximize the expected reward over sequences of length $n$ as $n$ goes to $\infty$.

In this generality, the task is extremely difficult. We could simplify things by assuming the environment provides complete feedback about the perceived state, so that the observation is the perceived state (for example, a game with perfect information). But even then this task can be extremely difficult if there are many potential perceived states and many potential actions. For example, in the game Go there are two players (black and white) which take alternating turns placing pieces on a 19×19 board, occasionally removing captured pieces. As an approximate upper bound on the perceived state space we can see that there are no more than $3^{19\cdot 19}\approx 10^{172}$ possible board configurations (each square is either black, white, or blank), from each of these positions we have no more than $19^2=361$ possible actions choose from. Searching through all possible chains of configurations and actions is completely intractable. Even encoding an optimal policy in tabular form is impossible. This is part of the reason that performing well on this game was such a high-hurdle for the AI community.

  1. We could also allow partially defined functions. 
  2. Indeed if you look under the hood, this is how functions are actually defined.
  3. Okay, runtime and memory usage are also important too. 
  4. I must admit this first attempt at finding a definition of reinforcement learning that fits into the above framework is a bit clumsy and accurately reflects my limited knowledge in this area. 
  5. As a non-trivial exercise write this down for sequences of length 1 and 2.