Probability and Statistics Background

Statistics – A subject which most statisticians find difficult, but in which nearly all physicians are expert. – Stephen S. Senn


For us, we will regard probability theory as a way of logically reasoning about uncertainty. I realize that this is not a precise mathematical definition, but neither is ‘probability theory is the mathematics arising from studying non-negative numbers which add up to 1’, which is at least partially accurate.

Some additional material is covered elsewhere:
* Statistical inference.

To get well-grounded let’s begin with a sequence of definitions.

First definitions


A probability space is a measure space $D$ with measure $P$ such that $P(D)=1$. The space $D$ is also sometimes called the sample space and the measurable subsets of $D$ are called events1.


The definition of probability space is sufficiently general to include lots of degenerate examples. For example, we can take any set $S$ and make it into a probability space by decreeing that the only measurable subsets are $S$ and $\emptyset$ with $P(S)=1$. Although we will try to make this explicit, we will almost always want that singleton sets, i.e., sets with just a single element, are measurable. When a probability space has this property and every measurable subset is a countable union of singleton sets we will call the probability space discrete.


Make the positive integers into a discrete probability space where every point has non-zero probability.


The probability of an event $E$ is $P(E)=:\int_E dP$. For discrete probability spaces we can also write $\int_E dP=\sum_{x\in E} P(x)$2.


Given two probability spaces $D_1$ and $D_2$ with respective probability measures $P_1$ and $P_2$. We can define a probability space $D_1\times D_2$ by:

  1. The underlying set is the cartesian product $D_1\times D_2$.
  2. The measurable subsets are generated under countable unions and complements by the products sets $I_1\times I_2$, where $I_1\subseteq D_1$ and $I_2\subseteq D_2$ are measurable subsets.
  3. The probability measure is determined by $P(I_1\times I_2)=P_1(I_1)\cdot P(I_2)$, where $I_1$ and $I_2$ are as in the previous statement.


Suppose we have a fair coin that we flip twice. Each of the four possible outcomes $D=\{HH,HT,TH,TT\}$ are equally likely and form a discrete probability space such that $P(x)=1/4$ for all $x\in D$. The probability of the event $E$, where we get precisely one head, is $P(E)=P(HT)+P(TH)=1/2$.


A random variable $X\colon D\to T$ is a measurable function from a probability space $D$ to a measure space $T$.

We can associate to each such $X$ a probability measure $P_X$ on $T$ by assigning to each measurable subset $U\subset T$, $P_X(U)=P(X^{-1}(U))$. Indeed it is clear that $P_X(T)=1$ and that for the measure of a countable disjoint union is $$P_X(\coprod U_i)=P(X^{-1}(\coprod U_i))=P(\coprod(X^{-1}U_i))=\sum P(U_i).$$


There is an unfortunate clash in the language of probability theory and standard english usage. For example, imagine that we have a box with a single button on it and a numerical display. Every time we push the button the screen displays a number between 1 and 10. In common usage we say that these values are random if there is no way to know which number will appear on the screen every time we push the button.

It is important to know that mathematics/probability theory/statistics do not provide any such mechanism. There is no function whose values are “randomly” chosen given a particular input. In particular, mathematics does not provide a method of randomly choosing objects.

One should keep this in mind when talking about random variables. Random variables are not objects with random values; they are functions. The additional data that a random variable $X$ does define are numbers associated to the preimages $X^{-1}(I)$ (for measurable subsets $I$), which we can use to weight the values of $X$.

This can also be used to shed light on statistical mechanics, which uses probability theory to model situations arising in physics. The fact that such models have been extremely successful in the field of quantum mechanics does not necessarily mean there is something random, in the common usage sense, about the universe; we are not claiming that “God plays dice with the universe”. It is just that our best mathematical models for these phenomena are constructed using the language of probability theory.

Finally, we should remark that the closest mathematical object to a random number generator in the sense of english is a pseudorandom number generator. These are deterministic functions which output sequences of numbers which attempt to model our intuition of what a random number generator should produce. Although not truly random, these are heavily used in simulations and Monte Carlo methods.


If we are regarding $\Bbb R$ as a measure space and do not specify an alternative measure, we will mean that it is equipped with its standard Borel measurable subsets and the Borel measure3 $E\mapsto \int_E dx$. If we regard a discrete finite set $S$ or any interval $[a,b]$ (with $a<b$) as a probability space and do not specify the measure then we will mean that it is equipped with a uniform measure. In other words, $P(s)=1/|S|$ for all $s\in S$ and for all measurable $E\subset I$ we have $P(E)=P_{\Bbb R}(E)/(b-a)$.


If the measure $P_X$ from the previous definition is absolutely continuous with respect to the standard Borel measure (i.e., the preimage of every measure 0 set with respect to the standard Borel measure is of measure 0), then there is a measurable function $dP_X/dx \colon T\to \Bbb R$ such that for all measurable $E\subset T$, $$P_X(E) := \int_{X^{-1} E} dP := \int_E dP_X = \int_{E} \frac{dP_X}{dx} dx.$$ All of these integrals are Lebesgue integrals.

The measurable function $dP_X/dx$ is called a Radon-Nikodym derivative and any two such derivatives disagree on a set of measure 0, i.e., they agree almost everywhere. Without the absolute continuity hypothesis there is only a distribution satisfying this property. Having a measure defined in such a way obvious implies absolute continuity, so the first sentence can be formulated as an if and only if statement. This is the Radon-Nikodym theorem.


For a discrete probability space $D$ the function $p\colon D\to [0,1]$, defined by $d\mapsto p(d):=P(d)$ is called the probability mass function (PMF) of $D$.

Note that the measure $P$ on $D$ is uniquely determined by the associated probability mass function.


Suppose that $\Bbb R$ is equipped with a probability measure $P$ and the cumulative distribution function (cdf) $F(a)=P(x\leq a)$ is a continuously differentiable function of $a$, then $F(x)=\int_{-\infty}^x F'(x) dx$ and $F'(x)$ is called the probability density function (pdf) of $F$ (or $P$).

Note that the probability measure $P$ is determined by $F$ and hence the probability density function $F’$. This can lead to some confusing abuses of language.


Let $D$ be the probability space from the first example. Let $X\colon D\to \mathbb{R}$ be the random variable which counts the number of occurrences of heads in a given event. Then the cumulative density function of $P_X$ is $F_X(x)=0$ if $x<0$, $F_X(x)=1/4$ if $0\leq x < 1$, $F_X(x)=3/4$ if $1\leq x < 2$ and $F_X(x)=1$ if $x\geq 2$. This function is discontinuous and hence the probability density function is not defined5.

Moments of distributions

Typically when we are handed a probability space $D$, we analyze it by constructing a random variable $X\colon D\to T$ where $T$ is either a countable subset of $\Bbb R$ or to $\Bbb R$. Using the procedure of the previous section we obtain a probability measure $P_X$ on $T$ and we now study this probability space. Usually a great deal of information is lost about $D$ during this process, but it allows us to focus our energies and work in the more tractable and explicit space $T\subset\Bbb R$.

So, we know focus on such probability spaces. This is usually decomposed into two cases, when $T$ is discrete (e.g., a subset of $\Bbb N$) and when $T$ is $\Bbb R$ (or some interval in $\Bbb R$). We could study the first case as a special case of the latter and just studying probability measures on $\Bbb R$, but that would require throwing in a lot of Dirac delta distributions at some point and I sense that you may not like that. We will seek a compromise and still use the integral notation to cover both cases although integrals in the discrete case can be expressed as sums.

There are two special properties of this situation that we will end up using:
1. It makes sense to multiply elements of $T$ with real valued functions.
2. There is a natural ordering on $T$ (so we can define a cdf).
3. We can now meaningfully compare random variables with values in $\Bbb R$ which are defined on different probability spaces, by comparing their associated probability measures on $\Bbb R$ (or their cdfs/pdfs when these exist).

For example the first property allows us to make sense of:


  1. The expected value or mean of a random variable $X\colon D\to T\subset \Bbb R$, is $$\mu_X:=E(X)= \int_{x\in T} x\cdot dP_X = \int_{d\in D} X(d) dP.$$
  2. Let $F_X$ denote the cdf of $X$. The median of $X$ is those $t\in \Bbb R$ such that $F_X(t)=0.5$.
  3. Suppose that $X$ admits a pdf $f_X$. The modes of $X$ are those $t\in \Bbb R$ such that $f_X(t)$ is maximal.


In our coin flipping example, the expected value of the random variable $X$ which counts the heads is $$\int_D X dP = \sum_{d\in D} X(d)p(d) = 2/4+1/4+1/4+0/4=1,$$
as expected.

The third property lets us make sense of:


Two random variables $X\colon D_1\to T$ and $Y\colon D_2\to T$ are identically distributed if they define the same probability measure on $T$, i.e., $P_X(I)=P_Y(I)$ for all measurable subsets $I\subseteq T$. In this case, we write $X\sim Y$.


We associate to two random variables $X,Y\colon D\to T$ a random variable $X\times Y\colon D \to T^2$ by $X\times Y(x)=(X(x),Y(x))$. This induces a probability measure $P_{X,Y}$ on $T^2.$ When $T=\Bbb R$ we can then define an associated joint cdf, $F_{X,Y}\colon \Bbb R^2\to [0,1]$ defined by $F_{X,Y}(a,b)=P_{X,Y}(x\leq a, y\leq b)$, which when $X\times Y$ is absolutely continuous with respect to the Lebesgue measure admits a joint pdf. Similarly, we can extend this to joint probability distributions of any number of random variables.


Two random variables $X,Y\colon D\to T$ with corresponding probability measures $P_X$ and $P_Y$ on $T$ are independent if the associated joint probability measure $P_{X,Y}$ on $T^2$ satisfies $P_{X,Y}(I_1\times I_2)=P_X(I_1)P_Y(I_2)$ for all measurable subsets $I_1, I_2\subseteq T$. When two variables are both independent and identically distributed then we abbreviate this to iid.


Suppose that $T$ is a probability space whose singleton sets are measurable. A random sample of size $n$ from $T$ will be any point of the associated product probability space $T^n$.


  1. Show that if $X$ and $Y$ are two $\Bbb R$ random variables then they are independent if and only if their joint cdf is the product of their individual cdfs.
  2. Suppose that moreover $X$ and $Y$ and the joint distribution admit pdfs $p_X, p_Y,$ and $p_{X,Y}$ respectively, then show that the $f_{X,Y}=f_X f_Y$ if and only if the distributions are independent.


  1. The $k$th moment of a random variable $X\colon D \to \Bbb R$ is $E(X^k)$.
  2. The variance of a random variable $X\colon D\to \Bbb R$ is
    $$\sigma_X^2=E((X-\mu_X)^2)=\int_D (X-\mu_X)^2 dP$$.
  3. The standard deviation of $X$ is $\sigma_X=\sqrt{\sigma_X^2}$.
  4. The covariance of a pair of random variables $X,Y\colon D\to \Bbb R$ is
    $$ Cov(X,Y) = E((X-\mu_X)(Y-\mu_Y)). $$
  5. The correlation coefficient of a pair of random variables $X,Y\colon D\to \Bbb R$ is $$\frac{Cov(X,Y)}{\sigma_X \sigma_Y}.$$


Suppose that $X$ and $Y$ are two independent $\Bbb R$-valued random variables with finite means $\mu_X$ and $\mu_Y$ and finite variances $\sigma^2_X$ and $\sigma^2_Y$ respectively.
1. Show that, for $a,b\in \Bbb R$ the mean of $aX+bY$ is $a\mu_X + b\mu_Y$.
1. Show that the variance of $aX+bY$ is $a^2 \sigma^2_X + b^2 \sigma^2_Y$.
1. Show that $E(XY)$ is $\mu_X\cdot \mu_Y$.
1. Show that $E(X^2)=\sigma_X^2+\mu_X^2$.


The characteristic function of a random variable $X\colon D\to \Bbb R$ is the complex function $$\varphi_X(t)=E(e^{itX})=\int_{x\in \Bbb R} e^{itx} dP_X = \int_{d\in D} e^{itX(d)} dP.$$


  1. The characteristic function is always defined (because we are integrating an absolutely bounded function over a finite measure space).
  2. When $X$ admits a pdf $p_X$, then up to a reparametrization the characteristic function is the Fourier transform of $p_X$: $F(p_X)(X)=\varphi_X(-2\pi t)$.
  3. Two random variables have the same characteristic functions if and only if they are identically distributed6.

Some Important Results

  1. The Law of Large Numbers, which essentially says that the average $S_n$ of a sum of $n$ iif random variables with finite mean $\mu$ “converges” to the common mean.
  2. The Central Limit Theorem, which says that under the above hypotheses plus the assumption that the random variables have a finite variance $\sigma^2$, the random variable $\sqrt{n}(S_n-\mu)$ converges in distribution to the normal distribution with mean $0$ and variance $\sigma^2$. This result is the basis behind many normality assumptions and is critical to hypothesis testing which is used throughout the sciences.

Conditional probability

Suppose we have a probability space $P\colon D\to [0,1]$ and two events $A,B\in D$. Then we write $P(A,B)=P(A\cap B)$. Suppose that $P(B)>0$, then define the conditional probability of $A$ given $B$ as $$P(A|B)=P(A,B)/P(B).$$ A similar definition is also given for the conditional pdf of two random variables $X$ and $Y$: $f_{X,Y}(x|y)=f_{X,Y}(x,y)/f_Y(y)$ where $f_Y(y)=\int_{x \in \Bbb R} f(x,y) dx$ is the marginal density.

Bayes Rule

Let $A$ be an event in a probability space $P\colon D\to [0,1]$ and that $\{B_i\}_{i=1}^n$ is a disjoint union of events which cover $D$ and all of these events have non-zero probability. Then
$$ $$

There is also the pdf form:
$$ f_{X,Y}(x|y)=\frac{f_{X,Y}(y|x)f_X(x)}{\int f_{X,Y}(y|x) f_X(x) dx}. $$

The usefulness of Bayes rule is that it allows us to write a conditional probability that we do not understand (the dependence of $X$ on $Y$) in terms that we might understand (the dependence of $Y$ on $X$).

  1. If you don’t know what a measurable subset is then you don’t know what a measure space is and you should consult the references. If you don’t consult the references and you just believe that all subsets of $D$ are measurable, it will take a long time to find out that you are wrong. 
  2. Here and elsewhere we will abuse notation and denote measure of a singleton set ${{x}}$ by $P(x)$. 
  3. For $\Bbb R^n$ and $n>1$ we should probably use the Lebesgue measure, which is a completion of the product Borel measure. 
  4. This material is quite advanced, so don’t worry if it goes over your head. The notation here is chosen so that in special cases where the [fundamental theorem of calculus] ( applies the Radon-Nikodym derivative can be chosen to be an actual derivative. 
  5. Although we could define the “pdf” as a linear combination of Dirac delta distributions, but then it wouldn’t be a function (no matter what a physicist tells you). 
  6. For a proof see these notes

Supervised Learning

A big computer, a complex algorithm, and a long time does not equal science. – Robert Gentleman


Before getting into what supervised learning precisely is, let’s look at some examples of supervised learning tasks:

  1. Identifying breast cancer.
  2. Image classification.
    • List of last year’s ILSVRC Winners
  3. Threat assessment
  4. Language Translation
  5. Identifying faces in images


Supervised learning is concerned with the construction of machine learning algorithms $f\colon 2^{D\times T}\times D\to T$. The subsets of $D\times T$ are referred to as subsets of labeled examples.

We often assume that the labeled examples arise from restricting some presumed function $F\colon D\to T$ whose values we know on $E\subset D$. We can then train $f$ on pairs $\{s, F(s) | s\in E\}$. We can then devise a cost function which measures the distance from the learned $f$ and the presumed $F$ (e.g., $L^2$-distance).

Supervised learning has been the most successful of the three branches to real world problems. The existence of labeled examples usually leads to well-defined performance metrics that transform supervised learning tasks into two other tasks finding an appropriate parametrized class of functions to choose $f$ from and an optimization problem of finding the best function in that class.

Typically $D$ will be some subset of $\Bbb R^n$ and we will refer to the components of $\Bbb R^n$ as features, dependent variables, or attributes (many concepts in machine learning have many names). Sometimes $D$ will be discrete, in which case we refer to these as categorical variables. There are a few simple tricks to map categorical variables into $\Bbb R^n$ (such as one-hot encoding), so it usually does not hurt to think of $D$ as a subset of $\Bbb R^n$.

When $T$ is discrete (typically finite), then the supervised learning problem is called a classification problem. When it is continuous (e.g, $\Bbb R$) then it is called a regression problem.


Let us consider the following sequence of supervised learning methods in turn.

  1. Linear Regression (Regression problems).
  2. $k$-nearest Neighbors (Classification or Regression problems).
  3. Logistic Regression (Classification problems1).
  4. Naive Bayes Classifier (Classification problems).
  5. Linear Discriminant Analysis (Classification problems).
  6. Support Vector Machines (Classification problems).
  7. Decision trees (Primarily Classification problems).
  8. Neural networks (Classification or Regression problems).

  1. I know the name is confusing. 

Machine Learning Overview

Science is knowledge which we understand so well that we can teach it to a computer; and if we don’t fully understand something, it is an art to deal with it. Donald Knuth


First Attempt at a Definition

One says that an algorithm learns if its performance improves with additional experience.

This is not a precise definition, so let’s try to make it so.


An algorithm will be a function1 $f\colon D\to T$ and whose individual values can be calculated by a computer (i.e., a Turing machine) in a finite amount of time using a finite amount of memory. We can refer to $D$ as the space of inputs or samples or examples and $T$ as the outputs or labels.

For the following definition we will want to be able to form a sum indexed by the elements of $D$. For this purpose we could suppose that our input is finite or that the domain is a measure space and that the algorithm will be a measurable function. For simplicity and convenience, we will pretend that $D$ is finite. That is if $D$ is infinite, then we have actually implicitly restricted to to a finite subset (e.g., $D=\Bbb R$ should really be interpreted as the finite subset of floating point numbers expressed in some fixed number of bits). This poses no real restriction in applications (all real world problem domains are effectively finite).


A performance measure of an algorithm will be a function $P\colon D\times T \to \Bbb R$. For two algorithms $f_1$ and $f_2$, we will say that $f_1$ performs better than $f_2$ (relative to $P$) if $$\sum_{d\in D} P(d,f_1(d)) > \sum_{d\in D} P(d, f_2(d)).$$ We similarly define ‘performs worse than’ and ‘performs at least as well as’, etc.

Alternatively, we may be presented a cost function $C\colon D\times T\to \Bbb R$ which we want to minimize. One can postcompose any performance measure (resp. cost function) with an order reversing map to obtain a cost function (resp. performance measure). This allows us to pass between the two notions.


A machine learning algorithm is an algorithm $f_-\colon 2^E \times D\to T$, that learns with respect to a specified performance measure $P$ (defined below). Here $2^E$ is the set of all subsets of $E$. We call an element $S\in 2^E$ (i.e., a subset $S\subseteq E$) a set of training data. For such an $S$ and $f_-$, let $f_S=f_-(S,-) \colon D\to T$ denote the resulting algorithm (which we will say is trained on $S$).


Consider the problem of finding a linear function $f\colon D\subset \Bbb R\to \Bbb R$ which closely approximates a fixed function $F\colon D\subset \Bbb R\to \Bbb R$. We can regard the function $F$ as a subset $S$ of $E=D\times \Bbb R$ by $F\mapsto S= \{(s,F(s)) | s\in D\}$2. We can define a cost function $C\colon D\times \Bbb R\to \Bbb R$ by $$C(d,t)=(F(d)-t)^2.$$ We can reformulate problem of minimizing the cost of $C(d,f(d))$ as maximizing the performance function $P(d,t)=\frac{1}{1+C(d,t)}$. The method of Linear Regression will define a machine learning algorithm $f_-\colon 2^E\times D\to \Bbb R$.

One can play with this problem using this demo.

Second Attempt at a Definition

A machine learning algorithm $f\colon 2^E\times D\to T$ learns (relative to a performance measure $P$) if for all proper subset inclusions $S_1\subset S_2\subseteq E$, $f_{S_2}$ performs better than $f_{S_1}$.

This is at least clearly defined, but it does not cover many examples of learning algorithms. For example, the method of linear regression will perform poorly when given a small training sample containing outliers (those $d\in D$ such that $F(d)$ takes on an unlikely value under some assumed probability distribution associated to $F$). So, the act of adjoining outliers to our training data will typically decrease the performance of this machine learning algorithm.

Final Definition

A machine learning algorithm $f\colon 2^E\times D\to T$ learns (relative to a performance measure $P$) if for each pair of natural numbers $m<n$, the average of the performances of $f_{S}$ over the subsets $S\subseteq E$ of cardinality $n$ is better than the average of the performances over the subsets of cardinality $m$.

Crucial remark

Without a specified performance measure, we have no obvious goal in machine learning. We can only evaluate machine learning algorithms relative to a performance measure3 and an algorithm that performs well with respect to one measure may perform poorly with respect to another. So the field of machine learning depends on first finding a good performance measure for the task at hand and then finding a machine learning algorithm that performs well with respect to this measure.


Now that we have come to a definition of a learning algorithm that learns that seems to cover at least one standard example, I will unveil some unfortunate truths:

  • In practice, we often do not have a performance measure that is defined over all of $D\times T$, instead we only have it over a subset.
  • Machine learning algorithms are often presented without proofs that they learn in the sense above.

Taxonomy of machine learning

We can roughly divide up the field of machine learning into 3 branches:

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement learning

Supervised learning

Before getting into what supervised learning precisely is, let’s look at some examples of supervised learning tasks:

  1. Identifying breast cancer.
  2. Image classification.
    • List of last year’s ILSVRC Winners
  3. Threat assessment
  4. Language Translation
  5. Identifying faces in images


Supervised learning is concerned with the construction of machine learning algorithms $f\colon 2^{D\times T}\times D\to T$. The subsets of $D\times T$ are referred to as subsets of labeled examples.

We often assume that the labeled examples arise from restricting some presumed function $F\colon D\to T$ whose values we know on $E\subset D$. We can then train $f$ on pairs $\{s, F(s) | s\in E\}$. We can then devise a cost function which measures the distance from the learned $f$ and the presumed $F$ (e.g., $L^2$-distance).

Supervised learning has been the most successful of the three branches to real world problems. The existence of labeled examples usually leads to well-defined performance metrics that transform supervised learning tasks into two other tasks:

  1. Find an appropriate parametrized class of functions to choose $f$ from.
  2. Solve the optimization problem of finding the best function in that class.


The first example of a supervised learning task is the linear regression problem mentioned above.

Unsupervised learning

First some examples:

  1. Some sample techniques in unsupervised learning can be seen at this beautiful blog post.
  2. Word vector embeddings.
  3. Independent components analysis


Unsupervised learning is concerned with the construction of machine learning algorithms of the form $f\colon 2^D\times D\to T$.

In this case, there is no obvious measure of performance for such an algorithm. The choice of performance measure essentially defines the unsupervised learning task.

Often, one can vaguely state that the goal of an unsupervised learning task is to summarize the data or determine its essential features. In other words, suppose that we have a function $i\colon T\to D$, then we can state the goal is to find an $f\colon D\to T$ such that $i\circ f$ approximates the identity function. That is our unsupervised learning task is to find a compression algorithm! In this case, we have transformed the unsupervised learning task into a supervised learning task (because we have as many labels as we like of the identity function).

The only task remaining (and it is the hard task) is to make sense of the word `approximates’ above by choosing an appropriate measure. For example, how do you quantitatively measure the quality of a video compression algorithm? Somehow we can usually see the difference between a terrible algorithm and a good one, but it can be difficult to turn this intuition into a well-defined measure.


One of the typical tasks in unsupervised learning is to identify ‘clusters’ of data points in a high dimensional space. We may be looking for difficult to identify relationships between data points. Once we have identified a classification that we deem to be ‘meaningful’, then our algorithm should be able to sort new data points into the correct classes. For example, can try to look at demographic data to identify clusters of voters with similar voting patterns and then try to identify the correct approach to appeal to this group.

Reinforcement learning

Reinforcement learning is applied to tasks where one must choose actions whose outcomes eventually lead to various rewards/punishments which are used to encourage or discourage certain choices:

  1. Breakout
  2. One of the more difficult challenges is Montezuma’s revenge where there is very little feedback and one of the implicit motivations is curiousity.
  3. Another truly amazing achievement is AlphaGo, or its successor AlphaGo Zero.
  4. Self-driving cars

To construct a reasonably precise definition, let’s describe a typical reinforcement learning task4. In this case the algorithm is to construct an agent (the name comes from the field of artificial intelligence). The agent learns a policy which is a mapping from perceived states to actions. Paired with the agent is an environment. The environment maintains an environment state and when given an action returns an observation and a reward to the agent, which it then uses to update its policy and perceived state and choose an action. This creates a feedback loop:

Reinforcement learning dynamics

We can break up the agent’s task into two parts:

  1. It must take in an observation and reward and update its perceived state. To update its perceived state it might maintain a model of its environment and use the observation and reward to update this model.
  2. It must take in a sequence of observations and rewards and update its policy. The policy is supposed to choose an allowable action and this decision could be made by defining a value function, which assigns to each possible action given our state a value, and choosing the action that maximizes the value. Since the agent is operating with imperfect information, the value function will only be an estimate, and it may choose to alternatively explore and choose a lower valued action in the hopes that such a state is being undervalued. We can model this exploration possibility, by saying that the policy assigns a probability to each possible action from a fixed perceived state.

Writing the general task out as a learning algorithm will get quite cumbersome. So let’s just hit the main points.

  • The environment will be encoded as a probability space $S_e \times A\times S_e$, which encodes the probability of transitioning from one environment state to another given a fixed action.
  • The policy will be encoded as a probability space $S_p \times A$, which encodes the probability of choosing an action from a given probability.
  • We will also require some type of model, which will be a probability space $S_p\times A\times O\times R\times S_p$ which will encode the probability of moving from one perceived state to another given an action, an observation, and a reward.
  • We will be a given a environmental feedback, which will be a random variable $F\colon S_e\times A\times S_e \to O\times R$.

All of these components can then be combined and we can form the expected reward of all sequences of length $n$ of tuples of the above variables5. The reinforcement learning task is to learn the policy function and the interpreter function so that we can maximize the expected reward over sequences of length $n$ as $n$ goes to $\infty$.

In this generality, the task is extremely difficult. We could simplify things by assuming the environment provides complete feedback about the perceived state, so that the observation is the perceived state (for example, a game with perfect information). But even then this task can be extremely difficult if there are many potential perceived states and many potential actions. For example, in the game Go there are two players (black and white) which take alternating turns placing pieces on a 19×19 board, occasionally removing captured pieces. As an approximate upper bound on the perceived state space we can see that there are no more than $3^{19\cdot 19}\approx 10^{172}$ possible board configurations (each square is either black, white, or blank), from each of these positions we have no more than $19^2=361$ possible actions choose from. Searching through all possible chains of configurations and actions is completely intractable. Even encoding an optimal policy in tabular form is impossible. This is part of the reason that performing well on this game was such a high-hurdle for the AI community.

  1. We could also allow partially defined functions. 
  2. Indeed if you look under the hood, this is how functions are actually defined.
  3. Okay, runtime and memory usage are also important too. 
  4. I must admit this first attempt at finding a definition of reinforcement learning that fits into the above framework is a bit clumsy and accurately reflects my limited knowledge in this area. 
  5. As a non-trivial exercise write this down for sequences of length 1 and 2.