Machine Learning Overview

Science is knowledge which we understand so well that we can teach it to a computer; and if we don’t fully understand something, it is an art to deal with it. Donald Knuth


First Attempt at a Definition

One says that an algorithm learns if its performance improves with additional experience.

This is not a precise definition, so let’s try to make it so.


An algorithm will be a function1 $f\colon D\to T$ and whose individual values can be calculated by a computer (i.e., a Turing machine) in a finite amount of time using a finite amount of memory. We can refer to $D$ as the space of inputs or samples or examples and $T$ as the outputs or labels.

For the following definition we will want to be able to form a sum indexed by the elements of $D$. For this purpose we could suppose that our input is finite or that the domain is a measure space and that the algorithm will be a measurable function. For simplicity and convenience, we will pretend that $D$ is finite. That is if $D$ is infinite, then we have actually implicitly restricted to to a finite subset (e.g., $D=\Bbb R$ should really be interpreted as the finite subset of floating point numbers expressed in some fixed number of bits). This poses no real restriction in applications (all real world problem domains are effectively finite).


A performance measure of an algorithm will be a function $P\colon D\times T \to \Bbb R$. For two algorithms $f_1$ and $f_2$, we will say that $f_1$ performs better than $f_2$ (relative to $P$) if $$\sum_{d\in D} P(d,f_1(d)) > \sum_{d\in D} P(d, f_2(d)).$$ We similarly define ‘performs worse than’ and ‘performs at least as well as’, etc.

Alternatively, we may be presented a cost function $C\colon D\times T\to \Bbb R$ which we want to minimize. One can postcompose any performance measure (resp. cost function) with an order reversing map to obtain a cost function (resp. performance measure). This allows us to pass between the two notions.


A machine learning algorithm is an algorithm $f_-\colon 2^E \times D\to T$, that learns with respect to a specified performance measure $P$ (defined below). Here $2^E$ is the set of all subsets of $E$. We call an element $S\in 2^E$ (i.e., a subset $S\subseteq E$) a set of training data. For such an $S$ and $f_-$, let $f_S=f_-(S,-) \colon D\to T$ denote the resulting algorithm (which we will say is trained on $S$).


Consider the problem of finding a linear function $f\colon D\subset \Bbb R\to \Bbb R$ which closely approximates a fixed function $F\colon D\subset \Bbb R\to \Bbb R$. We can regard the function $F$ as a subset $S$ of $E=D\times \Bbb R$ by $F\mapsto S= \{(s,F(s)) | s\in D\}$2. We can define a cost function $C\colon D\times \Bbb R\to \Bbb R$ by $$C(d,t)=(F(d)-t)^2.$$ We can reformulate problem of minimizing the cost of $C(d,f(d))$ as maximizing the performance function $P(d,t)=\frac{1}{1+C(d,t)}$. The method of Linear Regression will define a machine learning algorithm $f_-\colon 2^E\times D\to \Bbb R$.

One can play with this problem using this demo.

Second Attempt at a Definition

A machine learning algorithm $f\colon 2^E\times D\to T$ learns (relative to a performance measure $P$) if for all proper subset inclusions $S_1\subset S_2\subseteq E$, $f_{S_2}$ performs better than $f_{S_1}$.

This is at least clearly defined, but it does not cover many examples of learning algorithms. For example, the method of linear regression will perform poorly when given a small training sample containing outliers (those $d\in D$ such that $F(d)$ takes on an unlikely value under some assumed probability distribution associated to $F$). So, the act of adjoining outliers to our training data will typically decrease the performance of this machine learning algorithm.

Final Definition

A machine learning algorithm $f\colon 2^E\times D\to T$ learns (relative to a performance measure $P$) if for each pair of natural numbers $m<n$, the average of the performances of $f_{S}$ over the subsets $S\subseteq E$ of cardinality $n$ is better than the average of the performances over the subsets of cardinality $m$.

Crucial remark

Without a specified performance measure, we have no obvious goal in machine learning. We can only evaluate machine learning algorithms relative to a performance measure3 and an algorithm that performs well with respect to one measure may perform poorly with respect to another. So the field of machine learning depends on first finding a good performance measure for the task at hand and then finding a machine learning algorithm that performs well with respect to this measure.


Now that we have come to a definition of a learning algorithm that learns that seems to cover at least one standard example, I will unveil some unfortunate truths:

  • In practice, we often do not have a performance measure that is defined over all of $D\times T$, instead we only have it over a subset.
  • Machine learning algorithms are often presented without proofs that they learn in the sense above.

Taxonomy of machine learning

We can roughly divide up the field of machine learning into 3 branches:

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement learning

Supervised learning

Before getting into what supervised learning precisely is, let’s look at some examples of supervised learning tasks:

  1. Identifying breast cancer.
  2. Image classification.
    • List of last year’s ILSVRC Winners
  3. Threat assessment
  4. Language Translation
  5. Identifying faces in images


Supervised learning is concerned with the construction of machine learning algorithms $f\colon 2^{D\times T}\times D\to T$. The subsets of $D\times T$ are referred to as subsets of labeled examples.

We often assume that the labeled examples arise from restricting some presumed function $F\colon D\to T$ whose values we know on $E\subset D$. We can then train $f$ on pairs $\{s, F(s) | s\in E\}$. We can then devise a cost function which measures the distance from the learned $f$ and the presumed $F$ (e.g., $L^2$-distance).

Supervised learning has been the most successful of the three branches to real world problems. The existence of labeled examples usually leads to well-defined performance metrics that transform supervised learning tasks into two other tasks:

  1. Find an appropriate parametrized class of functions to choose $f$ from.
  2. Solve the optimization problem of finding the best function in that class.


The first example of a supervised learning task is the linear regression problem mentioned above.

Unsupervised learning

First some examples:

  1. Some sample techniques in unsupervised learning can be seen at this beautiful blog post.
  2. Word vector embeddings.
  3. Independent components analysis


Unsupervised learning is concerned with the construction of machine learning algorithms of the form $f\colon 2^D\times D\to T$.

In this case, there is no obvious measure of performance for such an algorithm. The choice of performance measure essentially defines the unsupervised learning task.

Often, one can vaguely state that the goal of an unsupervised learning task is to summarize the data or determine its essential features. In other words, suppose that we have a function $i\colon T\to D$, then we can state the goal is to find an $f\colon D\to T$ such that $i\circ f$ approximates the identity function. That is our unsupervised learning task is to find a compression algorithm! In this case, we have transformed the unsupervised learning task into a supervised learning task (because we have as many labels as we like of the identity function).

The only task remaining (and it is the hard task) is to make sense of the word `approximates’ above by choosing an appropriate measure. For example, how do you quantitatively measure the quality of a video compression algorithm? Somehow we can usually see the difference between a terrible algorithm and a good one, but it can be difficult to turn this intuition into a well-defined measure.


One of the typical tasks in unsupervised learning is to identify ‘clusters’ of data points in a high dimensional space. We may be looking for difficult to identify relationships between data points. Once we have identified a classification that we deem to be ‘meaningful’, then our algorithm should be able to sort new data points into the correct classes. For example, can try to look at demographic data to identify clusters of voters with similar voting patterns and then try to identify the correct approach to appeal to this group.

Reinforcement learning

Reinforcement learning is applied to tasks where one must choose actions whose outcomes eventually lead to various rewards/punishments which are used to encourage or discourage certain choices:

  1. Breakout
  2. One of the more difficult challenges is Montezuma’s revenge where there is very little feedback and one of the implicit motivations is curiousity.
  3. Another truly amazing achievement is AlphaGo, or its successor AlphaGo Zero.
  4. Self-driving cars

To construct a reasonably precise definition, let’s describe a typical reinforcement learning task4. In this case the algorithm is to construct an agent (the name comes from the field of artificial intelligence). The agent learns a policy which is a mapping from perceived states to actions. Paired with the agent is an environment. The environment maintains an environment state and when given an action returns an observation and a reward to the agent, which it then uses to update its policy and perceived state and choose an action. This creates a feedback loop:

Reinforcement learning dynamics

We can break up the agent’s task into two parts:

  1. It must take in an observation and reward and update its perceived state. To update its perceived state it might maintain a model of its environment and use the observation and reward to update this model.
  2. It must take in a sequence of observations and rewards and update its policy. The policy is supposed to choose an allowable action and this decision could be made by defining a value function, which assigns to each possible action given our state a value, and choosing the action that maximizes the value. Since the agent is operating with imperfect information, the value function will only be an estimate, and it may choose to alternatively explore and choose a lower valued action in the hopes that such a state is being undervalued. We can model this exploration possibility, by saying that the policy assigns a probability to each possible action from a fixed perceived state.

Writing the general task out as a learning algorithm will get quite cumbersome. So let’s just hit the main points.

  • The environment will be encoded as a probability space $S_e \times A\times S_e$, which encodes the probability of transitioning from one environment state to another given a fixed action.
  • The policy will be encoded as a probability space $S_p \times A$, which encodes the probability of choosing an action from a given probability.
  • We will also require some type of model, which will be a probability space $S_p\times A\times O\times R\times S_p$ which will encode the probability of moving from one perceived state to another given an action, an observation, and a reward.
  • We will be a given a environmental feedback, which will be a random variable $F\colon S_e\times A\times S_e \to O\times R$.

All of these components can then be combined and we can form the expected reward of all sequences of length $n$ of tuples of the above variables5. The reinforcement learning task is to learn the policy function and the interpreter function so that we can maximize the expected reward over sequences of length $n$ as $n$ goes to $\infty$.

In this generality, the task is extremely difficult. We could simplify things by assuming the environment provides complete feedback about the perceived state, so that the observation is the perceived state (for example, a game with perfect information). But even then this task can be extremely difficult if there are many potential perceived states and many potential actions. For example, in the game Go there are two players (black and white) which take alternating turns placing pieces on a 19×19 board, occasionally removing captured pieces. As an approximate upper bound on the perceived state space we can see that there are no more than $3^{19\cdot 19}\approx 10^{172}$ possible board configurations (each square is either black, white, or blank), from each of these positions we have no more than $19^2=361$ possible actions choose from. Searching through all possible chains of configurations and actions is completely intractable. Even encoding an optimal policy in tabular form is impossible. This is part of the reason that performing well on this game was such a high-hurdle for the AI community.

  1. We could also allow partially defined functions. 
  2. Indeed if you look under the hood, this is how functions are actually defined.
  3. Okay, runtime and memory usage are also important too. 
  4. I must admit this first attempt at finding a definition of reinforcement learning that fits into the above framework is a bit clumsy and accurately reflects my limited knowledge in this area. 
  5. As a non-trivial exercise write this down for sequences of length 1 and 2. 

Leave a Reply

Your email address will not be published. Required fields are marked *