Linear Regression

Prediction is very difficult, especially about the future. – Niels Bohr

The problem

Suppose we have a list of vectors (which we can think of as samples) $x_1, \cdots, x_m\in \Bbb R^n$ and a corresponding list of output scalars $y_1, \cdots, y_m \in \Bbb R$ (which we can regard as a vector $y\in \Bbb R^m$). We want to find a vector $\beta=(\beta_1,\cdots \beta_n)\in \Bbb R^n$ such that the sum of squared error $$ \sum_{i=1}^m (x_i \cdot \beta – y_i)^2 $$ is as small as possible. Alternatively, if we suppose that our pairs $(x_i, y_i)$ arise from a function $f(x_i)=y_i$, we want to find the closest approximation to $f$ by a linear function of the form $\widehat{f}(a)=\sum_{i=1}^n \beta_i \pi_i(a)$, where $\pi_i \colon \Bbb R^n\to \Bbb R$ is the projection to the $i$th coordinate.

Let’s arrange the input vectors $x_1,\cdots, x_m$ into an $m\times n$-matrix $A$, where $x_i$ appears as the $i$th row. Then we have a matrix equation of the form $ A \beta = \widehat{y}$ and we want to find $\beta$ such that $||\widehat{y}-y||^2$ is as small as possible1.

Let’s try to use calculus. Here we are trying to minimize:
$$ SSE(\beta) = \lVert\widehat{y}-y\rVert^2=(A\beta -y)^T(A\beta -y).$$ Differentiating with respect to $\beta$ (see Tensor Calculus) yields
\frac{dSSE(\beta)}{d\beta} & = 2(A\beta -y)^TA \label{eq:der}.
Differentiating with respect to $\beta$ again yields $2A^TA$ which will be a positive semidefinite matrix. This matrix will be positive definite if and only if $\ker A =0$2.

Now $\ker A \neq 0$ if and only if one of the columns of $A$ can be expressed as a linear combination of the others. In terms of our original problem, this means that one of the components of our input vectors is in fact expressible as a linear combination of the others, so there is a hidden dependency amongst the variables.


Perform the above calculation yourself while being careful about what are vector valued quantities.

Let’s suppose $\ker A\neq 0$ (we can arrange this anyway by dropping dependent columns). So our matrix is positive definite and we see that any critical value of $SSE(\beta)$ in \eqref{eq:der} is a minimum. So by setting \eqref{eq:der} to 0 and solving for $\beta$ (while observing that $A^TA$ is non-singular, given our assumptions) we get:
This value of $\beta$ is uniquely determined and hence yields the global minimum of $SSE(\beta)$.

Note that when $A$ is an invertible square matrix, then \eqref{eq:solv} reduces to $\beta = A^{-1} y$, which will have $SSE(\beta)=0$; just as we should expect from linear algebra.

Implementation concerns

Using a naive implementation of Cramer’s rule to invert an $n\times n$ matrix will take $O(n!)$ time and is totally impractical for problems of interest. While using Gaussian elimination to invert an $n\times n$ matrix takes $O(n^3)$-time (by being clever one can cut the runtime in half for inverting the symmetric matrix $A^TA$). The remaining multiplications require $O(n^2m)$ time and since $m\geq n$ (otherwise $\ker A\neq 0$), the dominant term will be the multiplications so the algorithm runs in $O(n^2m)$-time.

We may also be concerned about errors in the data. Suppose that the “correct” value of $y$ is, in fact $y+e$. For simplicity suppose that $A$ is invertible. We might hope that if $\lVert e\rVert$ is very small relative to $\lVert y\rVert$ then the error between $A^{-1}y$ and $A^{-1}(y+e)$ will be small relative to $A^{-1}y$.
We can calculate
\frac{\lVert A^{-1} e \rVert}{\lVert A^{-1} y \rVert}/\frac{\lVert e\rVert}{\lVert y\rVert} & = \frac{\lVert A^{-1} e\rVert}{\lVert e\rVert }\cdot \frac{\lVert y\rVert }{\lVert A^{-1}y\rVert } \label{eq:cond}
As we try to maximize this expression over all $e$ and $y$ (so we can replace $y$ with $Ay$), we obtain the product of operator norms $\lVert A^{-1}\rVert \cdot \lVert A \rVert$. While we won’t go into operator norms here, we will point out that when $A$ is diagonal, then $\lVert A\rVert$ is the maximum of the absolute values on the diagonal entries, so that \eqref{eq:cond} becomes $|M|/|m|$ where $|M|$ is the maximum of the absolute values of the diagonal entries and $|m|$ is the minimum of the absolute values of these entries. In particular, if $A$ is the 2×2 diagonal matrix with diagonal entries $10$ and $0.1$, inversion can multiply relative errors by a factor of a 100.

Typically the standard thing to do is not to invert the matrix, but rather construct a matrix decomposition from which it is easy to solve the system of equations by back substitution. Not that this would not help the numerical stability problem with the diagonal matrix.

Beyond simple linear functions

At this point we have seen how to find an approximation to a function $f(x)$ of the form $\widehat{f}(x)=\sum_{i=1}^n \beta_i \pi_i(x)$ that minimizes the sum of squared errors.

What happens if our function is just the constant function $f(x)=10$? In this case, our approximation will do rather poorly. In particular, $\widehat{f}(0)=0$, so this will never be a very good approximation about this point.

What we could do in this case is assume that our function instead has the form: $$\widehat{f}(x)=\sum_{i=1}^n \beta_i \pi_i(x)+\beta_{n+1}\cdot 1.$$ We can now solve this new approximation problem by just changing the vectors $x_i=(\pi_1(x_i),\cdots,\pi_n(x_i))$ into new $n+1$-dimensional vectors $(\pi_1(x_i),\cdots,\pi_n(x_i),1)$ and applying linear regression as before. For non-degenerate collections of vectors ${x_i}$, we will end up with the desired function $\widehat{f}(x)=10$. This is sometimes called the bias trick.

Here is an example of such a linear regression being fit to more and more data points. For this example, we used python’s scikit-learn package and rendered the images using matplotlib (see the code).

Fitting to more and more data points
In this case, the mean squared error approaches the Bayes error rate.

More generally if we think our $f(x)$ can be well approximated by a function of the form $\widehat{f}(x)=\sum_{i=1}^N \beta_i f_i(x)$ for any fixed set of $N$ functions $\{f_i\}$, we can just construct a new sample matrix whose $i$th row is the $N$-vector $(f_1(x_i),\cdots,f_N(x))$ and apply linear regression.

For example, if we are looking for a function $\widehat{f}\colon \Bbb R\to \Bbb R$ of the form $\widehat{f}(x)=\beta_1 x^2+\beta_2 x +\beta_3$, we just have to replace the samples $x_1,\cdots, x_m$ with the vectors $(x_1^2, x_1, 1), \cdots, (x_m^2, x_m, 1)$ and apply linear regression as before. This is an example of polynomial regression. For example, we have the following (see the code):

High degree polynomials fit the data better
These examples were constructed as above, but we first transform the data following the above discussion using PolynomialFeatures. Alternatively, it is very easy to just code this transformation yourself.

We could similarly try to find Fourier approximations for a function which we expect has periodic behavior. We could also use wavelets or splines. Clearly, this list goes on and on.


If we choose a suitably expressive class of functions to define our generalized linear regression problem, then there is a good chance that the linear regression problem will fit the data perfectly. For example, if we have any function $f\colon \Bbb R \to \Bbb R$ and we know its values on $n$ distinct points, then we can perfectly approximate this function by a polynomial of degree at least $n-1$ (e.g., 2 points determine a line, 3 points determine a quadratic curve, etc.). While this will perfectly fit our given data, it becomes quite unlikely that it will match any new data points. This is a classic example of overfitting. We can see this phenomenon in the following example:

This estimator has high variance
Here the degree 10 polynomial will fit any 30 points relatively well, but will drastically fail to generalize well to the entire dataset.

A Maximal Likelihood Justification for using SSE

While the above exposition is elementary, one might wonder where the choice to minimize the sum of squared errors comes from.

Let us suppose that that we have a function of the form $f(x)=g(x)+\epsilon$ where $\epsilon$ is an $\Bbb R$ valued random variable3 $\epsilon \sim N(0, \sigma^2)$ and $g(x)$ is a deterministic function. We could mentally justify this by saying that the output variables are not recorded exactly, but there is some normally distributed measurement error or there is some additional non-linear random factors contributing to the value of $f(x)$. Then the conditional density function of $f(x)$ is $$p(y|x)=\frac{e^{-(g(x) -y)^2/2\sigma^2}}{\sqrt{2\pi \sigma^2}}.$$

The maximal likelihood estimate (MLE) (see here for more details) for a vector of values $y=(y_1,\cdots,y_m)$ associated to a sequence of independent samples $x=(x_1,\cdots,x_m)$ is then


So in this case, the MLE for $y$ is the choice of $y$ which minimizes the sum of squared errors.

A frequentist description

If we suppose that our data samples $(x,y)$ are being drawn from a fixed distribution, e.g., they are the values of a vector valued random variable $X$ and a random variable $Y$. Moreover, suppose the joint pdf of $X$ and $Y$ is $p(x,y)$. Then we can regard linear regression as attempting to find


Bayesian approach

Here we will consider a Bayesian approach to the polynomial regression problem. First we consider the case, where we approximate data coming from a degree 3 polynomial using a Gaussian model that expects a degree 3 polynomial and is trained on all 200 samples. In other words, we are assuming a conditional distribution of the form
$$p(y,x|\theta)=1/\sqrt{2\pi \sigma^2}\cdot e^{\frac{(y-\sum_{i=0}^3 \beta_i x^i)^2}{2\sigma^2}},$$ where $\theta$ is the vector of parameters in this model. For our prior, we suppose that the parameters are independent random variables where $\beta_i~N(0, 100)$ and $\sigma$ is given by a half normal distribution with standard deviation 1 (since this can only take non-negative values). To calculate the posterior $p(\theta | D)$ we apply Bayes rule. Unfortunately Bayes rule has this nasty integral over the parameter space in the denominator:
$$ \int_\theta p(D|\theta)p(\theta)d\theta.$$ Rather than try and calculate this integral directly, we will calculate this using Monte Carlo integration, which is sampled using an algorithm which is completely NUTS. All of this is implemented using the package PyMC3. This is overkill for this situation, but it is nice to see how the general algorithm is implemented (see here for the code).

In effect what we are doing is using a special technique to generate samples $S$ of $\theta$ according to the prior distribution $p(\theta)$. For each $\theta\in S$, we calculate $z_\theta =p(D|\theta)$ and see that $$p(\theta|D)\approx \frac{z_\theta}{\sum_{\theta’ \in S} z_{\theta’}}.$$ As the number of samples goes up, the accuracy of this estimate increases.

We can look at the output here:

A degree 3 approximation

By taking posterior samples of the parameters we obtain the curves in grey. The mean of these samples generates the black curve in the middle. Alternatively, we can sample from $p(y|x)$ (that is we average over all of the choices of parameters). Taking the mean values for each $y$ given an $x$ we obtain the red curve. We have marked a band of radius the standard deviation about these samples.

The summary statistics for our parameter values are listed here. Note that the ‘true value’ of our function is $$f(x)=
7 x^3 – 9.1 x^2 + 3.5 x – 0.392,$$ with a standard deviation of 0.3. We note that none of the true coefficients fall into the 95% credibility intervals, which is not very promising.

mean sd mc_error hpd_2.5 hpd_97.5
coeff__0 -0.209607 0.080981 0.001872 -0.370318 -0.056474
coeff__1 1.914993 0.704315 0.020752 0.562936 3.260617
coeff__2 -5.556357 1.651007 0.050628 -8.678463 -2.410120
coeff__3 4.794343 1.099424 0.033513 2.710534 6.909714
sigma 0.291662 0.015164 0.000186 0.263677 0.321718

We can gain further information by studying the pdfs of the resulting variables:

Now we examine how the Bayesian method does with respect to our overfitting problem above. Here we try to fit a degree 10 polynomial to only 30 data points. We then obtain the following:

Again there is a lot of variance in the additional models.

Here we see the samples for the polynomial curves again have wild variance, but our posterior predictions are actually quite reasonable. Looking at the summary statistics in the table below , we note that the model has very low confidence in almost all of the parameter values and huge credibility intervals that mostly contain the true values. By averaging over all of these parameters we reduce the variance in our predictions and obtain a much better model.

mean sd mc_error hpd_2.5 hpd_97.5
coeff__0 -0.053526 0.200482 0.002303 -0.469330 0.317603
coeff__1 0.658498 1.898380 0.022927 -3.246255 4.222251
coeff__2 -2.835666 5.895880 0.084508 -13.723288 9.131853
coeff__3 2.316211 8.093137 0.126718 -13.588722 18.071517
coeff__4 1.342521 8.552999 0.104256 -15.235386 18.123408
coeff__5 -0.630845 8.596230 0.099357 -16.968628 16.577238
coeff__6 -1.251081 8.598991 0.103051 -17.388627 15.975167
coeff__7 -0.712717 8.881522 0.102497 -18.123559 16.398549
coeff__8 0.373434 8.985303 0.104398 -16.930835 18.813871
coeff__9 0.753457 8.428446 0.094453 -16.404371 16.786885
coeff__10 0.835288 7.222892 0.089429 -13.213493 15.239204
sigma 0.276957 0.041169 0.000551 0.205731 0.360664


How did our choice of prior bias the predictions of the coefficients? What would be some other reasonable choices of prior?

  1. Note that minimizing the sum is equivalent to minimizing the average. 
  2. Recall that on $\Bbb R^m$ we have a scalar product $\langle v,w\rangle=v^Tw$ (here we view vectors as column vectors). This formula tells us that $$\langle v,Ax\rangle =v^TAx=x^TA^Tv=\langle x,A^Tv\rangle =\langle A^Tv,x\rangle $$ (note that we used that every 1×1 matrix is symmetric). This implies that $$\langle x, A^TA x\rangle =\langle Ax, Ax \rangle \geq 0$$ for all $x$. Moreover, this quantity is zero if and only if $Ax=0$. 
  3. So our function $f$ is of the form $f\colon \Bbb R^n\times \Omega \to \Bbb R$ where $f(x,-)\colon \Omega \to \Bbb R$ is a measurable function on some probability which is distributed normally with mean $g(x)$ and variance $\sigma^2$.

Supervised Learning

A big computer, a complex algorithm, and a long time does not equal science. – Robert Gentleman


Before getting into what supervised learning precisely is, let’s look at some examples of supervised learning tasks:

  1. Identifying breast cancer.
  2. Image classification.
    • List of last year’s ILSVRC Winners
  3. Threat assessment
  4. Language Translation
  5. Identifying faces in images


Supervised learning is concerned with the construction of machine learning algorithms $f\colon 2^{D\times T}\times D\to T$. The subsets of $D\times T$ are referred to as subsets of labeled examples.

We often assume that the labeled examples arise from restricting some presumed function $F\colon D\to T$ whose values we know on $E\subset D$. We can then train $f$ on pairs $\{s, F(s) | s\in E\}$. We can then devise a cost function which measures the distance from the learned $f$ and the presumed $F$ (e.g., $L^2$-distance).

Supervised learning has been the most successful of the three branches to real world problems. The existence of labeled examples usually leads to well-defined performance metrics that transform supervised learning tasks into two other tasks finding an appropriate parametrized class of functions to choose $f$ from and an optimization problem of finding the best function in that class.

Typically $D$ will be some subset of $\Bbb R^n$ and we will refer to the components of $\Bbb R^n$ as features, dependent variables, or attributes (many concepts in machine learning have many names). Sometimes $D$ will be discrete, in which case we refer to these as categorical variables. There are a few simple tricks to map categorical variables into $\Bbb R^n$ (such as one-hot encoding), so it usually does not hurt to think of $D$ as a subset of $\Bbb R^n$.

When $T$ is discrete (typically finite), then the supervised learning problem is called a classification problem. When it is continuous (e.g, $\Bbb R$) then it is called a regression problem.


Let us consider the following sequence of supervised learning methods in turn.

  1. Linear Regression (Regression problems).
  2. $k$-nearest Neighbors (Classification or Regression problems).
  3. Logistic Regression (Classification problems1).
  4. Naive Bayes Classifier (Classification problems).
  5. Linear Discriminant Analysis (Classification problems).
  6. Support Vector Machines (Classification problems).
  7. Decision trees (Primarily Classification problems).
  8. Neural networks (Classification or Regression problems).

  1. I know the name is confusing.