Prediction is very difficult, especially about the future. – Niels Bohr
The problem
Suppose we have a list of vectors (which we can think of as samples) $x_1, \cdots, x_m\in \Bbb R^n$ and a corresponding list of output scalars $y_1, \cdots, y_m \in \Bbb R$ (which we can regard as a vector $y\in \Bbb R^m$). We want to find a vector $\beta=(\beta_1,\cdots \beta_n)\in \Bbb R^n$ such that the sum of squared error $$ \sum_{i=1}^m (x_i \cdot \beta – y_i)^2 $$ is as small as possible. Alternatively, if we suppose that our pairs $(x_i, y_i)$ arise from a function $f(x_i)=y_i$, we want to find the closest approximation to $f$ by a linear function of the form $\widehat{f}(a)=\sum_{i=1}^n \beta_i \pi_i(a)$, where $\pi_i \colon \Bbb R^n\to \Bbb R$ is the projection to the $i$th coordinate.
Let’s arrange the input vectors $x_1,\cdots, x_m$ into an $m\times n$matrix $A$, where $x_i$ appears as the $i$th row. Then we have a matrix equation of the form $ A \beta = \widehat{y}$ and we want to find $\beta$ such that $\widehat{y}y^2$ is as small as possible^{1}.
Let’s try to use calculus. Here we are trying to minimize:
$$ SSE(\beta) = \lVert\widehat{y}y\rVert^2=(A\beta y)^T(A\beta y).$$ Differentiating with respect to $\beta$ (see Tensor Calculus) yields
\begin{align}
\frac{dSSE(\beta)}{d\beta} & = 2(A\beta y)^TA \label{eq:der}.
\end{align}
Differentiating with respect to $\beta$ again yields $2A^TA$ which will be a positive semidefinite matrix. This matrix will be positive definite if and only if $\ker A =0$^{2}.
Now $\ker A \neq 0$ if and only if one of the columns of $A$ can be expressed as a linear combination of the others. In terms of our original problem, this means that one of the components of our input vectors is in fact expressible as a linear combination of the others, so there is a hidden dependency amongst the variables.
Exercise
Perform the above calculation yourself while being careful about what are vector valued quantities.
Let’s suppose $\ker A\neq 0$ (we can arrange this anyway by dropping dependent columns). So our matrix is positive definite and we see that any critical value of $SSE(\beta)$ in \eqref{eq:der} is a minimum. So by setting \eqref{eq:der} to 0 and solving for $\beta$ (while observing that $A^TA$ is nonsingular, given our assumptions) we get:
\begin{equation}
\beta=(A^TA)^{1}A^Ty.\label{eq:solv}
\end{equation}
This value of $\beta$ is uniquely determined and hence yields the global minimum of $SSE(\beta)$.
Note that when $A$ is an invertible square matrix, then \eqref{eq:solv} reduces to $\beta = A^{1} y$, which will have $SSE(\beta)=0$; just as we should expect from linear algebra.
Implementation concerns
Using a naive implementation of Cramer’s rule to invert an $n\times n$ matrix will take $O(n!)$ time and is totally impractical for problems of interest. While using Gaussian elimination to invert an $n\times n$ matrix takes $O(n^3)$time (by being clever one can cut the runtime in half for inverting the symmetric matrix $A^TA$). The remaining multiplications require $O(n^2m)$ time and since $m\geq n$ (otherwise $\ker A\neq 0$), the dominant term will be the multiplications so the algorithm runs in $O(n^2m)$time.
We may also be concerned about errors in the data. Suppose that the “correct” value of $y$ is, in fact $y+e$. For simplicity suppose that $A$ is invertible. We might hope that if $\lVert e\rVert$ is very small relative to $\lVert y\rVert$ then the error between $A^{1}y$ and $A^{1}(y+e)$ will be small relative to $A^{1}y$.
We can calculate
\begin{align}
\frac{\lVert A^{1} e \rVert}{\lVert A^{1} y \rVert}/\frac{\lVert e\rVert}{\lVert y\rVert} & = \frac{\lVert A^{1} e\rVert}{\lVert e\rVert }\cdot \frac{\lVert y\rVert }{\lVert A^{1}y\rVert } \label{eq:cond}
\end{align}
As we try to maximize this expression over all $e$ and $y$ (so we can replace $y$ with $Ay$), we obtain the product of operator norms $\lVert A^{1}\rVert \cdot \lVert A \rVert$. While we won’t go into operator norms here, we will point out that when $A$ is diagonal, then $\lVert A\rVert$ is the maximum of the absolute values on the diagonal entries, so that \eqref{eq:cond} becomes $M/m$ where $M$ is the maximum of the absolute values of the diagonal entries and $m$ is the minimum of the absolute values of these entries. In particular, if $A$ is the 2×2 diagonal matrix with diagonal entries $10$ and $0.1$, inversion can multiply relative errors by a factor of a 100.
Typically the standard thing to do is not to invert the matrix, but rather construct a matrix decomposition from which it is easy to solve the system of equations by back substitution. Not that this would not help the numerical stability problem with the diagonal matrix.
Beyond simple linear functions
At this point we have seen how to find an approximation to a function $f(x)$ of the form $\widehat{f}(x)=\sum_{i=1}^n \beta_i \pi_i(x)$ that minimizes the sum of squared errors.
What happens if our function is just the constant function $f(x)=10$? In this case, our approximation will do rather poorly. In particular, $\widehat{f}(0)=0$, so this will never be a very good approximation about this point.
What we could do in this case is assume that our function instead has the form: $$\widehat{f}(x)=\sum_{i=1}^n \beta_i \pi_i(x)+\beta_{n+1}\cdot 1.$$ We can now solve this new approximation problem by just changing the vectors $x_i=(\pi_1(x_i),\cdots,\pi_n(x_i))$ into new $n+1$dimensional vectors $(\pi_1(x_i),\cdots,\pi_n(x_i),1)$ and applying linear regression as before. For nondegenerate collections of vectors ${x_i}$, we will end up with the desired function $\widehat{f}(x)=10$. This is sometimes called the bias trick.
Here is an example of such a linear regression being fit to more and more data points. For this example, we used python’s scikitlearn package and rendered the images using matplotlib (see the code).
More generally if we think our $f(x)$ can be well approximated by a function of the form $\widehat{f}(x)=\sum_{i=1}^N \beta_i f_i(x)$ for any fixed set of $N$ functions $\{f_i\}$, we can just construct a new sample matrix whose $i$th row is the $N$vector $(f_1(x_i),\cdots,f_N(x))$ and apply linear regression.
For example, if we are looking for a function $\widehat{f}\colon \Bbb R\to \Bbb R$ of the form $\widehat{f}(x)=\beta_1 x^2+\beta_2 x +\beta_3$, we just have to replace the samples $x_1,\cdots, x_m$ with the vectors $(x_1^2, x_1, 1), \cdots, (x_m^2, x_m, 1)$ and apply linear regression as before. This is an example of polynomial regression. For example, we have the following (see the code):
We could similarly try to find Fourier approximations for a function which we expect has periodic behavior. We could also use wavelets or splines. Clearly, this list goes on and on.
Warning
If we choose a suitably expressive class of functions to define our generalized linear regression problem, then there is a good chance that the linear regression problem will fit the data perfectly. For example, if we have any function $f\colon \Bbb R \to \Bbb R$ and we know its values on $n$ distinct points, then we can perfectly approximate this function by a polynomial of degree at least $n1$ (e.g., 2 points determine a line, 3 points determine a quadratic curve, etc.). While this will perfectly fit our given data, it becomes quite unlikely that it will match any new data points. This is a classic example of overfitting. We can see this phenomenon in the following example:
Here the degree 10 polynomial will fit any 30 points relatively well, but will drastically fail to generalize well to the entire dataset.A Maximal Likelihood Justification for using SSE
While the above exposition is elementary, one might wonder where the choice to minimize the sum of squared errors comes from.
Let us suppose that that we have a function of the form $f(x)=g(x)+\epsilon$ where $\epsilon$ is an $\Bbb R$ valued random variable^{3} $\epsilon \sim N(0, \sigma^2)$ and $g(x)$ is a deterministic function. We could mentally justify this by saying that the output variables are not recorded exactly, but there is some normally distributed measurement error or there is some additional nonlinear random factors contributing to the value of $f(x)$. Then the conditional density function of $f(x)$ is $$p(yx)=\frac{e^{(g(x) y)^2/2\sigma^2}}{\sqrt{2\pi \sigma^2}}.$$
The maximal likelihood estimate (MLE) (see here for more details) for a vector of values $y=(y_1,\cdots,y_m)$ associated to a sequence of independent samples $x=(x_1,\cdots,x_m)$ is then
\begin{align}
\end{align}
So in this case, the MLE for $y$ is the choice of $y$ which minimizes the sum of squared errors.
A frequentist description
If we suppose that our data samples $(x,y)$ are being drawn from a fixed distribution, e.g., they are the values of a vector valued random variable $X$ and a random variable $Y$. Moreover, suppose the joint pdf of $X$ and $Y$ is $p(x,y)$. Then we can regard linear regression as attempting to find
\begin{align}
\end{align}
Bayesian approach
Here we will consider a Bayesian approach to the polynomial regression problem. First we consider the case, where we approximate data coming from a degree 3 polynomial using a Gaussian model that expects a degree 3 polynomial and is trained on all 200 samples. In other words, we are assuming a conditional distribution of the form
$$p(y,x\theta)=1/\sqrt{2\pi \sigma^2}\cdot e^{\frac{(y\sum_{i=0}^3 \beta_i x^i)^2}{2\sigma^2}},$$ where $\theta$ is the vector of parameters in this model. For our prior, we suppose that the parameters are independent random variables where $\beta_i~N(0, 100)$ and $\sigma$ is given by a half normal distribution with standard deviation 1 (since this can only take nonnegative values). To calculate the posterior $p(\theta  D)$ we apply Bayes rule. Unfortunately Bayes rule has this nasty integral over the parameter space in the denominator:
$$ \int_\theta p(D\theta)p(\theta)d\theta.$$ Rather than try and calculate this integral directly, we will calculate this using Monte Carlo integration, which is sampled using an algorithm which is completely NUTS. All of this is implemented using the package PyMC3. This is overkill for this situation, but it is nice to see how the general algorithm is implemented (see here for the code).
In effect what we are doing is using a special technique to generate samples $S$ of $\theta$ according to the prior distribution $p(\theta)$. For each $\theta\in S$, we calculate $z_\theta =p(D\theta)$ and see that $$p(\thetaD)\approx \frac{z_\theta}{\sum_{\theta’ \in S} z_{\theta’}}.$$ As the number of samples goes up, the accuracy of this estimate increases.
We can look at the output here:
By taking posterior samples of the parameters we obtain the curves in grey. The mean of these samples generates the black curve in the middle. Alternatively, we can sample from $p(yx)$ (that is we average over all of the choices of parameters). Taking the mean values for each $y$ given an $x$ we obtain the red curve. We have marked a band of radius the standard deviation about these samples.
The summary statistics for our parameter values are listed here. Note that the ‘true value’ of our function is $$f(x)=
7 x^3 – 9.1 x^2 + 3.5 x – 0.392,$$ with a standard deviation of 0.3. We note that none of the true coefficients fall into the 95% credibility intervals, which is not very promising.
mean  sd  mc_error  hpd_2.5  hpd_97.5  

coeff__0  0.209607  0.080981  0.001872  0.370318  0.056474 
coeff__1  1.914993  0.704315  0.020752  0.562936  3.260617 
coeff__2  5.556357  1.651007  0.050628  8.678463  2.410120 
coeff__3  4.794343  1.099424  0.033513  2.710534  6.909714 
sigma  0.291662  0.015164  0.000186  0.263677  0.321718 
We can gain further information by studying the pdfs of the resulting variables:
Now we examine how the Bayesian method does with respect to our overfitting problem above. Here we try to fit a degree 10 polynomial to only 30 data points. We then obtain the following:
Here we see the samples for the polynomial curves again have wild variance, but our posterior predictions are actually quite reasonable. Looking at the summary statistics in the table below , we note that the model has very low confidence in almost all of the parameter values and huge credibility intervals that mostly contain the true values. By averaging over all of these parameters we reduce the variance in our predictions and obtain a much better model.
mean  sd  mc_error  hpd_2.5  hpd_97.5  

coeff__0  0.053526  0.200482  0.002303  0.469330  0.317603 
coeff__1  0.658498  1.898380  0.022927  3.246255  4.222251 
coeff__2  2.835666  5.895880  0.084508  13.723288  9.131853 
coeff__3  2.316211  8.093137  0.126718  13.588722  18.071517 
coeff__4  1.342521  8.552999  0.104256  15.235386  18.123408 
coeff__5  0.630845  8.596230  0.099357  16.968628  16.577238 
coeff__6  1.251081  8.598991  0.103051  17.388627  15.975167 
coeff__7  0.712717  8.881522  0.102497  18.123559  16.398549 
coeff__8  0.373434  8.985303  0.104398  16.930835  18.813871 
coeff__9  0.753457  8.428446  0.094453  16.404371  16.786885 
coeff__10  0.835288  7.222892  0.089429  13.213493  15.239204 
sigma  0.276957  0.041169  0.000551  0.205731  0.360664 
Exercise
How did our choice of prior bias the predictions of the coefficients? What would be some other reasonable choices of prior?
 Note that minimizing the sum is equivalent to minimizing the average. ↩
 Recall that on $\Bbb R^m$ we have a scalar product $\langle v,w\rangle=v^Tw$ (here we view vectors as column vectors). This formula tells us that $$\langle v,Ax\rangle =v^TAx=x^TA^Tv=\langle x,A^Tv\rangle =\langle A^Tv,x\rangle $$ (note that we used that every 1×1 matrix is symmetric). This implies that $$\langle x, A^TA x\rangle =\langle Ax, Ax \rangle \geq 0$$ for all $x$. Moreover, this quantity is zero if and only if $Ax=0$. ↩

So our function $f$ is of the form $f\colon \Bbb R^n\times \Omega \to \Bbb R$ where $f(x,)\colon \Omega \to \Bbb R$ is a measurable function on some probability which is distributed normally with mean $g(x)$ and variance $\sigma^2$.
↩