Complete the lesson “Introduction to Machine Learning”.

This should have also included “Exploratory Data Analysis”. This has been added to the next week’s assignment.

MLE for the uniform distribution.

(Source: Kaelbling/Murphy) Consider a uniform distribution centered on 0 with width 2a. The density function is given by: $$ p(x) = \frac{\chi_{[-a,a]}}{2a}.$${
a. Given a data set $x_1,\cdots, x_n,$ what is the maximum likelihood estimate $a_{MLE}$ of $a$?
b. What probability would the model assign to a new data point $x_{n+1}$ using $a_{MLE}$?
c. Do you see any problem with the above approach? Briefly suggest (in words) a better approach.

Calculate the expected value and mode of $\theta$ when $\theta \sim \textrm{Beta}(\alpha, \beta)$,

Change of variables:

Let $X\colon S\to T_1$ be a discrete valued random variable with pmf $p_X\colon T_1\to [0,1]$ and let $Y\colon T_1\to T_2$ be a function. Derive the pmf $p_{Y\circ X}\colon T_2\to [0,1]$ in terms of $p_X$ and $Y$.

Let $X^n\colon S^{\times n}\to \{0,1\}^{\times n}$ be the random variable whose values give $n$-independent samples of a Bernoulli random variable $X$ with parameter $\theta$ (i.e., $p_X(1)=\theta$). Show that $$p_{X^n}(v_1,\cdots, v_n)=\theta^{\sum{v_i}}(1-\theta)^{n-\sum v_i}.$$ Now let $\sigma \colon \{0,1\}^{\times n}\to \{0,…n\}$ be defined by taking the sum of the entries. The composite $\sigma\circ X^{n}$ is called a Binomial random variable with parameters $n$ and $\theta.$ Determine $p_{\sigma\circ X^n}(k)$.

Let $X\colon S\to \Bbb R$ be a random variable with piecewise continuous pdf $p_X$ and let $Y\colon \Bbb R\to \Bbb R$ be a differentiable monotonic function. Show that $Y\circ X$ is a random variable and determine $p_{Y\circ X}$.

Uninformative prior for log-odds ratio:

(Source: Murphy) Let $$ \phi = \textrm{logit}(\theta) = \log \frac{\theta}{1-\theta}.$$ Show that if $p(\phi)\propto 1,$ then $p(\theta)\propto \textrm{Beta}(0,0)$.

R Lab:

Construct and apply a Naive Bayes classifier for a specific text classification problem (e.g., spooky author identification) from scratch. In other words, do not use any modeling libraries. Feel free to use any libraries you like to get the data into an acceptable format.

Probabilities are sensitive to the form of the question that was used to generate the answer:

(Source: Minka, Murphy.) My neighbor has two children. Assuming that the gender of a child is like a coin flip, it is most likely, a priori, that my neighbor has one boy and one girl, with probability 1/2. The other possibilities—two boys or two girls—have probabilities 1/4 and 1/4.
a. Suppose I ask him whether he has any boys, and he says yes. What is the probability that one child is a girl?
b. Suppose instead that I happen to see one of his children run by, and it is a boy. What is the probability that the other child is a girl?

Legal reasoning

(Source: Peter Lee, Murphy) Suppose a crime has been committed. Blood is found at the scene for which there is no innocent explanation. It is of a type which is present in 1% of the population.
a. The prosecutor claims: “There is a 1% chance that the defendant would have the crime blood type if he were innocent. Thus there is a 99% chance that he guilty”. This is known as the prosecutor’s fallacy. What is wrong with this argument?
b. The defender claims: “The crime occurred in a city of 800,000 people. The blood type would be found in approximately 8000 people. The evidence has provided a probability of just 1 in 8000 that the defendant is guilty, and thus has no relevance.” This is known as the defender’s fallacy. What is wrong with this argument?

Bayes rule for medical diagnosis

(Source: Koller, Murphy.) After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare disease, striking only one in 10,000 people.
a. What are the chances that you actually have the disease? (Show your calculations as well as giving the final result.)

Conditional independence (Source: Koller.)

Let $H\in {1,\cdots, K}$ be a discrete random variable, and let $e_1$ and $e_2$ the observed values of two other random variables $E_1$ and $E_2$. Suppose we wish to calculate the vector
$$P(H|e_1, e_2) = (P(H=1|e_1,e_2),\cdots, P(H=K|e_1, e_2)).$$
a. Which of the following sets of numbers are sufficient for the calculation?
i. $P(e_1, e_2), P(H), P(e_1| H), P(e_2, H)$.
ii. $P(e_1, e_2), P(H), P(e_1, e_2 | H)$.
iii. $P(e_1|H), P(e_2|H), P(H)$.

b. Now suppose we now assume $E_1\perp E_2 | H$ (i.e., $E_1$ and $E_2$ are independent given $H$). Which of the above 3 sets are sufficient now?

R lab

Estimate the value of $\pi$ by taking uniform random samples from the square $[-1,1]\times [1,1]$ and seeing which lie in the disc $x^2+y^2\leq 1$.

A company is trying to determine why their employees leave and why they stay. They have a list of roughly 15000 employee records here.
a. Download this dataset and load it in R (this may require setting up a Kaggle account if you don’t already have one).
b. Examine the dataset and see if you need to transform any of the features=columns (e.g., are there factors that were not recognized as such, is there missing data?).
c. Randomly shuffle the rows and cut the dataset into two pieces with 10000 entries in a data frame called train and the remaining entries in a data frame called valid.
d. Study the train data frame and see if you can find any features that predict whether or not an employee will leave.
e. Make a hypothesis about how you can predict whether an employee will leave by studying the train data.
f. Once you have fixed this hypothesis evaluate how well your criteria work on the valid data frame.
g. Justify your proposal with data and charts. Save at least one of these charts to a pdf file to share with management.

I will assume that you have seen some calculus, including multivariable calculus. That is you know how to differentiate a differentiable function $f\colon \Bbb R \to \Bbb R$, to obtain a new function $$\frac{\partial f}{\partial x} \colon \Bbb R \to \Bbb R.$$ You also know how to differentiate a multivariable function $f\colon \Bbb R^m \to \Bbb R^n$, to obtain a map $$Jf\colon \Bbb R^m \to \mathrm{Hom}(\Bbb R^m, \Bbb R^n),$$ which takes a point $x\in \Bbb R^m$ to the Jacobian at $x$, which is an $(m,n)$-matrix; alternatively you can think of the Jacobian as a matrix of functions with the $(i,j)$-entry being $\frac{\partial f_i}{\partial x_j}\colon \Bbb R^m \to \Bbb R.$

The Jacobian is a convenient form for packaging the derivative of a vector valued function with respect to a list of variables arranged into a vector. Many of us learn the basic results (the product rule, the sum rule, the chain rule) very well for single variable calculus and, hopefully, pretty well for multivariable calculus. We may have less experience using these compact forms in practice.

While this knowledge is very useful for considering vector valued functions indexed by vectors of variables, it can be a bit tricky to differentiate matrix valued functions with respect to matrices or vectors of variables and get a useful compact form.

Potential objection

A bright student might observe that an element of a matrix is an element of $\Bbb R^m\otimes \Bbb R^n\cong \Bbb R^{mn}$ and hence we can always regard an $(m,n)$-matrix as an $mn$-vector^{1}. This is true, but we usually organize data into matrices because that is the most natural structure for them to have, e.g., the Jacobian above. Here are some similar naturally occurring structures:

We have a table with 100 entries. On each row is data from some sample, say temperature and pressure. This naturally fits into a $(100,2)$-matrix, whose rows are the samples and whose columns are the respective measurements. So we can recover each datum, by using two numbers: the row index and the column index.

We have a list of 100 grayscale images 640×480 images. To recover a datum (the darkness of a pixel in a sample) in this table, we would use three integers:

the image number,

the $x$-coordinate, and

the $y$-coordinate.
This is encoded as a tensor of shape $(100,640,480)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{640} \otimes \Bbb R^{480}$. While we could encode this as a $307200000$-vector, it would make it much more difficult to understand and meaningfully manipulate.

We have a list of 100 color (i.e., RGB) images 640×480 images. To recover a datum (the strength of a particular color (red/green/blue) of a pixel in a sample) in this table, we would use four integers:

the image number,

the $x$-coordinate,

the $y$-coordinate, and

which color we are interested in (e.g., red = 0, green = 1, blue = 2).
This is encoded as a tensor of shape $(100,640,480,3)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{640} \otimes \Bbb R^{480}\otimes \Bbb R^3$.

We have a list of 100 color (i.e., RGB) videos each consisting of 18000 frames of 640×480 images. To recover a datum (the strength of a particular color (red/green/blue) of a pixel in a sample at a particular frame in the table) in this table, we would use five integers:

the image number,

the frame number,

the $x$-coordinate,

the $y$-coordinate, and

which color we are interested in (e.g., red = 0, green = 1, blue = 2).
This is encoded as a tensor of shape $(100,18000,640,480,3)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{18000}\otimes \Bbb R^{640} \otimes \Bbb R^{480}\otimes \Bbb R^3$. Again, it would be very unnatural to work with this as a single vector.

Definition

Suppose we have a function $F$ taking values in tensors of shape $(n_1,\cdots, n_\ell)$ which depends on variables indexed by a tensor $X$ of shape $(m_1,\cdots, m_k)$. The derivative $\frac{\partial F}{\partial X}$ is a function from $\Bbb R^{m_1}\otimes \cdots \otimes \Bbb R^{m_k}$ to tensors of shape $(n_1,\cdots,n_\ell,m_1,\cdots,m_k)$ whose $(i_1,\cdots, i_\ell, j_1, \cdots, j_k)$-entry is $\frac{\partial F_{i_1,\cdots, i_\ell}}{\partial X_{j_1,\cdots, j_k}}(-)$.

Some scandalous abuses of notation

We will typically abuse notation and let $n$-vectors also denote column vectors, i.e., tensors of shape $(n,1)$.

For convenience, many authors (including this one), will sometimes take tensors of the form $(\cdots, 1, \cdots)$ and regard them as tensors of the form $(\cdots,\cdots)$ (i.e., we omit the 1 term by ‘squeezing the tensor’. This abuse of notation is both convenient and confusing. Avoiding this abuse tends to cause ones to accumulate in the shape of our tensors and these one’s do not add any value to our interpretation. The best way to follow what is going on is to check it for yourself.

A less scandalous abuse of notation is to regularly add transposes to tensors of $(1,1)$ when convenient.

Some results

Since the only way to get used to tensor differentiation is to do some work with it yourself, we leave the following results as exercises.

Exercise

Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables. Show that $\frac{\partial f}{\partial X}=Jf$.

Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=Ax$ for an $(n,m)$-matrix $A$. Show that $\frac{\partial f}{\partial X}=A$ (a constant function).

Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=x^TA$ for an $(m,n)$-matrix $A$. Show that $\frac{\partial f}{\partial X}$ is a function valued in $(1,n,m)$-tensors whose $(1,i,j)$ entry is $(A^T)_{i,j}$.

Suppose that $f\colon \Bbb R^m\to \Bbb R$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=x^TA x$ for an $(m,m)$-matrix $A$. Show that $\frac{\partial
f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose $(1,i)$-entry is $((A+A^T)x)_i$.

Suppose that $f\colon \Bbb R^m\to \Bbb R$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=(Ax-y)^T(Ax-y)$ for an $(m,m)$-matrix $A$ and a constant $(m,1)$-matrix $y$. Show that $\frac{\partial
f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose entries give $2(Ax-y)^TA.$

Suppose that $f\colon \Bbb R^m\to \Bbb R^m$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=2(Ax-y)^TA$ for an $(m,m)$-matrix $A$ and a constant $(m,1)$-matrix $y$. Here we regard the function as taking values in $(m)$-tensors. Show that $\frac{\partial f}{\partial X}$ is the $(m,m)$-tensor $2A^TA$.
f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose entries give $2(Ax-y)^TA.$

To simplify our notation, $\otimes := \otimes_{\Bbb R}$. ↩

This is to be completed by November 27th, 2017. (THIS IS A TYPO: This should read October 26th, 2017). It is okay if you finish by this ridiculous first due date.

Forewarning

For several of these exercises, you will be asked to install software and/or setup accounts from outside websites. These websites may try to convince you to purchase some of their products. You do not need to purchase anything to do these exercises. All of the required materials have been made freely available to us.

Exercises

Learn the most important skill for this course

Learn how to use Google to find an answer to almost any technical problem (e.g., ‘How do I install — for Windows?’ or ‘How do I do — in R/Python?’ or ‘I’m getting a weird error message, how do I fix it?’). If you have not learned how to do this yet, I highly recommend it. It will open up your world to a range of new possibilities.

Setup a Datacamp account and join the course group by using the email invitation. Note you should not have to pay any money to do this. I do not believe you will need to provide any bank information for this purpose.

Try out some of the lessons designated for the course.

During the course you should keep up with the assigned lessons.

Follow this tutorial to install RStudio. RStudio is an interactive development environment for the R programming language. Which will make it easier to complete the R exercises. Again, you should not have to pay anything for this.

Install the ISLR package from CRAN. While you’re at it you may as well install caret, dplyr, tidyr, and ggplot2.

Python (Optional, but recommended for those interested in learning Python^{1})

Install Anaconda using Python 3.x (Python 2.7 is outdated, but still often used). As I remember it, this requires you to setup an account, but you will not have to pay anything. They will send you emails which you will have to unsubscribe from. I believe this is a small price to pay for a system that manages your Python installation and includes many of the packages used in data science.

Additional material

Glance at some of the additional source material for the course.

Since we will be using it for R labs, download the ISLR book (or go and buy it if you like it!).

R lab

Complete exercise 8 from Chapter 2 of ISLR. In this exercise you will be exploring data gathered from universities and colleges in the United States.

In my humble opinion, Python is a truly wonderful language with an incredible number of highly useful libraries. Once you know Python, it is a surprisingly short road to solving many typical computer tasks. It is typically much slower than other, or more grown-up, languages such as C/C++ or Java, but the speed-up in development time makes up for it for the kind of tasks we will be considering. Moreover, the syntax is very natural and easy to learn. (Drops the microphone) ↩

…the statistician knows…that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world. – George Box (JASA, 1976, Vol. 71, 791-799)

Parameter estimation

Suppose that we are given some data $D=(x_1,\cdots, x_n)\in T^n$. Our task is to find a parametric probability distribution $m(\hat{\theta})$ on $T$ such that if we were to independently sample from it $n$-times, we would could reasonably^{1} expect to obtain $D$. We first make some choice^{2} of possible distributions $m(-)$, which leaves us with the task of identifying the $\hat{\theta}$.

Likelihood approach

One approach is the maximal likelihood estimate or MLE. For this we set $$\hat{\theta}=\theta_{MLE} = \mathrm{argmax}_\theta p(D | \theta).$$ That is we choose $\theta$ so that the probability density of $D$ from the given model is maximized.

There are a few comments that we should make here:

There may not be a maximum for a general distribution. Indeed the models typically considered for logistic regression do not have a maximum.

If there is a maximum, it may not be unique.

This is an optimization problem and when this problem is not convex, we may not have any method that is guaranteed to identify find a maximum even when it exists.

We do not actually address these problems, instead we just say that optimization techniques will give us a parameter value that is “good enough”.

Example

Suppose that $D=(x_1,\cdots, x_n)\in \Bbb R^n$ and we want to find a normal distribution $N(\mu, \sigma^2)$ which models the data under the assumption that these were independent samples.

First we construct the MLE of $\mu$:

\begin{align}

\end{align}

We differentiate the last expression with respect to $\mu$ and set this to 0 and obtain
\begin{align}
-\sum_{i=1}^n (x_i -\mu)&=0 \\
\mu_{MLE} &= \frac{\sum_{i=1}^n x_i}{n}.
\end{align}
Taking second derivatives shows that this is in fact a minimum. In other words, the MLE of $\mu$ is the sample mean.

Exercise

Show that MLE of $\sigma^2$ is the (biased) sample variance. In other words, $$\sigma_{MLE}^2 = \frac{\sum_{i=1}^n (x_i-\mu_{MLE})^2}{n}.$$

Frequentist evaluation

The frequentist approach gives us a method to evaluate the above estimates of the parameters. Suppose that our data is drawn from a true distribution $m(\hat{\theta})$ and set $\overline{\theta} = E(\theta_{MLE}(D))$, where the expectation is given by integrating over $D$ given the “true” model. Define the bias of $\theta_{MLE}(-)$ by $$\mathrm{bias}(\theta_{MLE})=\overline{\theta}-\hat{\theta}.$$

Let $\mu_S(D)$ be the sample mean of data drawn from some distribution $m$ with finite mean $\widehat{\mu}$. Then we see
\begin{align}
E(\mu_S) & = E\left(\frac{\sum_{i=1}^n X_i}{n}\right) \\
& = \frac{\sum_{i=1}^n E(X_i)}{n} \\
& = n\widehat{\mu}/n = \widehat{\mu}
\end{align}
so this is an unbiased estimate.

For a visualization of an unbiased point estimate from the frequentist point of view look here.

Exercise

Show that the expected value of the sample variance $\sigma^2_S$ of data drawn from some distribution $m$ with finite mean $\widehat{\mu}$ and finite variance $\widehat{\sigma^2}$ is $$ E(\sigma^2_S)= \frac{n \widehat{\sigma^2}}{n-1}.$$
Show that it follows that the bias of $\sigma^2_S$ is $\widehat{\sigma^2}/(n-1)$.

Show that the unbiased sample variance $\sigma^2_U(X_1,\cdots,X_n) = \frac{\sum_{i=1}^n(X_i-\mu_S(X_1,\cdots X_n))^2}{n-1}$ is, in fact, an unbiased estimate.

Construct a symmetric 95% confidence interval (see here for some additional discussion) for our estimate $\mu_S$ of $\widehat{\mu}$ which is centered about $\mu_S$, when $n=10$, $\mu_S=15$ and $\sigma^2_U = 2$. Hint: the random variable $\frac{\mu_S-\widehat{\mu}}{(\sigma_U /\sqrt{n})}$ has the same distribution as the Student’s $t$-distribution whose inverse cdf values can be calculated by statistics packages.

Bias-Variance tradeoff

Suppose that we have an estimator $\theta=\delta(D)$ of $\hat{\theta}$ and the expected value of the estimate is $E(\delta(D))=\overline{\theta}$. We obtain the following expression for our expected squared error
\begin{align}
E((\theta-\hat{\theta})^2) &= E(\left((\theta-\overline{\theta})-(\hat{\theta}-\overline{\theta})\right)^2) \\
&= E((\theta-\overline{\theta})^2)-E(\theta-\overline{\theta})\cdot 2(\hat{\theta}-\overline{\theta}) + (\hat{\theta}-\overline{\theta})^2 \\
&= \mathrm{var}(\theta)+\mathrm{bias}^2(\theta)
\end{align}

This equation indicates that while it may be nice to have an unbiased estimator (i.e., one that will, on average, give precisely the correct parameter), if this comes with the cost of creating an estimate that varies wildly with the data then we will still expect a large amount of error and indicates why we would like estimators with low-variance.

A typical example of this is seen in polynomial regression (here we are viewing the problem as trying to estimate $p(y|x)$). As we vary the degrees of the polynomial being used, we find that very large degree polynomials can fit the data well:

However, the fit for high degree polynomials varies highly depending on the choice of data points:

The MAP Estimate

One Bayesian approach to parameter estimation is called the MAP estimate or maximum a posteriori estimate. Here we are given data $D$, which we want to say is modeled by a distribution $m(\theta)$ and we construct the MAP estimate of $\theta$ as
\begin{equation}

\end{equation}
In other words, we choose the mode of the posterior distribution. Note that if our prior $p(\theta)$ is constant independent of $\theta$, then the only part of the right hand side that depends on $\theta$ is $p(D|\theta)$ and the MAP estimate of $\theta$ is the same as the MLE of $\theta$.

Example

Suppose that we have a coin whose bias $\theta$ we would like to determine (so $\theta$ is the probability of the coin coming up heads). We flip the coin twice and obtain some data $D=(H,H)$. Under the assumption that the coin flips are independent we are trying to model a Bernoulli distribution. Now we calculate the conditional probability:
\begin{equation}
p(D|\theta) = \theta^2(1-\theta)^0
\end{equation}
and quickly see the MLE for $\theta$ (given that $\theta \in [0,1]$) is what we would expect: $\theta = 1,$ as mentioned in the hypothesis testing example of an overconfident estimator.

As already mentioned, the MLE and MAP estimate agree in the case of the uniform prior, but the Bayesian approach allows us to calculate $p(\theta | D)$, which gives us a measure of our confidence in this estimate. In this case we see:
\begin{align}
p(\theta | D) &= \frac{p(D|\theta)p(\theta)}{\int_{\theta’} p(D|\theta’)p(\theta’)d\theta’} \\
&= \frac{\theta^2}{\int_{\theta’} (\theta’)^2d\theta’} \\
&= \frac{\theta^2}{1/3} \\
&= 3\theta^2.
\end{align}

If we want to find a 99.9% credible interval for the estimate $\theta_*\in [a,1]$, then we just integrate
\begin{align}
P(\theta \in [a,1] | D) &=\int_{\theta =a}^1 p(\theta | D)d\theta \\
&=\int_{\theta =a}^1 3\theta^2 d\theta \\
&=1-a^3.
\end{align}
Setting this to be equal to $1-10^{-3}$ we see that $a=0.1$. This interval is much more conservative than the confidence interval obtained by frequentist methods.

Now what if we do not have a uniform prior. Instead we suppose we have some prior information^{3} that makes us believe that the most likely value of $\theta$ is actually 0.25 and that it has variance 0.05. How do we incorporate this information into a prior? Well the mode and the variance are not sufficient to determine a distribution over the unit interval, so let’s assume that $p(\theta)$ has some convenient form that fits these values.

It would simplify matters if $p(\theta)$ had the same form as $p(D|\theta)=\theta^{|H|}(1-\theta)^{|T|}$. So, supposing $p(\theta)=C\theta^a(1-\theta)^b$ for some constant $C$ (i.e., $p(\theta)\propto \theta^a (1-\theta)^b$), we would see that
\begin{align}
p(\theta | D) &\propto p(D|\theta)p(\theta) \\
&\propto \theta^{|H|}(1-\theta)^{|T|} \theta^a(1-\theta)^b\\
& = \theta^{|H|+a}(1-\theta^{|T|+b}).
\end{align}
This has the same form as the prior! Such a prior is called a conjugate prior to the given likelihood function. Having such a prior is super convenient for computation.

In this case, the conjugate prior to the Bernoulli distribution is the Beta distribution which has the form $p(\theta | \alpha, \beta)\propto \theta^{\alpha-1}(1-\theta)^{\beta-1}$. One can calculate the mode and variance of this distribution (or just look it up) and get
\begin{align}
\textrm{mode} &= \frac{\alpha -1}{\alpha+\beta-2} \\
\sigma^2 &= \frac{\alpha \beta}{(\alpha+\beta)^2(\alpha+\beta+1)}
\end{align}
Plugging this into Wolfram Alpha gives us some approximate parameter values: $\alpha \approx 1.4$ and $\beta \approx 2.3$.

Using this prior, the new posterior is
$$ p(\theta | {HH})\propto \theta^{2.4}(1-\theta)^{1.3}.$$ We can form the MAP estimate by taking the $\log$ of this expression (which is fine away from the points $\theta = 0, 1$, which we know can not yield the maximum), differentiating with respect to $\theta$, setting this to 0, and finally checking that the second derivative at this point is positive.

Exercise

Show the MAP estimate for this problem is $2.4/3.7\approx 0.65$ by calculating the mode of the $\beta$ density function.

We see that our estimate of $\theta$ shifts drastically as we obtain (even just a little bit) more information. Now lets see how the posterior changes as we see even longer strings of heads:

From the image we see that as posterior gradually becomes highly confident that the the true value of $\theta$ is very large and assigns very small probability densities to small values of $\theta.$

To see how much our choice of prior affects the posterior in the presence of the same data, we can look at the analogous chart starting from a uniform prior, i.e., when the $\beta$ parameters are $\alpha=\beta=1$.

Or we can look at a side by side comparison:

From looking at the graphs, we can see that while the two priors are very different (e.g., the non-uniform prior assigns very small densities to large values of $\theta$), the posteriors rapidly become close to each other. How the posterior depends on the prior[/caption]Since they never agree, our methods still depend on our choice of prior, but we can also see that the both methods are approaching the same estimate.

To finally drive this point home, let’s consider a severely biased prior $\beta(1.01,10.0)$, whose mode puts the probability of heads at approximately 0 and with near 0 variance in the binomial model. Then we can see how the posterior changes with additional evidence:

The severely biased prior is much slower to approach the same consensus as the other priors.

Exercise

The severely biased prior is very close to what we would obtain as a posterior from a uniform prior after witnessing 9 tails in a row. If we were to witness 9 tails in a row followed by 20 heads in a row, how would you change your modeling assumptions to better reflect the data?

Further point estimates

As discussed above, one way to estimate a parameter is the MAP estimate which is just the mode of the posterior distribution $p(\theta | D)$. This is a very common approach to parameter estimation because it transforms the task into an optimization problem and we have a number of tools for such tasks.

However, the mode of a distribution can be far from a typical point. The mode of a distribution is also not invariant under reparametrization. One alternative is to select the mean or expected value of $\theta$ from the posterior distribution. Another alternative is to take the median of the posterior distribution.

I know this is not a very precise statement, but I want to be flexible at this point. ↩

How we make this choice is the model selection problem which will be discussed later. ↩

Acceptance without proof is the fundamental characteristic of Western religion, rejection without proof is the fundamental characteristic of Western science. – Gary Zukav, “The Dancing Wu Li Masters”

Hypothesis Testing

Now we consider Hypothesis Testing in an example. While Bayesians also have a form of hypothesis testing, the term is almost always used for the frequentist approach. We will describe this first in an example here.

(Frequentist) Hypothesis Testing

Suppose we produce a cancer treatment that we believe will increase the chance of a patient living for another year compared to standard treatment. We run a controlled experiment where we divide up a group of patients into two groups, the control group and the test group (one has to be extremely careful about how this is done in order to get meaningful results). The control group receives standard treatment and the test group receives the new treatment. We follow the patients for a year and record the outcomes as $D$. We would like to show our hypothesis $H_1$, that our new treatment changes outcomes, holds as opposed to the null hypothesis $H_0$ that the treatment has no effect. Associated to $H_0$ is a statistical model from which we could provide predictions about patient outcomes. The hypothesis $H_1$ is less specific, it really hypothesizes that the data is given by any other model besides $H_0$.

Clearly, what we want to show^{1} is that
\begin{equation}
p(H_1 | D) > p(H_0 | D), \label{eq:b-hyp}
\end{equation}

that is given the data from our experiment, it is more likely that our treatment improves outcomes than it does nothing. However, \eqref{eq:b-hyp} is meaningless from the frequentist perspective: in this approach there is only one true model giving rise to the data, so either $H_1$ holds or it doesn’t (in which case $H_0$ holds). Since $p(H_i)$ is either 0 or 1 and we can’t know which ahead of time, the frequentist instead considers $p(D | H_0)$ which, since $H_0$ defines a specific model, can be calculated. So the experimenter, calculates the probability $P(D\in S| H_0)$, where $S$ is some set of (often extremal) values of data containing the observed data $D$. If this probability is less than some specific small $\alpha\in (0,1)$, then they conclude that the null hypothesis is unlikely to hold and reject the null hypothesis at the significance level $\alpha$ and conclude that $H_1$ holds.

This approach is essentially a probabilistic proof by contradiction. The conditional probability $P(D\in S | H_0)$ is only defined if $P(H_0)>0$, which, in the frequentist perspective, is equivalent to supposing $P(H_0)=1$. A hypothesis test that rejects the null hypothesis shows that the this assumption leads to an outcome which we deem to be so unlikely that we are justified in excluding it. If instead of testing the null hypothesis we tested the alternative hypothesis and somehow calculated $P(D\in S | H_1)$, then we would be implicitly assuming the desired conclusion that $P(H_1)=1$, which makes the argument circular. This makes it impossible in this framework to directly establish that any one specific model is the correct hypothesis, we can only argue “by contradiction” and eliminate individual hypotheses.

In many areas of social science and science, the acceptable $\alpha$ for a publishable result is $\alpha = 0.05$. Let’s suppose that $P(D\in S|H_0)=0.049999$. The claim is that if the treatment had no effect, then we would only witness patient outcomes like the above about once in every 20 experiments, while if the treatment had any effect whatsoever we would see outcomes like the above about 19 times out of every 20 experiments.

Allow me to rant a little bit about what I dislike about this approach:

This approach says nothing about whether or not the treatment improves outcomes, just how it compares to not doing the treatment.

It does not say anything about the size of the effect (one could argue that if we include the power of a test, then we can quantify the effect size; I will discuss this below). For example, smoking, some headache medicines, and stress have been shown to increase the risk of birth defects for pregnant women and hence doctors typically advise pregnant women to avoid all of these things^{2}. This advice is not weighted according to the degree of risk, partly because these methods say nothing about the degree of risk. So, if a pregnant woman is stressed because she can’t have a cigarette and she has a headache what should she do?

Since the academic culture is almost exclusively interested in positive results, essentially only the studies that reject the null hypothesis are published. This leads to some additional problems:

Since many experiments do not lead to positive outcomes, there is a good chance that different academic groups are repeating the same (or similar) experiments. Even if the null hypothesis is true, the probability that one of these groups will generate an unlikely dataset that leads to a statistically significant result increases with the number of experiments^{3}. This could be avoided by using a much smaller value of $\alpha$. Particle physicists, for example, have used $\alpha \approx 0.0000003$. Of course, particle physicists deal with an enormous amount of data, so reaching such significance levels is possible for them.

More likely, one experiment leads to data that can be sliced in many ways and one of those slices will lead to a statically significant result; this is called $p$-hacking. Shrinking the acceptable $\alpha$ only makes the $p$-hacking more difficult, but not impossible. One can find a thorough discussion here.

Rejecting the null hypothesis when it is true is called a Type I error. Accepting the null hypothesis when the alternative hypothesis holds is called a Type II error. These names are rather unhelpful and I remember them using the mnemonic “Type I errors are the number one most common type of error in practice”, which I mentally justify using reasons above.

Recall that we reject the null hypothesis at significance level $\alpha$ if the observed data does not lie in an interval $I_\alpha$. Here $I_\alpha$ is determined by $P(D\in I_{\alpha} | H_0) =1-\alpha$. So the probability of incorrectly accepting the null hypothesis, when the alternative hypothesis holds, is
\begin{equation}
P(D\in I_{\alpha} | H_1).\label{eq:power}
\end{equation}
The probability $P(D\not\in I_{\alpha} | H_1)$ is called the power of the test.

As far as I can tell, the alternative hypothesis must be the negation of the null hypothesis for hypothesis testing to be meaningful. So the alternative hypothesis needs to be more vague than the null hypothesis and I can not see any way to calculate \eqref{eq:power} in practice. For example, how do we calculate the probability of seeing data in a given interval if we assume the data is produced by any other model besides that using the null hypothesis?

However, we can calculate \eqref{eq:power}, when $H_1$ is instead some $H(\theta)$. Here $H(\theta)$ is a specific hypothesis other than the null hypothesis. While this information could be helpful (the resulting probability function of $\theta$ in \eqref{eq:power} is called the power function), it is not the probability of committing a type II error and, because of frequentist assumptions, only makes sense when $H(\theta)$ is the “true” hypothesis generating the data^{4}. Looking around the internet, there seems to be a great amount of confusion about this.

Since, from a frequentist perspective, $P(D \in I_{\alpha} | H(\theta))$ is only defined when $P(H(\theta))=1$, for all of the possible values of $\theta$ this expression is only well-defined for one specific value of $\theta.$ Since frequentists often consider this expression for the value of $\theta$ estimated from the data, they are making an implicit assumption that their estimate was a perfect match to the one true parameter. A Bayesian (or perhaps anyone) would suggest that this is a rather extreme assumption.

To help understand my skepticism of the use of the power function, let me consider a mathematical example. One of the great open problems is whether or not two objects $P$ and $NP$ are equal. I won’t get into the details of what this means, but I will say that the encryption schemes that we implicitly use for computer security rely on the fact that these two classes are not equal. If they were, there could be relatively efficient algorithms for breaking encryption algorithms. The world, whether they know it or not, has essentially been operating under the assumption that $P\neq NP$.

Now, suppose that someone writes a paper and shows that if $P=NP$, then something wonderful, which we will call $W$, happens. While this would be interesting, we can conclude nothing about whether or not $W$ holds because we do not know that $P=NP$ (in fact, we expect that this assumption is false). Moreover, this paper would open up the possibility that $P=NP$ if and only if $W$ happens. In this case, many would view this result as significant evidence against $W$ happening. To connect to the power function, the calculation of the power function is dependent on an assumption that many would not find very credible and hence they would not have a high degree of confidence that the power function is well-defined under frequentist assumptions^{5}.

A beautiful visualization of inference hypothesis testing is available here.

Bayesian hypothesis testing

From the Bayesian perspective, \eqref{eq:b-hyp} is a meaningful statement that can be checked. Moreover, Bayesians are not forced to only consider the alternative hypothesis $H_1$ that anything other than the null hypothesis holds. They can identify specific hypotheses to explain the data and check the probability of them holding. As described above, the only way to make sense of this is to assign non-zero prior probabilities $p(H)$ for each possible hypothesis being considered. Bayesians do not have qualms about this and make a principled choice for these prior possibilities. This leads to significantly more refined statements (which are necessarily accompanied by clear assumptions) which define models for the data distribution. These models provide estimates of the size of an effect and can be tested for accuracy on future data.

The length of the list of things that pregnant women are supposed to avoid is truly incredible. ↩

When one group keeps repeating similar experiments to obtain the desired results, I would classify this as academic misconduct. However, private companies typically have a vested interest in establishing one outcome and can fund many different groups to do similar studies and only release the ones that support the private company’s preferred outcome. This is a very good reason to look at privately funded studies with skepticism. ↩

An extremely poor practice is calculating the power function for the hypothesis that best fits the observed data and stating that this is the probability of type II error for the given test. This is clearly absurd from first principles. ↩

Granted, this conclusion is based on a Bayesian perspective, but I am simultaneously arguing that many of us implicitly have this perspective. Even mathematicians working under the assumption that a conjecture is either true or false, will often assign a degree of belief to one statement or the other. ↩

All models are wrong, but some are useful. – George Box

Introduction

The general setup for statistical inference is that we are given some data $D$ which we assume arise as the values of a random variable that we assume is distributed according to some parametric model $m(\theta)$. The goal is to estimate $\theta$ in this circumstance and to give some measure of the certainty of our estimate. There are two schools of thought on this problem. One can assume that the true parameter value $\hat{\theta}$ is fixed and that we can evaluate the correctness of an estimate by running repeated trials (i.e., sampling $D$ from $m(\hat{\theta})$), this is the frequentist school. The other, Bayesian, school assumes that the data is given and that $\theta$ is actually a random variable.

For more in depth coverage of these approaches please consult:

We are trying to determine the percentage $\theta$ of likely-voters who will vote for a particular candidate in an election. We gather data by polling a sample from the population.

We are trying to determine the average standardized test scores $\theta$ of students taught with a new method. We gather data by teaching a test group of students and calculating the average scores of the students.

We are trying to predict the probability $\theta$ of rain tomorrow based on the meteorological data we have gathered.

Is there a “true” value of $\theta$? Or is the proposed $\theta$ we are trying to estimate subject to many additional factors which are difficult or impossible to specify? Is it possible to run repeated tests and somehow sample from the true population?

Sometimes is is said there is also a third school of thought, the “likelihoodists”. In practice, the likelihoodists try to find an optimal value of the likelihood $p(D|\theta)$ while the Bayesian’s typically try to find an optimal value of the posterior $p(\theta | D)$. These are both related by Bayes rule, but the choice of which expression to simplify and what to average over reflects what the user is allowing to vary^{1}. One feature of the Bayesian method is that they must specify a prior $p(\theta)$ and while there are guidelines for what should be a good choice, there is still a choice to be made and this makes some people unhappy. Both methods often yield similar results (in fact, when assuming a uniform prior, the Bayesian approach is the likelihood approach), but their interpretations differ because of the differing assumptions.

Bayesian methods in fact provide several different approaches to parameter estimation. The most common of which is the MAP estimate. While one could use any estimation method from either the Bayesian or frequentist approach, how the quality of an estimate is evaluated differs from the two approaches. For example, the frequentist perspective leads to the notion of an unbiased estimator, which is not a well-defined notion in the Bayesian approach. The frequentist and Bayesian schools generate (generally different) intervals which are used to quantify our confidence in a given estimate.

Example

To clarify the difference between the frequentist and Bayesian approaches, we suppose that we have to estimate a parameter $\theta$ of a model $m(\hat{\theta})$ given some data $D$. The frequentists produce an estimate $\theta_f=\delta_f(D)$ and the Bayesians produce an estimate $\theta_B$. We would like some measure of the uncertainty in our estimate. The frequentists will produce a 95% confidence interval: $$I_f(D)=\{l(D)\leq \theta_f=\delta_f(D) \leq u(D)\}.$$

A desirable interpretation of this interval is that the value of the true parameter $\hat{\theta}$ has a 95% chance of lying in this interval, however such a statement is meaningless from the frequentist approach: The true parameter $\hat{\theta}$ is fixed, so it either is or is not in $I_f$. Instead the confidence interval satisfies the following: if we were to keep producing datasets $D_1, D_2, \cdots$ sampled from $m(\hat{\theta})$ then $\hat{\theta}$ will lie in 95% of the confidence intervals $I_f(D_1), I_f(D_2), \cdots$. For a visualization of confidence intervals look here.

Since the Bayesians assume that $\hat{\theta}$ is in fact a random variable, they can make sense of the claim that there is a 95% chance of $\hat{\theta}$ lying in some given interval $I_B(D)$ given the data $D$. The Bayesians can produce such intervals and they are called 95% credible intervals.

The frequentist approach also leads to black swan phenomena. For example, suppose that we want to estimate the bias $\theta$ of a coin. The $\theta$ will correspond to the probability of obtaining a heads on one coin flip. If the observed data is two coin flips which each yield heads, then the maximum likelihood estimate for $\theta$ is 1. If we ask for a 99.999999% confidence interval for this measurement we will just obtain the one element set $\{1\}$. Having never seen a tails, this method constructs a model that assigns the probability of seeing tails to 0! Moreover, the model is extremely confident in this prediction even though a fair coin would produce this outcome 25% of the time. So the frequentist approach can lead to highly overconfident estimates.

On the other end of the spectrum, we consider the following example due to Berger. Suppose that two integers $D=(x_1, x_2)$ are drawn from a parametric distribution of the form:
$$ p(x|\theta) = \begin{cases}
0.5 & x = \theta\\
0.5 & x = \theta+1 \\
0 & x \not\in \{\theta, \theta+1\}
\end{cases}
$$
If $\theta = 5$, then we would expect of the following outcomes with equal probability $(5,5), (5,6), (6,5), (6,6)$. Let $m=\min(x_1,x_2)$ and consider the one element interval $[m,m]$. This confidence interval will contain $\theta$ for each of the previous samples except for $(6,6)$, so this will be a 75% confidence interval. But note that if $D=(5,6)$, then $P(\theta = 5 | D)=1.0$, so the model is underconfident about this estimate.

Frequentists are Bayesian (sort of)

In the Bayesian world view, we are studying a joint probability distribution on a data space $\Delta$ and a parameter space $\tau$. The frequentists object to the Bayesian specification of a prior $P(\theta)$ for two closely related reasons:

By specifying a prior, they are tipping the scales and expressing a bias for which they have no evidence.

Frequentists assume that the data is generated from one true model using a particular $\theta_*\in \tau$, so $\theta$ is not random.

We can point out that the frequentist assumption in the second reason above is equivalent to specifying a prior of a specific form. Namely, the frequentists assume that $P(\theta)$ is 1 when $\theta=\theta_*$ is the, unknown, true quantity and 0 otherwise (i.e., $\tau$ follows a Dirac-$\delta$ distribution). So the frequentist approach does fit into the Bayesian setup, they just use a very specific form for the prior which depends on an unknown and, in practice, unknowable quantity.

Unfortunately, this choice of a prior eliminates any possibility of using Bayesian inference (try to apply Bayes rule when using the frequentist prior and see for yourself). On the other hand, it does mean that $p(D |\theta_*)=p(D)$ and hence all witnessed data is generated from the only possible model, while for the Bayesians the data is pulled from all possible conditional distributions $p(D|\theta)$. The Bayesian response is that they don’t care: They simply generate the best value of $\theta$ that fits whatever data they have seen.

Finally, we should point out that in the presence of more and more data sampled from the type of parametric model we are, the Bayesian posterior $p(\theta | D)$ gradually approximates a Dirac-$\delta$ distribution and converges to the frequentist assumption.

It is worth pointing out at this point, that the expression $p(D|\theta)$ only makes sense if $p(\theta)>0$ (so the likelihoodists at least dip their toes into the water of Bayesian methods when constructing estimates). Similarly, $p(\theta | D)$ only makes sense if $p(D)>0$, in which case the Bayesians are also assuming the given data is generated by some random process. ↩

Prediction is very difficult, especially about the future. – Niels Bohr

The problem

Suppose we have a list of vectors (which we can think of as samples) $x_1, \cdots, x_m\in \Bbb R^n$ and a corresponding list of output scalars $y_1, \cdots, y_m \in \Bbb R$ (which we can regard as a vector $y\in \Bbb R^m$). We want to find a vector $\beta=(\beta_1,\cdots \beta_n)\in \Bbb R^n$ such that the sum of squared error $$ \sum_{i=1}^m (x_i \cdot \beta – y_i)^2 $$ is as small as possible. Alternatively, if we suppose that our pairs $(x_i, y_i)$ arise from a function $f(x_i)=y_i$, we want to find the closest approximation to $f$ by a linear function of the form $\widehat{f}(a)=\sum_{i=1}^n \beta_i \pi_i(a)$, where $\pi_i \colon \Bbb R^n\to \Bbb R$ is the projection to the $i$th coordinate.

Let’s arrange the input vectors $x_1,\cdots, x_m$ into an $m\times n$-matrix $A$, where $x_i$ appears as the $i$th row. Then we have a matrix equation of the form $ A \beta = \widehat{y}$ and we want to find $\beta$ such that $||\widehat{y}-y||^2$ is as small as possible^{1}.

Let’s try to use calculus. Here we are trying to minimize:
$$ SSE(\beta) = \lVert\widehat{y}-y\rVert^2=(A\beta -y)^T(A\beta -y).$$ Differentiating with respect to $\beta$ (see Tensor Calculus) yields
\begin{align}
\frac{dSSE(\beta)}{d\beta} & = 2(A\beta -y)^TA \label{eq:der}.
\end{align}
Differentiating with respect to $\beta$ again yields $2A^TA$ which will be a positive semidefinite matrix. This matrix will be positive definite if and only if $\ker A =0$^{2}.

Now $\ker A \neq 0$ if and only if one of the columns of $A$ can be expressed as a linear combination of the others. In terms of our original problem, this means that one of the components of our input vectors is in fact expressible as a linear combination of the others, so there is a hidden dependency amongst the variables.

Exercise

Perform the above calculation yourself while being careful about what are vector valued quantities.

Let’s suppose $\ker A\neq 0$ (we can arrange this anyway by dropping dependent columns). So our matrix is positive definite and we see that any critical value of $SSE(\beta)$ in \eqref{eq:der} is a minimum. So by setting \eqref{eq:der} to 0 and solving for $\beta$ (while observing that $A^TA$ is non-singular, given our assumptions) we get:
\begin{equation}
\beta=(A^TA)^{-1}A^Ty.\label{eq:solv}
\end{equation}
This value of $\beta$ is uniquely determined and hence yields the global minimum of $SSE(\beta)$.

Note that when $A$ is an invertible square matrix, then \eqref{eq:solv} reduces to $\beta = A^{-1} y$, which will have $SSE(\beta)=0$; just as we should expect from linear algebra.

Implementation concerns

Using a naive implementation of Cramer’s rule to invert an $n\times n$ matrix will take $O(n!)$ time and is totally impractical for problems of interest. While using Gaussian elimination to invert an $n\times n$ matrix takes $O(n^3)$-time (by being clever one can cut the runtime in half for inverting the symmetric matrix $A^TA$). The remaining multiplications require $O(n^2m)$ time and since $m\geq n$ (otherwise $\ker A\neq 0$), the dominant term will be the multiplications so the algorithm runs in $O(n^2m)$-time.

We may also be concerned about errors in the data. Suppose that the “correct” value of $y$ is, in fact $y+e$. For simplicity suppose that $A$ is invertible. We might hope that if $\lVert e\rVert$ is very small relative to $\lVert y\rVert$ then the error between $A^{-1}y$ and $A^{-1}(y+e)$ will be small relative to $A^{-1}y$.
We can calculate
\begin{align}
\frac{\lVert A^{-1} e \rVert}{\lVert A^{-1} y \rVert}/\frac{\lVert e\rVert}{\lVert y\rVert} & = \frac{\lVert A^{-1} e\rVert}{\lVert e\rVert }\cdot \frac{\lVert y\rVert }{\lVert A^{-1}y\rVert } \label{eq:cond}
\end{align}
As we try to maximize this expression over all $e$ and $y$ (so we can replace $y$ with $Ay$), we obtain the product of operator norms $\lVert A^{-1}\rVert \cdot \lVert A \rVert$. While we won’t go into operator norms here, we will point out that when $A$ is diagonal, then $\lVert A\rVert$ is the maximum of the absolute values on the diagonal entries, so that \eqref{eq:cond} becomes $|M|/|m|$ where $|M|$ is the maximum of the absolute values of the diagonal entries and $|m|$ is the minimum of the absolute values of these entries. In particular, if $A$ is the 2×2 diagonal matrix with diagonal entries $10$ and $0.1$, inversion can multiply relative errors by a factor of a 100.

Typically the standard thing to do is not to invert the matrix, but rather construct a matrix decomposition from which it is easy to solve the system of equations by back substitution. Not that this would not help the numerical stability problem with the diagonal matrix.

Beyond simple linear functions

At this point we have seen how to find an approximation to a function $f(x)$ of the form $\widehat{f}(x)=\sum_{i=1}^n \beta_i \pi_i(x)$ that minimizes the sum of squared errors.

What happens if our function is just the constant function $f(x)=10$? In this case, our approximation will do rather poorly. In particular, $\widehat{f}(0)=0$, so this will never be a very good approximation about this point.

What we could do in this case is assume that our function instead has the form: $$\widehat{f}(x)=\sum_{i=1}^n \beta_i \pi_i(x)+\beta_{n+1}\cdot 1.$$ We can now solve this new approximation problem by just changing the vectors $x_i=(\pi_1(x_i),\cdots,\pi_n(x_i))$ into new $n+1$-dimensional vectors $(\pi_1(x_i),\cdots,\pi_n(x_i),1)$ and applying linear regression as before. For non-degenerate collections of vectors ${x_i}$, we will end up with the desired function $\widehat{f}(x)=10$. This is sometimes called the bias trick.

Here is an example of such a linear regression being fit to more and more data points. For this example, we used python’s scikit-learn package and rendered the images using matplotlib (see the code).

In this case, the mean squared error approaches the Bayes error rate.

More generally if we think our $f(x)$ can be well approximated by a function of the form $\widehat{f}(x)=\sum_{i=1}^N \beta_i f_i(x)$ for any fixed set of $N$ functions $\{f_i\}$, we can just construct a new sample matrix whose $i$th row is the $N$-vector $(f_1(x_i),\cdots,f_N(x))$ and apply linear regression.

For example, if we are looking for a function $\widehat{f}\colon \Bbb R\to \Bbb R$ of the form $\widehat{f}(x)=\beta_1 x^2+\beta_2 x +\beta_3$, we just have to replace the samples $x_1,\cdots, x_m$ with the vectors $(x_1^2, x_1, 1), \cdots, (x_m^2, x_m, 1)$ and apply linear regression as before. This is an example of polynomial regression. For example, we have the following (see the code):

These examples were constructed as above, but we first transform the data following the above discussion using PolynomialFeatures. Alternatively, it is very easy to just code this transformation yourself.

We could similarly try to find Fourier approximations for a function which we expect has periodic behavior. We could also use wavelets or splines. Clearly, this list goes on and on.

Warning

If we choose a suitably expressive class of functions to define our generalized linear regression problem, then there is a good chance that the linear regression problem will fit the data perfectly. For example, if we have any function $f\colon \Bbb R \to \Bbb R$ and we know its values on $n$ distinct points, then we can perfectly approximate this function by a polynomial of degree at least $n-1$ (e.g., 2 points determine a line, 3 points determine a quadratic curve, etc.). While this will perfectly fit our given data, it becomes quite unlikely that it will match any new data points. This is a classic example of overfitting. We can see this phenomenon in the following example:

Here the degree 10 polynomial will fit any 30 points relatively well, but will drastically fail to generalize well to the entire dataset.

A Maximal Likelihood Justification for using SSE

While the above exposition is elementary, one might wonder where the choice to minimize the sum of squared errors comes from.

Let us suppose that that we have a function of the form $f(x)=g(x)+\epsilon$ where $\epsilon$ is an $\Bbb R$ valued random variable^{3} $\epsilon \sim N(0, \sigma^2)$ and $g(x)$ is a deterministic function. We could mentally justify this by saying that the output variables are not recorded exactly, but there is some normally distributed measurement error or there is some additional non-linear random factors contributing to the value of $f(x)$. Then the conditional density function of $f(x)$ is $$p(y|x)=\frac{e^{-(g(x) -y)^2/2\sigma^2}}{\sqrt{2\pi \sigma^2}}.$$

The maximal likelihood estimate (MLE) (see here for more details) for a vector of values $y=(y_1,\cdots,y_m)$ associated to a sequence of independent samples $x=(x_1,\cdots,x_m)$ is then
\begin{align}

\end{align}

So in this case, the MLE for $y$ is the choice of $y$ which minimizes the sum of squared errors.

A frequentist description

If we suppose that our data samples $(x,y)$ are being drawn from a fixed distribution, e.g., they are the values of a vector valued random variable $X$ and a random variable $Y$. Moreover, suppose the joint pdf of $X$ and $Y$ is $p(x,y)$. Then we can regard linear regression as attempting to find
\begin{align}

\end{align}

Bayesian approach

Here we will consider a Bayesian approach to the polynomial regression problem. First we consider the case, where we approximate data coming from a degree 3 polynomial using a Gaussian model that expects a degree 3 polynomial and is trained on all 200 samples. In other words, we are assuming a conditional distribution of the form
$$p(y,x|\theta)=1/\sqrt{2\pi \sigma^2}\cdot e^{\frac{(y-\sum_{i=0}^3 \beta_i x^i)^2}{2\sigma^2}},$$ where $\theta$ is the vector of parameters in this model. For our prior, we suppose that the parameters are independent random variables where $\beta_i~N(0, 100)$ and $\sigma$ is given by a half normal distribution with standard deviation 1 (since this can only take non-negative values). To calculate the posterior $p(\theta | D)$ we apply Bayes rule. Unfortunately Bayes rule has this nasty integral over the parameter space in the denominator:
$$ \int_\theta p(D|\theta)p(\theta)d\theta.$$ Rather than try and calculate this integral directly, we will calculate this using Monte Carlo integration, which is sampled using an algorithm which is completely NUTS. All of this is implemented using the package PyMC3. This is overkill for this situation, but it is nice to see how the general algorithm is implemented (see here for the code).

In effect what we are doing is using a special technique to generate samples $S$ of $\theta$ according to the prior distribution $p(\theta)$. For each $\theta\in S$, we calculate $z_\theta =p(D|\theta)$ and see that $$p(\theta|D)\approx \frac{z_\theta}{\sum_{\theta’ \in S} z_{\theta’}}.$$ As the number of samples goes up, the accuracy of this estimate increases.

We can look at the output here:

By taking posterior samples of the parameters we obtain the curves in grey. The mean of these samples generates the black curve in the middle. Alternatively, we can sample from $p(y|x)$ (that is we average over all of the choices of parameters). Taking the mean values for each $y$ given an $x$ we obtain the red curve. We have marked a band of radius the standard deviation about these samples.

The summary statistics for our parameter values are listed here. Note that the ‘true value’ of our function is $$f(x)=
7 x^3 – 9.1 x^2 + 3.5 x – 0.392,$$ with a standard deviation of 0.3. We note that none of the true coefficients fall into the 95% credibility intervals, which is not very promising.

mean

sd

mc_error

hpd_2.5

hpd_97.5

coeff__0

-0.209607

0.080981

0.001872

-0.370318

-0.056474

coeff__1

1.914993

0.704315

0.020752

0.562936

3.260617

coeff__2

-5.556357

1.651007

0.050628

-8.678463

-2.410120

coeff__3

4.794343

1.099424

0.033513

2.710534

6.909714

sigma

0.291662

0.015164

0.000186

0.263677

0.321718

We can gain further information by studying the pdfs of the resulting variables:

Now we examine how the Bayesian method does with respect to our overfitting problem above. Here we try to fit a degree 10 polynomial to only 30 data points. We then obtain the following:

Here we see the samples for the polynomial curves again have wild variance, but our posterior predictions are actually quite reasonable. Looking at the summary statistics in the table below , we note that the model has very low confidence in almost all of the parameter values and huge credibility intervals that mostly contain the true values. By averaging over all of these parameters we reduce the variance in our predictions and obtain a much better model.

mean

sd

mc_error

hpd_2.5

hpd_97.5

coeff__0

-0.053526

0.200482

0.002303

-0.469330

0.317603

coeff__1

0.658498

1.898380

0.022927

-3.246255

4.222251

coeff__2

-2.835666

5.895880

0.084508

-13.723288

9.131853

coeff__3

2.316211

8.093137

0.126718

-13.588722

18.071517

coeff__4

1.342521

8.552999

0.104256

-15.235386

18.123408

coeff__5

-0.630845

8.596230

0.099357

-16.968628

16.577238

coeff__6

-1.251081

8.598991

0.103051

-17.388627

15.975167

coeff__7

-0.712717

8.881522

0.102497

-18.123559

16.398549

coeff__8

0.373434

8.985303

0.104398

-16.930835

18.813871

coeff__9

0.753457

8.428446

0.094453

-16.404371

16.786885

coeff__10

0.835288

7.222892

0.089429

-13.213493

15.239204

sigma

0.276957

0.041169

0.000551

0.205731

0.360664

Exercise

How did our choice of prior bias the predictions of the coefficients? What would be some other reasonable choices of prior?

Note that minimizing the sum is equivalent to minimizing the average. ↩

Recall that on $\Bbb R^m$ we have a scalar product $\langle v,w\rangle=v^Tw$ (here we view vectors as column vectors). This formula tells us that $$\langle v,Ax\rangle =v^TAx=x^TA^Tv=\langle x,A^Tv\rangle =\langle A^Tv,x\rangle $$ (note that we used that every 1×1 matrix is symmetric). This implies that $$\langle x, A^TA x\rangle =\langle Ax, Ax \rangle \geq 0$$ for all $x$. Moreover, this quantity is zero if and only if $Ax=0$. ↩

So our function $f$ is of the form $f\colon \Bbb R^n\times \Omega \to \Bbb R$ where $f(x,-)\colon \Omega \to \Bbb R$ is a measurable function on some probability which is distributed normally with mean $g(x)$ and variance $\sigma^2$. ↩

Machine Learning: A probabilistic perspective by Kevin Murphy. The material in this book is closest to what we will cover in the course, but is unfortunately not available for free. Written by an academic and a practitioner of machine learning, this text is full of real world examples and applications, while eschewing the ad-hoc approach of other texts.

Deep learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. A new text on this burgeoning field. Written by central figures in the area, this text surveys deep learning, includes many useful citations (a great deal of the literature is still in academic articles), provides good rules of thumb, and is a relatively light read. This book is very much from a practitioner’s perspective with less justification for methodology and less emphasis on mathematical rigor. My limited experience is that other papers in this area have a similar emphasis, which reflects the limited understanding of why these methods work very well in some instances and not so well in others.

Trevor Hastie and Robert Tibshirani have made their lectures and slides for their course on statistical learning available. These are very well done.

Machine Learning lectures from Georgia Tech, these are very well done introductory lectures on machine learning directed toward engineers. I personally love the nerdy jokes interspersed throughout.

Deep learning course lectures by Nando de Freitas. Excellent lecture series on deep learning from Oxford for computer scientists. This lecture series is aimed at those with some more background.

Datacamp: An online learning site with an emphasis on data science. Includes mini-lectures and online programming exercises. A nice introduction on how to take the material we are learning and apply it. However, there is a noticeable bump in difficulty in proceeding to implement these methods from scratch on your own computer.

Andrej Karpathy’s blog: A truly excellent sequence of blog posts on topics related to machine learning.

Kaggle: Hosts machine learning competitions, allows machine learning practitioners to share information, and hosts many large datasets.

Seeing Theory: A beautiful sequence of javascript visualizations for statistical concepts.

If you find that you’re spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you find that you’re spending almost all your time on practice, start turning some attention to theoretical things; it will improve your practice. – Donald Knuth

Introduction

This section will contain some very basic computer science background. Since the target audience of this lecture series has little to no training in this area, I will attempt to fill in some gaps (admittedly very poorly).

Numerical analysis

It should come as no surprise that computers store data as sequences of 1’s and 0’s. For example, non-negative integers are represented by their base-2 expansions (e.g. 7 is represented by 111 and 8 by 1000). Since we can only allocate a finite amount of space for each integer (e.g., 16, 32, or 64 bits) we can only hold a finite range of integers (e.g., $-2^{63}\leq i \leq 2^{63}-1$) (see two’s complement). This means that one has to avoid applying operations that will take us out of this range. Arithmetic operations that generate values outside of their allowed range are said to overflow.

Problems with the finite representations of integers are dwarfed by the problems arising from the finite representation of real numbers. The (IEEE) standard way to finitely represent a real number is as a floating point number. The idea is to approximate a real number $x$ by a non-negative integer $\alpha$, an integer $\beta$, and a sign bit $\epsilon$ so that $x\approx (-1)^\epsilon \cdot \alpha \cdot 2^{\beta}$. Since, using a fixed number of bits, we can only represent finitely many integers, there are only finitely many such floating point numbers.

The finiteness of this representation leads to some obvious consequences as well as some non-obvious ones.

Since the exponent $\beta$ is bounded above, there is a bound for the range of real numbers that we can encode. This leads to the same type of overflow problems as for integers.

Since the exponent $\beta$ is bounded below, there is a smallest positive real number we can encode in this representation. This leads to a new set of problems called underflow. For example $e^{-n}=0$ for $n\gg 0$. As a consequence, $e^{-n}\cdot e^{n}\neq e^0=1$ for large $n$. If we instead expanded this expression from the middle then we obtain: $$e\cdot (e\cdot (\cdots(e\cdot e^{-1})\cdot e^{-1})\cdots e^{-1})=1.$$ So multiplication of floating point numbers fails to be associative.

Since $\alpha$ is a bounded positive integer, we can only encode a finite amount of precision. This implies that for certain (large) values of $m$ and $n$, $$\sum_{i=1}^{2^n} 2^{-n} + 2^m = 2^m+1,$$ if we apply the operations from left to right, while the sum is $2^m$ if we apply them from right to left. So addition of floating point numbers fails to be associative.

We can not encode fractions precisely so $1/n \cdot n\neq 1$ in general. On older calculators one can witness this first hand and see that $1/3\cdot 3“=”0.9999999999$.

Remark

I sometimes find it helpful to visualize all of the floating point numbers placed on the real line as a sequence of dots. These dots cluster heavily near 0 and gradually spread farther and farther apart as we move away from the origin until we meet the size bounds. Beyond these points there are no longer any dots.

Since both encodings of real numbers into floating point and floating point operations contribute errors^{1}, care needs to be taken in the design of floating point algorithms to ensure that these errors do not grow out of control, otherwise things may go seriously wrong).

Remark

In probability theory and, especially, statistical inference, it is very common to take iterated products of probabilities. Since many of these probabilities are very small, there is a high risk of numerical underflow as we calculate the product.

One way to avoid this underflow is to use the simple formula $$ \prod_{i=1}^n x_i = \exp(\sum_{i=1}^n \log x_i). $$ This will help avoid the numerical underflow for the product. To ensure that the sum of quantities of varying magnitude is as accurate as possible, we can first sort the values to be summed and then add them.

Algorithm Analysis

Computer scientists and engineers are very concerned with space and computational efficiency. Making fair comparisons between two implementations of algorithms can be tricky. Implementation details of essentially identical algorithms can result in significant changes in runtime performance and memory costs (e.g., consider a program written in python 2.6 versus one written python 3.0).

An easier comparison can be made about how the memory and runtime requirements of an algorithm change as we increase the size of the input. For example, the standard grade school algorithm for multiplying two $n\times n$ matrices requires $n^3$ multiplications and $n^2*(n-1)$ addition operations^{2}. What is important to realize about this algorithm is not the exact count of the operations or how much faster addition is compared to multiplication, it is, for example, that if we replace $n$ with $100n$ then we will require roughly $1 000 000$-times more operations!

At this scale the addition operations and their relative speed in comparison to multiplication do not matter as much as this observation. The dominant feature is that the time to multiply two $n\times n$ matrices grows cubically in $n$.

One way to encode the essential features of the growth of functions is with Big O-notation.

Definition

We say^{3} $f(n)= O(g(n))$ if there exists an $M>0$ such that for all sufficiently large $n$, $$|f(n)|\leq M|g(n)|.$$

For most big data tasks, any algorithm whose runtime is not $O(n \cdot \log n)$ requires some attention (especially attention to how one can parallelize the computation across a cluster). More importantly, once the data is too large to fit into a single computers memory, this entire approach to comparing algorithms breaks down (as do the standard implementations of the algorithms). In this case, one has to think very carefully about how to break up the task into separate pieces. This is all complicated by the fact that the time move data between computers is much slower than it is to move the data internally in computer memory.

One way to reduce the errors in floating point operations is to use more bits for the representations, do the operation, and then truncate. ↩

It turns out this is not asymptotically optimal. For example see here. ↩

I would like to, on the record, state my disapproval of the standard usage of the equal-sign in the expression $f(n)=O(g(n))$. I can see no reasonable interpretation of the equality symbol here.)https://en.wikiquote.org/wiki/Donald_Knuth ↩