General – Page 2 – Machine Learning for Mathies

Problem Set 3

This is to be completed by November 9th, 2017.

Exercises

[Datacamp](https://www.datacamp.com/home
- Complete the lesson “Introduction to Machine Learning”.
- This should have also included “Exploratory Data Analysis”. This has been added to the next week’s assignment.
MLE for the uniform distribution.
- (Source: Kaelbling/Murphy) Consider a uniform distribution centered on 0 with width 2a. The density function is given by: $$ p(x) = \frac{\chi_{[-a,a]}}{2a}.$${
  a. Given a data set $x_1,\cdots, x_n,$ what is the maximum likelihood estimate $a_{MLE}$ of $a$?
  b. What probability would the model assign to a new data point $x_{n+1}$ using $a_{MLE}$?
  c. Do you see any problem with the above approach? Briefly suggest (in words) a better approach.
Calculate the expected value and mode of $\theta$ when $\theta \sim \textrm{Beta}(\alpha, \beta)$,
Change of variables:
- Let $X\colon S\to T_1$ be a discrete valued random variable with pmf $p_X\colon T_1\to [0,1]$ and let $Y\colon T_1\to T_2$ be a function. Derive the pmf $p_{Y\circ X}\colon T_2\to [0,1]$ in terms of $p_X$ and $Y$.
- Let $X^n\colon S^{\times n}\to \{0,1\}^{\times n}$ be the random variable whose values give $n$-independent samples of a Bernoulli random variable $X$ with parameter $\theta$ (i.e., $p_X(1)=\theta$). Show that $$p_{X^n}(v_1,\cdots, v_n)=\theta^{\sum{v_i}}(1-\theta)^{n-\sum v_i}.$$ Now let $\sigma \colon \{0,1\}^{\times n}\to \{0,…n\}$ be defined by taking the sum of the entries. The composite $\sigma\circ X^{n}$ is called a Binomial random variable with parameters $n$ and $\theta.$ Determine $p_{\sigma\circ X^n}(k)$.
- Let $X\colon S\to \Bbb R$ be a random variable with piecewise continuous pdf $p_X$ and let $Y\colon \Bbb R\to \Bbb R$ be a differentiable monotonic function. Show that $Y\circ X$ is a random variable and determine $p_{Y\circ X}$.
Uninformative prior for log-odds ratio:
- (Source: Murphy) Let $$ \phi = \textrm{logit}(\theta) = \log \frac{\theta}{1-\theta}.$$ Show that if $p(\phi)\propto 1,$ then $p(\theta)\propto \textrm{Beta}(0,0)$.
R Lab:
- Construct and apply a Naive Bayes classifier for a specific text classification problem (e.g., spooky author identification) from scratch. In other words, do not use any modeling libraries. Feel free to use any libraries you like to get the data into an acceptable format.

Problem Set 2

This is to be completed by November 2nd, 2017.

Exercises

Datacamp
- Complete the lesson “Data Visualization in R”.
Probabilities are sensitive to the form of the question that was used to generate the answer:
- (Source: Minka, Murphy.) My neighbor has two children. Assuming that the gender of a child is like a coin flip, it is most likely, a priori, that my neighbor has one boy and one girl, with probability 1/2. The other possibilities—two boys or two girls—have probabilities 1/4 and 1/4.
  a. Suppose I ask him whether he has any boys, and he says yes. What is the probability that one child is a girl?
  b. Suppose instead that I happen to see one of his children run by, and it is a boy. What is the probability that the other child is a girl?
Legal reasoning
- (Source: Peter Lee, Murphy) Suppose a crime has been committed. Blood is found at the scene for which there is no innocent explanation. It is of a type which is present in 1% of the population.
  a. The prosecutor claims: “There is a 1% chance that the defendant would have the crime blood type if he were innocent. Thus there is a 99% chance that he guilty”. This is known as the prosecutor’s fallacy. What is wrong with this argument?
  b. The defender claims: “The crime occurred in a city of 800,000 people. The blood type would be found in approximately 8000 people. The evidence has provided a probability of just 1 in 8000 that the defendant is guilty, and thus has no relevance.” This is known as the defender’s fallacy. What is wrong with this argument?
Bayes rule for medical diagnosis
- (Source: Koller, Murphy.) After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare disease, striking only one in 10,000 people.
  a. What are the chances that you actually have the disease? (Show your calculations as well as giving the final result.)
Conditional independence (Source: Koller.)
- Let $H\in {1,\cdots, K}$ be a discrete random variable, and let $e_1$ and $e_2$ the observed values of two other random variables $E_1$ and $E_2$. Suppose we wish to calculate the vector
  $$P(H|e_1, e_2) = (P(H=1|e_1,e_2),\cdots, P(H=K|e_1, e_2)).$$
  a. Which of the following sets of numbers are sufficient for the calculation?
  i. $P(e_1, e_2), P(H), P(e_1| H), P(e_2, H)$.
  ii. $P(e_1, e_2), P(H), P(e_1, e_2 | H)$.
  iii. $P(e_1|H), P(e_2|H), P(H)$.
b. Now suppose we now assume $E_1\perp E_2 | H$ (i.e., $E_1$ and $E_2$ are independent given $H$). Which of the above 3 sets are sufficient now?
R lab
- Estimate the value of $\pi$ by taking uniform random samples from the square $[-1,1]\times [1,1]$ and seeing which lie in the disc $x^2+y^2\leq 1$.
- A company is trying to determine why their employees leave and why they stay. They have a list of roughly 15000 employee records here.
  a. Download this dataset and load it in R (this may require setting up a Kaggle account if you don’t already have one).
  b. Examine the dataset and see if you need to transform any of the features=columns (e.g., are there factors that were not recognized as such, is there missing data?).
  c. Randomly shuffle the rows and cut the dataset into two pieces with 10000 entries in a data frame called train and the remaining entries in a data frame called valid.
  d. Study the train data frame and see if you can find any features that predict whether or not an employee will leave.
  e. Make a hypothesis about how you can predict whether an employee will leave by studying the train data.
  f. Once you have fixed this hypothesis evaluate how well your criteria work on the valid data frame.
  g. Justify your proposal with data and charts. Save at least one of these charts to a pdf file to share with management.

Problem Set 1

This is to be completed by November 27th, 2017. (THIS IS A TYPO: This should read October 26th, 2017). It is okay if you finish by this ridiculous first due date.

Forewarning

For several of these exercises, you will be asked to install software and/or setup accounts from outside websites. These websites may try to convince you to purchase some of their products. You do not need to purchase anything to do these exercises. All of the required materials have been made freely available to us.

Exercises

Learn the most important skill for this course
- Learn how to use Google to find an answer to almost any technical problem (e.g., ‘How do I install — for Windows?’ or ‘How do I do — in R/Python?’ or ‘I’m getting a weird error message, how do I fix it?’). If you have not learned how to do this yet, I highly recommend it. It will open up your world to a range of new possibilities.
Datacamp
- Setup a Datacamp account and join the course group by using the email invitation. Note you should not have to pay any money to do this. I do not believe you will need to provide any bank information for this purpose.
- Try out some of the lessons designated for the course.
- During the course you should keep up with the assigned lessons.
RStudio
- Follow this tutorial to install RStudio. RStudio is an interactive development environment for the R programming language. Which will make it easier to complete the R exercises. Again, you should not have to pay anything for this.
- Install the ISLR package from CRAN. While you’re at it you may as well install caret, dplyr, tidyr, and ggplot2.
- Try out the commands from this lab.
Python (Optional, but recommended for those interested in learning Python¹)
- Install Anaconda using Python 3.x (Python 2.7 is outdated, but still often used). As I remember it, this requires you to setup an account, but you will not have to pay anything. They will send you emails which you will have to unsubscribe from. I believe this is a small price to pay for a system that manages your Python installation and includes many of the packages used in data science.
Additional material
- Glance at some of the additional source material for the course.
- Since we will be using it for R labs, download the ISLR book (or go and buy it if you like it!).
R lab
- Complete exercise 8 from Chapter 2 of ISLR. In this exercise you will be exploring data gathered from universities and colleges in the United States.

In my humble opinion, Python is a truly wonderful language with an incredible number of highly useful libraries. Once you know Python, it is a surprisingly short road to solving many typical computer tasks. It is typically much slower than other, or more grown-up, languages such as C/C++ or Java, but the speed-up in development time makes up for it for the kind of tasks we will be considering. Moreover, the syntax is very natural and easy to learn. (Drops the microphone) ↩

Parameter Estimation

…the statistician knows…that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world. – George Box (JASA, 1976, Vol. 71, 791-799)

Parameter estimation

Suppose that we are given some data $D=(x_1,\cdots, x_n)\in T^n$. Our task is to find a parametric probability distribution $m(\hat{\theta})$ on $T$ such that if we were to independently sample from it $n$-times, we would could reasonably¹ expect to obtain $D$. We first make some choice² of possible distributions $m(-)$, which leaves us with the task of identifying the $\hat{\theta}$.

Likelihood approach

One approach is the maximal likelihood estimate or MLE. For this we set $$\hat{\theta}=\theta_{MLE} = \mathrm{argmax}_\theta p(D | \theta).$$ That is we choose $\theta$ so that the probability density of $D$ from the given model is maximized.

There are a few comments that we should make here:

There may not be a maximum for a general distribution. Indeed the models typically considered for logistic regression do not have a maximum.
If there is a maximum, it may not be unique.
This is an optimization problem and when this problem is not convex, we may not have any method that is guaranteed to identify find a maximum even when it exists.

We do not actually address these problems, instead we just say that optimization techniques will give us a parameter value that is “good enough”.

Example

Suppose that $D=(x_1,\cdots, x_n)\in \Bbb R^n$ and we want to find a normal distribution $N(\mu, \sigma^2)$ which models the data under the assumption that these were independent samples.

First we construct the MLE of $\mu$:

\begin{align}

\end{align}

We differentiate the last expression with respect to $\mu$ and set this to 0 and obtain
\begin{align}
-\sum_{i=1}^n (x_i -\mu)&=0 \\
\mu_{MLE} &= \frac{\sum_{i=1}^n x_i}{n}.
\end{align}
Taking second derivatives shows that this is in fact a minimum. In other words, the MLE of $\mu$ is the sample mean.

Exercise

Show that MLE of $\sigma^2$ is the (biased) sample variance. In other words, $$\sigma_{MLE}^2 = \frac{\sum_{i=1}^n (x_i-\mu_{MLE})^2}{n}.$$

Frequentist evaluation

The frequentist approach gives us a method to evaluate the above estimates of the parameters. Suppose that our data is drawn from a true distribution $m(\hat{\theta})$ and set $\overline{\theta} = E(\theta_{MLE}(D))$, where the expectation is given by integrating over $D$ given the “true” model. Define the bias of $\theta_{MLE}(-)$ by $$\mathrm{bias}(\theta_{MLE})=\overline{\theta}-\hat{\theta}.$$

Let $\mu_S(D)$ be the sample mean of data drawn from some distribution $m$ with finite mean $\widehat{\mu}$. Then we see
\begin{align}
E(\mu_S) & = E\left(\frac{\sum_{i=1}^n X_i}{n}\right) \\
& = \frac{\sum_{i=1}^n E(X_i)}{n} \\
& = n\widehat{\mu}/n = \widehat{\mu}
\end{align}
so this is an unbiased estimate.

For a visualization of an unbiased point estimate from the frequentist point of view look here.

Exercise

Show that the expected value of the sample variance $\sigma^2_S$ of data drawn from some distribution $m$ with finite mean $\widehat{\mu}$ and finite variance $\widehat{\sigma^2}$ is $$ E(\sigma^2_S)= \frac{n \widehat{\sigma^2}}{n-1}.$$
Show that it follows that the bias of $\sigma^2_S$ is $\widehat{\sigma^2}/(n-1)$.
Show that the unbiased sample variance $\sigma^2_U(X_1,\cdots,X_n) = \frac{\sum_{i=1}^n(X_i-\mu_S(X_1,\cdots X_n))^2}{n-1}$ is, in fact, an unbiased estimate.
Construct a symmetric 95% confidence interval (see here for some additional discussion) for our estimate $\mu_S$ of $\widehat{\mu}$ which is centered about $\mu_S$, when $n=10$, $\mu_S=15$ and $\sigma^2_U = 2$. Hint: the random variable $\frac{\mu_S-\widehat{\mu}}{(\sigma_U /\sqrt{n})}$ has the same distribution as the Student’s $t$-distribution whose inverse cdf values can be calculated by statistics packages.

Bias-Variance tradeoff

Suppose that we have an estimator $\theta=\delta(D)$ of $\hat{\theta}$ and the expected value of the estimate is $E(\delta(D))=\overline{\theta}$. We obtain the following expression for our expected squared error
\begin{align}
E((\theta-\hat{\theta})^2) &= E(\left((\theta-\overline{\theta})-(\hat{\theta}-\overline{\theta})\right)^2) \\
&= E((\theta-\overline{\theta})^2)-E(\theta-\overline{\theta})\cdot 2(\hat{\theta}-\overline{\theta}) + (\hat{\theta}-\overline{\theta})^2 \\
&= \mathrm{var}(\theta)+\mathrm{bias}^2(\theta)
\end{align}

This equation indicates that while it may be nice to have an unbiased estimator (i.e., one that will, on average, give precisely the correct parameter), if this comes with the cost of creating an estimate that varies wildly with the data then we will still expect a large amount of error and indicates why we would like estimators with low-variance.

A typical example of this is seen in polynomial regression (here we are viewing the problem as trying to estimate $p(y|x)$). As we vary the degrees of the polynomial being used, we find that very large degree polynomials can fit the data well:

High degree polynomials fit the data better

However, the fit for high degree polynomials varies highly depending on the choice of data points:

The MAP Estimate

One Bayesian approach to parameter estimation is called the MAP estimate or maximum a posteriori estimate. Here we are given data $D$, which we want to say is modeled by a distribution $m(\theta)$ and we construct the MAP estimate of $\theta$ as
\begin{equation}

\end{equation}
In other words, we choose the mode of the posterior distribution. Note that if our prior $p(\theta)$ is constant independent of $\theta$, then the only part of the right hand side that depends on $\theta$ is $p(D|\theta)$ and the MAP estimate of $\theta$ is the same as the MLE of $\theta$.

Example

Suppose that we have a coin whose bias $\theta$ we would like to determine (so $\theta$ is the probability of the coin coming up heads). We flip the coin twice and obtain some data $D=(H,H)$. Under the assumption that the coin flips are independent we are trying to model a Bernoulli distribution. Now we calculate the conditional probability:
\begin{equation}
p(D|\theta) = \theta^2(1-\theta)^0
\end{equation}
and quickly see the MLE for $\theta$ (given that $\theta \in [0,1]$) is what we would expect: $\theta = 1,$ as mentioned in the hypothesis testing example of an overconfident estimator.

As already mentioned, the MLE and MAP estimate agree in the case of the uniform prior, but the Bayesian approach allows us to calculate $p(\theta | D)$, which gives us a measure of our confidence in this estimate. In this case we see:
\begin{align}
p(\theta | D) &= \frac{p(D|\theta)p(\theta)}{\int_{\theta’} p(D|\theta’)p(\theta’)d\theta’} \\
&= \frac{\theta^2}{\int_{\theta’} (\theta’)^2d\theta’} \\
&= \frac{\theta^2}{1/3} \\
&= 3\theta^2.
\end{align}

If we want to find a 99.9% credible interval for the estimate $\theta_*\in [a,1]$, then we just integrate
\begin{align}
P(\theta \in [a,1] | D) &=\int_{\theta =a}^1 p(\theta | D)d\theta \\
&=\int_{\theta =a}^1 3\theta^2 d\theta \\
&=1-a^3.
\end{align}
Setting this to be equal to $1-10^{-3}$ we see that $a=0.1$. This interval is much more conservative than the confidence interval obtained by frequentist methods.

Now what if we do not have a uniform prior. Instead we suppose we have some prior information³ that makes us believe that the most likely value of $\theta$ is actually 0.25 and that it has variance 0.05. How do we incorporate this information into a prior? Well the mode and the variance are not sufficient to determine a distribution over the unit interval, so let’s assume that $p(\theta)$ has some convenient form that fits these values.

It would simplify matters if $p(\theta)$ had the same form as $p(D|\theta)=\theta^{|H|}(1-\theta)^{|T|}$. So, supposing $p(\theta)=C\theta^a(1-\theta)^b$ for some constant $C$ (i.e., $p(\theta)\propto \theta^a (1-\theta)^b$), we would see that
\begin{align}
p(\theta | D) &\propto p(D|\theta)p(\theta) \\
&\propto \theta^{|H|}(1-\theta)^{|T|} \theta^a(1-\theta)^b\\
& = \theta^{|H|+a}(1-\theta^{|T|+b}).
\end{align}
This has the same form as the prior! Such a prior is called a conjugate prior to the given likelihood function. Having such a prior is super convenient for computation.

In this case, the conjugate prior to the Bernoulli distribution is the Beta distribution which has the form $p(\theta | \alpha, \beta)\propto \theta^{\alpha-1}(1-\theta)^{\beta-1}$. One can calculate the mode and variance of this distribution (or just look it up) and get
\begin{align}
\textrm{mode} &= \frac{\alpha -1}{\alpha+\beta-2} \\
\sigma^2 &= \frac{\alpha \beta}{(\alpha+\beta)^2(\alpha+\beta+1)}
\end{align}
Plugging this into Wolfram Alpha gives us some approximate parameter values: $\alpha \approx 1.4$ and $\beta \approx 2.3$.

Using this prior, the new posterior is
$$ p(\theta | {HH})\propto \theta^{2.4}(1-\theta)^{1.3}.$$ We can form the MAP estimate by taking the $\log$ of this expression (which is fine away from the points $\theta = 0, 1$, which we know can not yield the maximum), differentiating with respect to $\theta$, setting this to 0, and finally checking that the second derivative at this point is positive.

Exercise

Show the MAP estimate for this problem is $2.4/3.7\approx 0.65$ by calculating the mode of the $\beta$ density function.

We see that our estimate of $\theta$ shifts drastically as we obtain (even just a little bit) more information. Now lets see how the posterior changes as we see even longer strings of heads:

From the image we see that as posterior gradually becomes highly confident that the the true value of $\theta$ is very large and assigns very small probability densities to small values of $\theta.$

To see how much our choice of prior affects the posterior in the presence of the same data, we can look at the analogous chart starting from a uniform prior, i.e., when the $\beta$ parameters are $\alpha=\beta=1$.

Or we can look at a side by side comparison:

From looking at the graphs, we can see that while the two priors are very different (e.g., the non-uniform prior assigns very small densities to large values of $\theta$), the posteriors rapidly become close to each other. How the posterior depends on the prior[/caption]Since they never agree, our methods still depend on our choice of prior, but we can also see that the both methods are approaching the same estimate.

To finally drive this point home, let’s consider a severely biased prior $\beta(1.01,10.0)$, whose mode puts the probability of heads at approximately 0 and with near 0 variance in the binomial model. Then we can see how the posterior changes with additional evidence:

The severely biased prior is much slower to approach the same consensus as the other priors.

Exercise

The severely biased prior is very close to what we would obtain as a posterior from a uniform prior after witnessing 9 tails in a row. If we were to witness 9 tails in a row followed by 20 heads in a row, how would you change your modeling assumptions to better reflect the data?

Further point estimates

As discussed above, one way to estimate a parameter is the MAP estimate which is just the mode of the posterior distribution $p(\theta | D)$. This is a very common approach to parameter estimation because it transforms the task into an optimization problem and we have a number of tools for such tasks.

However, the mode of a distribution can be far from a typical point. The mode of a distribution is also not invariant under reparametrization. One alternative is to select the mean or expected value of $\theta$ from the posterior distribution. Another alternative is to take the median of the posterior distribution.

I know this is not a very precise statement, but I want to be flexible at this point. ↩
How we make this choice is the model selection problem which will be discussed later. ↩
I apologize that this example is so contrived. ↩

Hypothesis Testing

Acceptance without proof is the fundamental characteristic of Western religion, rejection without proof is the fundamental characteristic of Western science. – Gary Zukav, “The Dancing Wu Li Masters”

Hypothesis Testing

Now we consider Hypothesis Testing in an example. While Bayesians also have a form of hypothesis testing, the term is almost always used for the frequentist approach. We will describe this first in an example here.

(Frequentist) Hypothesis Testing

Suppose we produce a cancer treatment that we believe will increase the chance of a patient living for another year compared to standard treatment. We run a controlled experiment where we divide up a group of patients into two groups, the control group and the test group (one has to be extremely careful about how this is done in order to get meaningful results). The control group receives standard treatment and the test group receives the new treatment. We follow the patients for a year and record the outcomes as $D$. We would like to show our hypothesis $H_1$, that our new treatment changes outcomes, holds as opposed to the null hypothesis $H_0$ that the treatment has no effect. Associated to $H_0$ is a statistical model from which we could provide predictions about patient outcomes. The hypothesis $H_1$ is less specific, it really hypothesizes that the data is given by any other model besides $H_0$.

Clearly, what we want to show¹ is that
\begin{equation}
p(H_1 | D) > p(H_0 | D), \label{eq:b-hyp}
\end{equation}

that is given the data from our experiment, it is more likely that our treatment improves outcomes than it does nothing. However, \eqref{eq:b-hyp} is meaningless from the frequentist perspective: in this approach there is only one true model giving rise to the data, so either $H_1$ holds or it doesn’t (in which case $H_0$ holds). Since $p(H_i)$ is either 0 or 1 and we can’t know which ahead of time, the frequentist instead considers $p(D | H_0)$ which, since $H_0$ defines a specific model, can be calculated. So the experimenter, calculates the probability $P(D\in S| H_0)$, where $S$ is some set of (often extremal) values of data containing the observed data $D$. If this probability is less than some specific small $\alpha\in (0,1)$, then they conclude that the null hypothesis is unlikely to hold and reject the null hypothesis at the significance level $\alpha$ and conclude that $H_1$ holds.

This approach is essentially a probabilistic proof by contradiction. The conditional probability $P(D\in S | H_0)$ is only defined if $P(H_0)>0$, which, in the frequentist perspective, is equivalent to supposing $P(H_0)=1$. A hypothesis test that rejects the null hypothesis shows that the this assumption leads to an outcome which we deem to be so unlikely that we are justified in excluding it. If instead of testing the null hypothesis we tested the alternative hypothesis and somehow calculated $P(D\in S | H_1)$, then we would be implicitly assuming the desired conclusion that $P(H_1)=1$, which makes the argument circular. This makes it impossible in this framework to directly establish that any one specific model is the correct hypothesis, we can only argue “by contradiction” and eliminate individual hypotheses.

In many areas of social science and science, the acceptable $\alpha$ for a publishable result is $\alpha = 0.05$. Let’s suppose that $P(D\in S|H_0)=0.049999$. The claim is that if the treatment had no effect, then we would only witness patient outcomes like the above about once in every 20 experiments, while if the treatment had any effect whatsoever we would see outcomes like the above about 19 times out of every 20 experiments.

Allow me to rant a little bit about what I dislike about this approach:

This approach says nothing about whether or not the treatment improves outcomes, just how it compares to not doing the treatment.
It does not say anything about the size of the effect (one could argue that if we include the power of a test, then we can quantify the effect size; I will discuss this below). For example, smoking, some headache medicines, and stress have been shown to increase the risk of birth defects for pregnant women and hence doctors typically advise pregnant women to avoid all of these things². This advice is not weighted according to the degree of risk, partly because these methods say nothing about the degree of risk. So, if a pregnant woman is stressed because she can’t have a cigarette and she has a headache what should she do?
Since the academic culture is almost exclusively interested in positive results, essentially only the studies that reject the null hypothesis are published. This leads to some additional problems:
- Since many experiments do not lead to positive outcomes, there is a good chance that different academic groups are repeating the same (or similar) experiments. Even if the null hypothesis is true, the probability that one of these groups will generate an unlikely dataset that leads to a statistically significant result increases with the number of experiments³. This could be avoided by using a much smaller value of $\alpha$. Particle physicists, for example, have used $\alpha \approx 0.0000003$. Of course, particle physicists deal with an enormous amount of data, so reaching such significance levels is possible for them.
- More likely, one experiment leads to data that can be sliced in many ways and one of those slices will lead to a statically significant result; this is called $p$-hacking. Shrinking the acceptable $\alpha$ only makes the $p$-hacking more difficult, but not impossible. One can find a thorough discussion here.

Rejecting the null hypothesis when it is true is called a Type I error. Accepting the null hypothesis when the alternative hypothesis holds is called a Type II error. These names are rather unhelpful and I remember them using the mnemonic “Type I errors are the number one most common type of error in practice”, which I mentally justify using reasons above.

Recall that we reject the null hypothesis at significance level $\alpha$ if the observed data does not lie in an interval $I_\alpha$. Here $I_\alpha$ is determined by $P(D\in I_{\alpha} | H_0) =1-\alpha$. So the probability of incorrectly accepting the null hypothesis, when the alternative hypothesis holds, is
\begin{equation}
P(D\in I_{\alpha} | H_1).\label{eq:power}
\end{equation}
The probability $P(D\not\in I_{\alpha} | H_1)$ is called the power of the test.

As far as I can tell, the alternative hypothesis must be the negation of the null hypothesis for hypothesis testing to be meaningful. So the alternative hypothesis needs to be more vague than the null hypothesis and I can not see any way to calculate \eqref{eq:power} in practice. For example, how do we calculate the probability of seeing data in a given interval if we assume the data is produced by any other model besides that using the null hypothesis?

However, we can calculate \eqref{eq:power}, when $H_1$ is instead some $H(\theta)$. Here $H(\theta)$ is a specific hypothesis other than the null hypothesis. While this information could be helpful (the resulting probability function of $\theta$ in \eqref{eq:power} is called the power function), it is not the probability of committing a type II error and, because of frequentist assumptions, only makes sense when $H(\theta)$ is the “true” hypothesis generating the data⁴. Looking around the internet, there seems to be a great amount of confusion about this.

Since, from a frequentist perspective, $P(D \in I_{\alpha} | H(\theta))$ is only defined when $P(H(\theta))=1$, for all of the possible values of $\theta$ this expression is only well-defined for one specific value of $\theta.$ Since frequentists often consider this expression for the value of $\theta$ estimated from the data, they are making an implicit assumption that their estimate was a perfect match to the one true parameter. A Bayesian (or perhaps anyone) would suggest that this is a rather extreme assumption.

To help understand my skepticism of the use of the power function, let me consider a mathematical example. One of the great open problems is whether or not two objects $P$ and $NP$ are equal. I won’t get into the details of what this means, but I will say that the encryption schemes that we implicitly use for computer security rely on the fact that these two classes are not equal. If they were, there could be relatively efficient algorithms for breaking encryption algorithms. The world, whether they know it or not, has essentially been operating under the assumption that $P\neq NP$.

Now, suppose that someone writes a paper and shows that if $P=NP$, then something wonderful, which we will call $W$, happens. While this would be interesting, we can conclude nothing about whether or not $W$ holds because we do not know that $P=NP$ (in fact, we expect that this assumption is false). Moreover, this paper would open up the possibility that $P=NP$ if and only if $W$ happens. In this case, many would view this result as significant evidence against $W$ happening. To connect to the power function, the calculation of the power function is dependent on an assumption that many would not find very credible and hence they would not have a high degree of confidence that the power function is well-defined under frequentist assumptions⁵.

A beautiful visualization of inference hypothesis testing is available here.

Bayesian hypothesis testing

From the Bayesian perspective, \eqref{eq:b-hyp} is a meaningful statement that can be checked. Moreover, Bayesians are not forced to only consider the alternative hypothesis $H_1$ that anything other than the null hypothesis holds. They can identify specific hypotheses to explain the data and check the probability of them holding. As described above, the only way to make sense of this is to assign non-zero prior probabilities $p(H)$ for each possible hypothesis being considered. Bayesians do not have qualms about this and make a principled choice for these prior possibilities. This leads to significantly more refined statements (which are necessarily accompanied by clear assumptions) which define models for the data distribution. These models provide estimates of the size of an effect and can be tested for accuracy on future data.

At least this is what I would like to show. ↩
The length of the list of things that pregnant women are supposed to avoid is truly incredible. ↩
When one group keeps repeating similar experiments to obtain the desired results, I would classify this as academic misconduct. However, private companies typically have a vested interest in establishing one outcome and can fund many different groups to do similar studies and only release the ones that support the private company’s preferred outcome. This is a very good reason to look at privately funded studies with skepticism. ↩
An extremely poor practice is calculating the power function for the hypothesis that best fits the observed data and stating that this is the probability of type II error for the given test. This is clearly absurd from first principles. ↩
Granted, this conclusion is based on a Bayesian perspective, but I am simultaneously arguing that many of us implicitly have this perspective. Even mathematicians working under the assumption that a conjecture is either true or false, will often assign a degree of belief to one statement or the other. ↩

Statistical Inference

All models are wrong, but some are useful. – George Box

Introduction

The general setup for statistical inference is that we are given some data $D$ which we assume arise as the values of a random variable that we assume is distributed according to some parametric model $m(\theta)$. The goal is to estimate $\theta$ in this circumstance and to give some measure of the certainty of our estimate. There are two schools of thought on this problem. One can assume that the true parameter value $\hat{\theta}$ is fixed and that we can evaluate the correctness of an estimate by running repeated trials (i.e., sampling $D$ from $m(\hat{\theta})$), this is the frequentist school. The other, Bayesian, school assumes that the data is given and that $\theta$ is actually a random variable.

For more in depth coverage of these approaches please consult:

Question

Consider the following standard situations:

We are trying to determine the percentage $\theta$ of likely-voters who will vote for a particular candidate in an election. We gather data by polling a sample from the population.
We are trying to determine the average standardized test scores $\theta$ of students taught with a new method. We gather data by teaching a test group of students and calculating the average scores of the students.
We are trying to predict the probability $\theta$ of rain tomorrow based on the meteorological data we have gathered.

Is there a “true” value of $\theta$? Or is the proposed $\theta$ we are trying to estimate subject to many additional factors which are difficult or impossible to specify? Is it possible to run repeated tests and somehow sample from the true population?

Sometimes is is said there is also a third school of thought, the “likelihoodists”. In practice, the likelihoodists try to find an optimal value of the likelihood $p(D|\theta)$ while the Bayesian’s typically try to find an optimal value of the posterior $p(\theta | D)$. These are both related by Bayes rule, but the choice of which expression to simplify and what to average over reflects what the user is allowing to vary¹. One feature of the Bayesian method is that they must specify a prior $p(\theta)$ and while there are guidelines for what should be a good choice, there is still a choice to be made and this makes some people unhappy. Both methods often yield similar results (in fact, when assuming a uniform prior, the Bayesian approach is the likelihood approach), but their interpretations differ because of the differing assumptions.

Bayesian methods in fact provide several different approaches to parameter estimation. The most common of which is the MAP estimate. While one could use any estimation method from either the Bayesian or frequentist approach, how the quality of an estimate is evaluated differs from the two approaches. For example, the frequentist perspective leads to the notion of an unbiased estimator, which is not a well-defined notion in the Bayesian approach. The frequentist and Bayesian schools generate (generally different) intervals which are used to quantify our confidence in a given estimate.

Example

To clarify the difference between the frequentist and Bayesian approaches, we suppose that we have to estimate a parameter $\theta$ of a model $m(\hat{\theta})$ given some data $D$. The frequentists produce an estimate $\theta_f=\delta_f(D)$ and the Bayesians produce an estimate $\theta_B$. We would like some measure of the uncertainty in our estimate. The frequentists will produce a 95% confidence interval: $$I_f(D)=\{l(D)\leq \theta_f=\delta_f(D) \leq u(D)\}.$$

A desirable interpretation of this interval is that the value of the true parameter $\hat{\theta}$ has a 95% chance of lying in this interval, however such a statement is meaningless from the frequentist approach: The true parameter $\hat{\theta}$ is fixed, so it either is or is not in $I_f$. Instead the confidence interval satisfies the following: if we were to keep producing datasets $D_1, D_2, \cdots$ sampled from $m(\hat{\theta})$ then $\hat{\theta}$ will lie in 95% of the confidence intervals $I_f(D_1), I_f(D_2), \cdots$. For a visualization of confidence intervals look here.

Since the Bayesians assume that $\hat{\theta}$ is in fact a random variable, they can make sense of the claim that there is a 95% chance of $\hat{\theta}$ lying in some given interval $I_B(D)$ given the data $D$. The Bayesians can produce such intervals and they are called 95% credible intervals.

The frequentist approach also leads to black swan phenomena. For example, suppose that we want to estimate the bias $\theta$ of a coin. The $\theta$ will correspond to the probability of obtaining a heads on one coin flip. If the observed data is two coin flips which each yield heads, then the maximum likelihood estimate for $\theta$ is 1. If we ask for a 99.999999% confidence interval for this measurement we will just obtain the one element set $\{1\}$. Having never seen a tails, this method constructs a model that assigns the probability of seeing tails to 0! Moreover, the model is extremely confident in this prediction even though a fair coin would produce this outcome 25% of the time. So the frequentist approach can lead to highly overconfident estimates.

On the other end of the spectrum, we consider the following example due to Berger. Suppose that two integers $D=(x_1, x_2)$ are drawn from a parametric distribution of the form:
$$ p(x|\theta) = \begin{cases}
0.5 & x = \theta\\
0.5 & x = \theta+1 \\
0 & x \not\in \{\theta, \theta+1\}
\end{cases}
$$
If $\theta = 5$, then we would expect of the following outcomes with equal probability $(5,5), (5,6), (6,5), (6,6)$. Let $m=\min(x_1,x_2)$ and consider the one element interval $[m,m]$. This confidence interval will contain $\theta$ for each of the previous samples except for $(6,6)$, so this will be a 75% confidence interval. But note that if $D=(5,6)$, then $P(\theta = 5 | D)=1.0$, so the model is underconfident about this estimate.

Frequentists are Bayesian (sort of)

In the Bayesian world view, we are studying a joint probability distribution on a data space $\Delta$ and a parameter space $\tau$. The frequentists object to the Bayesian specification of a prior $P(\theta)$ for two closely related reasons:

By specifying a prior, they are tipping the scales and expressing a bias for which they have no evidence.
Frequentists assume that the data is generated from one true model using a particular $\theta_*\in \tau$, so $\theta$ is not random.

We can point out that the frequentist assumption in the second reason above is equivalent to specifying a prior of a specific form. Namely, the frequentists assume that $P(\theta)$ is 1 when $\theta=\theta_*$ is the, unknown, true quantity and 0 otherwise (i.e., $\tau$ follows a Dirac-$\delta$ distribution). So the frequentist approach does fit into the Bayesian setup, they just use a very specific form for the prior which depends on an unknown and, in practice, unknowable quantity.

Unfortunately, this choice of a prior eliminates any possibility of using Bayesian inference (try to apply Bayes rule when using the frequentist prior and see for yourself). On the other hand, it does mean that $p(D |\theta_*)=p(D)$ and hence all witnessed data is generated from the only possible model, while for the Bayesians the data is pulled from all possible conditional distributions $p(D|\theta)$. The Bayesian response is that they don’t care: They simply generate the best value of $\theta$ that fits whatever data they have seen.

Finally, we should point out that in the presence of more and more data sampled from the type of parametric model we are, the Bayesian posterior $p(\theta | D)$ gradually approximates a Dirac-$\delta$ distribution and converges to the frequentist assumption.

It is worth pointing out at this point, that the expression $p(D|\theta)$ only makes sense if $p(\theta)>0$ (so the likelihoodists at least dip their toes into the water of Bayesian methods when constructing estimates). Similarly, $p(\theta | D)$ only makes sense if $p(D)>0$, in which case the Bayesians are also assuming the given data is generated by some random process. ↩

Machine Learning Overview

Science is knowledge which we understand so well that we can teach it to a computer; and if we don’t fully understand something, it is an art to deal with it. Donald Knuth

Introduction

First Attempt at a Definition

One says that an algorithm learns if its performance improves with additional experience.

This is not a precise definition, so let’s try to make it so.

Definition

An algorithm will be a function¹ $f\colon D\to T$ and whose individual values can be calculated by a computer (i.e., a Turing machine) in a finite amount of time using a finite amount of memory. We can refer to $D$ as the space of inputs or samples or examples and $T$ as the outputs or labels.

For the following definition we will want to be able to form a sum indexed by the elements of $D$. For this purpose we could suppose that our input is finite or that the domain is a measure space and that the algorithm will be a measurable function. For simplicity and convenience, we will pretend that $D$ is finite. That is if $D$ is infinite, then we have actually implicitly restricted to to a finite subset (e.g., $D=\Bbb R$ should really be interpreted as the finite subset of floating point numbers expressed in some fixed number of bits). This poses no real restriction in applications (all real world problem domains are effectively finite).

Definition

A performance measure of an algorithm will be a function $P\colon D\times T \to \Bbb R$. For two algorithms $f_1$ and $f_2$, we will say that $f_1$ performs better than $f_2$ (relative to $P$) if $$\sum_{d\in D} P(d,f_1(d)) > \sum_{d\in D} P(d, f_2(d)).$$ We similarly define ‘performs worse than’ and ‘performs at least as well as’, etc.

Alternatively, we may be presented a cost function $C\colon D\times T\to \Bbb R$ which we want to minimize. One can postcompose any performance measure (resp. cost function) with an order reversing map to obtain a cost function (resp. performance measure). This allows us to pass between the two notions.

Definition

A machine learning algorithm is an algorithm $f_-\colon 2^E \times D\to T$, that learns with respect to a specified performance measure $P$ (defined below). Here $2^E$ is the set of all subsets of $E$. We call an element $S\in 2^E$ (i.e., a subset $S\subseteq E$) a set of training data. For such an $S$ and $f_-$, let $f_S=f_-(S,-) \colon D\to T$ denote the resulting algorithm (which we will say is trained on $S$).

Example

Consider the problem of finding a linear function $f\colon D\subset \Bbb R\to \Bbb R$ which closely approximates a fixed function $F\colon D\subset \Bbb R\to \Bbb R$. We can regard the function $F$ as a subset $S$ of $E=D\times \Bbb R$ by $F\mapsto S= \{(s,F(s)) | s\in D\}$². We can define a cost function $C\colon D\times \Bbb R\to \Bbb R$ by $$C(d,t)=(F(d)-t)^2.$$ We can reformulate problem of minimizing the cost of $C(d,f(d))$ as maximizing the performance function $P(d,t)=\frac{1}{1+C(d,t)}$. The method of Linear Regression will define a machine learning algorithm $f_-\colon 2^E\times D\to \Bbb R$.

One can play with this problem using this demo.

Second Attempt at a Definition

A machine learning algorithm $f\colon 2^E\times D\to T$ learns (relative to a performance measure $P$) if for all proper subset inclusions $S_1\subset S_2\subseteq E$, $f_{S_2}$ performs better than $f_{S_1}$.

This is at least clearly defined, but it does not cover many examples of learning algorithms. For example, the method of linear regression will perform poorly when given a small training sample containing outliers (those $d\in D$ such that $F(d)$ takes on an unlikely value under some assumed probability distribution associated to $F$). So, the act of adjoining outliers to our training data will typically decrease the performance of this machine learning algorithm.

Final Definition

A machine learning algorithm $f\colon 2^E\times D\to T$ learns (relative to a performance measure $P$) if for each pair of natural numbers $m<n$, the average of the performances of $f_{S}$ over the subsets $S\subseteq E$ of cardinality $n$ is better than the average of the performances over the subsets of cardinality $m$.

Crucial remark

Without a specified performance measure, we have no obvious goal in machine learning. We can only evaluate machine learning algorithms relative to a performance measure³ and an algorithm that performs well with respect to one measure may perform poorly with respect to another. So the field of machine learning depends on first finding a good performance measure for the task at hand and then finding a machine learning algorithm that performs well with respect to this measure.

Remarks

Now that we have come to a definition of a learning algorithm that learns that seems to cover at least one standard example, I will unveil some unfortunate truths:

In practice, we often do not have a performance measure that is defined over all of $D\times T$, instead we only have it over a subset.
Machine learning algorithms are often presented without proofs that they learn in the sense above.

Taxonomy of machine learning

We can roughly divide up the field of machine learning into 3 branches:

Supervised Learning
Unsupervised Learning
Reinforcement learning

Supervised learning

Before getting into what supervised learning precisely is, let’s look at some examples of supervised learning tasks:

Identifying breast cancer.
- A sample study.
Image classification.
- List of last year’s ILSVRC Winners
Threat assessment
Language Translation
- NYTimes: Google’s neural machine translation
Identifying faces in images

Definition

Supervised learning is concerned with the construction of machine learning algorithms $f\colon 2^{D\times T}\times D\to T$. The subsets of $D\times T$ are referred to as subsets of labeled examples.

We often assume that the labeled examples arise from restricting some presumed function $F\colon D\to T$ whose values we know on $E\subset D$. We can then train $f$ on pairs $\{s, F(s) | s\in E\}$. We can then devise a cost function which measures the distance from the learned $f$ and the presumed $F$ (e.g., $L^2$-distance).

Supervised learning has been the most successful of the three branches to real world problems. The existence of labeled examples usually leads to well-defined performance metrics that transform supervised learning tasks into two other tasks:

Find an appropriate parametrized class of functions to choose $f$ from.
Solve the optimization problem of finding the best function in that class.

Example

The first example of a supervised learning task is the linear regression problem mentioned above.

Unsupervised learning

First some examples:

Some sample techniques in unsupervised learning can be seen at this beautiful blog post.
Word vector embeddings.
Independent components analysis

Definition

Unsupervised learning is concerned with the construction of machine learning algorithms of the form $f\colon 2^D\times D\to T$.

In this case, there is no obvious measure of performance for such an algorithm. The choice of performance measure essentially defines the unsupervised learning task.

Often, one can vaguely state that the goal of an unsupervised learning task is to summarize the data or determine its essential features. In other words, suppose that we have a function $i\colon T\to D$, then we can state the goal is to find an $f\colon D\to T$ such that $i\circ f$ approximates the identity function. That is our unsupervised learning task is to find a compression algorithm! In this case, we have transformed the unsupervised learning task into a supervised learning task (because we have as many labels as we like of the identity function).

The only task remaining (and it is the hard task) is to make sense of the word `approximates’ above by choosing an appropriate measure. For example, how do you quantitatively measure the quality of a video compression algorithm? Somehow we can usually see the difference between a terrible algorithm and a good one, but it can be difficult to turn this intuition into a well-defined measure.

Example

One of the typical tasks in unsupervised learning is to identify ‘clusters’ of data points in a high dimensional space. We may be looking for difficult to identify relationships between data points. Once we have identified a classification that we deem to be ‘meaningful’, then our algorithm should be able to sort new data points into the correct classes. For example, can try to look at demographic data to identify clusters of voters with similar voting patterns and then try to identify the correct approach to appeal to this group.

Reinforcement learning

Reinforcement learning is applied to tasks where one must choose actions whose outcomes eventually lead to various rewards/punishments which are used to encourage or discourage certain choices:

Breakout
One of the more difficult challenges is Montezuma’s revenge where there is very little feedback and one of the implicit motivations is curiousity.
Another truly amazing achievement is AlphaGo, or its successor AlphaGo Zero.
Self-driving cars

To construct a reasonably precise definition, let’s describe a typical reinforcement learning task⁴. In this case the algorithm is to construct an agent (the name comes from the field of artificial intelligence). The agent learns a policy which is a mapping from perceived states to actions. Paired with the agent is an environment. The environment maintains an environment state and when given an action returns an observation and a reward to the agent, which it then uses to update its policy and perceived state and choose an action. This creates a feedback loop:

We can break up the agent’s task into two parts:

It must take in an observation and reward and update its perceived state. To update its perceived state it might maintain a model of its environment and use the observation and reward to update this model.
It must take in a sequence of observations and rewards and update its policy. The policy is supposed to choose an allowable action and this decision could be made by defining a value function, which assigns to each possible action given our state a value, and choosing the action that maximizes the value. Since the agent is operating with imperfect information, the value function will only be an estimate, and it may choose to alternatively explore and choose a lower valued action in the hopes that such a state is being undervalued. We can model this exploration possibility, by saying that the policy assigns a probability to each possible action from a fixed perceived state.

Writing the general task out as a learning algorithm will get quite cumbersome. So let’s just hit the main points.

The environment will be encoded as a probability space $S_e \times A\times S_e$, which encodes the probability of transitioning from one environment state to another given a fixed action.
The policy will be encoded as a probability space $S_p \times A$, which encodes the probability of choosing an action from a given probability.
We will also require some type of model, which will be a probability space $S_p\times A\times O\times R\times S_p$ which will encode the probability of moving from one perceived state to another given an action, an observation, and a reward.
We will be a given a environmental feedback, which will be a random variable $F\colon S_e\times A\times S_e \to O\times R$.

All of these components can then be combined and we can form the expected reward of all sequences of length $n$ of tuples of the above variables⁵. The reinforcement learning task is to learn the policy function and the interpreter function so that we can maximize the expected reward over sequences of length $n$ as $n$ goes to $\infty$.

In this generality, the task is extremely difficult. We could simplify things by assuming the environment provides complete feedback about the perceived state, so that the observation is the perceived state (for example, a game with perfect information). But even then this task can be extremely difficult if there are many potential perceived states and many potential actions. For example, in the game Go there are two players (black and white) which take alternating turns placing pieces on a 19×19 board, occasionally removing captured pieces. As an approximate upper bound on the perceived state space we can see that there are no more than $3^{19\cdot 19}\approx 10^{172}$ possible board configurations (each square is either black, white, or blank), from each of these positions we have no more than $19^2=361$ possible actions choose from. Searching through all possible chains of configurations and actions is completely intractable. Even encoding an optimal policy in tabular form is impossible. This is part of the reason that performing well on this game was such a high-hurdle for the AI community.

We could also allow partially defined functions. ↩
Indeed if you look under the hood, this is how functions are actually defined.
↩
Okay, runtime and memory usage are also important too. ↩
I must admit this first attempt at finding a definition of reinforcement learning that fits into the above framework is a bit clumsy and accurately reflects my limited knowledge in this area. ↩
As a non-trivial exercise write this down for sequences of length 1 and 2. ↩