Hypothesis Testing

Acceptance without proof is the fundamental characteristic of Western religion, rejection without proof is the fundamental characteristic of Western science. – Gary Zukav, “The Dancing Wu Li Masters”

Hypothesis Testing

Now we consider Hypothesis Testing in an example. While Bayesians also have a form of hypothesis testing, the term is almost always used for the frequentist approach. We will describe this first in an example here.

(Frequentist) Hypothesis Testing

Suppose we produce a cancer treatment that we believe will increase the chance of a patient living for another year compared to standard treatment. We run a controlled experiment where we divide up a group of patients into two groups, the control group and the test group (one has to be extremely careful about how this is done in order to get meaningful results). The control group receives standard treatment and the test group receives the new treatment. We follow the patients for a year and record the outcomes as $D$. We would like to show our hypothesis $H_1$, that our new treatment changes outcomes, holds as opposed to the null hypothesis $H_0$ that the treatment has no effect. Associated to $H_0$ is a statistical model from which we could provide predictions about patient outcomes. The hypothesis $H_1$ is less specific, it really hypothesizes that the data is given by any other model besides $H_0$.

Clearly, what we want to show¹ is that
\begin{equation}
p(H_1 | D) > p(H_0 | D), \label{eq:b-hyp}
\end{equation}

that is given the data from our experiment, it is more likely that our treatment improves outcomes than it does nothing. However, \eqref{eq:b-hyp} is meaningless from the frequentist perspective: in this approach there is only one true model giving rise to the data, so either $H_1$ holds or it doesn’t (in which case $H_0$ holds). Since $p(H_i)$ is either 0 or 1 and we can’t know which ahead of time, the frequentist instead considers $p(D | H_0)$ which, since $H_0$ defines a specific model, can be calculated. So the experimenter, calculates the probability $P(D\in S| H_0)$, where $S$ is some set of (often extremal) values of data containing the observed data $D$. If this probability is less than some specific small $\alpha\in (0,1)$, then they conclude that the null hypothesis is unlikely to hold and reject the null hypothesis at the significance level $\alpha$ and conclude that $H_1$ holds.

This approach is essentially a probabilistic proof by contradiction. The conditional probability $P(D\in S | H_0)$ is only defined if $P(H_0)>0$, which, in the frequentist perspective, is equivalent to supposing $P(H_0)=1$. A hypothesis test that rejects the null hypothesis shows that the this assumption leads to an outcome which we deem to be so unlikely that we are justified in excluding it. If instead of testing the null hypothesis we tested the alternative hypothesis and somehow calculated $P(D\in S | H_1)$, then we would be implicitly assuming the desired conclusion that $P(H_1)=1$, which makes the argument circular. This makes it impossible in this framework to directly establish that any one specific model is the correct hypothesis, we can only argue “by contradiction” and eliminate individual hypotheses.

In many areas of social science and science, the acceptable $\alpha$ for a publishable result is $\alpha = 0.05$. Let’s suppose that $P(D\in S|H_0)=0.049999$. The claim is that if the treatment had no effect, then we would only witness patient outcomes like the above about once in every 20 experiments, while if the treatment had any effect whatsoever we would see outcomes like the above about 19 times out of every 20 experiments.

Allow me to rant a little bit about what I dislike about this approach:

This approach says nothing about whether or not the treatment improves outcomes, just how it compares to not doing the treatment.
It does not say anything about the size of the effect (one could argue that if we include the power of a test, then we can quantify the effect size; I will discuss this below). For example, smoking, some headache medicines, and stress have been shown to increase the risk of birth defects for pregnant women and hence doctors typically advise pregnant women to avoid all of these things². This advice is not weighted according to the degree of risk, partly because these methods say nothing about the degree of risk. So, if a pregnant woman is stressed because she can’t have a cigarette and she has a headache what should she do?
Since the academic culture is almost exclusively interested in positive results, essentially only the studies that reject the null hypothesis are published. This leads to some additional problems:
- Since many experiments do not lead to positive outcomes, there is a good chance that different academic groups are repeating the same (or similar) experiments. Even if the null hypothesis is true, the probability that one of these groups will generate an unlikely dataset that leads to a statistically significant result increases with the number of experiments³. This could be avoided by using a much smaller value of $\alpha$. Particle physicists, for example, have used $\alpha \approx 0.0000003$. Of course, particle physicists deal with an enormous amount of data, so reaching such significance levels is possible for them.
- More likely, one experiment leads to data that can be sliced in many ways and one of those slices will lead to a statically significant result; this is called $p$-hacking. Shrinking the acceptable $\alpha$ only makes the $p$-hacking more difficult, but not impossible. One can find a thorough discussion here.

Rejecting the null hypothesis when it is true is called a Type I error. Accepting the null hypothesis when the alternative hypothesis holds is called a Type II error. These names are rather unhelpful and I remember them using the mnemonic “Type I errors are the number one most common type of error in practice”, which I mentally justify using reasons above.

Recall that we reject the null hypothesis at significance level $\alpha$ if the observed data does not lie in an interval $I_\alpha$. Here $I_\alpha$ is determined by $P(D\in I_{\alpha} | H_0) =1-\alpha$. So the probability of incorrectly accepting the null hypothesis, when the alternative hypothesis holds, is
\begin{equation}
P(D\in I_{\alpha} | H_1).\label{eq:power}
\end{equation}
The probability $P(D\not\in I_{\alpha} | H_1)$ is called the power of the test.

As far as I can tell, the alternative hypothesis must be the negation of the null hypothesis for hypothesis testing to be meaningful. So the alternative hypothesis needs to be more vague than the null hypothesis and I can not see any way to calculate \eqref{eq:power} in practice. For example, how do we calculate the probability of seeing data in a given interval if we assume the data is produced by any other model besides that using the null hypothesis?

However, we can calculate \eqref{eq:power}, when $H_1$ is instead some $H(\theta)$. Here $H(\theta)$ is a specific hypothesis other than the null hypothesis. While this information could be helpful (the resulting probability function of $\theta$ in \eqref{eq:power} is called the power function), it is not the probability of committing a type II error and, because of frequentist assumptions, only makes sense when $H(\theta)$ is the “true” hypothesis generating the data⁴. Looking around the internet, there seems to be a great amount of confusion about this.

Since, from a frequentist perspective, $P(D \in I_{\alpha} | H(\theta))$ is only defined when $P(H(\theta))=1$, for all of the possible values of $\theta$ this expression is only well-defined for one specific value of $\theta.$ Since frequentists often consider this expression for the value of $\theta$ estimated from the data, they are making an implicit assumption that their estimate was a perfect match to the one true parameter. A Bayesian (or perhaps anyone) would suggest that this is a rather extreme assumption.

To help understand my skepticism of the use of the power function, let me consider a mathematical example. One of the great open problems is whether or not two objects $P$ and $NP$ are equal. I won’t get into the details of what this means, but I will say that the encryption schemes that we implicitly use for computer security rely on the fact that these two classes are not equal. If they were, there could be relatively efficient algorithms for breaking encryption algorithms. The world, whether they know it or not, has essentially been operating under the assumption that $P\neq NP$.

Now, suppose that someone writes a paper and shows that if $P=NP$, then something wonderful, which we will call $W$, happens. While this would be interesting, we can conclude nothing about whether or not $W$ holds because we do not know that $P=NP$ (in fact, we expect that this assumption is false). Moreover, this paper would open up the possibility that $P=NP$ if and only if $W$ happens. In this case, many would view this result as significant evidence against $W$ happening. To connect to the power function, the calculation of the power function is dependent on an assumption that many would not find very credible and hence they would not have a high degree of confidence that the power function is well-defined under frequentist assumptions⁵.

A beautiful visualization of inference hypothesis testing is available here.

Bayesian hypothesis testing

From the Bayesian perspective, \eqref{eq:b-hyp} is a meaningful statement that can be checked. Moreover, Bayesians are not forced to only consider the alternative hypothesis $H_1$ that anything other than the null hypothesis holds. They can identify specific hypotheses to explain the data and check the probability of them holding. As described above, the only way to make sense of this is to assign non-zero prior probabilities $p(H)$ for each possible hypothesis being considered. Bayesians do not have qualms about this and make a principled choice for these prior possibilities. This leads to significantly more refined statements (which are necessarily accompanied by clear assumptions) which define models for the data distribution. These models provide estimates of the size of an effect and can be tested for accuracy on future data.

At least this is what I would like to show. ↩
The length of the list of things that pregnant women are supposed to avoid is truly incredible. ↩
When one group keeps repeating similar experiments to obtain the desired results, I would classify this as academic misconduct. However, private companies typically have a vested interest in establishing one outcome and can fund many different groups to do similar studies and only release the ones that support the private company’s preferred outcome. This is a very good reason to look at privately funded studies with skepticism. ↩
An extremely poor practice is calculating the power function for the hypothesis that best fits the observed data and stating that this is the probability of type II error for the given test. This is clearly absurd from first principles. ↩
Granted, this conclusion is based on a Bayesian perspective, but I am simultaneously arguing that many of us implicitly have this perspective. Even mathematicians working under the assumption that a conjecture is either true or false, will often assign a degree of belief to one statement or the other. ↩

Hypothesis Testing