# Statistical Inference

All models are wrong, but some are useful. – George Box

## Introduction

The general setup for statistical inference is that we are given some data $D$ which we assume arise as the values of a random variable that we assume is distributed according to some parametric model $m(\theta)$. The goal is to estimate $\theta$ in this circumstance and to give some measure of the certainty of our estimate. There are two schools of thought on this problem. One can assume that the true parameter value $\hat{\theta}$ is fixed and that we can evaluate the correctness of an estimate by running repeated trials (i.e., sampling $D$ from $m(\hat{\theta})$), this is the frequentist school. The other, Bayesian, school assumes that the data is given and that $\theta$ is actually a random variable.

For more in depth coverage of these approaches please consult:

#### Question

Consider the following standard situations:

1. We are trying to determine the percentage $\theta$ of likely-voters who will vote for a particular candidate in an election. We gather data by polling a sample from the population.
2. We are trying to determine the average standardized test scores $\theta$ of students taught with a new method. We gather data by teaching a test group of students and calculating the average scores of the students.
3. We are trying to predict the probability $\theta$ of rain tomorrow based on the meteorological data we have gathered.

Is there a “true” value of $\theta$? Or is the proposed $\theta$ we are trying to estimate subject to many additional factors which are difficult or impossible to specify? Is it possible to run repeated tests and somehow sample from the true population?

Sometimes is is said there is also a third school of thought, the “likelihoodists”. In practice, the likelihoodists try to find an optimal value of the likelihood $p(D|\theta)$ while the Bayesian’s typically try to find an optimal value of the posterior $p(\theta | D)$. These are both related by Bayes rule, but the choice of which expression to simplify and what to average over reflects what the user is allowing to vary1. One feature of the Bayesian method is that they must specify a prior $p(\theta)$ and while there are guidelines for what should be a good choice, there is still a choice to be made and this makes some people unhappy. Both methods often yield similar results (in fact, when assuming a uniform prior, the Bayesian approach is the likelihood approach), but their interpretations differ because of the differing assumptions.

Bayesian methods in fact provide several different approaches to parameter estimation. The most common of which is the MAP estimate. While one could use any estimation method from either the Bayesian or frequentist approach, how the quality of an estimate is evaluated differs from the two approaches. For example, the frequentist perspective leads to the notion of an unbiased estimator, which is not a well-defined notion in the Bayesian approach. The frequentist and Bayesian schools generate (generally different) intervals which are used to quantify our confidence in a given estimate.

#### Example

To clarify the difference between the frequentist and Bayesian approaches, we suppose that we have to estimate a parameter $\theta$ of a model $m(\hat{\theta})$ given some data $D$. The frequentists produce an estimate $\theta_f=\delta_f(D)$ and the Bayesians produce an estimate $\theta_B$. We would like some measure of the uncertainty in our estimate. The frequentists will produce a 95% confidence interval: $$I_f(D)=\{l(D)\leq \theta_f=\delta_f(D) \leq u(D)\}.$$

A desirable interpretation of this interval is that the value of the true parameter $\hat{\theta}$ has a 95% chance of lying in this interval, however such a statement is meaningless from the frequentist approach: The true parameter $\hat{\theta}$ is fixed, so it either is or is not in $I_f$. Instead the confidence interval satisfies the following: if we were to keep producing datasets $D_1, D_2, \cdots$ sampled from $m(\hat{\theta})$ then $\hat{\theta}$ will lie in 95% of the confidence intervals $I_f(D_1), I_f(D_2), \cdots$. For a visualization of confidence intervals look here.

Since the Bayesians assume that $\hat{\theta}$ is in fact a random variable, they can make sense of the claim that there is a 95% chance of $\hat{\theta}$ lying in some given interval $I_B(D)$ given the data $D$. The Bayesians can produce such intervals and they are called 95% credible intervals.

The frequentist approach also leads to black swan phenomena. For example, suppose that we want to estimate the bias $\theta$ of a coin. The $\theta$ will correspond to the probability of obtaining a heads on one coin flip. If the observed data is two coin flips which each yield heads, then the maximum likelihood estimate for $\theta$ is 1. If we ask for a 99.999999% confidence interval for this measurement we will just obtain the one element set $\{1\}$. Having never seen a tails, this method constructs a model that assigns the probability of seeing tails to 0! Moreover, the model is extremely confident in this prediction even though a fair coin would produce this outcome 25% of the time. So the frequentist approach can lead to highly overconfident estimates.

On the other end of the spectrum, we consider the following example due to Berger. Suppose that two integers $D=(x_1, x_2)$ are drawn from a parametric distribution of the form:
$$p(x|\theta) = \begin{cases} 0.5 & x = \theta\\ 0.5 & x = \theta+1 \\ 0 & x \not\in \{\theta, \theta+1\} \end{cases}$$
If $\theta = 5$, then we would expect of the following outcomes with equal probability $(5,5), (5,6), (6,5), (6,6)$. Let $m=\min(x_1,x_2)$ and consider the one element interval $[m,m]$. This confidence interval will contain $\theta$ for each of the previous samples except for $(6,6)$, so this will be a 75% confidence interval. But note that if $D=(5,6)$, then $P(\theta = 5 | D)=1.0$, so the model is underconfident about this estimate.

### Frequentists are Bayesian (sort of)

In the Bayesian world view, we are studying a joint probability distribution on a data space $\Delta$ and a parameter space $\tau$. The frequentists object to the Bayesian specification of a prior $P(\theta)$ for two closely related reasons:

1. By specifying a prior, they are tipping the scales and expressing a bias for which they have no evidence.
2. Frequentists assume that the data is generated from one true model using a particular $\theta_*\in \tau$, so $\theta$ is not random.

We can point out that the frequentist assumption in the second reason above is equivalent to specifying a prior of a specific form. Namely, the frequentists assume that $P(\theta)$ is 1 when $\theta=\theta_*$ is the, unknown, true quantity and 0 otherwise (i.e., $\tau$ follows a Dirac-$\delta$ distribution). So the frequentist approach does fit into the Bayesian setup, they just use a very specific form for the prior which depends on an unknown and, in practice, unknowable quantity.

Unfortunately, this choice of a prior eliminates any possibility of using Bayesian inference (try to apply Bayes rule when using the frequentist prior and see for yourself). On the other hand, it does mean that $p(D |\theta_*)=p(D)$ and hence all witnessed data is generated from the only possible model, while for the Bayesians the data is pulled from all possible conditional distributions $p(D|\theta)$. The Bayesian response is that they don’t care: They simply generate the best value of $\theta$ that fits whatever data they have seen.

Finally, we should point out that in the presence of more and more data sampled from the type of parametric model we are, the Bayesian posterior $p(\theta | D)$ gradually approximates a Dirac-$\delta$ distribution and converges to the frequentist assumption.

1. It is worth pointing out at this point, that the expression $p(D|\theta)$ only makes sense if $p(\theta)>0$ (so the likelihoodists at least dip their toes into the water of Bayesian methods when constructing estimates). Similarly, $p(\theta | D)$ only makes sense if $p(D)>0$, in which case the Bayesians are also assuming the given data is generated by some random process.