{"id":344,"date":"2017-10-10T08:59:07","date_gmt":"2017-10-10T08:59:07","guid":{"rendered":"http:\/\/www.nullplug.org\/ML-Blog\/?p=344"},"modified":"2017-11-02T08:10:36","modified_gmt":"2017-11-02T08:10:36","slug":"parameter-estimation","status":"publish","type":"post","link":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/10\/parameter-estimation\/","title":{"rendered":"Parameter Estimation"},"content":{"rendered":"<blockquote><p>\n  \u2026the statistician knows\u2026that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world. &#8211; George Box (JASA, 1976, Vol. 71, 791-799)\n<\/p><\/blockquote>\n<h2>Parameter estimation<\/h2>\n<p>Suppose that we are given some data $D=(x_1,\\cdots, x_n)\\in T^n$. Our task is to find a parametric probability distribution $m(\\hat{\\theta})$ on $T$ such that if we were to independently sample from it $n$-times, we would could reasonably<sup id=\"fnref-344-4\"><a href=\"#fn-344-4\" class=\"jetpack-footnote\">1<\/a><\/sup> expect to obtain $D$. We first make some choice<sup id=\"fnref-344-5\"><a href=\"#fn-344-5\" class=\"jetpack-footnote\">2<\/a><\/sup> of possible distributions $m(-)$, which leaves us with the task of identifying the $\\hat{\\theta}$.<\/p>\n<h3>Likelihood approach <a name=\"MLE\"><\/a><\/h3>\n<p>One approach is the <em>maximal likelihood estimate<\/em> or MLE. For this we set  $$\\hat{\\theta}=\\theta_{MLE} = \\mathrm{argmax}_\\theta p(D | \\theta).$$  That is we choose $\\theta$ so that the probability density of $D$ from the given model is maximized.<\/p>\n<p>There are a few comments that we should make here:<\/p>\n<ol>\n<li>There may not be a maximum for a general distribution. Indeed the models typically considered for logistic regression do not have a maximum.<\/li>\n<li>If there is a maximum, it may not be unique. <\/li>\n<li>This is an optimization problem and when this problem is not <a href=\"https:\/\/en.wikipedia.org\/wiki\/Convex_optimization\">convex<\/a>, we may not have any method that is guaranteed to identify find a maximum even when it exists. <\/li>\n<\/ol>\n<p>We do not actually address these problems, instead we just say that optimization techniques will give us a parameter value that is &#8220;good enough&#8221;.<\/p>\n<h4>Example<\/h4>\n<p>Suppose that $D=(x_1,\\cdots, x_n)\\in \\Bbb R^n$ and we want to find a normal distribution $N(\\mu, \\sigma^2)$ which models the data under the assumption that these were independent samples.<\/p>\n<p>First we construct the MLE of $\\mu$:<\/p>\n<p>\\begin{align}<br \/>\n <!--\\mu_{MLE} &= \\mathrm{argmax}_{\\mu} p(D | N(\\mu, \\sigma^2)) \\\\\\\\\n &= \\mathrm{argmax}_{\\mu} \\prod_{i=1}^n p(x_i | N(\\mu, \\sigma^2)) \\\\\\\\\n &= \\textrm{argmax}_{\\mu} (2\\pi \\sigma^2)^{-n\/2} \\prod_{i=1}^n e^{\\frac{-(x_i-\\mu)^2}{2\\sigma^2}} \\\\\\\\\n &= \\textrm{argmax}_{\\mu} (2\\pi \\sigma^2)^{-n\/2}  e^{\\frac{-\\sum_{i=1}^n(x_i-\\mu)^2}{2\\sigma^2}} \\\\\\\\\n &= \\textrm{argmin}_{\\mu} \\log((2\\pi \\sigma^2)^{-n\/2})  \\frac{\\sum_{i=1}^n(x_i-\\mu)^2}{2\\sigma^2} \\\\\\\\\n &= \\textrm{argmin}_{\\mu} \\sum_{i=1}^n(x_i-\\mu)^2 --><br \/>\n\\end{align}<\/p>\n<p>We differentiate the last expression with respect to $\\mu$ and set this to 0 and obtain<br \/>\n\\begin{align}<br \/>\n -\\sum_{i=1}^n (x_i -\\mu)&amp;=0 &#92;&#92;<br \/>\n  \\mu_{MLE} &amp;= \\frac{\\sum_{i=1}^n x_i}{n}.<br \/>\n\\end{align}<br \/>\nTaking second derivatives shows that this is in fact a minimum. In other words, the MLE of $\\mu$ is the <em>sample mean<\/em>.<\/p>\n<h4>Exercise<\/h4>\n<p>Show that MLE of $\\sigma^2$ is the (biased) <em>sample variance<\/em>. In other words, $$\\sigma_{MLE}^2 = \\frac{\\sum_{i=1}^n (x_i-\\mu_{MLE})^2}{n}.$$<\/p>\n<h3>Frequentist evaluation <a name=\"freq-eval\"><\/a><\/h3>\n<p>The frequentist approach gives us a method to evaluate the above estimates of the parameters. Suppose that our data is drawn from a true distribution $m(\\hat{\\theta})$ and set $\\overline{\\theta} = E(\\theta_{MLE}(D))$, where the expectation is given by integrating over $D$ given the &#8220;true&#8221; model. Define the <em>bias<\/em> of $\\theta_{MLE}(-)$ by $$\\mathrm{bias}(\\theta_{MLE})=\\overline{\\theta}-\\hat{\\theta}.$$<\/p>\n<p>Let $\\mu_S(D)$ be the sample mean of data drawn from some distribution $m$ with finite mean $\\widehat{\\mu}$. Then we see<br \/>\n\\begin{align}<br \/>\n  E(\\mu_S) &amp; = E\\left(\\frac{\\sum_{i=1}^n X_i}{n}\\right) &#92;&#92;<br \/>\n  &amp; = \\frac{\\sum_{i=1}^n E(X_i)}{n} &#92;&#92;<br \/>\n  &amp; = n\\widehat{\\mu}\/n = \\widehat{\\mu}<br \/>\n\\end{align}<br \/>\nso this is an unbiased estimate.<\/p>\n<p>For a visualization of an unbiased point estimate from the frequentist point of view look <a href=\"https:\/\/students.brown.edu\/seeing-theory\/statistical-inference\/index.html#third\">here<\/a>.<\/p>\n<h4>Exercise<\/h4>\n<ol>\n<li>Show that the expected value of the sample variance $\\sigma^2_S$ of data drawn from some distribution $m$ with finite mean $\\widehat{\\mu}$ and finite variance $\\widehat{\\sigma^2}$ is $$ E(\\sigma^2_S)= \\frac{n \\widehat{\\sigma^2}}{n-1}.$$<br \/>\nShow that it follows that the bias of $\\sigma^2_S$ is $\\widehat{\\sigma^2}\/(n-1)$. <\/li>\n<li>Show that the <em>unbiased sample variance<\/em> $\\sigma^2_U(X_1,\\cdots,X_n) = \\frac{\\sum_{i=1}^n(X_i-\\mu_S(X_1,\\cdots X_n))^2}{n-1}$ is, in fact, an unbiased estimate.<\/li>\n<li>Construct a symmetric 95% <em>confidence interval<\/em> (see <a href=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/05\/statistical-inference-2\/\">here<\/a> for some additional discussion) for our estimate $\\mu_S$ of $\\widehat{\\mu}$ which is centered about $\\mu_S$, when $n=10$, $\\mu_S=15$ and $\\sigma^2_U = 2$. Hint: the random variable $\\frac{\\mu_S-\\widehat{\\mu}}{(\\sigma_U \/\\sqrt{n})}$ has the same distribution as the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Student%27s_t-distribution\">Student&#8217;s $t$-distribution<\/a> whose inverse cdf values can be calculated by statistics packages. <\/li>\n<\/ol>\n<h3>Bias-Variance tradeoff<\/h3>\n<p>Suppose that we have an estimator $\\theta=\\delta(D)$ of $\\hat{\\theta}$ and the expected value of the estimate is $E(\\delta(D))=\\overline{\\theta}$. We obtain the following expression for our expected squared error<br \/>\n\\begin{align}<br \/>\n E((\\theta-\\hat{\\theta})^2) &amp;= E(\\left((\\theta-\\overline{\\theta})-(\\hat{\\theta}-\\overline{\\theta})\\right)^2) &#92;&#92;<br \/>\n &amp;= E((\\theta-\\overline{\\theta})^2)-E(\\theta-\\overline{\\theta})\\cdot 2(\\hat{\\theta}-\\overline{\\theta}) + (\\hat{\\theta}-\\overline{\\theta})^2 &#92;&#92;<br \/>\n &amp;= \\mathrm{var}(\\theta)+\\mathrm{bias}^2(\\theta)<br \/>\n\\end{align}<\/p>\n<p>This equation indicates that while it may be nice to have an unbiased estimator (i.e., one that will, on average, give precisely the correct parameter), if this comes with the cost of creating an estimate that varies wildly with the data then we will still expect a large amount of error and indicates why we would like estimators with low-variance.<\/p>\n<p>A typical example of this is seen in polynomial regression (here we are viewing the problem as trying to estimate $p(y|x)$). As we vary the degrees of the polynomial being used, we find that very large degree polynomials can fit the data well:<figure id=\"attachment_365\" aria-describedby=\"caption-attachment-365\" style=\"width: 640px\" class=\"wp-caption alignnone\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"365\" data-permalink=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/10\/parameter-estimation\/compressed_polyreg_normal\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?fit=640%2C480\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Polynomial regression\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;High degree polynomials fit the data better&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?fit=640%2C480\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"size-full wp-image-365\" \/><figcaption id=\"caption-attachment-365\" class=\"wp-caption-text\">High degree polynomials fit the data better<\/figcaption><\/figure><\/p>\n<p>However, the fit for high degree polynomials varies highly depending on the choice of data points:<figure id=\"attachment_361\" aria-describedby=\"caption-attachment-361\" style=\"width: 640px\" class=\"wp-caption alignnone\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"361\" data-permalink=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/10\/parameter-estimation\/polyreg_var_normal\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/polyreg_var_normal.gif?fit=640%2C480\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Degree 10 polynomial regression\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;This estimator has high variance&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/polyreg_var_normal.gif?fit=640%2C480\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/polyreg_var_normal.gif?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"size-full wp-image-361\" \/><figcaption id=\"caption-attachment-361\" class=\"wp-caption-text\">This estimator has high variance<\/figcaption><\/figure><\/p>\n<h3>The MAP Estimate <a name=\"MAP\"><\/a><\/h3>\n<p>One Bayesian approach to parameter estimation is called the <em>MAP estimate<\/em> or <em>maximum a posteriori estimate<\/em>. Here we are given data $D$, which we want to say is modeled by a distribution $m(\\theta)$ and we construct the MAP estimate of $\\theta$ as<br \/>\n\\begin{equation}<br \/>\n<!-- \\theta_{MAP}=\\mathrm{argmax}_{\\theta} p(\\theta | D) = \\mathrm{argmax}_{\\theta} \\frac{p(D|\\theta)p(\\theta)}{\\int_{\\theta'} p(D|\\theta')p(\\theta')d\\theta'}. --><br \/>\n\\end{equation}<br \/>\nIn other words, we choose the mode of the posterior distribution. Note that if our prior $p(\\theta)$ is constant independent of $\\theta$, then the only part of the right hand side that depends on $\\theta$ is $p(D|\\theta)$ and the MAP estimate of $\\theta$ is the same as the MLE of $\\theta$.<\/p>\n<h4>Example<\/h4>\n<p>Suppose that we have a coin whose bias $\\theta$ we would like to determine (so $\\theta$ is the probability of the coin coming up heads). We flip the coin twice and obtain some data $D=(H,H)$. Under the assumption that the coin flips are independent we are trying to model a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Bernoulli_distribution\">Bernoulli distribution<\/a>. Now we calculate the conditional probability:<br \/>\n\\begin{equation}<br \/>\n p(D|\\theta) = \\theta^2(1-\\theta)^0<br \/>\n\\end{equation}<br \/>\nand quickly see the MLE for $\\theta$ (given that $\\theta \\in [0,1]$) is what we would expect: $\\theta = 1,$ as mentioned in the hypothesis testing <a href=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/10\/hypothesis-testing\/\">example<\/a> of an overconfident estimator.<\/p>\n<p>As already mentioned, the MLE and MAP estimate agree in the case of the uniform prior, but the Bayesian approach allows us to calculate $p(\\theta | D)$, which gives us a measure of our confidence in this estimate. In this case we see:<br \/>\n\\begin{align}<br \/>\n p(\\theta | D) &amp;= \\frac{p(D|\\theta)p(\\theta)}{\\int_{\\theta&#8217;} p(D|\\theta&#8217;)p(\\theta&#8217;)d\\theta&#8217;} &#92;&#92;<br \/>\n&amp;= \\frac{\\theta^2}{\\int_{\\theta&#8217;} (\\theta&#8217;)^2d\\theta&#8217;} &#92;&#92;<br \/>\n&amp;= \\frac{\\theta^2}{1\/3} &#92;&#92;<br \/>\n&amp;= 3\\theta^2.<br \/>\n\\end{align}<\/p>\n<p>If we want to find a 99.9% credible interval for the estimate $\\theta_*\\in [a,1]$, then we just integrate<br \/>\n\\begin{align}<br \/>\nP(\\theta \\in [a,1] | D) &amp;=\\int_{\\theta =a}^1 p(\\theta | D)d\\theta &#92;&#92;<br \/>\n &amp;=\\int_{\\theta =a}^1 3\\theta^2 d\\theta &#92;&#92;<br \/>\n &amp;=1-a^3.<br \/>\n\\end{align}<br \/>\nSetting this to be equal to $1-10^{-3}$ we see that $a=0.1$. This interval is much more conservative than the confidence interval obtained by frequentist methods.<\/p>\n<p>Now what if we do not have a uniform prior. Instead we suppose we have some prior information<sup id=\"fnref-344-6\"><a href=\"#fn-344-6\" class=\"jetpack-footnote\">3<\/a><\/sup> that makes us believe that the most likely value of $\\theta$ is actually 0.25 and that it has variance 0.05. How do we incorporate this information into a prior? Well the mode and the variance are not sufficient to determine a distribution over the unit interval, so let&#8217;s assume that $p(\\theta)$ has some convenient form that fits these values.<\/p>\n<p>It would simplify matters if $p(\\theta)$ had the same form as $p(D|\\theta)=\\theta^{|H|}(1-\\theta)^{|T|}$. So, supposing $p(\\theta)=C\\theta^a(1-\\theta)^b$ for some constant $C$  (i.e., $p(\\theta)\\propto \\theta^a (1-\\theta)^b$), we would see that<br \/>\n\\begin{align}<br \/>\n p(\\theta | D) &amp;\\propto p(D|\\theta)p(\\theta) &#92;&#92;<br \/>\n &amp;\\propto \\theta^{|H|}(1-\\theta)^{|T|} \\theta^a(1-\\theta)^b&#92;&#92;<br \/>\n&amp; = \\theta^{|H|+a}(1-\\theta^{|T|+b}).<br \/>\n\\end{align}<br \/>\nThis has the same form as the prior! Such a prior is called a conjugate prior to the given likelihood function. Having such a prior is super convenient for computation.<\/p>\n<p>In this case, the conjugate prior to the Bernoulli distribution is the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Beta_distribution\">Beta distribution<\/a> which has the form $p(\\theta | \\alpha, \\beta)\\propto \\theta^{\\alpha-1}(1-\\theta)^{\\beta-1}$. One can calculate the mode and variance of this distribution (or just look it up) and get<br \/>\n\\begin{align}<br \/>\n\\textrm{mode} &amp;= \\frac{\\alpha -1}{\\alpha+\\beta-2} &#92;&#92;<br \/>\n\\sigma^2 &amp;= \\frac{\\alpha \\beta}{(\\alpha+\\beta)^2(\\alpha+\\beta+1)}<br \/>\n\\end{align}<br \/>\nPlugging this into <a href=\"https:\/\/www.wolframalpha.com\/input\/?i=Solve(+(a-1)%2F(a%2Bb-2)%3D0.25,+(a*b)%2F((a%2Bb)%5E2(a%2Bb%2B1))%3D0.05+for+a,b)\">Wolfram Alpha<\/a> gives us some approximate parameter values: $\\alpha \\approx 1.4$ and $\\beta \\approx 2.3$.<\/p>\n<p>Using this prior, the new posterior is<br \/>\n$$ p(\\theta | &#123;HH&#125;)\\propto \\theta^{2.4}(1-\\theta)^{1.3}.$$ We can form the MAP estimate by taking the $\\log$ of this expression (which is fine away from the points $\\theta = 0, 1$, which we know can not yield the maximum), differentiating with respect to $\\theta$, setting this to 0, and finally checking that the second derivative at this point is positive.<\/p>\n<h4>Exercise<\/h4>\n<p>Show the MAP estimate for this problem is $2.4\/3.7\\approx 0.65$ by calculating the mode of the $\\beta$ density function.<\/p>\n<p>We see that our estimate of $\\theta$ shifts drastically as we obtain (even just a little bit) more information. Now lets see how the posterior changes as we see even longer strings of heads:<br \/>\n<img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"312\" data-permalink=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/05\/statistical-inference-2\/beta-prior\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-prior.png?fit=640%2C480\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"beta-prior\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-prior.png?fit=640%2C480\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-prior.png?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"alignnone size-full wp-image-312\" srcset=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-prior.png?w=640 640w, https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-prior.png?resize=300%2C225 300w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><br \/>\nFrom the image we see that as posterior gradually becomes highly confident that the the true value of $\\theta$ is very large and assigns very small probability densities to small values of $\\theta.$<\/p>\n<p>To see how much our choice of prior affects the posterior in the presence of the same data, we can look at the analogous chart starting from a uniform prior, i.e., when the $\\beta$ parameters are $\\alpha=\\beta=1$.<br \/>\n<img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"314\" data-permalink=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/05\/statistical-inference-2\/beta-prior-uniform-2\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-prior-uniform-1.png?fit=640%2C480\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"beta-prior-uniform\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-prior-uniform-1.png?fit=640%2C480\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-prior-uniform-1.png?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"alignnone size-full wp-image-314\" srcset=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-prior-uniform-1.png?w=640 640w, https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-prior-uniform-1.png?resize=300%2C225 300w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><br \/>\nOr we can look at a side by side comparison:<br \/>\n<img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"318\" data-permalink=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/05\/statistical-inference-2\/betas-4\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/betas-3.png?fit=640%2C480\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"betas\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/betas-3.png?fit=640%2C480\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/betas-3.png?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"alignnone size-full wp-image-318\" srcset=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/betas-3.png?w=640 640w, https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/betas-3.png?resize=300%2C225 300w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><br \/>\nFrom looking at the graphs, we can see that while the two priors are very different (e.g., the non-uniform prior assigns very small densities to large values of $\\theta$), the posteriors rapidly become close to each other. How the posterior depends on the prior[\/caption]Since they never agree, our methods still depend on our choice of prior, but we can also see that the both methods are approaching the same estimate.<\/p>\n<p>To finally drive this point home, let&#8217;s consider a severely biased prior $\\beta(1.01,10.0)$, whose mode puts the probability of heads at approximately 0 and with near 0 variance in the binomial model. Then we can see how the posterior changes with additional evidence:<br \/>\n<figure id=\"attachment_324\" aria-describedby=\"caption-attachment-324\" style=\"width: 640px\" class=\"wp-caption alignnone\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"324\" data-permalink=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/05\/statistical-inference-2\/beta-posterior-2\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-posterior-1.gif?fit=640%2C480\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"beta-posterior\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;How the posterior depends on the prior&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-posterior-1.gif?fit=640%2C480\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/beta-posterior-1.gif?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"size-full wp-image-324\" \/><figcaption id=\"caption-attachment-324\" class=\"wp-caption-text\">How the posterior depends on the prior<\/figcaption><\/figure><br \/>\nThe severely biased prior is much slower to approach the same consensus as the other priors.<\/p>\n<h4>Exercise<\/h4>\n<p>The severely biased prior is very close to what we would obtain as a posterior from a uniform prior after witnessing 9 <em>tails<\/em> in a row. If we were to witness 9 tails in a row followed by 20 heads in a row, how would you change your modeling assumptions to better reflect the data?<\/p>\n<h4>Further point estimates<\/h4>\n<p>As discussed above, one way to estimate a parameter is the MAP estimate which is just the mode of the posterior distribution $p(\\theta | D)$. This is a very common approach to parameter estimation because it transforms the task into an optimization problem and we have a number of tools for such tasks.<\/p>\n<p>However, the mode of a distribution can be far from a typical point. The mode of a distribution is also not invariant under reparametrization. One alternative is to select the <em>mean<\/em> or expected value of $\\theta$ from the posterior distribution. Another alternative is to take the <em>median<\/em> of the posterior distribution.<\/p>\n<div class=\"footnotes\">\n<hr \/>\n<ol>\n<li id=\"fn-344-4\">\nI know this is not a very precise statement, but I want to be flexible at this point.&#160;<a href=\"#fnref-344-4\">&#8617;<\/a>\n<\/li>\n<li id=\"fn-344-5\">\nHow we make this choice is the <em>model selection problem<\/em> which will be discussed later.&#160;<a href=\"#fnref-344-5\">&#8617;<\/a>\n<\/li>\n<li id=\"fn-344-6\">\nI apologize that this example is so contrived.&#160;<a href=\"#fnref-344-6\">&#8617;<\/a>\n<\/li>\n<\/ol>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>\u2026the statistician knows\u2026that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world. &#8211; George Box (JASA, 1976, Vol. 71, 791-799) Parameter estimation Suppose &hellip; <a href=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/10\/parameter-estimation\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Parameter Estimation&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-344","post","type-post","status-publish","format-standard","hentry","category-general"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9dIpN-5y","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":486,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/11\/03\/problem-set-3\/","url_meta":{"origin":344,"position":0},"title":"Problem Set 3","author":"Justin Noel","date":"November 3, 2017","format":false,"excerpt":"Problem Set 3 This is to be completed by November 9th, 2017. Exercises [Datacamp](https:\/\/www.datacamp.com\/home Complete the lesson \"Introduction to Machine Learning\". This should have also included \"Exploratory Data Analysis\". This has been added to the next week's assignment. MLE for the uniform distribution. (Source: Kaelbling\/Murphy) Consider a uniform distribution centered\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":286,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/05\/statistical-inference-2\/","url_meta":{"origin":344,"position":1},"title":"Statistical Inference","author":"Justin Noel","date":"October 5, 2017","format":false,"excerpt":"All models are wrong, but some are useful. - George Box Introduction The general setup for statistical inference is that we are given some data $D$ which we assume arise as the values of a random variable that we assume is distributed according to some parametric model $m(\\theta)$. The goal\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":214,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/04\/linear-regression\/","url_meta":{"origin":344,"position":2},"title":"Linear Regression","author":"Justin Noel","date":"October 4, 2017","format":false,"excerpt":"Prediction is very difficult, especially about the future. - Niels Bohr The problem Suppose we have a list of vectors (which we can think of as samples) $x_1, \\cdots, x_m\\in \\Bbb R^n$ and a corresponding list of output scalars $y_1, \\cdots, y_m \\in \\Bbb R$ (which we can regard as\u2026","rel":"","context":"In &quot;Regression&quot;","block_context":{"text":"Regression","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/regression\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/trace.png?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/trace.png?resize=350%2C200 1x, https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/trace.png?resize=525%2C300 1.5x"},"classes":[]},{"id":508,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/11\/09\/problem-set-4\/","url_meta":{"origin":344,"position":3},"title":"Problem Set 4","author":"Justin Noel","date":"November 9, 2017","format":false,"excerpt":"Problem Set 4 This is to be completed by November 16th, 2017. Exercises Datacamp Complete the lessons: a. Supervised Learning in R: Regression b. Supervised Learning in R: Classification c. Exploratory Data Analysis (If you did not already do so) Let $\\lambda\\geq 0$, $X\\in \\Bbb R^n\\otimes \\Bbb R^m$, $Y\\in \\Bbb\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":33,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/26\/machine-learning-overview\/","url_meta":{"origin":344,"position":4},"title":"Machine Learning Overview","author":"Justin Noel","date":"September 26, 2017","format":false,"excerpt":"Science is knowledge which we understand so well that we can teach it to a computer; and if we don't fully understand something, it is an art to deal with it. Donald Knuth Introduction First Attempt at a Definition One says that an algorithm learns if its performance improves with\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/general\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/web.stanford.edu\/class\/cs234\/images\/header2.png?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/web.stanford.edu\/class\/cs234\/images\/header2.png?resize=350%2C200 1x, https:\/\/i0.wp.com\/web.stanford.edu\/class\/cs234\/images\/header2.png?resize=525%2C300 1.5x, https:\/\/i0.wp.com\/web.stanford.edu\/class\/cs234\/images\/header2.png?resize=700%2C400 2x"},"classes":[]},{"id":61,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/26\/probability-and-statistics-background\/","url_meta":{"origin":344,"position":5},"title":"Probability and Statistics Background","author":"Justin Noel","date":"September 26, 2017","format":false,"excerpt":"Statistics - A subject which most statisticians find difficult, but in which nearly all physicians are expert. - Stephen S. Senn Introduction For us, we will regard probability theory as a way of logically reasoning about uncertainty. I realize that this is not a precise mathematical definition, but neither is\u2026","rel":"","context":"In &quot;Supplementary material&quot;","block_context":{"text":"Supplementary material","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/supplementary-material\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts\/344","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/comments?post=344"}],"version-history":[{"count":10,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts\/344\/revisions"}],"predecessor-version":[{"id":347,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts\/344\/revisions\/347"}],"wp:attachment":[{"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/media?parent=344"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/categories?post=344"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/tags?post=344"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}