{"id":214,"date":"2017-10-04T10:46:49","date_gmt":"2017-10-04T10:46:49","guid":{"rendered":"http:\/\/www.nullplug.org\/ML-Blog\/?p=214"},"modified":"2017-11-02T07:37:22","modified_gmt":"2017-11-02T07:37:22","slug":"linear-regression","status":"publish","type":"post","link":"https:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/04\/linear-regression\/","title":{"rendered":"Linear Regression"},"content":{"rendered":"<blockquote><p>\n  Prediction is very difficult, especially about the future. &#8211; <a href=\"http:\/\/www.quotationspage.com\/quote\/26159.html\">Niels Bohr<\/a>\n<\/p><\/blockquote>\n<h2>The problem<\/h2>\n<p>Suppose we have a list of vectors (which we can think of as samples) $x_1, \\cdots, x_m\\in \\Bbb R^n$ and a corresponding list of output scalars $y_1, \\cdots, y_m \\in \\Bbb R$ (which we can regard as a vector $y\\in \\Bbb R^m$). We want to find a vector $\\beta=(\\beta_1,\\cdots \\beta_n)\\in \\Bbb R^n$ such that the <em>sum of squared error<\/em> $$ \\sum_{i=1}^m (x_i \\cdot \\beta &#8211; y_i)^2 $$ is as small as possible. Alternatively, if we suppose that our pairs $(x_i, y_i)$ arise from a function $f(x_i)=y_i$, we want to find the closest approximation to $f$ by a linear function of the form $\\widehat{f}(a)=\\sum_{i=1}^n \\beta_i \\pi_i(a)$, where $\\pi_i \\colon \\Bbb R^n\\to \\Bbb R$ is the projection to the $i$th coordinate.<\/p>\n<p>Let&#8217;s arrange the input vectors $x_1,\\cdots, x_m$ into an $m\\times n$-matrix $A$, where $x_i$ appears as the $i$th row. Then we have a matrix equation of the form $ A \\beta = \\widehat{y}$ and we want to find $\\beta$ such that $||\\widehat{y}-y||^2$ is as small as possible<sup id=\"fnref-214-1\"><a href=\"#fn-214-1\" class=\"jetpack-footnote\">1<\/a><\/sup>.<\/p>\n<p><!--- Now if $m=n$, then we can hope that it $A$ is invertible, in which case $\\beta = A^{-1}y$ would give us a solution with 0 error. If $A$ is not invertible then some entry $x_i$ can be written as a linear combination of the other entries and we could choose to omit it. Then we end up with an $(n-1)\\times n$-matrix and we end up trying to solve the more general problem anyway.  --><\/p>\n<p>Let&#8217;s try to use calculus. Here we are trying to minimize:<br \/>\n$$ SSE(\\beta) = \\lVert\\widehat{y}-y\\rVert^2=(A\\beta -y)^T(A\\beta -y).$$ Differentiating with respect to $\\beta$ (see <a href=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/18\/tensor-calculus\/\">Tensor Calculus<\/a>) yields<br \/>\n\\begin{align}<br \/>\n \\frac{dSSE(\\beta)}{d\\beta} &amp; = 2(A\\beta -y)^TA \\label{eq:der}.<br \/>\n\\end{align}<br \/>\nDifferentiating with respect to $\\beta$ again yields $2A^TA$ which will be a positive semidefinite matrix. This matrix will be positive definite if and only if $\\ker A =0$<sup id=\"fnref-214-4\"><a href=\"#fn-214-4\" class=\"jetpack-footnote\">2<\/a><\/sup>.<\/p>\n<p>Now $\\ker A \\neq 0$ if and only if one of the columns of $A$ can be expressed as a linear combination of the others. In terms of our original problem, this means that one of the components of our input vectors is in fact expressible as a linear combination of the others, so there is a hidden dependency amongst the variables.<\/p>\n<h4>Exercise<\/h4>\n<p>Perform the above calculation yourself while being careful about what are vector valued quantities.<\/p>\n<p>Let&#8217;s suppose $\\ker A\\neq 0$ (we can arrange this anyway by dropping dependent columns). So our matrix is positive definite and we see that any critical value of $SSE(\\beta)$ in \\eqref{eq:der} is a minimum. So by setting \\eqref{eq:der}  to 0 and solving for $\\beta$ (while observing that $A^TA$ is non-singular, given our assumptions) we get:<br \/>\n\\begin{equation}<br \/>\n\\beta=(A^TA)^{-1}A^Ty.\\label{eq:solv}<br \/>\n\\end{equation}<br \/>\nThis value of $\\beta$ is uniquely determined and hence yields the global minimum of $SSE(\\beta)$.<\/p>\n<p>Note that when $A$ is an invertible square matrix, then \\eqref{eq:solv} reduces to $\\beta = A^{-1} y$, which will have $SSE(\\beta)=0$; just as we should expect from linear algebra.<\/p>\n<h3>Implementation concerns<\/h3>\n<p>Using a naive implementation of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cramer%27s_rule\">Cramer&#8217;s rule<\/a> to invert an $n\\times n$ matrix will take $O(n!)$ time and is totally impractical for problems of interest. While using Gaussian elimination to invert an $n\\times n$ matrix takes $O(n^3)$-time (by being <a href=\"https:\/\/en.wikipedia.org\/wiki\/Cholesky_decomposition\">clever<\/a> one can cut the runtime in half for inverting the symmetric matrix $A^TA$). The remaining multiplications require $O(n^2m)$ time and since $m\\geq n$ (otherwise $\\ker A\\neq 0$), the dominant term will be the multiplications so the algorithm runs in $O(n^2m)$-time.<\/p>\n<p>We may also be concerned about errors in the data. Suppose that the &#8220;correct&#8221; value of $y$ is, in fact $y+e$. For simplicity suppose that $A$ is invertible. We might hope that if $\\lVert e\\rVert$ is very small relative to $\\lVert y\\rVert$ then the error between $A^{-1}y$ and $A^{-1}(y+e)$ will be small relative to $A^{-1}y$.<br \/>\nWe can calculate<br \/>\n\\begin{align}<br \/>\n\\frac{\\lVert A^{-1} e \\rVert}{\\lVert A^{-1} y \\rVert}\/\\frac{\\lVert e\\rVert}{\\lVert y\\rVert} &amp; = \\frac{\\lVert A^{-1} e\\rVert}{\\lVert e\\rVert }\\cdot \\frac{\\lVert y\\rVert }{\\lVert A^{-1}y\\rVert } \\label{eq:cond}<br \/>\n\\end{align}<br \/>\nAs we try to maximize this expression over all $e$ and $y$ (so we can replace $y$ with $Ay$), we obtain the product of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Operator_norm\">operator norms<\/a> $\\lVert A^{-1}\\rVert \\cdot \\lVert A \\rVert$. While we won&#8217;t go into operator norms here, we will point out that when $A$ is diagonal, then $\\lVert A\\rVert$ is the maximum of the absolute values on the diagonal entries, so that \\eqref{eq:cond} becomes $|M|\/|m|$ where $|M|$ is the maximum of the absolute values of the diagonal entries and $|m|$ is the minimum of the absolute values of these entries. In particular, if $A$ is the 2&#215;2 diagonal matrix with diagonal entries $10$ and $0.1$, inversion can multiply relative errors by a factor of a 100.<\/p>\n<p>Typically the standard thing to do is not to invert the matrix, but rather construct a matrix decomposition from which it is easy to solve the system of equations by back substitution. Not that this would not help the numerical stability problem with the diagonal matrix.<\/p>\n<h3>Beyond simple linear functions<\/h3>\n<p>At this point we have seen how to find an approximation to a function $f(x)$ of the form $\\widehat{f}(x)=\\sum_{i=1}^n \\beta_i \\pi_i(x)$ that minimizes the sum of squared errors.<\/p>\n<p>What happens if our function is just the constant function $f(x)=10$? In this case, our approximation will do rather poorly. In particular, $\\widehat{f}(0)=0$, so this will never be a very good approximation about this point.<\/p>\n<p>What we could do in this case is assume that our function instead has the form: $$\\widehat{f}(x)=\\sum_{i=1}^n \\beta_i \\pi_i(x)+\\beta_{n+1}\\cdot 1.$$ We can now solve this new approximation problem by just changing the vectors $x_i=(\\pi_1(x_i),\\cdots,\\pi_n(x_i))$ into new $n+1$-dimensional vectors $(\\pi_1(x_i),\\cdots,\\pi_n(x_i),1)$ and applying linear regression as before. For non-degenerate collections of vectors $&#123;x_i&#125;$, we will end up with the desired function $\\widehat{f}(x)=10$. This is sometimes called <em>the bias trick<\/em>.<\/p>\n<p>Here is an example of such a linear regression being fit to more and more data points. For this example, we used python&#8217;s <a href=\"http:\/\/scikit-learn.org\/stable\/\">scikit-learn package<\/a> and rendered the images using <a href=\"https:\/\/matplotlib.org\/\">matplotlib<\/a> (see the <a href=\"https:\/\/github.com\/JustinNoel1\/ML-Course\/blob\/master\/linear-regression\/python\/linreg.py\">code<\/a>).<br \/>\n <figure id=\"attachment_367\" aria-describedby=\"caption-attachment-367\" style=\"width: 640px\" class=\"wp-caption alignnone\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"367\" data-permalink=\"https:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/04\/linear-regression\/compressed_linreg_normal\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_linreg_normal.gif?fit=640%2C480&amp;ssl=1\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Linear regression\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;Fitting to more and more data points&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_linreg_normal.gif?fit=640%2C480&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_linreg_normal.gif?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"size-full wp-image-367\" \/><figcaption id=\"caption-attachment-367\" class=\"wp-caption-text\">Fitting to more and more data points<\/figcaption><\/figure> In this case, the mean squared error approaches the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Bayes_error_rate\">Bayes error rate<\/a>.<\/p>\n<p>More generally if we think our $f(x)$ can be well approximated by a function of the form $\\widehat{f}(x)=\\sum_{i=1}^N \\beta_i f_i(x)$ for <em>any<\/em> fixed set of $N$ functions $&#92;{f_i&#92;}$, we can just construct a new sample matrix whose $i$th row is the $N$-vector $(f_1(x_i),\\cdots,f_N(x))$ and apply linear regression.<\/p>\n<p>For example, if we are looking for a function $\\widehat{f}\\colon \\Bbb R\\to \\Bbb R$ of the form $\\widehat{f}(x)=\\beta_1 x^2+\\beta_2 x +\\beta_3$, we just have to replace the samples $x_1,\\cdots, x_m$ with the vectors $(x_1^2, x_1, 1), \\cdots, (x_m^2, x_m, 1)$ and apply linear regression as before. This is an example of <em>polynomial regression<\/em>. For example, we have the following (see the <a href=\"https:\/\/github.com\/JustinNoel1\/ML-Course\/blob\/master\/linear-regression\/python\/polyreg.py\">code<\/a>):<br \/>\n<figure id=\"attachment_365\" aria-describedby=\"caption-attachment-365\" style=\"width: 640px\" class=\"wp-caption alignnone\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"365\" data-permalink=\"https:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/10\/parameter-estimation\/compressed_polyreg_normal\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?fit=640%2C480&amp;ssl=1\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Polynomial regression\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;High degree polynomials fit the data better&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?fit=640%2C480&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"size-full wp-image-365\" \/><figcaption id=\"caption-attachment-365\" class=\"wp-caption-text\">High degree polynomials fit the data better<\/figcaption><\/figure> These examples were constructed as above, but we first transform the data following the above discussion using <a href=\"http:\/\/scikit-learn.org\/stable\/modules\/generated\/sklearn.preprocessing.PolynomialFeatures.html\">PolynomialFeatures<\/a>. Alternatively, it is very easy to just code this transformation yourself.<\/p>\n<p>We could similarly try to find <a href=\"https:\/\/en.wikipedia.org\/wiki\/Fourier_series\">Fourier approximations<\/a> for a function which we expect has periodic behavior. We could also use <a href=\"https:\/\/en.wikipedia.org\/wiki\/Wavelet\">wavelets<\/a> or <a href=\"https:\/\/en.wikipedia.org\/wiki\/Spline_(mathematics)\">splines<\/a>. Clearly, this list goes on and on.<\/p>\n<h3>Warning<\/h3>\n<p>If we choose a suitably expressive class of functions to define our generalized linear regression problem, then there is a good chance that the linear regression problem will fit the data perfectly. For example, if we have any function $f\\colon \\Bbb R \\to \\Bbb R$ and we know its values on $n$ distinct points, then we can perfectly approximate this function by a polynomial of degree at least $n-1$ (e.g., 2 points determine a line, 3 points determine a quadratic curve, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Newton_polynomial\">etc.<\/a>). While this will perfectly fit our given data, it becomes quite unlikely that it will match any new data points. This is a classic example of <em>overfitting<\/em>. We can see this phenomenon in the following example:<figure id=\"attachment_361\" aria-describedby=\"caption-attachment-361\" style=\"width: 640px\" class=\"wp-caption alignnone\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"361\" data-permalink=\"https:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/10\/parameter-estimation\/polyreg_var_normal\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/polyreg_var_normal.gif?fit=640%2C480&amp;ssl=1\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Degree 10 polynomial regression\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;This estimator has high variance&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/polyreg_var_normal.gif?fit=640%2C480&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/polyreg_var_normal.gif?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"size-full wp-image-361\" \/><figcaption id=\"caption-attachment-361\" class=\"wp-caption-text\">This estimator has high variance<\/figcaption><\/figure> Here the degree 10 polynomial will fit any 30 points relatively well, but will drastically fail to generalize well to the entire dataset.<\/p>\n<h3>A Maximal Likelihood Justification for using SSE<\/h3>\n<p>While the above exposition is elementary, one might wonder where the choice to minimize the sum of squared errors comes from.<\/p>\n<p>Let us suppose that that we have a function of the form $f(x)=g(x)+\\epsilon$ where $\\epsilon$ is an $\\Bbb R$ valued random variable<sup id=\"fnref-214-5\"><a href=\"#fn-214-5\" class=\"jetpack-footnote\">3<\/a><\/sup> $\\epsilon \\sim N(0, \\sigma^2)$ and $g(x)$ is a deterministic function. We could mentally justify this by saying that the output variables are not recorded exactly, but there is some normally distributed measurement error or there is some additional non-linear random factors contributing to the value of $f(x)$. Then the conditional density function of $f(x)$ is $$p(y|x)=\\frac{e^{-(g(x) -y)^2\/2\\sigma^2}}{\\sqrt{2\\pi \\sigma^2}}.$$<\/p>\n<p>The <em>maximal likelihood estimate<\/em> (MLE) (see <a href=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/10\/parameter-estimation\/\">here<\/a> for more details) for a vector of values $y=(y_1,\\cdots,y_m)$ associated to a sequence of independent samples $x=(x_1,\\cdots,x_m)$ is then<br \/>\n\\begin{align}<br \/>\n<!-- \\textrm{argmax}_y p(y|x) &= \\textrm{argmax}_y \\prod_{i=1}^m p(y_i|x_i) \\\\\\\\\n &= \\textrm{argmax}_y \\log \\prod_{i=1}^m p(y_i|x_i) \\\\\\\\\n  &= \\textrm{argmax}_y \\sum_{i=1}^m -(g(x_i) -y_i)^2\/2\\sigma^2 \\\\\\\\\n  &= \\textrm{argmin}_y \\sum_{i=1}^m (g(x_i) -y_i)^2. --><br \/>\n\\end{align}<\/p>\n<p>So in this case, the MLE for $y$ is the choice of $y$ which minimizes the sum of squared errors.<\/p>\n<h3>A frequentist description<\/h3>\n<p>If we suppose that our data samples $(x,y)$ are being drawn from a fixed distribution, e.g., they are the values of a vector valued random variable $X$ and a random variable $Y$. Moreover, suppose the joint pdf of $X$ and $Y$ is $p(x,y)$. Then we can regard linear regression as attempting to find<br \/>\n\\begin{align}<br \/>\n<!-- \\textrm{argmin}_\\beta E((X \\cdot \\beta-Y)^2) &= \\textrm{argmin}_{\\beta} \\int (x\\cdot\\beta-y)^2 p(x,y) dx dy \\\\\n&=\\textrm{argmin}_{\\beta} \\int (x\\cdot\\beta-y)^2 p(y|x) p(x) dx dy. --><br \/>\n\\end{align}<\/p>\n<h2>Bayesian approach<\/h2>\n<p>Here we will consider a Bayesian approach to the polynomial regression problem. First we consider the case, where we approximate data coming from a degree 3 polynomial using a Gaussian model that expects a degree 3 polynomial and is trained on all 200 samples. In other words, we are assuming a conditional distribution of the form<br \/>\n$$p(y,x|\\theta)=1\/\\sqrt{2\\pi \\sigma^2}\\cdot e^{\\frac{(y-\\sum_{i=0}^3 \\beta_i x^i)^2}{2\\sigma^2}},$$ where $\\theta$ is the vector of parameters in this model. For our prior, we suppose that the parameters are independent random variables where $\\beta_i~N(0, 100)$ and $\\sigma$ is given by a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Half-normal_distribution\">half normal distribution<\/a> with standard deviation 1 (since this can only take non-negative values). To calculate the posterior $p(\\theta | D)$ we apply Bayes rule. Unfortunately Bayes rule has this nasty integral over the parameter space in the denominator:<br \/>\n$$ \\int_\\theta p(D|\\theta)p(\\theta)d\\theta.$$ Rather than try and calculate this integral directly, we will calculate this using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Monte_Carlo_integration\">Monte Carlo integration<\/a>, which is sampled using an algorithm which is completely <a href=\"https:\/\/arxiv.org\/abs\/1111.4246\">NUTS<\/a>. All of this is implemented using the package <a href=\"http:\/\/docs.pymc.io\/notebooks\/getting_started.html\">PyMC3<\/a>. This is overkill for this situation, but it is nice to see how the general algorithm is implemented (see <a href=\"https:\/\/github.com\/JustinNoel1\/ML-Course\/blob\/master\/bayes\/bayesian-regression\/python\/bayreg.py\">here<\/a> for the code).<\/p>\n<p>In effect what we are doing is using a special technique to generate samples $S$ of $\\theta$ according to the prior distribution $p(\\theta)$. For each $\\theta\\in S$, we calculate $z_\\theta =p(D|\\theta)$ and see that $$p(\\theta|D)\\approx \\frac{z_\\theta}{\\sum_{\\theta&#8217; \\in S} z_{\\theta&#8217;}}.$$ As the number of samples goes up, the accuracy of this estimate increases.<\/p>\n<p>We can look at the output here:<br \/>\n<figure id=\"attachment_371\" aria-describedby=\"caption-attachment-371\" style=\"width: 640px\" class=\"wp-caption middle\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"371\" data-permalink=\"https:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/04\/linear-regression\/bayreg-2\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/bayreg-1.png?fit=640%2C480&amp;ssl=1\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Bayesian polynomial regression\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;A degree 3 approximation&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/bayreg-1.png?fit=640%2C480&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/bayreg-1.png?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"size-full wp-image-371\" srcset=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/bayreg-1.png?w=640&amp;ssl=1 640w, https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/bayreg-1.png?resize=300%2C225&amp;ssl=1 300w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><figcaption id=\"caption-attachment-371\" class=\"wp-caption-text\">A degree 3 approximation<\/figcaption><\/figure><br \/>\nBy taking posterior samples of the parameters we obtain the curves in grey. The mean of these samples generates the black curve in the middle. Alternatively, we can sample from $p(y|x)$ (that is we average over all of the choices of parameters). Taking the mean values for each $y$ given an $x$ we obtain the red curve. We have marked a band of radius the standard deviation about these samples.<\/p>\n<p>The summary statistics for our parameter values are listed here. Note that the &#8216;true value&#8217; of our function is $$f(x)=<br \/>\n7 x^3 &#8211; 9.1 x^2 + 3.5 x &#8211; 0.392,$$ with a standard deviation of 0.3. We note that none of the true coefficients fall into the 95% credibility intervals, which is not very promising.<\/p>\n<table border=\"1\" class=\"dataframe\">\n<thead>\n<tr style=\"text-align: right;\">\n<th><\/th>\n<th>mean<\/th>\n<th>sd<\/th>\n<th>mc_error<\/th>\n<th>hpd_2.5<\/th>\n<th>hpd_97.5<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<th>coeff__0<\/th>\n<td>-0.209607<\/td>\n<td>0.080981<\/td>\n<td>0.001872<\/td>\n<td>-0.370318<\/td>\n<td>-0.056474<\/td>\n<\/tr>\n<tr>\n<th>coeff__1<\/th>\n<td>1.914993<\/td>\n<td>0.704315<\/td>\n<td>0.020752<\/td>\n<td>0.562936<\/td>\n<td>3.260617<\/td>\n<\/tr>\n<tr>\n<th>coeff__2<\/th>\n<td>-5.556357<\/td>\n<td>1.651007<\/td>\n<td>0.050628<\/td>\n<td>-8.678463<\/td>\n<td>-2.410120<\/td>\n<\/tr>\n<tr>\n<th>coeff__3<\/th>\n<td>4.794343<\/td>\n<td>1.099424<\/td>\n<td>0.033513<\/td>\n<td>2.710534<\/td>\n<td>6.909714<\/td>\n<\/tr>\n<tr>\n<th>sigma<\/th>\n<td>0.291662<\/td>\n<td>0.015164<\/td>\n<td>0.000186<\/td>\n<td>0.263677<\/td>\n<td>0.321718<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We can gain further information by studying the pdfs of the resulting variables:<\/p>\n<p style=\"text-align:center\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"374\" data-permalink=\"https:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/04\/linear-regression\/trace\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/trace.png?fit=526%2C842&amp;ssl=1\" data-orig-size=\"526,842\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"PDFs of coefficient variables\" data-image-description=\"\" data-image-caption=\"\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/trace.png?fit=526%2C842&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/trace.png?resize=526%2C842\" alt=\"\" width=\"526\" height=\"842\" class=\"alignnone size-full wp-image-374\" srcset=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/trace.png?w=526&amp;ssl=1 526w, https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/trace.png?resize=187%2C300&amp;ssl=1 187w\" sizes=\"auto, (max-width: 526px) 85vw, 526px\" \/> <\/p>\n<p>Now we examine how the Bayesian method does with respect to our overfitting problem above. Here we try to fit a degree 10 polynomial to only 30 data points. We then obtain the following:<figure id=\"attachment_372\" aria-describedby=\"caption-attachment-372\" style=\"width: 640px\" class=\"wp-caption middle\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"372\" data-permalink=\"https:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/04\/linear-regression\/bayreg10\/\" data-orig-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/bayreg10.png?fit=640%2C480&amp;ssl=1\" data-orig-size=\"640,480\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"Degree 10 Bayesian polynomial regression\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;Again there is a lot of variance in the additional models.&lt;\/p&gt;\n\" data-large-file=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/bayreg10.png?fit=640%2C480&amp;ssl=1\" src=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/bayreg10.png?resize=640%2C480\" alt=\"\" width=\"640\" height=\"480\" class=\"size-full wp-image-372\" srcset=\"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/bayreg10.png?w=640&amp;ssl=1 640w, https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/bayreg10.png?resize=300%2C225&amp;ssl=1 300w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><figcaption id=\"caption-attachment-372\" class=\"wp-caption-text\">Again there is a lot of variance in the additional models.<\/figcaption><\/figure><\/p>\n<p>Here we see the samples for the polynomial curves again have wild variance, but our posterior predictions are actually quite reasonable. Looking at the summary statistics in the table below , we note that the model has very low confidence in almost all of the parameter values and huge credibility intervals that mostly contain the true values. By averaging over all of these parameters we reduce the variance in our predictions and obtain a much better model.<\/p>\n<table border=\"1\" class=\"dataframe\">\n<thead>\n<tr style=\"text-align: right;\">\n<th><\/th>\n<th>mean<\/th>\n<th>sd<\/th>\n<th>mc_error<\/th>\n<th>hpd_2.5<\/th>\n<th>hpd_97.5<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<th>coeff__0<\/th>\n<td>-0.053526<\/td>\n<td>0.200482<\/td>\n<td>0.002303<\/td>\n<td>-0.469330<\/td>\n<td>0.317603<\/td>\n<\/tr>\n<tr>\n<th>coeff__1<\/th>\n<td>0.658498<\/td>\n<td>1.898380<\/td>\n<td>0.022927<\/td>\n<td>-3.246255<\/td>\n<td>4.222251<\/td>\n<\/tr>\n<tr>\n<th>coeff__2<\/th>\n<td>-2.835666<\/td>\n<td>5.895880<\/td>\n<td>0.084508<\/td>\n<td>-13.723288<\/td>\n<td>9.131853<\/td>\n<\/tr>\n<tr>\n<th>coeff__3<\/th>\n<td>2.316211<\/td>\n<td>8.093137<\/td>\n<td>0.126718<\/td>\n<td>-13.588722<\/td>\n<td>18.071517<\/td>\n<\/tr>\n<tr>\n<th>coeff__4<\/th>\n<td>1.342521<\/td>\n<td>8.552999<\/td>\n<td>0.104256<\/td>\n<td>-15.235386<\/td>\n<td>18.123408<\/td>\n<\/tr>\n<tr>\n<th>coeff__5<\/th>\n<td>-0.630845<\/td>\n<td>8.596230<\/td>\n<td>0.099357<\/td>\n<td>-16.968628<\/td>\n<td>16.577238<\/td>\n<\/tr>\n<tr>\n<th>coeff__6<\/th>\n<td>-1.251081<\/td>\n<td>8.598991<\/td>\n<td>0.103051<\/td>\n<td>-17.388627<\/td>\n<td>15.975167<\/td>\n<\/tr>\n<tr>\n<th>coeff__7<\/th>\n<td>-0.712717<\/td>\n<td>8.881522<\/td>\n<td>0.102497<\/td>\n<td>-18.123559<\/td>\n<td>16.398549<\/td>\n<\/tr>\n<tr>\n<th>coeff__8<\/th>\n<td>0.373434<\/td>\n<td>8.985303<\/td>\n<td>0.104398<\/td>\n<td>-16.930835<\/td>\n<td>18.813871<\/td>\n<\/tr>\n<tr>\n<th>coeff__9<\/th>\n<td>0.753457<\/td>\n<td>8.428446<\/td>\n<td>0.094453<\/td>\n<td>-16.404371<\/td>\n<td>16.786885<\/td>\n<\/tr>\n<tr>\n<th>coeff__10<\/th>\n<td>0.835288<\/td>\n<td>7.222892<\/td>\n<td>0.089429<\/td>\n<td>-13.213493<\/td>\n<td>15.239204<\/td>\n<\/tr>\n<tr>\n<th>sigma<\/th>\n<td>0.276957<\/td>\n<td>0.041169<\/td>\n<td>0.000551<\/td>\n<td>0.205731<\/td>\n<td>0.360664<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h4>Exercise<\/h4>\n<p>How did our choice of prior bias the predictions of the coefficients? What would be some other reasonable choices of prior?<\/p>\n<div class=\"footnotes\">\n<hr \/>\n<ol>\n<li id=\"fn-214-1\">\nNote that minimizing the sum is equivalent to minimizing the average.&#160;<a href=\"#fnref-214-1\">&#8617;<\/a>\n<\/li>\n<li id=\"fn-214-4\">\nRecall that on $\\Bbb R^m$ we have a scalar product $\\langle v,w\\rangle=v^Tw$ (here we view vectors as column vectors). This formula tells us that $$\\langle v,Ax\\rangle =v^TAx=x^TA^Tv=\\langle x,A^Tv\\rangle =\\langle A^Tv,x\\rangle $$ (note that we used that every 1&#215;1 matrix is symmetric). This implies that $$\\langle x, A^TA x\\rangle =\\langle Ax, Ax \\rangle \\geq 0$$ for all $x$. Moreover, this quantity is zero if and only if $Ax=0$.&#160;<a href=\"#fnref-214-4\">&#8617;<\/a>\n<\/li>\n<li id=\"fn-214-5\">\nSo our function $f$ is of the form $f\\colon \\Bbb R^n\\times \\Omega \\to \\Bbb R$ where $f(x,-)\\colon \\Omega \\to \\Bbb R$ is a measurable function on some probability which is distributed normally with mean $g(x)$ and variance $\\sigma^2$.<br \/>\n<!-- [^6]: We should observe that in this case our integral is actually of a pretty reasonable form. Although it might be possible to just calculate it, the MC integration technique can be applied much more generally. -->&#160;<a href=\"#fnref-214-5\">&#8617;<\/a>\n<\/li>\n<\/ol>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Prediction is very difficult, especially about the future. &#8211; Niels Bohr The problem Suppose we have a list of vectors (which we can think of as samples) $x_1, \\cdots, x_m\\in \\Bbb R^n$ and a corresponding list of output scalars $y_1, \\cdots, y_m \\in \\Bbb R$ (which we can regard as a vector $y\\in \\Bbb R^m$). &hellip; <a href=\"https:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/04\/linear-regression\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Linear Regression&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[6,3],"tags":[],"class_list":["post-214","post","type-post","status-publish","format-standard","hentry","category-regression","category-supervised-learning"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9dIpN-3s","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":405,"url":"https:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/18\/tensor-calculus\/","url_meta":{"origin":214,"position":0},"title":"Tensor Calculus","author":"Justin Noel","date":"October 18, 2017","format":false,"excerpt":"Introduction I will assume that you have seen some calculus, including multivariable calculus. That is you know how to differentiate a differentiable function $f\\colon \\Bbb R \\to \\Bbb R$, to obtain a new function $$\\frac{\\partial f}{\\partial x} \\colon \\Bbb R \\to \\Bbb R.$$ You also know how to differentiate a\u2026","rel":"","context":"In &quot;Supplementary material&quot;","block_context":{"text":"Supplementary material","link":"https:\/\/www.nullplug.org\/ML-Blog\/category\/supplementary-material\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":508,"url":"https:\/\/www.nullplug.org\/ML-Blog\/2017\/11\/09\/problem-set-4\/","url_meta":{"origin":214,"position":1},"title":"Problem Set 4","author":"Justin Noel","date":"November 9, 2017","format":false,"excerpt":"Problem Set 4 This is to be completed by November 16th, 2017. Exercises Datacamp Complete the lessons: a. Supervised Learning in R: Regression b. Supervised Learning in R: Classification c. Exploratory Data Analysis (If you did not already do so) Let $\\lambda\\geq 0$, $X\\in \\Bbb R^n\\otimes \\Bbb R^m$, $Y\\in \\Bbb\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/www.nullplug.org\/ML-Blog\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":486,"url":"https:\/\/www.nullplug.org\/ML-Blog\/2017\/11\/03\/problem-set-3\/","url_meta":{"origin":214,"position":2},"title":"Problem Set 3","author":"Justin Noel","date":"November 3, 2017","format":false,"excerpt":"Problem Set 3 This is to be completed by November 9th, 2017. Exercises [Datacamp](https:\/\/www.datacamp.com\/home Complete the lesson \"Introduction to Machine Learning\". This should have also included \"Exploratory Data Analysis\". This has been added to the next week's assignment. MLE for the uniform distribution. (Source: Kaelbling\/Murphy) Consider a uniform distribution centered\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/www.nullplug.org\/ML-Blog\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":344,"url":"https:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/10\/parameter-estimation\/","url_meta":{"origin":214,"position":3},"title":"Parameter Estimation","author":"Justin Noel","date":"October 10, 2017","format":false,"excerpt":"\u2026the statistician knows\u2026that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world. - George Box (JASA, 1976, Vol.\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/www.nullplug.org\/ML-Blog\/category\/general\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?resize=350%2C200 1x, https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?resize=525%2C300 1.5x"},"classes":[]},{"id":35,"url":"https:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/26\/supervised-learning\/","url_meta":{"origin":214,"position":4},"title":"Supervised Learning","author":"Justin Noel","date":"September 26, 2017","format":false,"excerpt":"A big computer, a complex algorithm, and a long time does not equal science. - Robert Gentleman Examples Before getting into what supervised learning precisely is, let's look at some examples of supervised learning tasks: Identifying breast cancer. A sample study. Image classification. List of last year's ILSVRC Winners Threat\u2026","rel":"","context":"In &quot;Supervised Learning&quot;","block_context":{"text":"Supervised Learning","link":"https:\/\/www.nullplug.org\/ML-Blog\/category\/supervised-learning\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":33,"url":"https:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/26\/machine-learning-overview\/","url_meta":{"origin":214,"position":5},"title":"Machine Learning Overview","author":"Justin Noel","date":"September 26, 2017","format":false,"excerpt":"Science is knowledge which we understand so well that we can teach it to a computer; and if we don't fully understand something, it is an art to deal with it. Donald Knuth Introduction First Attempt at a Definition One says that an algorithm learns if its performance improves with\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/www.nullplug.org\/ML-Blog\/category\/general\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/web.stanford.edu\/class\/cs234\/images\/header2.png?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/web.stanford.edu\/class\/cs234\/images\/header2.png?resize=350%2C200 1x, https:\/\/i0.wp.com\/web.stanford.edu\/class\/cs234\/images\/header2.png?resize=525%2C300 1.5x, https:\/\/i0.wp.com\/web.stanford.edu\/class\/cs234\/images\/header2.png?resize=700%2C400 2x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts\/214","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/comments?post=214"}],"version-history":[{"count":10,"href":"https:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts\/214\/revisions"}],"predecessor-version":[{"id":479,"href":"https:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts\/214\/revisions\/479"}],"wp:attachment":[{"href":"https:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/media?parent=214"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/categories?post=214"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/tags?post=214"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}