{"id":33,"date":"2017-09-26T15:52:45","date_gmt":"2017-09-26T15:52:45","guid":{"rendered":"http:\/\/www.nullplug.org\/ML-Blog\/?p=33"},"modified":"2017-10-19T07:37:18","modified_gmt":"2017-10-19T07:37:18","slug":"machine-learning-overview","status":"publish","type":"post","link":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/26\/machine-learning-overview\/","title":{"rendered":"Machine Learning Overview"},"content":{"rendered":"<blockquote><p>\n  Science is knowledge which we understand so well that we can teach it to a computer; and if we don&#8217;t fully understand something, it is an art to deal with it. <a href=\"https:\/\/en.wikiquote.org\/wiki\/Donald_Knuth\">Donald Knuth<\/a>\n<\/p><\/blockquote>\n<h2>Introduction<\/h2>\n<h4>First Attempt at a Definition<\/h4>\n<p>One says that an algorithm <em>learns<\/em> if its performance improves with additional experience.<\/p>\n<p>This is not a precise definition, so let&#8217;s try to make it so.<\/p>\n<h4>Definition<\/h4>\n<p>An algorithm will be a function<sup id=\"fnref-33-0\"><a href=\"#fn-33-0\" class=\"jetpack-footnote\">1<\/a><\/sup> $f\\colon D\\to T$ and whose individual values can be calculated by a computer (i.e., a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Turing_machine\">Turing machine<\/a>) in a finite amount of time using a finite amount of memory. We can refer to $D$ as the space of inputs or samples or examples and $T$ as the outputs or labels.<\/p>\n<p>For the following definition we will want to be able to form a sum indexed by the elements of $D$. For this purpose we could suppose that our input is finite or that the domain is a measure space and that the algorithm will be a measurable function. For simplicity and convenience, we will pretend that $D$ is finite. That is if $D$ is infinite, then we have actually implicitly restricted to to a finite subset (e.g., $D=\\Bbb R$ should really be interpreted as the finite subset of floating point numbers expressed in some fixed number of bits). This poses no real restriction in applications (all real world problem domains are effectively finite).<\/p>\n<h4>Definition<\/h4>\n<p>A performance measure of an algorithm will be a function $P\\colon D\\times T \\to \\Bbb R$. For two algorithms $f_1$ and $f_2$, we will say that $f_1$ <em>performs better than<\/em> $f_2$ (relative to $P$) if $$\\sum_{d\\in D} P(d,f_1(d)) > \\sum_{d\\in D} P(d, f_2(d)).$$ We similarly define &#8216;performs worse than&#8217; and &#8216;performs at least as well as&#8217;, etc.<\/p>\n<p>Alternatively, we may be presented a <em>cost<\/em> function $C\\colon D\\times T\\to \\Bbb R$ which we want to <em>minimize<\/em>. One can postcompose any performance measure (resp. cost function) with an order reversing map to obtain a cost function (resp. performance measure). This allows us to pass between the two notions.<\/p>\n<h4>Definition<\/h4>\n<p>A <em>machine learning algorithm<\/em> is an algorithm $f_-\\colon 2^E \\times D\\to T$, that learns with respect to a specified performance measure $P$ (defined below). Here $2^E$ is the set of all subsets of $E$. We call an element $S\\in 2^E$ (i.e., a subset $S\\subseteq E$) a set of <em>training data<\/em>. For such an $S$ and $f_-$, let $f_S=f_-(S,-) \\colon D\\to T$ denote the resulting algorithm (which we will say is <em>trained<\/em> on $S$).<\/p>\n<h4>Example<\/h4>\n<p>Consider the problem of finding a linear function $f\\colon D\\subset \\Bbb R\\to \\Bbb R$ which closely approximates a fixed function $F\\colon D\\subset \\Bbb R\\to \\Bbb R$. We can regard the function $F$ as a subset $S$ of $E=D\\times \\Bbb R$ by $F\\mapsto S= &#92;{(s,F(s)) | s\\in D&#92;}$<sup id=\"fnref-33-1\"><a href=\"#fn-33-1\" class=\"jetpack-footnote\">2<\/a><\/sup>. We can define a <em>cost<\/em> function $C\\colon D\\times \\Bbb R\\to \\Bbb R$ by $$C(d,t)=(F(d)-t)^2.$$ We can reformulate problem of minimizing the cost of $C(d,f(d))$ as maximizing the performance function $P(d,t)=\\frac{1}{1+C(d,t)}$. The method of <a href=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/04\/linear-regression\/\">Linear Regression<\/a> will define a machine learning algorithm $f_-\\colon 2^E\\times D\\to \\Bbb R$.<\/p>\n<p>One can play with this problem using this <a href=\"https:\/\/www.geogebra.org\/m\/xC6zq7Zv\">demo<\/a>.<\/p>\n<h4>Second Attempt at a Definition<\/h4>\n<p>A machine learning algorithm $f\\colon 2^E\\times D\\to T$ <em>learns<\/em> (relative to a performance measure $P$) if for all proper subset inclusions  $S_1\\subset S_2\\subseteq E$, $f_{S_2}$ performs better than $f_{S_1}$.<\/p>\n<p>This is at least clearly defined, but it does not cover many examples of learning algorithms. For example, the method of linear regression will perform poorly when given a small training sample containing outliers (those $d\\in D$ such that $F(d)$ takes on an unlikely value under some assumed probability distribution associated to $F$). So, the act of adjoining outliers to our training data will typically decrease the performance of this machine learning algorithm.<\/p>\n<h4>Final Definition<\/h4>\n<p>A machine learning algorithm $f\\colon 2^E\\times D\\to T$ <em>learns<\/em> (relative to a performance measure $P$) if for each pair of natural numbers $m&lt;n$,  the average of the performances of $f_{S}$ over the subsets $S\\subseteq E$ of cardinality $n$ is better than the average of the performances over the subsets of cardinality $m$.<\/p>\n<h4>Crucial remark<\/h4>\n<p>Without a specified performance measure, we have no obvious goal in machine learning. We can only evaluate machine learning algorithms relative to a performance measure<sup id=\"fnref-33-3\"><a href=\"#fn-33-3\" class=\"jetpack-footnote\">3<\/a><\/sup> and an algorithm that performs well with respect to one measure may perform poorly with respect to another. So the field of machine learning depends on first finding a good performance measure for the task at hand and <em>then<\/em> finding a machine learning algorithm that performs well with respect to this measure.<\/p>\n<h4>Remarks<\/h4>\n<p>Now that we have come to a definition of a learning algorithm that learns that seems to cover at least one standard example, I will unveil some unfortunate truths:<\/p>\n<ul>\n<li>In practice, we often do not have a performance measure that is defined over all of $D\\times T$, instead we only have it over a subset. <\/li>\n<li>Machine learning algorithms are often presented without proofs that they learn in the sense above. <\/li>\n<\/ul>\n<h2>Taxonomy of machine learning<\/h2>\n<p>We can roughly divide up the field of machine learning into 3 branches:<\/p>\n<ol>\n<li><a href=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/26\/supervised-learning\/\">Supervised Learning<\/a><\/li>\n<li>Unsupervised Learning<\/li>\n<li>Reinforcement learning<\/li>\n<\/ol>\n<h2>Supervised learning<\/h2>\n<p>Before getting into what supervised learning precisely is, let&#8217;s look at some examples of supervised learning tasks:<\/p>\n<ol>\n<li><a href=\"https:\/\/www.kaggle.com\/uciml\/breast-cancer-wisconsin-data\">Identifying breast cancer<\/a>. <!--- ![Image of cells](http:\/\/curiousily.com\/assets\/1.diagnosing_breast_cancer_files\/biopsy.jpg) -->\n<ul>\n<li>A <a href=\"https:\/\/www.kaggle.com\/gargmanish\/basic-machine-learning-with-cancer\/data\">sample study<\/a>.<\/li>\n<\/ul>\n<\/li>\n<li><a href=\"http:\/\/cs.stanford.edu\/people\/karpathy\/ilsvrc\/\">Image classification<\/a>.\n<ul>\n<li>List of last year&#8217;s ILSVRC <a href=\"http:\/\/image-net.org\/challenges\/LSVRC\/2016\/results\">Winners<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"https:\/\/www.kaggle.com\/c\/passenger-screening-algorithm-challenge\">Threat assessment<\/a><br \/>\n<!--- ![Best to be secure](https:\/\/media.giphy.com\/media\/pNTqjthDxPDBm\/giphy.gif) --><\/li>\n<li><a href=\"https:\/\/translate.google.com\/#\">Language Translation<\/a> <!--- ![Alignment problems](https:\/\/devblogs.nvidia.com\/wp-content\/uploads\/2015\/07\/Figure6_sample_translations1-624x282.png) -->\n<ul>\n<li><a href=\"https:\/\/www.nytimes.com\/2016\/12\/14\/magazine\/the-great-ai-awakening.html\">NYTimes: Google&#8217;s neural machine translation<\/a><\/li>\n<\/ul>\n<\/li>\n<li><a href=\"https:\/\/qph.ec.quoracdn.net\/main-qimg-1e398bd3e145227faf1cb31373dc4ec8.webp\">Identifying faces in images<\/a><\/li>\n<\/ol>\n<h4>Definition<\/h4>\n<p><em>Supervised learning<\/em> is concerned with the construction of machine learning algorithms $f\\colon 2^{D\\times T}\\times D\\to T$. The subsets of $D\\times T$ are referred to as subsets of <em>labeled examples<\/em>.<\/p>\n<p>We often assume that the labeled examples arise from restricting some presumed function $F\\colon D\\to T$ whose values we know on $E\\subset D$. We can then train $f$ on pairs $&#92;{s, F(s) | s\\in E&#92;}$. We can then devise a cost function which measures the distance from the learned $f$ and the presumed $F$ (e.g., <a href=\"https:\/\/en.wikipedia.org\/wiki\/Norm_(mathematics)#p-norm\">$L^2$-distance<\/a>).<\/p>\n<p>Supervised learning has been the most successful of the three branches to real world problems. The existence of labeled examples usually leads to well-defined performance metrics that transform supervised learning tasks into two other tasks:<\/p>\n<ol>\n<li>Find an appropriate parametrized class of functions to choose $f$ from.<\/li>\n<li>Solve the optimization problem of finding the best function in that class. <\/li>\n<\/ol>\n<h4>Example<\/h4>\n<p>The first example of a supervised learning task is the linear regression problem mentioned above.<\/p>\n<h2>Unsupervised learning<\/h2>\n<p>First some examples:<\/p>\n<ol>\n<li>Some sample techniques in unsupervised learning can be seen at this beautiful <a href=\"http:\/\/colah.github.io\/posts\/2014-10-Visualizing-MNIST\/\">blog post<\/a>.<\/li>\n<li><a href=\"https:\/\/blog.acolyer.org\/2016\/04\/21\/the-amazing-power-of-word-vectors\/\">Word vector embeddings<\/a>.<\/li>\n<li><a href=\"https:\/\/danieltakeshi.github.io\/2015\/01\/03\/independent-component-analysis-a-gentle-introduction\/\">Independent components analysis<\/a><\/li>\n<\/ol>\n<h4>Definition<\/h4>\n<p><em>Unsupervised learning<\/em> is concerned with the construction of machine learning algorithms of the form $f\\colon 2^D\\times D\\to T$.<\/p>\n<p>In this case, there is no obvious measure of performance for such an algorithm. The choice of performance measure essentially defines the unsupervised learning task.<\/p>\n<p>Often, one can vaguely state that the goal of an unsupervised learning task is to summarize the data or determine its essential features. In other words, suppose that we have a function $i\\colon T\\to D$, then we can state the goal is to find an $f\\colon D\\to T$ such that $i\\circ f$ <em>approximates the identity function<\/em>. That is our unsupervised learning task is to find a compression algorithm! In this case, we have transformed the unsupervised learning task into a supervised learning task (because we have as many labels as we like of the identity function).<\/p>\n<p>The only task remaining (and it is the hard task) is to make sense of the word `approximates&#8217; above by choosing an appropriate measure. For example, how do you quantitatively measure the quality of a video compression algorithm? Somehow we can usually see the difference between a terrible algorithm and a good one, but it can be difficult to turn this intuition into a well-defined measure.<\/p>\n<h4>Example<\/h4>\n<p>One of the typical tasks in unsupervised learning is to identify &#8216;clusters&#8217; of data points in a high dimensional space. We may be looking for difficult to identify relationships between data points. Once we have identified a classification that we deem to be &#8216;meaningful&#8217;, then our algorithm should be able to sort new data points into the correct classes. For example, can try to look at demographic data to identify clusters of voters with similar voting patterns and then try to identify the correct approach to appeal to this group.<\/p>\n<h2>Reinforcement learning<\/h2>\n<p>Reinforcement learning is applied to tasks where one must choose actions whose outcomes eventually lead to various rewards\/punishments which are used to encourage or discourage certain choices:<\/p>\n<ol>\n<li><a href=\"https:\/\/www.youtube.com\/watch?v=V1eYniJ0Rnk\">Breakout<\/a> <\/li>\n<li>One of the more difficult challenges is <a href=\"https:\/\/www.youtube.com\/watch?v=0yI2wJ6F8r0&amp;feature=youtu.be\">Montezuma&#8217;s revenge<\/a> where there is very little feedback and one of the implicit motivations is curiousity.<\/li>\n<li>Another truly amazing achievement is <a href=\"https:\/\/deepmind.com\/research\/alphago\/\">AlphaGo<\/a>, or its successor <a href=\"https:\/\/deepmind.com\/blog\/alphago-zero-learning-scratch\/\">AlphaGo Zero<\/a>.<\/li>\n<li><a href=\"https:\/\/www.youtube.com\/watch?v=aT_wCH0r9yE\">Self-driving cars<\/a><\/li>\n<\/ol>\n<p>To construct a reasonably precise definition, let&#8217;s describe a typical reinforcement learning task<sup id=\"fnref-33-4\"><a href=\"#fn-33-4\" class=\"jetpack-footnote\">4<\/a><\/sup>. In this case the algorithm is to construct an <em><a href=\"https:\/\/en.wikipedia.org\/wiki\/Intelligent_agent\">agent<\/a><\/em> (the name comes from the field of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Artificial_intelligence\">artificial intelligence<\/a>). The agent learns a <em>policy<\/em> which is a mapping from <em>perceived states<\/em> to <em>actions<\/em>. Paired with the agent is an <em>environment<\/em>. The environment maintains an <em>environment state<\/em> and when given an action returns an <em>observation<\/em> and a <em>reward<\/em> to the agent, which it then uses to update its policy and perceived state and choose an action. This creates a feedback loop:<br \/>\n<figure style=\"width: 908px\" class=\"wp-caption aligncenter\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" src=\"https:\/\/i0.wp.com\/web.stanford.edu\/class\/cs234\/images\/header2.png?resize=840%2C324\" width=\"840\" height=\"324\" class=\"size-medium\" \/><figcaption class=\"wp-caption-text\">Reinforcement learning dynamics<\/figcaption><\/figure><\/p>\n<p>We can break up the agent&#8217;s task into two parts:<\/p>\n<ol>\n<li>It must take in an observation and reward and update its perceived state. To update its perceived state it might maintain a <em>model<\/em> of its environment and use the observation and reward to update this model.<\/li>\n<li>It must take in a sequence of observations and rewards and update its policy. The policy is supposed to choose an allowable action and this decision could be made by defining a <em>value<\/em> function, which assigns to each possible action given our state a value, and choosing the action that maximizes the value. Since the agent is operating with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Perfect_information\">imperfect information<\/a>, the value function will only be an estimate, and it may choose to alternatively <em>explore<\/em> and choose a lower valued action in the hopes that such a state is being undervalued. We can model this exploration possibility, by saying that the policy assigns a probability to each possible action from a fixed perceived state. <\/li>\n<\/ol>\n<p>Writing the general task out as a learning algorithm will get quite cumbersome. So let&#8217;s just hit the main points.<\/p>\n<ul>\n<li>The environment will be encoded as a probability space $S_e \\times A\\times S_e$, which encodes the probability of transitioning from one environment state to another given a fixed action. <\/li>\n<li>The policy will be encoded as a probability space $S_p \\times A$, which encodes the probability of choosing an action from a given probability.<\/li>\n<li>We will also require some type of model, which will be a probability space $S_p\\times A\\times O\\times R\\times S_p$ which will encode the probability of moving from one perceived state to another given an action, an observation, and a reward. <\/li>\n<li>We will be a given a environmental feedback, which will be a random variable $F\\colon S_e\\times A\\times S_e \\to O\\times R$. <\/li>\n<\/ul>\n<p>All of these components can then be combined and we can form the expected reward of all sequences of length $n$ of tuples of the above variables<sup id=\"fnref-33-5\"><a href=\"#fn-33-5\" class=\"jetpack-footnote\">5<\/a><\/sup>. The <em>reinforcement learning<\/em> task is to learn the policy function and the interpreter function so that we can maximize the expected reward over sequences of length $n$ as $n$ goes to $\\infty$.<\/p>\n<p>In this generality, the task is extremely difficult. We could simplify things by assuming the environment provides complete feedback about the perceived state, so that the observation is the perceived state (for example, a game with perfect information). But even then this task can be extremely difficult if there are many potential perceived states and many potential actions. For example, in the game <a href=\"https:\/\/en.wikipedia.org\/wiki\/Go_(game)\">Go<\/a> there are two players (black and white) which take alternating turns placing pieces on a 19&#215;19 board, occasionally removing captured pieces. As an approximate upper bound on the perceived state space we can see that there are no more than $3^{19\\cdot 19}\\approx 10^{172}$ possible board configurations (each square is either black, white, or blank), from each of these positions we have no more than $19^2=361$ possible actions choose from. Searching through all possible chains of configurations and actions is completely intractable. Even encoding an optimal policy in tabular form is impossible. This is part of the reason that performing well on this game was such a high-hurdle for the AI community.<\/p>\n<p><!--\nThe environmental feedback is part of our assumed data. We do not have to construct it, but it may have a probabilistic component. So we can regard it as a random variable of the form $$ E\\colon S_e \\times A \\times S_e\\to O\\times R$, where the input space is a probability space where $P(s,a,o)$ is the probability of being in state $s$, taking action $a$, and  The component $O$ will be the space of observations and the component $T$ will be the set of rewards. Let $E_T\\colon S_e\\times A\\times S_e\\to T$ obtained by postcomposing $E$ with the projection to the $T$-component.\n\n\nThat is we can regard the reinforcement learning task as a pair of related machine learning tasks. \n\n1. Learn a function (the policy) from a set of histories, which will be sequences of perceived states, actions taken, and subsequent observations, and rewards. The policy will then be a probability measure on the set of pairs $(s,a)\\inS_p\\times A$ encoding the probability of being in state $s$ and taking action $a$.  \n1. Learn a function (an interpreter) from a set of histories. The interpreter will be a probability measure on the set of 4 tuples $(s_0, a, o, r\n\nCertain assumptions will drastically simplify this task. For example, the observations may tell the interpreter precisely the environment state, so we can just assign the environment state to the perceived state. It could also be that the optimal policy only depends on the current state and not on the sequence of steps we took to arrive at this state (i.e., a Markov process). Then we don't need to keep around all of this historical data in our perceived game state. Nonetheless, we have chosen the above example to cover a broad range of reinforcement learning problems. And now we can make the following:\n\n#### First approximation of a Definition\nGiven an environment feedback random variable of the form $E\\colon S_e\\times A\\times S_e \\to O\\times T$, where $S_e$ is the space of environment states, $A$ is the set of actions, $O$ is the set of observations, and $T$ is the set of rewards, the *reinforcement learning* task is the construction of a pair of machine learning functions of the form:\n\\begin{align}\n  P & \\colon & 2^{G}\\times S_p\\times A  \\to [0,1]\\\\\\\\\n  S & \\colon & 2^{G}\\times G \\times O\\times T \\to S_p\n\\end{align}\nwhere $P_X=P(X,s,-)$ defines a probability measure on the space of actions given a fixed perceived state and training data $X$. Let $E_T$ be the $T$-component of the environment feedback function (so it outputs the reward). The expected reward of the given pair after one time step is\n$$ E_T(\\textrm{start}_e, P(\\emptyset, \\textrm{start}_p, \n---><\/p>\n<div class=\"footnotes\">\n<hr \/>\n<ol>\n<li id=\"fn-33-0\">\nWe could also allow partially defined functions.&#160;<a href=\"#fnref-33-0\">&#8617;<\/a>\n<\/li>\n<li id=\"fn-33-1\">\nIndeed if you look under the hood, this is how functions are actually <a href=\"https:\/\/en.wikipedia.org\/wiki\/Function_(mathematics)#Definition\">defined<\/a>.<br \/>\n<!-- [^2]: Technically this function will only be partially defined. If $E$ contains two elements $(d,a)$ and $(d,b)$ with $a\\neq b$ (i.e., $E$ does not consist of the input-output pairs of a well-defined function), then linear regression will fail. Note that we can't define the performance measure as above in this case. While this may seem like a mere technicality, it can happen in a real-world environment in which two different real numbers are mapped to the same floating point representation $d$ then such an error can occur. More generally, if $d_1-d_2$ is very small and $a_1-a_2$ is very large, then a standard linear regression algorithm trained on $(d_1,a_1)$ and $(d_2, a_2)$ will produce drastically different functions if we perturb $d_1$ slightly. This makes such an algorithm numerically unstable. -->&#160;<a href=\"#fnref-33-1\">&#8617;<\/a>\n<\/li>\n<li id=\"fn-33-3\">\nOkay, runtime and memory usage are also important too.&#160;<a href=\"#fnref-33-3\">&#8617;<\/a>\n<\/li>\n<li id=\"fn-33-4\">\nI must admit this first attempt at finding a definition of reinforcement learning that fits into the above framework is a bit clumsy and accurately reflects my limited knowledge in this area.&#160;<a href=\"#fnref-33-4\">&#8617;<\/a>\n<\/li>\n<li id=\"fn-33-5\">\nAs a non-trivial exercise write this down for sequences of length 1 and 2.&#160;<a href=\"#fnref-33-5\">&#8617;<\/a>\n<\/li>\n<\/ol>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Science is knowledge which we understand so well that we can teach it to a computer; and if we don&#8217;t fully understand something, it is an art to deal with it. Donald Knuth Introduction First Attempt at a Definition One says that an algorithm learns if its performance improves with additional experience. This is not &hellip; <a href=\"http:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/26\/machine-learning-overview\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Machine Learning Overview&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[1],"tags":[],"class_list":["post-33","post","type-post","status-publish","format-standard","hentry","category-general"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p9dIpN-x","jetpack_likes_enabled":true,"jetpack-related-posts":[{"id":35,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/26\/supervised-learning\/","url_meta":{"origin":33,"position":0},"title":"Supervised Learning","author":"Justin Noel","date":"September 26, 2017","format":false,"excerpt":"A big computer, a complex algorithm, and a long time does not equal science. - Robert Gentleman Examples Before getting into what supervised learning precisely is, let's look at some examples of supervised learning tasks: Identifying breast cancer. A sample study. Image classification. List of last year's ILSVRC Winners Threat\u2026","rel":"","context":"In &quot;Supervised Learning&quot;","block_context":{"text":"Supervised Learning","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/supervised-learning\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":63,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/26\/computer-science-background\/","url_meta":{"origin":33,"position":1},"title":"Computer Science Background","author":"Justin Noel","date":"September 26, 2017","format":false,"excerpt":"If you find that you're spending almost all your time on theory, start turning some attention to practical things; it will improve your theories. If you find that you're spending almost all your time on practice, start turning some attention to theoretical things; it will improve your practice. - Donald\u2026","rel":"","context":"In &quot;Supplementary material&quot;","block_context":{"text":"Supplementary material","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/supplementary-material\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":61,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/26\/probability-and-statistics-background\/","url_meta":{"origin":33,"position":2},"title":"Probability and Statistics Background","author":"Justin Noel","date":"September 26, 2017","format":false,"excerpt":"Statistics - A subject which most statisticians find difficult, but in which nearly all physicians are expert. - Stephen S. Senn Introduction For us, we will regard probability theory as a way of logically reasoning about uncertainty. I realize that this is not a precise mathematical definition, but neither is\u2026","rel":"","context":"In &quot;Supplementary material&quot;","block_context":{"text":"Supplementary material","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/supplementary-material\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":344,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/10\/10\/parameter-estimation\/","url_meta":{"origin":33,"position":3},"title":"Parameter Estimation","author":"Justin Noel","date":"October 10, 2017","format":false,"excerpt":"\u2026the statistician knows\u2026that in nature there never was a normal distribution, there never was a straight line, yet with normal and linear assumptions, known to be false, he can often derive results which match, to a useful approximation, those found in the real world. - George Box (JASA, 1976, Vol.\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/general\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?resize=350%2C200 1x, https:\/\/i0.wp.com\/www.nullplug.org\/ML-Blog\/wp-content\/uploads\/2017\/10\/compressed_polyreg_normal.gif?resize=525%2C300 1.5x"},"classes":[]},{"id":572,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2018\/01\/19\/572\/","url_meta":{"origin":33,"position":4},"title":"Problem Set  12","author":"Justin Noel","date":"January 19, 2018","format":false,"excerpt":"Problem Set 12 This is to be completed by January 25th, 2018. Exercises Datacamp Complete the lesson: a. Python Data Science Toolbox (Part I) Let $S\\subset \\Bbb R^n$ with $|S|<\\infty$. Let $\\mu=\\frac{1}{|S|}\\sum_{x_i\\in S} x_i$. Show that $$ \\frac{1}{|S|}\\sum_{(x_i,x_j)\\in S\\times S} ||x_i-x_j||^2 = 2\\sum_{x_i\\in S} ||x_i-\\mu||^2.$$ Prove that the $K$-means clustering\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":118,"url":"http:\/\/www.nullplug.org\/ML-Blog\/2017\/09\/27\/additional-sources\/","url_meta":{"origin":33,"position":5},"title":"Additional Sources","author":"Justin Noel","date":"September 27, 2017","format":false,"excerpt":"Textbooks Machine Learning: A probabilistic perspective\u00a0by Kevin Murphy. The material in this book is closest to what we will cover in the course, but is unfortunately not available for free. Written by an academic and a practitioner of machine learning, this text is full of real world examples and applications,\u2026","rel":"","context":"In &quot;Supplementary material&quot;","block_context":{"text":"Supplementary material","link":"http:\/\/www.nullplug.org\/ML-Blog\/category\/supplementary-material\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"_links":{"self":[{"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts\/33","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/comments?post=33"}],"version-history":[{"count":11,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts\/33\/revisions"}],"predecessor-version":[{"id":57,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/posts\/33\/revisions\/57"}],"wp:attachment":[{"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/media?parent=33"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/categories?post=33"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.nullplug.org\/ML-Blog\/wp-json\/wp\/v2\/tags?post=33"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}