November 2017 – Machine Learning for Mathies

Problem Set 6

This is to be completed by November 30th, 2017.

Exercises

Datacamp
- Complete the lesson:
  a. Text Mining: Bag of Words
Exercises from Elements of Statistical Learning
- Complete exercises:
  a. 4.2
  b. 4.6
Run the perceptron learning algorithm by hand for the two class classification problem with $(X,Y)$-pairs (given by bitwise or): $((0,0), 0), ((1,0),1), ((0,1),1)), ((1,1),1)$.
R Lab:
- Update the LDA Classifier from last week as follows.
  a. After fitting an LDA Classifier, produce a function which projects an input sample onto the hyperplane containing the class centroids.
  b. Update the classifier to use these projections for classification. Compare the runtimes of prediction of the two methods when the number of features is large relative to the number of classes.
- Construct a perceptron classifier for two class classification. Put an upper bound on the number of steps.
  a. Evaluate the perceptron on the above problem and for the bitwise xor problem: $((0,0), 0), ((1,0),1), ((0,1),1)), ((1,1),0)$.

Problem Set 5

This is to be completed by November 23rd, 2017.

Exercises

Datacamp
- Complete the lesson:
  a. Machine Learning Toolbox
R Lab:
- Write a function in R that will take in a vector of discrete variables and will produce the corresponding one hot encodings.
- Write a function in R that will take in a matrix $X$ of samples and a vector $Y$ of classes (in $(1,…,K)$) and produces a function which classifies a new sample according to the LDA rule (do not use R’s built in machine learning facilities).
- Do the same for QDA.
- Apply your models to the MNIST dataset for handwriting classification. There are various ways to get this dataset, but perhaps the easiest is to pull it in through the keras package. Besides having keras is useful anyway. You may need to reduce the dimension of the data and/or the number of samples to get this to work in a re

Problem Set 4

This is to be completed by November 16th, 2017.

Exercises

Datacamp
- Complete the lessons:
  a. Supervised Learning in R: Regression
  b. Supervised Learning in R: Classification
  c. Exploratory Data Analysis (If you did not already do so)
Let $\lambda\geq 0$, $X\in \Bbb R^n\otimes \Bbb R^m$, $Y\in \Bbb R^n$, and $\beta \in \Bbb R^m$ suitably regarded as matrices.
- Identify when $$\textrm{argmin}_\beta (X\beta-Y)^t(X\beta-Y)+\lambda \beta^t\beta$$ exists, and determine it in these cases.
- How does the size of $\lambda$ affect the solution? When might it be desirable to set $\lambda$ to be positive?
Bayesian approach to linear regression. Suppose that $\beta\sim N(0,\tau^2)$, and the distribution of $Y$ conditional on $X$ is $N(X\beta,\sigma^2I)$, i.e., $\beta$, $X$, and $Y$ are vector valued random variables. Show that, after seeing some data $D$, the MAP and mean estimates of the posterior distribution for $\beta$ correspond to solutions of the previous problem.
R Lab:
- Write a linear regression function that takes in a matrix of $x$-values and a corresponding vector of $y$-values and returns a function derived from the linear regression fit.
- Write a function that takes in a non-negative number (the degree), a vector of $x$-values and a corresponding vector of $y$-values and returns a function derived from the polynomial regression fit.
- Write a function that takes in a number $n$, a vector of $x$-values, and a corresponding vector of $y$-values and returns a function of the form: $$f(x)=\sum_{i=0}^n a_i \sin(ix)+b_i\cos(ix).$$
- Generate suitable testing data for the three functions constructed above and plot the fitted functions.

Problem Set 3

This is to be completed by November 9th, 2017.

Exercises

[Datacamp](https://www.datacamp.com/home
- Complete the lesson “Introduction to Machine Learning”.
- This should have also included “Exploratory Data Analysis”. This has been added to the next week’s assignment.
MLE for the uniform distribution.
- (Source: Kaelbling/Murphy) Consider a uniform distribution centered on 0 with width 2a. The density function is given by: $$ p(x) = \frac{\chi_{[-a,a]}}{2a}.$${
  a. Given a data set $x_1,\cdots, x_n,$ what is the maximum likelihood estimate $a_{MLE}$ of $a$?
  b. What probability would the model assign to a new data point $x_{n+1}$ using $a_{MLE}$?
  c. Do you see any problem with the above approach? Briefly suggest (in words) a better approach.
Calculate the expected value and mode of $\theta$ when $\theta \sim \textrm{Beta}(\alpha, \beta)$,
Change of variables:
- Let $X\colon S\to T_1$ be a discrete valued random variable with pmf $p_X\colon T_1\to [0,1]$ and let $Y\colon T_1\to T_2$ be a function. Derive the pmf $p_{Y\circ X}\colon T_2\to [0,1]$ in terms of $p_X$ and $Y$.
- Let $X^n\colon S^{\times n}\to \{0,1\}^{\times n}$ be the random variable whose values give $n$-independent samples of a Bernoulli random variable $X$ with parameter $\theta$ (i.e., $p_X(1)=\theta$). Show that $$p_{X^n}(v_1,\cdots, v_n)=\theta^{\sum{v_i}}(1-\theta)^{n-\sum v_i}.$$ Now let $\sigma \colon \{0,1\}^{\times n}\to \{0,…n\}$ be defined by taking the sum of the entries. The composite $\sigma\circ X^{n}$ is called a Binomial random variable with parameters $n$ and $\theta.$ Determine $p_{\sigma\circ X^n}(k)$.
- Let $X\colon S\to \Bbb R$ be a random variable with piecewise continuous pdf $p_X$ and let $Y\colon \Bbb R\to \Bbb R$ be a differentiable monotonic function. Show that $Y\circ X$ is a random variable and determine $p_{Y\circ X}$.
Uninformative prior for log-odds ratio:
- (Source: Murphy) Let $$ \phi = \textrm{logit}(\theta) = \log \frac{\theta}{1-\theta}.$$ Show that if $p(\phi)\propto 1,$ then $p(\theta)\propto \textrm{Beta}(0,0)$.
R Lab:
- Construct and apply a Naive Bayes classifier for a specific text classification problem (e.g., spooky author identification) from scratch. In other words, do not use any modeling libraries. Feel free to use any libraries you like to get the data into an acceptable format.