## Problem Set 3

This is to be completed by November 9th, 2017.

### Exercises

1. [Datacamp](https://www.datacamp.com/home
• Complete the lesson “Introduction to Machine Learning”.
• This should have also included “Exploratory Data Analysis”. This has been added to the next week’s assignment.
2. MLE for the uniform distribution.
• (Source: Kaelbling/Murphy) Consider a uniform distribution centered on 0 with width 2a. The density function is given by: $$p(x) = \frac{\chi_{[-a,a]}}{2a}.$${
a. Given a data set $x_1,\cdots, x_n,$ what is the maximum likelihood estimate $a_{MLE}$ of $a$?
b. What probability would the model assign to a new data point $x_{n+1}$ using $a_{MLE}$?
c. Do you see any problem with the above approach? Briefly suggest (in words) a better approach.
3. Calculate the expected value and mode of $\theta$ when $\theta \sim \textrm{Beta}(\alpha, \beta)$,
4. Change of variables:
• Let $X\colon S\to T_1$ be a discrete valued random variable with pmf $p_X\colon T_1\to [0,1]$ and let $Y\colon T_1\to T_2$ be a function. Derive the pmf $p_{Y\circ X}\colon T_2\to [0,1]$ in terms of $p_X$ and $Y$.
• Let $X^n\colon S^{\times n}\to \{0,1\}^{\times n}$ be the random variable whose values give $n$-independent samples of a Bernoulli random variable $X$ with parameter $\theta$ (i.e., $p_X(1)=\theta$). Show that $$p_{X^n}(v_1,\cdots, v_n)=\theta^{\sum{v_i}}(1-\theta)^{n-\sum v_i}.$$ Now let $\sigma \colon \{0,1\}^{\times n}\to \{0,…n\}$ be defined by taking the sum of the entries. The composite $\sigma\circ X^{n}$ is called a Binomial random variable with parameters $n$ and $\theta.$ Determine $p_{\sigma\circ X^n}(k)$.
• Let $X\colon S\to \Bbb R$ be a random variable with piecewise continuous pdf $p_X$ and let $Y\colon \Bbb R\to \Bbb R$ be a differentiable monotonic function. Show that $Y\circ X$ is a random variable and determine $p_{Y\circ X}$.
5. Uninformative prior for log-odds ratio:
• (Source: Murphy) Let $$\phi = \textrm{logit}(\theta) = \log \frac{\theta}{1-\theta}.$$ Show that if $p(\phi)\propto 1,$ then $p(\theta)\propto \textrm{Beta}(0,0)$.
6. R Lab:
• Construct and apply a Naive Bayes classifier for a specific text classification problem (e.g., spooky author identification) from scratch. In other words, do not use any modeling libraries. Feel free to use any libraries you like to get the data into an acceptable format.

## Problem Set 2

This is to be completed by November 2nd, 2017.

### Exercises

1. Datacamp
• Complete the lesson “Data Visualization in R”.
2. Probabilities are sensitive to the form of the question that was used to generate the answer:
• (Source: Minka, Murphy.) My neighbor has two children. Assuming that the gender of a child is like a coin flip, it is most likely, a priori, that my neighbor has one boy and one girl, with probability 1/2. The other possibilities—two boys or two girls—have probabilities 1/4 and 1/4.
a. Suppose I ask him whether he has any boys, and he says yes. What is the probability that one child is a girl?
b. Suppose instead that I happen to see one of his children run by, and it is a boy. What is the probability that the other child is a girl?
3. Legal reasoning
• (Source: Peter Lee, Murphy) Suppose a crime has been committed. Blood is found at the scene for which there is no innocent explanation. It is of a type which is present in 1% of the population.
a. The prosecutor claims: “There is a 1% chance that the defendant would have the crime blood type if he were innocent. Thus there is a 99% chance that he guilty”. This is known as the prosecutor’s fallacy. What is wrong with this argument?
b. The defender claims: “The crime occurred in a city of 800,000 people. The blood type would be found in approximately 8000 people. The evidence has provided a probability of just 1 in 8000 that the defendant is guilty, and thus has no relevance.” This is known as the defender’s fallacy. What is wrong with this argument?
4. Bayes rule for medical diagnosis
• (Source: Koller, Murphy.) After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare disease, striking only one in 10,000 people.
a. What are the chances that you actually have the disease? (Show your calculations as well as giving the final result.)
5. Conditional independence (Source: Koller.)
• Let $H\in {1,\cdots, K}$ be a discrete random variable, and let $e_1$ and $e_2$ the observed values of two other random variables $E_1$ and $E_2$. Suppose we wish to calculate the vector
$$P(H|e_1, e_2) = (P(H=1|e_1,e_2),\cdots, P(H=K|e_1, e_2)).$$
a. Which of the following sets of numbers are sufficient for the calculation?
i. $P(e_1, e_2), P(H), P(e_1| H), P(e_2, H)$.
ii. $P(e_1, e_2), P(H), P(e_1, e_2 | H)$.
iii. $P(e_1|H), P(e_2|H), P(H)$.

b. Now suppose we now assume $E_1\perp E_2 | H$ (i.e., $E_1$ and $E_2$ are independent given $H$). Which of the above 3 sets are sufficient now?

6. R lab
• Estimate the value of $\pi$ by taking uniform random samples from the square $[-1,1]\times [1,1]$ and seeing which lie in the disc $x^2+y^2\leq 1$.
• A company is trying to determine why their employees leave and why they stay. They have a list of roughly 15000 employee records here.
a. Download this dataset and load it in R (this may require setting up a Kaggle account if you don’t already have one).
b. Examine the dataset and see if you need to transform any of the features=columns (e.g., are there factors that were not recognized as such, is there missing data?).
c. Randomly shuffle the rows and cut the dataset into two pieces with 10000 entries in a data frame called train and the remaining entries in a data frame called valid.
d. Study the train data frame and see if you can find any features that predict whether or not an employee will leave.
e. Make a hypothesis about how you can predict whether an employee will leave by studying the train data.
f. Once you have fixed this hypothesis evaluate how well your criteria work on the valid data frame.
g. Justify your proposal with data and charts. Save at least one of these charts to a pdf file to share with management.

## Introduction

I will assume that you have seen some calculus, including multivariable calculus. That is you know how to differentiate a differentiable function $f\colon \Bbb R \to \Bbb R$, to obtain a new function $$\frac{\partial f}{\partial x} \colon \Bbb R \to \Bbb R.$$ You also know how to differentiate a multivariable function $f\colon \Bbb R^m \to \Bbb R^n$, to obtain a map $$Jf\colon \Bbb R^m \to \mathrm{Hom}(\Bbb R^m, \Bbb R^n),$$ which takes a point $x\in \Bbb R^m$ to the Jacobian at $x$, which is an $(m,n)$-matrix; alternatively you can think of the Jacobian as a matrix of functions with the $(i,j)$-entry being $\frac{\partial f_i}{\partial x_j}\colon \Bbb R^m \to \Bbb R.$

The Jacobian is a convenient form for packaging the derivative of a vector valued function with respect to a list of variables arranged into a vector. Many of us learn the basic results (the product rule, the sum rule, the chain rule) very well for single variable calculus and, hopefully, pretty well for multivariable calculus. We may have less experience using these compact forms in practice.

While this knowledge is very useful for considering vector valued functions indexed by vectors of variables, it can be a bit tricky to differentiate matrix valued functions with respect to matrices or vectors of variables and get a useful compact form.

### Potential objection

A bright student might observe that an element of a matrix is an element of $\Bbb R^m\otimes \Bbb R^n\cong \Bbb R^{mn}$ and hence we can always regard an $(m,n)$-matrix as an $mn$-vector1. This is true, but we usually organize data into matrices because that is the most natural structure for them to have, e.g., the Jacobian above. Here are some similar naturally occurring structures:

1. We have a table with 100 entries. On each row is data from some sample, say temperature and pressure. This naturally fits into a $(100,2)$-matrix, whose rows are the samples and whose columns are the respective measurements. So we can recover each datum, by using two numbers: the row index and the column index.
2. We have a list of 100 grayscale images 640×480 images. To recover a datum (the darkness of a pixel in a sample) in this table, we would use three integers:
• the image number,
• the $x$-coordinate, and
• the $y$-coordinate.
This is encoded as a tensor of shape $(100,640,480)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{640} \otimes \Bbb R^{480}$. While we could encode this as a $307200000$-vector, it would make it much more difficult to understand and meaningfully manipulate.
3. We have a list of 100 color (i.e., RGB) images 640×480 images. To recover a datum (the strength of a particular color (red/green/blue) of a pixel in a sample) in this table, we would use four integers:
• the image number,
• the $x$-coordinate,
• the $y$-coordinate, and
• which color we are interested in (e.g., red = 0, green = 1, blue = 2).
This is encoded as a tensor of shape $(100,640,480,3)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{640} \otimes \Bbb R^{480}\otimes \Bbb R^3$.
4. We have a list of 100 color (i.e., RGB) videos each consisting of 18000 frames of 640×480 images. To recover a datum (the strength of a particular color (red/green/blue) of a pixel in a sample at a particular frame in the table) in this table, we would use five integers:
• the image number,
• the frame number,
• the $x$-coordinate,
• the $y$-coordinate, and
• which color we are interested in (e.g., red = 0, green = 1, blue = 2).
This is encoded as a tensor of shape $(100,18000,640,480,3)$, i.e., an element of $\Bbb R^{100}\otimes \Bbb R^{18000}\otimes \Bbb R^{640} \otimes \Bbb R^{480}\otimes \Bbb R^3$. Again, it would be very unnatural to work with this as a single vector.

### Definition

Suppose we have a function $F$ taking values in tensors of shape $(n_1,\cdots, n_\ell)$ which depends on variables indexed by a tensor $X$ of shape $(m_1,\cdots, m_k)$. The derivative $\frac{\partial F}{\partial X}$ is a function from $\Bbb R^{m_1}\otimes \cdots \otimes \Bbb R^{m_k}$ to tensors of shape $(n_1,\cdots,n_\ell,m_1,\cdots,m_k)$ whose $(i_1,\cdots, i_\ell, j_1, \cdots, j_k)$-entry is $\frac{\partial F_{i_1,\cdots, i_\ell}}{\partial X_{j_1,\cdots, j_k}}(-)$.

### Some scandalous abuses of notation

We will typically abuse notation and let $n$-vectors also denote column vectors, i.e., tensors of shape $(n,1)$.

For convenience, many authors (including this one), will sometimes take tensors of the form $(\cdots, 1, \cdots)$ and regard them as tensors of the form $(\cdots,\cdots)$ (i.e., we omit the 1 term by ‘squeezing the tensor’. This abuse of notation is both convenient and confusing. Avoiding this abuse tends to cause ones to accumulate in the shape of our tensors and these one’s do not add any value to our interpretation. The best way to follow what is going on is to check it for yourself.

A less scandalous abuse of notation is to regularly add transposes to tensors of $(1,1)$ when convenient.

## Some results

Since the only way to get used to tensor differentiation is to do some work with it yourself, we leave the following results as exercises.

### Exercise

1. Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables. Show that $\frac{\partial f}{\partial X}=Jf$.
2. Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=Ax$ for an $(n,m)$-matrix $A$. Show that $\frac{\partial f}{\partial X}=A$ (a constant function).
3. Suppose that $f\colon \Bbb R^m\to \Bbb R^n$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=x^TA$ for an $(m,n)$-matrix $A$. Show that $\frac{\partial f}{\partial X}$ is a function valued in $(1,n,m)$-tensors whose $(1,i,j)$ entry is $(A^T)_{i,j}$.
4. Suppose that $f\colon \Bbb R^m\to \Bbb R$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=x^TA x$ for an $(m,m)$-matrix $A$. Show that $\frac{\partial f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose $(1,i)$-entry is $((A+A^T)x)_i$.
5. Suppose that $f\colon \Bbb R^m\to \Bbb R$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=(Ax-y)^T(Ax-y)$ for an $(m,m)$-matrix $A$ and a constant $(m,1)$-matrix $y$. Show that $\frac{\partial f}{\partial X}$ is a function valued in $(1,m)$-tensors, whose entries give $2(Ax-y)^TA.$
6. Suppose that $f\colon \Bbb R^m\to \Bbb R^m$ is a smooth function depending on an $m$-vector $X$ of variables of the form $f(x)=2(Ax-y)^TA$ for an $(m,m)$-matrix $A$ and a constant $(m,1)$-matrix $y$. Here we regard the function as taking values in $(m)$-tensors. Show that $\frac{\partial f}{\partial X}$ is the $(m,m)$-tensor $2A^TA$.