# Problem Set 2

## Problem Set 2

This is to be completed by November 2nd, 2017.

### Exercises

1. Datacamp
• Complete the lesson “Data Visualization in R”.
2. Probabilities are sensitive to the form of the question that was used to generate the answer:
• (Source: Minka, Murphy.) My neighbor has two children. Assuming that the gender of a child is like a coin flip, it is most likely, a priori, that my neighbor has one boy and one girl, with probability 1/2. The other possibilities—two boys or two girls—have probabilities 1/4 and 1/4.
a. Suppose I ask him whether he has any boys, and he says yes. What is the probability that one child is a girl?
b. Suppose instead that I happen to see one of his children run by, and it is a boy. What is the probability that the other child is a girl?
3. Legal reasoning
• (Source: Peter Lee, Murphy) Suppose a crime has been committed. Blood is found at the scene for which there is no innocent explanation. It is of a type which is present in 1% of the population.
a. The prosecutor claims: “There is a 1% chance that the defendant would have the crime blood type if he were innocent. Thus there is a 99% chance that he guilty”. This is known as the prosecutor’s fallacy. What is wrong with this argument?
b. The defender claims: “The crime occurred in a city of 800,000 people. The blood type would be found in approximately 8000 people. The evidence has provided a probability of just 1 in 8000 that the defendant is guilty, and thus has no relevance.” This is known as the defender’s fallacy. What is wrong with this argument?
4. Bayes rule for medical diagnosis
• (Source: Koller, Murphy.) After your yearly checkup, the doctor has bad news and good news. The bad news is that you tested positive for a serious disease, and that the test is 99% accurate (i.e., the probability of testing positive given that you have the disease is 0.99, as is the probability of testing negative given that you don’t have the disease). The good news is that this is a rare disease, striking only one in 10,000 people.
a. What are the chances that you actually have the disease? (Show your calculations as well as giving the final result.)
5. Conditional independence (Source: Koller.)
• Let $H\in {1,\cdots, K}$ be a discrete random variable, and let $e_1$ and $e_2$ the observed values of two other random variables $E_1$ and $E_2$. Suppose we wish to calculate the vector
$$P(H|e_1, e_2) = (P(H=1|e_1,e_2),\cdots, P(H=K|e_1, e_2)).$$
a. Which of the following sets of numbers are sufficient for the calculation?
i. $P(e_1, e_2), P(H), P(e_1| H), P(e_2, H)$.
ii. $P(e_1, e_2), P(H), P(e_1, e_2 | H)$.
iii. $P(e_1|H), P(e_2|H), P(H)$.

b. Now suppose we now assume $E_1\perp E_2 | H$ (i.e., $E_1$ and $E_2$ are independent given $H$). Which of the above 3 sets are sufficient now?

6. R lab
• Estimate the value of $\pi$ by taking uniform random samples from the square $[-1,1]\times [1,1]$ and seeing which lie in the disc $x^2+y^2\leq 1$.
• A company is trying to determine why their employees leave and why they stay. They have a list of roughly 15000 employee records here.
a. Download this dataset and load it in R (this may require setting up a Kaggle account if you don’t already have one).
b. Examine the dataset and see if you need to transform any of the features=columns (e.g., are there factors that were not recognized as such, is there missing data?).
c. Randomly shuffle the rows and cut the dataset into two pieces with 10000 entries in a data frame called train and the remaining entries in a data frame called valid.
d. Study the train data frame and see if you can find any features that predict whether or not an employee will leave.
e. Make a hypothesis about how you can predict whether an employee will leave by studying the train data.
f. Once you have fixed this hypothesis evaluate how well your criteria work on the valid data frame.
g. Justify your proposal with data and charts. Save at least one of these charts to a pdf file to share with management.