Assignment #7: Bayesian Statistics and LDA
Notice: This assignment is due Thursday, November 2nd at 11:59pm.
Please submit your solutions via Compass. You should submit your assignment as a typeset PDF. Please do not include scanned or photographed equations as they are difficult for us to grade. Any diagrams you draw should be drawn using vector graphics—please do not handdraw a figure and include it as an image in your PDF.
1. Priors, Posteriors, and Conjugacy [45 pts]
Note: We are not expecting you to memorize the exact distributions and formulas used in this part of the assignment. Rather, we are trying to get you to understand the definitions of (conjugate) prior distributions and posterior distributions with a more concrete example.
We are expecting you to understand the process here well enough that you could answer similar questions about different distributions on the exam. We would give you all the necessary information in order to perform the computation, just like we have in this question.
Suppose that you read a study that estimates that 10% of the population is lefthanded, with a standard deviation of +/ 5%. In your class of 98 students, you observe that 11 of them are lefthanded.
Let $\theta$ be the probability of being lefthanded in the population (that is, assume lefthandedness follows a Bernoulli distribution with parameter $\theta$).

[6 pts] If you ignore the study, what is your best guess for $\theta$ based on your population sample? What statistical estimation technique did you use?
 You know that the Beta distribution
$\displaystyle P(\theta \mid \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha  1} (1\theta)^{\beta  1}$
is a conjugate prior to the Bernoulli (we proved this in class).

[2 pts] According to the above reference, what is the mean of the Beta distribution?

[2 pts] According to the above reference, what is the variance of the Beta distribution?

[10 pts] Describe how to set the parameters of a Beta distribution to incorporate your prior knowledge from the study, and find those parameters.


[10 pts] What is the posterior mean of $\theta$ if you incorporate your prior from above with the data you observed in your sample?
 Let’s prove that the Gamma distribution
$\displaystyle P(\lambda \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha1}e^{\beta\lambda}$
is a conjugate prior for the Poisson distribution
$\displaystyle P(x \mid \lambda) = \frac{\lambda^x e^{\lambda}}{x!}.$Let your data be $X = (x_1, x_2, \ldots, x_N)$ and assume each $x_i$ is independent from the others.

[5 pts] What is the likelihood function? (Hint: Try rewriting the likelihood function so that the product only occurs in the denominator.)

[10 pts] You know that the posterior distribution is
$\displaystyle P(\lambda \mid X, \alpha, \beta) \propto P(X \mid \lambda) P(\lambda \mid \alpha, \beta).$Expand this formula with the definition of the likelihood function you gave above and the definition of the Gamma distribution, and simplify to show that the posterior is also a Gamma distribution. What are the parameters for that new Gamma distribution?

2. Graphical Models and Plate Notation [25 pts]

A Naive Bayes classifier is a classification model that assumes that each document $\mathbf{w_d}$ in a corpus is associated with some class label $c_d$. All of the words in the document are then drawn from a unigram language model associated with the class $c_d$. If we smooth these language models with a Dirichlet prior with parameter vector $\beta$, and we assume we have only two classes, we would have the following joint distribution:
$\displaystyle P(\mathbf{W}, \mathbf{C}, \phi_0, \phi_1 \mid \theta, \beta) = P(\phi_0 \mid \beta) P(\phi_1 \mid \beta) \prod_{d=1}^M P(c_d \mid \theta) \prod_{i=1}^{N_d} P(w_{d,i} \mid \phi_{c_d})$
[5 pts] Draw a twoclass Naive Bayes classifier with a $Dirichlet(\beta)$ prior on the unigram language models using plate notation. Use the joint distribution given above to help guide your diagramming.

[5 pts] What are the random variables being modeled? What are the parameters of the model?


Suppose you have a dataset where each document is associated with a realvalued response variable $y_d$. For example, if the dataset consists of movie reviews, the response variable might be the average rating.

[5 pts] Design a modified LDA model for such a review dataset. Describe the generative process (how each random variable is generated) in words (see section 2 of this note for an example of what we’re looking for).

[5 pts] Draw your modified model using plate notation.

[5 pts] Write down the joint distribution of the random variables for your model according to your plate diagram.

3. Latent Dirichlet Allocation [30 pts]

[5 pts] Briefly explain why we cannot perform exact posterior inference in the LDA model.

[5 pts] In plain English, explain the central idea behind variational inference. In math notation, write down the optimization problem to be solved (you do not have to simplify the equation).

[5 pts] Briefly explain the goal of a Markov chain Monte Carlo algorithm.

[5 pts] In plain English, explain how one iteration of Gibbs sampling proceeds.

[10pts] What are the advantages of variational inference over Gibbs sampling? What are the advantages of Gibbs sampling over variational inference?