CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

Assignment #7: Bayesian Statistics and LDA

Notice: This assignment is due Thursday, November 2nd at 11:59pm.

Please submit your solutions via Compass. You should submit your assignment as a typeset PDF. Please do not include scanned or photographed equations as they are difficult for us to grade. Any diagrams you draw should be drawn using vector graphics—please do not hand-draw a figure and include it as an image in your PDF.

1. Priors, Posteriors, and Conjugacy [45 pts]

Note: We are not expecting you to memorize the exact distributions and formulas used in this part of the assignment. Rather, we are trying to get you to understand the definitions of (conjugate) prior distributions and posterior distributions with a more concrete example.

We are expecting you to understand the process here well enough that you could answer similar questions about different distributions on the exam. We would give you all the necessary information in order to perform the computation, just like we have in this question.

Suppose that you read a study that estimates that 10% of the population is left-handed, with a standard deviation of +/- 5%. In your class of 98 students, you observe that 11 of them are left-handed.

Let θ\theta be the probability of being left-handed in the population (that is, assume left-handedness follows a Bernoulli distribution with parameter θ\theta).

  1. [6 pts] If you ignore the study, what is your best guess for θ\theta based on your population sample? What statistical estimation technique did you use?

  2. You know that the Beta distribution
    P(θα,β)=Γ(α+β)Γ(α)Γ(β)θα1(1θ)β1\displaystyle P(\theta \mid \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta^{\alpha - 1} (1-\theta)^{\beta - 1}

    is a conjugate prior to the Bernoulli (we proved this in class).

    1. [2 pts] According to the above reference, what is the mean of the Beta distribution?

    2. [2 pts] According to the above reference, what is the variance of the Beta distribution?

    3. [10 pts] Describe how to set the parameters of a Beta distribution to incorporate your prior knowledge from the study, and find those parameters.

  3. [10 pts] What is the posterior mean of θ\theta if you incorporate your prior from above with the data you observed in your sample?

  4. Let’s prove that the Gamma distribution
    P(λα,β)=βαΓ(α)λα1eβλ\displaystyle P(\lambda \mid \alpha, \beta) = \frac{\beta^\alpha}{\Gamma(\alpha)} \lambda^{\alpha-1}e^{-\beta\lambda}

    is a conjugate prior for the Poisson distribution

    P(xλ)=λxeλx!.\displaystyle P(x \mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}.

    Let your data be X=(x1,x2,,xN)X = (x_1, x_2, \ldots, x_N) and assume each xix_i is independent from the others.

    1. [5 pts] What is the likelihood function? (Hint: Try rewriting the likelihood function so that the product only occurs in the denominator.)

    2. [10 pts] You know that the posterior distribution is

      P(λX,α,β)P(Xλ)P(λα,β).\displaystyle P(\lambda \mid X, \alpha, \beta) \propto P(X \mid \lambda) P(\lambda \mid \alpha, \beta).

      Expand this formula with the definition of the likelihood function you gave above and the definition of the Gamma distribution, and simplify to show that the posterior is also a Gamma distribution. What are the parameters for that new Gamma distribution?

2. Graphical Models and Plate Notation [25 pts]

  1. A Naive Bayes classifier is a classification model that assumes that each document wd\mathbf{w_d} in a corpus is associated with some class label cdc_d. All of the words in the document are then drawn from a unigram language model associated with the class cdc_d. If we smooth these language models with a Dirichlet prior with parameter vector β\beta, and we assume we have only two classes, we would have the following joint distribution:

    P(W,C,ϕ0,ϕ1θ,β)=P(ϕ0β)P(ϕ1β)d=1MP(cdθ)i=1NdP(wd,iϕcd)\displaystyle P(\mathbf{W}, \mathbf{C}, \phi_0, \phi_1 \mid \theta, \beta) = P(\phi_0 \mid \beta) P(\phi_1 \mid \beta) \prod_{d=1}^M P(c_d \mid \theta) \prod_{i=1}^{N_d} P(w_{d,i} \mid \phi_{c_d})
    1. [5 pts] Draw a two-class Naive Bayes classifier with a Dirichlet(β)Dirichlet(\beta) prior on the unigram language models using plate notation. Use the joint distribution given above to help guide your diagramming.

    2. [5 pts] What are the random variables being modeled? What are the parameters of the model?

  2. Suppose you have a dataset where each document is associated with a real-valued response variable ydy_d. For example, if the dataset consists of movie reviews, the response variable might be the average rating.

    1. [5 pts] Design a modified LDA model for such a review dataset. Describe the generative process (how each random variable is generated) in words (see section 2 of this note for an example of what we’re looking for).

    2. [5 pts] Draw your modified model using plate notation.

    3. [5 pts] Write down the joint distribution of the random variables for your model according to your plate diagram.

3. Latent Dirichlet Allocation [30 pts]

  1. [5 pts] Briefly explain why we cannot perform exact posterior inference in the LDA model.

  2. [5 pts] In plain English, explain the central idea behind variational inference. In math notation, write down the optimization problem to be solved (you do not have to simplify the equation).

  3. [5 pts] Briefly explain the goal of a Markov chain Monte Carlo algorithm.

  4. [5 pts] In plain English, explain how one iteration of Gibbs sampling proceeds.

  5. [10pts] What are the advantages of variational inference over Gibbs sampling? What are the advantages of Gibbs sampling over variational inference?