CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

Assignment #6: Mixture Models and EM Algorithm

Notice: This assignment is due Thursday, October 26th at 11:59pm.

Please submit your solutions via Compass. You should submit your assignment as a typeset PDF. Please do not include scanned or photographed equations as they are difficult for us to grade.

1. Mixture Models [35 pts]

Author HH and author TT are co-authoring a paper in the following way:

  1. Each word is written independently.

  2. When writing a word, they would first toss a coin to decide who will write the word. The coin is known to show up as “heads” 80% of the time. If the coin shows up as “heads”, then author HH would write the word, otherwise, author TT would write the word.

  3. If it is author HH’s turn to write, he would “write” the word by simply drawing a word according to word distribution p(wH)p(w \mid H). Similarly, if it is author TT’s turn to write, he would “write” the word by drawing a word according to word distribution p(wT)p(w \mid T).

Suppose the two distributions p(wH)p(w \mid H) and p(wT)p(w \mid T) are defined as follows:

Word ww p(wH)p(w \mid H) p(wT)p(w \mid T)
the 0.3 0.3
computer 0.1 0.2
data 0.1 0.1
baseball 0.2 0.1
game 0.2 0.1
  1. [5 pts] What is the probability that they would write “the” as the first word of the paper? Show your calculation.

  2. [5 pts] What is the probability that they would write “the” as the second word of the paper? Show your calculation.

  3. [10 pts] Suppose we observe that the first word they wrote is “data”, what is the probability that this word was written by author H? Show your calculation.

  4. [10 pts] Imagine that we observe a very long paper written by them (e.g., with more than 10,000 words). Among the 5 words shown in the table above (i.e., “the”, “computer”, “data”, “baseball”, “game”), which one would you expect to occur least frequently in the paper? Briefly explain why.

  5. [5 pts] Suppose we don’t know p(wH)p(w \mid H), but observed a paper DD known to be written solely by author HH. That is, the coin somehow always showed up as “heads” when they wrote the paper. Suppose D=D=“the computer data the computer game the computer data game” and we would like to use the maximum likelihood estimator to estimate p(wH)p(w \mid H). What would be the estimated probabilities of “computer” and “game”, respectively? Show your calculations.

2. EM Algorithm [35 pts]

The two-component mixture model we discussed for feedback can also be used to estimate the redundancy of one document with respect to another. Specifically, given two documents D1D_1 and D2D_2, and a background language model p(wC)p(w \mid C), we can use the maximum likelihood estimator to estimate a unigram language model based on D1D_1, which will be denoted by θ1\theta_1 (i.e., p(wθ1)=c(w,D1)D1)p(w \mid \theta_1)=\frac{c(w,D_1)}{|D_1|}). Now, we can assume that document D2D_2 is generated by sampling words from a two-component multinomial mixture model where one component is p(wθ1)p(w \mid \theta_1), and the other is p(wC)p(w \mid C). Let λ\lambda denote the probability that p(wθ1)p(w \mid \theta_1) would be selected to generate a word in D2D_2 (thus, 1λ1-\lambda would be the probability of selecting the background model p(wC)p(w \mid C)). Let D2=(w1,w2,,wk)D_2=(w_1, w_2, \ldots, w_k) where wiw_i is a word in our vocabulary set VV. We can then use the mixture model to fit D2D_2 and compute the maximum likelihood estimate of λ\lambda, which can then be used to measure the redundancy of D2D_2 with respect to D1D_1. We can use the EM algorithm to compute the maximum likelihood estimate.

  1. [5 pts] Write down the formula to compute the probability of generating a word wiw_i in document D2D_2 from such a mixture model.

  2. [5 pts] Write down the log-likelihood of the whole document D2D_2, i.e., the probability of observing all the words in D2D_2 being generated from the mixture model.

  3. [10 pts] How many binary hidden variables in total do we need for computing this maximum likelihood estimate using the EM algorithm? Why?

  4. [15 pts] Write down the E-step and M-step updating formulas for estimating λ\lambda.

3. PLSA (Open Research Question)[30 pts]

Suppose that, you are given a large corpus of sentences where each sentence talks about exactly one tourist visiting site either from city Seattle or Chicago. For example, one sentence might talk about the “Willis Tower” in Chicago, whereas another sentence may talk about “Mount Rainier” in Seattle. Your task is to classify each sentence into one of the following classes:

However, one major problem is you have not been given any labeled training data to train a supervised classifier. Also, the corpus is so large that its also hard for a human to go through it and annotate the data manually. This limits the applicability of supervised classifier methods for this problem. In absence of supervised labels, our only option is to resort to unsupervised techniques like topic modeling.

The classification task is a lot easier if the sentences explicitly mention the name of the city, i.e., Chicago or Seattle within the sentence. In that case, lets assume that the classification task is trivial and we can label the sentence with the name of the city explicitly mentioned in it. However, lets assume that a large number of sentences don’t mention the name of the city explicitly. Given this scenario, answer the following questions.

  1. [5 pts] Can we exploit PLSA to solve this classification problem? How? [Answer within 3 sentences max]

  2. [5 pts] Can we also exploit the fact that some sentences explicitly mention the name of the city? How can we use it in the PLSA model? [Answer within 3 sentences max]

  3. [10 pts] Write a five line (max) pseudocode that extends PLSA to solve the above mentioned classification problem. [Natural English description is fine]

  4. [5 pts] How would you like to evaluate your proposed classification method?

  5. [5 pts] Mention one case where your proposed method is expected to perform worse than its average case performance. Does it have to do anything with the assumption you are making in your method?