## CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

# Assignment #6: Mixture Models and EM Algorithm

Notice: This assignment is due Thursday, October 26th at 11:59pm.

Please submit your solutions via Compass. You should submit your assignment as a typeset PDF. Please do not include scanned or photographed equations as they are difficult for us to grade.

### 1. Mixture Models [35 pts]

Author $H$ and author $T$ are co-authoring a paper in the following way:

1. Each word is written independently.

2. When writing a word, they would first toss a coin to decide who will write the word. The coin is known to show up as “heads” 80% of the time. If the coin shows up as “heads”, then author $H$ would write the word, otherwise, author $T$ would write the word.

3. If it is author $H$’s turn to write, he would “write” the word by simply drawing a word according to word distribution $p(w \mid H)$. Similarly, if it is author $T$’s turn to write, he would “write” the word by drawing a word according to word distribution $p(w \mid T)$.

Suppose the two distributions $p(w \mid H)$ and $p(w \mid T)$ are defined as follows:

Word $w$ $p(w \mid H)$ $p(w \mid T)$
the 0.3 0.3
computer 0.1 0.2
data 0.1 0.1
baseball 0.2 0.1
game 0.2 0.1
1. [5 pts] What is the probability that they would write “the” as the first word of the paper? Show your calculation.

2. [5 pts] What is the probability that they would write “the” as the second word of the paper? Show your calculation.

3. [10 pts] Suppose we observe that the first word they wrote is “data”, what is the probability that this word was written by author H? Show your calculation.

4. [10 pts] Imagine that we observe a very long paper written by them (e.g., with more than 10,000 words). Among the 5 words shown in the table above (i.e., “the”, “computer”, “data”, “baseball”, “game”), which one would you expect to occur least frequently in the paper? Briefly explain why.

5. [5 pts] Suppose we don’t know $p(w \mid H)$, but observed a paper $D$ known to be written solely by author $H$. That is, the coin somehow always showed up as “heads” when they wrote the paper. Suppose $D=$“the computer data the computer game the computer data game” and we would like to use the maximum likelihood estimator to estimate $p(w \mid H)$. What would be the estimated probabilities of “computer” and “game”, respectively? Show your calculations.

### 2. EM Algorithm [35 pts]

The two-component mixture model we discussed for feedback can also be used to estimate the redundancy of one document with respect to another. Specifically, given two documents $D_1$ and $D_2$, and a background language model $p(w \mid C)$, we can use the maximum likelihood estimator to estimate a unigram language model based on $D_1$, which will be denoted by $\theta_1$ (i.e., $p(w \mid \theta_1)=\frac{c(w,D_1)}{|D_1|})$. Now, we can assume that document $D_2$ is generated by sampling words from a two-component multinomial mixture model where one component is $p(w \mid \theta_1)$, and the other is $p(w \mid C)$. Let $\lambda$ denote the probability that $p(w \mid \theta_1)$ would be selected to generate a word in $D_2$ (thus, $1-\lambda$ would be the probability of selecting the background model $p(w \mid C)$). Let $D_2=(w_1, w_2, \ldots, w_k)$ where $w_i$ is a word in our vocabulary set $V$. We can then use the mixture model to fit $D_2$ and compute the maximum likelihood estimate of $\lambda$, which can then be used to measure the redundancy of $D_2$ with respect to $D_1$. We can use the EM algorithm to compute the maximum likelihood estimate.

1. [5 pts] Write down the formula to compute the probability of generating a word $w_i$ in document $D_2$ from such a mixture model.

2. [5 pts] Write down the log-likelihood of the whole document $D_2$, i.e., the probability of observing all the words in $D_2$ being generated from the mixture model.

3. [10 pts] How many binary hidden variables in total do we need for computing this maximum likelihood estimate using the EM algorithm? Why?

4. [15 pts] Write down the E-step and M-step updating formulas for estimating $\lambda$.

### 3. PLSA (Open Research Question)[30 pts]

Suppose that, you are given a large corpus of sentences where each sentence talks about exactly one tourist visiting site either from city Seattle or Chicago. For example, one sentence might talk about the “Willis Tower” in Chicago, whereas another sentence may talk about “Mount Rainier” in Seattle. Your task is to classify each sentence into one of the following classes:

• Class “Chicago”: if the sentence is talking about a tourist spot from Chicago
• Class “Seattle”: if the sentence is talking about a tourist spot from Seattle

However, one major problem is you have not been given any labeled training data to train a supervised classifier. Also, the corpus is so large that its also hard for a human to go through it and annotate the data manually. This limits the applicability of supervised classifier methods for this problem. In absence of supervised labels, our only option is to resort to unsupervised techniques like topic modeling.

The classification task is a lot easier if the sentences explicitly mention the name of the city, i.e., Chicago or Seattle within the sentence. In that case, lets assume that the classification task is trivial and we can label the sentence with the name of the city explicitly mentioned in it. However, lets assume that a large number of sentences don’t mention the name of the city explicitly. Given this scenario, answer the following questions.

1. [5 pts] Can we exploit PLSA to solve this classification problem? How? [Answer within 3 sentences max]

2. [5 pts] Can we also exploit the fact that some sentences explicitly mention the name of the city? How can we use it in the PLSA model? [Answer within 3 sentences max]

3. [10 pts] Write a five line (max) pseudocode that extends PLSA to solve the above mentioned classification problem. [Natural English description is fine]

4. [5 pts] How would you like to evaluate your proposed classification method?

5. [5 pts] Mention one case where your proposed method is expected to perform worse than its average case performance. Does it have to do anything with the assumption you are making in your method?