Assignment #4: Language Models for Retrieval, Mixture Models and the EM Algorithm

Notice: This assignment is due Saturday, October 1st at 11:59pm Saturday, October 8th at 11:59pm.

1. KL-Divergence Retrieval Function [10 pts]

Show that the KL-divergence retrieval function covers the query likelihood retrieval function as a special case if we set the query language model to the empirical word distribution in the query (i.e., $p(w\mid\theta_Q) = \frac{c(w,Q)}{|Q|}$, where $c(w,Q)$ is the count of word $w$ in query $Q$, and $|Q|$ is the length of the query).

2. Language Models for Boolean Queries [25 pts]

Intuitively, the semantic structure of a query should affect how we score a document. The two basic semantic structures are conjunctive (i.e., “AND”) and disjunctive (i.e., “OR”) expressions. For a conjunctive query such as $Q = w_1 \text{ AND } w_2$, we would favor a document that matches both terms $w_1$ and $w_2$ much more to one that matches just one of them, while for a disjunctive query such as $Q=w_1 \text{ OR } w_2$, we would not discriminate them so much, though matching both terms should still lead to a higher score than matching just one of them.

1. What is the implied query semantic structure in the basic multinomial query likelihood retrieval model? Is it disjunctive or conjunctive?

2. Now consider general Boolean queries (without negation) of the form of $Q= Q_1 \text{ AND } Q_2 \ldots \text{ AND } Q_k$, where each $Q_i$ is a disjunctive “subquery” $Q_i = w_{i,1} \text{ OR } \ldots w_{i,n_i}$. How can we extend the basic multinomial query likelihood retrieval model to score documents for such Boolean queries? Explain your ideas with formulas.

3. Mixture Models [30 pts]

Author $H$ and author $T$ are co-authoring a paper in the following way:

1. Each word is written independently.

2. When writing a word, they would first toss a coin to decide who will write the word. The coin is known to show up as “heads” 80% of the time. If the coin shows up as “heads”, then author $H$ would write the word, otherwise, author $T$ would write the word.

3. If it is author $H$’s turn to write, he would “write” the word by simply drawing a word according to word distribution $p(w \mid H)$. Similarly, if it is author $T$’s turn to write, he would “write” the word by drawing a word according to word distribution $p(w \mid T)$.

Suppose the two distributions $p(w \mid H)$ and $p(w \mid T)$ are defined as follows:

Word $w$ $p(w \mid H)$ $p(w \mid T)$
the 0.3 0.3
computer 0.1 0.2
data 0.1 0.1
baseball 0.2 0.1
game 0.2 0.1
1. [5 pts] What is the probability that they would write “the” as the first word of the paper? Show your calculation.

2. [5 pts] What is the probability that they would write “the” as the second word of the paper? Show your calculation.

3. [10 pts] Suppose we observe that the first word they wrote is “data”, what is the probability that this word was written by author H? Show your calculation.

4. [5 pts] Imagine that we observe a very long paper written by them (e.g., with more than 10,000 words). Among the 5 words shown in the table above (i.e., “the”, “computer”, “data”, “baseball”, “game”), which one would you expect to occur least frequently in the paper? Briefly explain why.

5. [5 pts] Suppose we don’t know $p(w \mid H)$, but observed a paper $D$ known to be written solely by author $H$. That is, the coin somehow always showed up as “heads” when they wrote the paper. Suppose $D=$“the computer data the computer game the computer data game” and we would like to use the maximum likelihood estimator to estimate $p(w \mid H)$. What would be the estimated probabilities of “computer” and “game”, respectively? Show your calculations.

4. EM Algorithm [35 pts]

The two-component mixture model we discussed for feedback can also be used to estimate the redundancy of one document with respect to another. Specifically, given two documents $D_1$ and $D_2$, and a background language model $p(w \mid C)$, we can use the maximum likelihood estimator to estimate a unigram language model based on $D_1$, which will be denoted by $\theta_1$ (i.e., $p(w \mid \theta_1)=\frac{c(w,D_1)}{|D_1|})$. Now, we can assume that document $D_2$ is generated by sampling words from a two-component multinomial mixture model where one component is $p(w \mid \theta_1)$, and the other is $p(w \mid C)$. Let $\lambda$ denote the probability that $p(w \mid \theta_1)$ would be selected to generate a word in $D_2$ (thus, $1-\lambda$ would be the probability of selecting the background model $p(w \mid C)$). Let $D_2=(w_1, w_2, \ldots, w_k)$ where $w_i$ is a word in our vocabulary set $V$. We can then use the mixture model to fit $D_2$ and compute the maximum likelihood estimate of $\lambda$, which can then be used to measure the redundancy of $D_2$ with respect to $D_1$. We can use the EM algorithm to compute the maximum likelihood estimate.

1. [5 pts] Write down the formula to compute the probability of generating a word $w_i$ in document $D_2$ from such a mixture model.

2. [5 pts] Write down the log-likelihood of the whole document $D_2$, i.e., the probability of observing all the words in $D_2$ being generated from the mixture model.

3. [5 pts] How many binary hidden variables in total do we need for computing this maximum likelihood estimate using the EM algorithm? Why?

4. [20 pts] Write down the E-step and M-step updating formulas for estimating $\lambda$.