CS 598cxz Fall 2016

Assignment #4: Language Models for Retrieval, Mixture Models and the EM Algorithm

Notice: This assignment is due Saturday, October 1st at 11:59pm Saturday, October 8th at 11:59pm.

Please submit your solutions via Compass. You should submit your assignment as a PDF.

1. KL-Divergence Retrieval Function [10 pts]

Show that the KL-divergence retrieval function covers the query likelihood retrieval function as a special case if we set the query language model to the empirical word distribution in the query (i.e., p(wθQ)=c(w,Q)Qp(w\mid\theta_Q) = \frac{c(w,Q)}{|Q|}, where c(w,Q)c(w,Q) is the count of word ww in query QQ, and Q|Q| is the length of the query).

2. Language Models for Boolean Queries [25 pts]

Intuitively, the semantic structure of a query should affect how we score a document. The two basic semantic structures are conjunctive (i.e., “AND”) and disjunctive (i.e., “OR”) expressions. For a conjunctive query such as Q=w1 AND w2Q = w_1 \text{ AND } w_2, we would favor a document that matches both terms w1w_1 and w2w_2 much more to one that matches just one of them, while for a disjunctive query such as Q=w1 OR w2Q=w_1 \text{ OR } w_2, we would not discriminate them so much, though matching both terms should still lead to a higher score than matching just one of them.

  1. What is the implied query semantic structure in the basic multinomial query likelihood retrieval model? Is it disjunctive or conjunctive?

  2. Now consider general Boolean queries (without negation) of the form of Q=Q1 AND Q2 AND QkQ= Q_1 \text{ AND } Q_2 \ldots \text{ AND } Q_k, where each QiQ_i is a disjunctive “subquery” Qi=wi,1 OR wi,niQ_i = w_{i,1} \text{ OR } \ldots w_{i,n_i}. How can we extend the basic multinomial query likelihood retrieval model to score documents for such Boolean queries? Explain your ideas with formulas.

3. Mixture Models [30 pts]

Author HH and author TT are co-authoring a paper in the following way:

  1. Each word is written independently.

  2. When writing a word, they would first toss a coin to decide who will write the word. The coin is known to show up as “heads” 80% of the time. If the coin shows up as “heads”, then author HH would write the word, otherwise, author TT would write the word.

  3. If it is author HH’s turn to write, he would “write” the word by simply drawing a word according to word distribution p(wH)p(w \mid H). Similarly, if it is author TT’s turn to write, he would “write” the word by drawing a word according to word distribution p(wT)p(w \mid T).

Suppose the two distributions p(wH)p(w \mid H) and p(wT)p(w \mid T) are defined as follows:

Word ww p(wH)p(w \mid H) p(wT)p(w \mid T)
the 0.3 0.3
computer 0.1 0.2
data 0.1 0.1
baseball 0.2 0.1
game 0.2 0.1
  1. [5 pts] What is the probability that they would write “the” as the first word of the paper? Show your calculation.

  2. [5 pts] What is the probability that they would write “the” as the second word of the paper? Show your calculation.

  3. [10 pts] Suppose we observe that the first word they wrote is “data”, what is the probability that this word was written by author H? Show your calculation.

  4. [5 pts] Imagine that we observe a very long paper written by them (e.g., with more than 10,000 words). Among the 5 words shown in the table above (i.e., “the”, “computer”, “data”, “baseball”, “game”), which one would you expect to occur least frequently in the paper? Briefly explain why.

  5. [5 pts] Suppose we don’t know p(wH)p(w \mid H), but observed a paper DD known to be written solely by author HH. That is, the coin somehow always showed up as “heads” when they wrote the paper. Suppose D=D=“the computer data the computer game the computer data game” and we would like to use the maximum likelihood estimator to estimate p(wH)p(w \mid H). What would be the estimated probabilities of “computer” and “game”, respectively? Show your calculations.

4. EM Algorithm [35 pts]

The two-component mixture model we discussed for feedback can also be used to estimate the redundancy of one document with respect to another. Specifically, given two documents D1D_1 and D2D_2, and a background language model p(wC)p(w \mid C), we can use the maximum likelihood estimator to estimate a unigram language model based on D1D_1, which will be denoted by θ1\theta_1 (i.e., p(wθ1)=c(w,D1)D1)p(w \mid \theta_1)=\frac{c(w,D_1)}{|D_1|}). Now, we can assume that document D2D_2 is generated by sampling words from a two-component multinomial mixture model where one component is p(wθ1)p(w \mid \theta_1), and the other is p(wC)p(w \mid C). Let λ\lambda denote the probability that p(wθ1)p(w \mid \theta_1) would be selected to generate a word in D2D_2 (thus, 1λ1-\lambda would be the probability of selecting the background model p(wC)p(w \mid C)). Let D2=(w1,w2,,wk)D_2=(w_1, w_2, \ldots, w_k) where wiw_i is a word in our vocabulary set VV. We can then use the mixture model to fit D2D_2 and compute the maximum likelihood estimate of λ\lambda, which can then be used to measure the redundancy of D2D_2 with respect to D1D_1. We can use the EM algorithm to compute the maximum likelihood estimate.

  1. [5 pts] Write down the formula to compute the probability of generating a word wiw_i in document D2D_2 from such a mixture model.

  2. [5 pts] Write down the log-likelihood of the whole document D2D_2, i.e., the probability of observing all the words in D2D_2 being generated from the mixture model.

  3. [5 pts] How many binary hidden variables in total do we need for computing this maximum likelihood estimate using the EM algorithm? Why?

  4. [20 pts] Write down the E-step and M-step updating formulas for estimating λ\lambda.