## CS598CXZ Assignment #6: Language Models for Retrieval, Topic Models, and EM Algorithm (due 11:59pm, Thursday, Oct. 30, 2014 )

1. [10 points] KL-divergence retrieval function

Show that the KL-divergence retrieval function covers the query likelihood retrieval function as a special case if we set the query language model to the empirical word distribution in the query (i.e., p(w|θQ) = c(w,Q)/|Q|, where c(w,Q) is the count of word w in query Q, and |Q| is the length of the query).

2. [25 points] Language models for Boolean queries

Intuitively, the semantic structure of a query should affect how we score a document. The two basic semantic structures are conjunctive (i.e., "AND") and disjunctive (i.e., "OR") expressions. For a conjunctive query such as Q= w1 AND w2, we would favor a document that matches both terms w1 and w2 much more to one that matches just one of them, while for a disjunctive query such as Q=w1 OR w2, we would not discriminate them so much, though matching both terms should still lead to a higher score than matching just one of them. What is the implied query semantic structure in the basic multinomial query likelihood retrieval model? Is it disjunctive or conjunctive?

Now consider general Boolean queries (without negation) of the form of Q= Q1 AND Q2 ... AND Qk, where each Qi is a disjunctive ``subquery'' Qi =wi1 OR ... wini, how can we extend the basic multinomial query likelihood retrieval model to score documents for such a Boolean query? Explain your ideas with formulas.

3. [30 points] Mixture model

Author H and author T are co-authoring a paper in the following way: (1) Each word is written independently. (2) When writing a word, they would first toss a coin to decide who will write the word. The coin is known to show up as HEAD 80% of the time. If the coin shows up as HEAD, then author H would write the word, otherwise, author T would write the word. (3) If it is author H's turn to write, he would ``write'' the word by simply drawing a word according to word distribution p(w|H). Similarly, if it is author T's turn to write, he would ``write'' the word by drawing a word according to word distribution p(w|T). Suppose the two distributions p(w|H) and p(w|T) are defined as follows:

----------------------------
| Word w | p(w|H) | p(w|T) |
----------------------------
| the    | 0.3    | 0.3    |
|computer| 0.1    | 0.2    |
| data   | 0.1    |  0.1   |
|baseball|  0.2   | 0.1    |
| game   |  0.2   | 0.1    |
| ...    | ...    |  ...   |
----------------------------

1. [5/30 points] What is the probability that they would write ``the'' as the first word of the paper? Show your calculation.
2. [5/30 points] What is the probability that they would write ``the'' as the second word of the paper? Show your calculation.
3. [10/30 points] Suppose we observe that the first word they wrote is ``data'', what is the probability that this word was written by author H? Show your calculation.
4. [5/30 points] Imagine that we observe a very long paper written by them (e.g., with more than 10,000 words). Among the 5 words shown in the table above (i.e., ``the'', ``computer'', ``data'', ``baseball'', ``game''), which one would you expect to occur least frequently in the paper? Briefly explain why.
5. [5/30 points] Suppose we don't know p(w|H), but observed a paper D known to be written solely by author H. That is, the coin somehow always showed up as HEAD when they wrote the paper. Suppose D="the computer data the computer game the computer data game" and we would like to use the maximum likelihood estimator to estimate p(w|H). What would be the estimated probabilities of "computer" and "game", respectively? Show your calculation.

4. [35 points] EM Algorithm

The two-component mixture model we discussed for feedback can also be used to estimate the redundancy of one document with respect to another. Specifically, given two documents D1 and D2, and a background language model p(w|C), we can use the maximum likelihood estimator to estimate a unigram language model based on D1, which will be denoted by θ1 (i.e., p(w|θ1)=c(w,D1)/|D1|). Now, we can assume that document D2 is generated by sampling words from a two-component multinomial mixture model where one component is p(w|θ1), and the other is p(w|C). Let λ denote the probability that p(w|θ1) would be selected to generate a word in D2 (thus, 1-λ would be the probability of selecting the background model p(w|C)). Let D2=w1 w2 ... wk where wi is a word in our vocabulary set V. We can then use the mixture model to fit D2 and compute the maximum likelihood estimate of λ, which can then be used to measure the redundancy of D2 with respect to D1. We can use the EM algorithm to compute the maximum likelihood estimate.

1. [5/35 points] Write down the formula to compute the probability of generating a word wi in document D2.
2. [5/35 points] Write down the log-likelihood of the whole document D2, i.e., the probability of observing all the words in D2 being generated from the mixture model.
3. [5/35 points] How many binary hidden variables in total do we need for computing this maximum likelihood estimate using the EM algorithm? Why?
4. [20/35 points] Write down the E-step and M-step updating formulas for estimating λ.