Please submit your solutions via Compass.
Show that the KL-divergence retrieval function covers the query likelihood retrieval function as a special case if we set the query language model to the empirical word distribution in the query (i.e., p(w|θ_{Q}) = c(w,Q)/|Q|, where c(w,Q) is the count of word w in query Q, and |Q| is the length of the query).
Intuitively, the semantic structure of a query should affect how we score a document. The two basic semantic structures are conjunctive (i.e., "AND") and disjunctive (i.e., "OR") expressions. For a conjunctive query such as Q= w1 AND w2, we would favor a document that matches both terms w1 and w2 much more to one that matches just one of them, while for a disjunctive query such as Q=w1 OR w2, we would not discriminate them so much, though matching both terms should still lead to a higher score than matching just one of them. What is the implied query semantic structure in the basic multinomial query likelihood retrieval model? Is it disjunctive or conjunctive?
Now consider general Boolean queries (without negation) of the form of Q= Q1 AND Q2 ... AND Qk, where each Q_{i} is a disjunctive ``subquery'' Q_{i} =w_{i1} OR ... w_{ini}, how can we extend the basic multinomial query likelihood retrieval model to score documents for such a Boolean query? Explain your ideas with formulas.
Author H and author T are co-authoring a paper in the following way: (1) Each word is written independently. (2) When writing a word, they would first toss a coin to decide who will write the word. The coin is known to show up as HEAD 80% of the time. If the coin shows up as HEAD, then author H would write the word, otherwise, author T would write the word. (3) If it is author H's turn to write, he would ``write'' the word by simply drawing a word according to word distribution p(w|H). Similarly, if it is author T's turn to write, he would ``write'' the word by drawing a word according to word distribution p(w|T). Suppose the two distributions p(w|H) and p(w|T) are defined as follows:
---------------------------- | Word w | p(w|H) | p(w|T) | ---------------------------- | the | 0.3 | 0.3 | |computer| 0.1 | 0.2 | | data | 0.1 | 0.1 | |baseball| 0.2 | 0.1 | | game | 0.2 | 0.1 | | ... | ... | ... | ----------------------------
The two-component mixture model we discussed for feedback can also be used to estimate the redundancy of one document with respect to another. Specifically, given two documents D1 and D2, and a background language model p(w|C), we can use the maximum likelihood estimator to estimate a unigram language model based on D1, which will be denoted by θ_{1} (i.e., p(w|θ_{1})=c(w,D1)/|D1|). Now, we can assume that document D2 is generated by sampling words from a two-component multinomial mixture model where one component is p(w|θ_{1}), and the other is p(w|C). Let λ denote the probability that p(w|θ_{1}) would be selected to generate a word in D2 (thus, 1-λ would be the probability of selecting the background model p(w|C)). Let D2=w1 w2 ... wk where w_{i} is a word in our vocabulary set V. We can then use the mixture model to fit D2 and compute the maximum likelihood estimate of λ, which can then be used to measure the redundancy of D2 with respect to D1. We can use the EM algorithm to compute the maximum likelihood estimate.