## CS598CXZ Assignment #4: Information Retrieval Models (deadline: 11:59pm, Thursday, Oct. 24, 2013, submit your assignment via Wiki)

#### Submit your completed assignment as a PDF or Word attachment to your homework wiki page. Scanned handwritten answers are allowed. Name your file as "yourlastname-yourfirstname-hw4" (e.g., smith-john-hw4.doc or smith-john-hw4.pdf). See the detailed instructions at the end of this page.

1. [20 points] Pivoted Length Normalization The following questions refer to the pivoted length normalization paper [Singhal et al. SIGIR 1996].
1. [5/20 points] According to the paper, what are the two main reasons for doing document length normalization (i.e., normalizing term weights to penalize long documents)? What do you think about pre-segmenting all documents into passages of equal lengths as a way to achieve document length normalization?
2. [5/20 points] In order to check the optimality of length normalization of a retrieval function, the authors plotted and compared two curves in Figure 1 (c). Briefly explain how exactly each curve was generated. What was the conclusion drawn from Figure 1 about the cosine normalization method?
3. [5/20 points] The essence of pivoted length normalization is that it assumes that the documents with lengths equal to the "pivot" have the "right scores" (thus no need for normalization), those longer than the pivot would get penalized, and those shorter would get rewarded. The commonly used pivoted length normalizer is (1-s)+s*doclen/average-doc-length. In such a normalizer, what is the implied pivot? What is the meaning and effect of parameter s?
4. [5/20 points] Consider two versions of normalized TF: (1) RawTF/[1-s+s*docLen/avgDocLen], and (2) log(1+log(1+RawTF))/[1-s+s*docLen/avgDocLen]. Suppose we put these TF formulas in a TF-IDF retrieval function, and tune parameter s empirically for both TF formulas using the same test collection (i.e., finding the s value that works the best on this collection), do you think we will end up having about the same optimal value for s? If not, which TF formula would have a higher optimal value for s? Why?

2. [20 points] Classic Probabilistic Retrieval Model.
1. [8/20 points] In the derivation of the Robertson-Sparck-Jones (RSJ) model, a multi-variate Bernoulli model was used to model term presence/absence in a relevant document and a non-relevant document. Suppose, we change the model to a multinomial model (see the slide that covers both models for computing query likelihood). Using a similar independence assumption as we used in deriving RSJ, show that ranking based on probability that a document is relevant to a query Q, i.e., p(R=1|D,Q), is equivalent to ranking based on the following formula:
```            \---            p(w|Q,R=1)
score(Q,D) = >   c(w,D) log ---------
/---            p(w|Q,R=0)
w in V
```
where the sum is taken over all the words in our vocabulary (denoted by V). How many parameters are there in such a retrieval model?
2. [2/20 points] The retrieval function above won't work unless we can estimate all the parameters. Suppose we use the entire collection C={D1,...,Dn} as an approximation of the examples of non-relevant documents. Give the formula for the Maximum Likelihood estimate of p(w|Q,R=0).
3. [2/20 points] Suppose we use the query as the only example of a relevant document. Give the formula for the Maximum Likelihood estimate of p(w|Q,R=1).
4. [3/20 points] One problem with the maximum likelihood estimate of p(w|Q,R=1) is that many words would have zero probability, which limits its accuracy of modeling words in relevant documents. Give the formula for smoothing this maximum likelihood estimate using fixed coefficient linear interpolation (i.e., Jelinek-Mercer) with a collection language model.
5. [5/20 points] With the two estimates you proposed, we should now have a retrieval function that can be used to compute a score for any document D and any query Q. Write down your retrieval function by plugging in the two estimates. Can your retrieval function capture the three major retrieval heuristics (i.e., TF, IDF, and document length normalization)? How?

3. [30 points] Language Models
1. [15/30 points] Show that if we use the query-likelihood scoring method (i.e., p(Q|D)) and the Jelinek-Mercer smoothing method (i.e., fixed co-efficient interpolation with smoothing parameter a) for retrieval, we can rank documents based on the following scoring function:
```            \-------                  (1-a)*c(w,D)
score(Q,D) = >         c(w,Q) log (1+ --------------- )
/-------                   a*p(w|REF)*|D|
w in Q and D

```
where the sum is taken over all the matched query terms in D, |D| is the document length, "a" is the smoothing parameter (i.e., the same as lambda on the lecture slide), p(w|REF) is the probability of word w given by the reference language model estimated using the whole collection.
2. [15/30 points] One way to check whether a retrieval function would over-penalize a long document is to do the following: Imagine that we duplicate a document D k times to generate a new document D' so that each term would have k times more occurrences (naturally, the new document D' is also k times longer than the original D). Intuitively, the score for D' shouldn't be less than the score of D for any query. Check if this is true for both Jelinek-Mercer smoothing and Dirichlet prior smoothing.
4. [20 points] Evaluation of IR Models
In this question, you will play with a Web-based IR experiment system to observe the real performance of some retrieval functions you learnt in the lectures. Please use the IR Virtual Lab system to accomplish the following tasks. (You may find a tutorial for the system here.)
1. [5/20 points] For the datasets "ap8889" and "doe", (1) evaluate the BM25 retrieval function with different parameter settings. (2) plot the parameter sensitivity curves, i.e. plot MAP against each parameter while keeping other parameter(s), if any, fixed, and (3) report the highest MAP value you obtained together with its corresponding parameter setting.
2. [5/20 points] Do the same for the Dirichlet Prior retrieval function.
3. [10/20 points] Pick either one of the two datasets, and compare the performance of the two retrieval functions in terms of MAP. Furthermore, pick 5 easy queries and 5 hard queries within the dataset as measured by BM25 Average Precision (AP), and compare the performance of the two retrieval functions on each query in terms of AP. Are your per-query comparison results consistent with your overall comparison result?
5. [10 points] Improve an existing IR model
In this question, you will continue to work in IR Virtual Lab and try to come up with your own ways to improve the two existing retrieval functions.
You may start with either function. Try to modify the function and evaluate it on the datasets "ap8889" and "doe". Note that you are not required to be restricted to the specific forms of the existing functions. Show your creativity by designing whatever retrieval function you could come up with.
1. [10/10 points] Show the retrieval functions you have tried together with explanations. For each of the two datsets, report the highest MAP value you obtained together with the function id and the implementation of the corresponding retrieval function.
2. The top retrieval functions (each identified by a unique function id) are shown on the leader board, and their creators will get bonus score for this assignment. The rankings will be read at 8:00pm on the assignment due date. In order to get the bonus score, the creators must publish their names, net ids, and the implementations of their retrieval functions together with an explanation on this wiki page within 24 hours after the assignment deadline.

### How to submit your homework

```[assignment 4|^name-of-your-file-attached]