CS598CXZ Assignment #4: IR Evaluation and Vector Space Model
(due 11:59pm, Sunday, Oct. 5, 2014)

Please submit your solutions via Compass.

  1. [25 points] Evaluation Measures
    Suppose a query has a total of 10 relevant documents in a collection with 100 documents. A system has retrieved 8 documents whose relevance status is [+,-,-,+,+,-,-,+] in the order of ranking. A ``+'' (or ``-'') indicates that the corresponding document is relevant (or non-relevant). For example, the first document is relevant, while the second is non-relevant, etc.
    1. [15/25 points] Compute the precision, recall, F1, and average precision for this result.
    2. [10/25 points] Assume that a relevant document and a non-relevant one have a gain value of 1 and 0, respectively. Compute the Cumulative Gain (CG) at 4 documents, Discounted Cumulative Gain (DCG) at 4 documents, and the normalized DCG at 4 documents. Use base 2 for the logarithm discounting function.


  2. [25 points] Pooling.
    1. [10/25 points] Use a few sentences to briefly explain how pooling in IR evaluation works and why we want to do this.
    2. [15/25 points] When using a test collection created with pooling to compare two retrieval methods, the common practice is to assume an unjudged document to be non-relevant (the "standard strategy"). An alternative would be to remove all the unjudged documents from the collection and only use judged documents, including both judged non-relevant and relevant documents (the "alternative strategy"). Imagine that you proposed a new retrieval algorithm (a new ranking function) and would like to reuse a test collection created with pooling to compare your new function with existing ones (the existing ones may have contributed to the pool of documents judged). Which of the two strategies would you use in your evaluation? Why? Should you report results using both strategies?

  3. [50 points] Pivoted Length Normalization
    The following questions refer to the pivoted length normalization paper [Singhal et al. SIGIR 1996].
    1. [10/50 points] According to the paper, what are the two main reasons for doing document length normalization (i.e., normalizing term weights to penalize long documents)? What do you think about pre-segmenting all documents into passages of equal lengths as a way to achieve document length normalization?
    2. [15/50 points] In order to check the optimality of length normalization of a retrieval function, the authors plotted and compared two curves in Figure 1 (c). Briefly explain how exactly each curve was generated. What was the conclusion drawn from Figure 1 about the cosine normalization method?
    3. [10/50 points] The essence of pivoted length normalization is that it assumes that the documents with lengths equal to the "pivot" have the "right scores" (thus no need for normalization), those longer than the pivot would get penalized, and those shorter would get rewarded. The commonly used pivoted length normalizer is (1-s)+s*doclen/average-doc-length. In such a normalizer, what is the implied pivot? What is the meaning and effect of parameter s?
    4. [15/50 points] Consider two versions of normalized TF: (1) RawTF/[1-s+s*docLen/avgDocLen], and (2) log(1+log(1+RawTF))/[1-s+s*docLen/avgDocLen]. Suppose we put these TF formulas in a TF-IDF retrieval function, and tune parameter s empirically for both TF formulas using the same test collection (i.e., finding the s value that works the best on this collection), do you think we will end up having about the same optimal value for s? If not, which TF formula would have a higher optimal value for s? Why?