## CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

# MP1: Two-Stage Smoothing for Information Retrieval

Notice: This assignment is due Tuesday, October 3rd at 11:59pm.

This is a programming assignment, so submission procedures are slightly different than for regular written assignments. You will upload your code as well as your write-up (in either Markdown or PDF) to your git repository. Concrete instructions on how to do so are included in this assignment specification. Make sure that you double-check that you can see your submission in your git repository!

In this assignment, you will explore the parameter sensitivity of two classical language model-based ranking methods (Dirichlet prior and Jelinek-Mercer) for different query types on some real-world test collections using the standard Cranfield evaluation technique by using the MeTA toolkit’s Python bindings. You will then implement a two-stage smoothing method for a language model-based ranker and further explore this new method’s parameter sensitivity on the same collections.

Warning: This assignment requires basic programming using Python. You should install Python 3 on your system in order to do the assignment.

Please make sure you’ve completed the setup instructions well before the deadline, as we cannot guarantee that you’ll get your installation questions answered at the last minute (since we will prioritize other assignment questions first, since by definition they are closer to finishing than you!).

### Setup Instructions

First, install Python 3 if you haven’t already. It should be available on Linux, OS X, and Windows. Instructions for installing Python 3 should be available on the Python website.

Next, you’ll need to install the supporting library using Python’s pip:

Info: On Windows, you may need to add the python install path and the Scripts sub-directory to your path.

Assuming you installed Python to the directory C:\Python36, this would look something like this in a command prompt:

#### EWS Instructions

If you need to, you can use the EWS remote machines if you have trouble getting Python or the above libraries installed (we strongly encourage you to use your own machine if you can, however). Once you SSH into an EWS machine, you should be able to use Python 3 by running

and then install the required library with

## 0. Using git

We will be using git for the version control system for this assignment. If you are unfamiliar with git, here are some good tutorials/resources:

Consider this a good opportunity to learn a tool that is widely used in industry for source control!

### Getting the Assignment Code

Info: Once you’ve logged in, you can choose to set up a SSH key for public/private key authentication, or you can just use the password you set if you use the HTTPS protocol for checking out the code. I recommend setting up an SSH key, but you don’t have to do this (skip this note if you don’t care about setting up an SSH key).

To get this set up, go up to the upper right-hand corner of the screen and click on the drop-down arrow. From there, click on Settings.

Then, in the left-hand menu, click SSH Keys.

On this page, there’s a helpful link about generating an SSH key if you’ve not generated one before. After you’ve generated your key, simply copy the public key into the form field and hit submit.

Now that you have your authentication set up, you should be able to check out the repository. Go back to the Projects page by clicking the projects link in the very top nav bar. Then, click on the cs510-fa17-mp1 project.

At the top of this page, there should be a URL given to you underneath the project name, which should look something like this:

(You can toggle the URL to be HTTPS if you are using a password instead of an SSH key.)

You should be able to use that URL to clone the repository. If you’re using the command line, that would look something like this:

When you have changes to commit, you need to first add the files you changed using

and then you can create a commit with

Finally, you need to push your changes to the GitLab server.

Let’s get started!

### 1. The Generalized Language Model Retrieval Function [8 pts]

Info: This assignment closely follows the following paper:

ChengXiang Zhai and John Lafferty. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 2 (April 2004), 179-214.

Recall that, under the query-likelihood derivation, we model the probability of relevance with

$\displaystyle P(R=1 \mid D, Q) \approx P(Q \mid R=1, D).$

If we assume a multinomial distribution for the query generation, we arrive at the general formula

$\displaystyle P(Q \mid D) = \prod_{i=1}^{|Q|} p(q_i \mid \theta_D).$

The main difference between retrieval functions of this form comes from their different choice of smoothing method that is applied to the unigram language model $\theta_D$. In general, when we smooth with a collection background language model, we can write this probability as

$\displaystyle p(w \mid \theta_D) = \begin{cases} p_s(w \mid \theta_D) & \text{ if } w \in D\\ \alpha_D p(w \mid C) & \text{ otherwise} \end{cases}$

where $p_s(w \mid \theta_D)$ is the discounted maximum likelihood estimate of observing word $w$ in document $D$ and $\alpha_D$ is a document-specific coefficient that controls the amount of probability mass assigned to unseen words to ensure that all of the probabilities sum to one.

Noting that $\log$ is a monotonic transform (thus leading to equivalent results under ranking), and using the above smoothing formulation, we can show the following:

\displaystyle \begin{aligned} \log P(Q \mid D) &= \sum_{i=1}^{|Q|} \log p(q_i \mid \theta_D)\\ &= \sum_{w \in V} c(w, Q) \log p(w \mid \theta_D)\\ &= \sum_{w \in D} c(w, Q) \log p_s(w \mid \theta_D) + \sum_{w \not\in D} c(w, Q) \log \alpha_D p(w \mid C)\\ &= \sum_{w \in D} c(w, Q) \log p_s(w \mid \theta_D) + \sum_{w \in V} c(w, Q) \log \alpha_D p(w \mid C) - \sum_{w \in D} c(w, Q) \log \alpha_D p(w \mid C)\\ &= \sum_{w \in D} c(w, Q) \log \frac{p_s(w \mid \theta_D)}{\alpha_D p(w \mid C)} + |Q| \log \alpha_D + \sum_{w \in V} c(w, Q) \log p(w \mid C)\\ &\stackrel{\text{rank}}{=} \sum_{w \in D} c(w, Q) \log \frac{p_s(w \mid \theta_D)}{\alpha_D p(w \mid C)} + |Q| \log \alpha_D \end{aligned}

This ranking function can be computed efficiently at query time by considering only matched query terms, and it is completely general in that it can be instantiated for any collection language model-based smoothing method.

1. [2 pts] What is $p_s(w \mid \theta_D)$ for Jelinek-Mercer smoothing? What is $\alpha_D$? Provide a brief explanation for your derivation of $\alpha_D$.

2. [2 pts] What is $p_s(w \mid \theta_D)$ for Dirichlet prior smoothing? What is $\alpha_D$? Provide a brief explanation for your derivation of $\alpha_D$.

3. [4 pts] Consider the following smoothed language model:

$\displaystyle p(w \mid \theta_D) = (1 - \lambda) \frac{c(w, D) + \mu p(w \mid C)} {|D| + \mu} + \lambda p(w \mid C).$

This is called the “two-stage smoothing” method. If we were to use this style of smoothing in the query likelihood model, what is $\alpha_D$ for the two-stage smoothing method? Provide a brief explanation of your derivation.

### 2. Implementing the Two-Stage Smoothing Method [20 pts]

In your repository you will find the file two_stage_smoothing.py. Near the top of this file you will see a place for you to fill in the definition of the two-stage smoothing method. Since this is a language-model based retrieval function smoothing with a collection language model, we can use the general language-model retrieval function definition.

In metapy, that general retrieval function is provided in an abstract base class metapy.index.LanguageModelRanker. Thus, to specify the two-stage smoothing retrieval function, we only need to provide a definition for the two major components of the general function: $p_s(w \mid \theta_D)$ and $\alpha_D$.

1. [10 pts] Fill in the definition of the member function smoothed_prob(), which takes a metapy.index.ScoreData object and returns $p_s(w \mid \theta_D)$.

We have left you some helpful comments to get you started with the implementation.

2. [10 pts] Fill in the definition of the member function doc_constant(), which takes a metapy.index.ScoreData object and returns $\alpha_D$.

### 3. Parameter Sensitivity of Smoothing Methods for IR [20 pts]

Now let’s investigate the parameter sensitivity of the smoothing methods we discussed in this assignment. In particular, we are interested in modeling how the mean average precision changes as a function of the smoothing parameter(s) for each of three methods.

Note: You may find the Search and IR eval metapy tutorial helpful here.

In your repository, you will find the file helpers.py that has the helper functions compute_map() and create_param_table() that we would like you to finish implementing.

1. [10 pts] Fill in the inner loop of compute_map() to compute the average precision for the query. We’ve left some comments that should help you get started with the implementation.

Now, make the function return the mean average precision across all of the queries in the query set.

2. [10 pts] Look for the function create_param_table() in helpers.py. Fill in its inner loop to properly write a CSV file containing the obtained MAP value for each parameter setting specified.

(It might help to have a look at experiment.py to see how everything fits together; you shouldn’t have to change that file, however.)

### 4. Running Sensitivity Experiments

Because of the “dual role” that smoothing plays in the query likelihood retrieval model, we’re going to model this effect for four different types of queries:

1. short keyword,
2. long keyword,
3. short verbose, and
4. long verbose.

The difference between a long and a short query should hopefully be obvious. When we talk about a “keyword query”, this is a typical web search style query that contains, by and large, a collection of noun phrases. A verbose query, on the other hand, might be a paragraph or more of text describing the documents that the user wishes to retrieve.

Our particular datasets are from TREC, and are old but standard text retrieval benchmark datasets. We will be using the following datasets:

1. News articles from the Associated Press between 1988–1989
2. Articles from Ziff-Davis publications between 1989–1990

For our query sets, we will be using topics 1–150 from TREC, along with the relevance judgments collected from pooling during the associated competitions.

In your repository, we have provided code that can be used to generate CSV files containing the necessary data you’ll need to generate the plots we want. (You’ve filled in part of it already; have a look in experiment.py if you’re curious about the missing pieces.)

To help you generate the data you’ll need, we’ll be using the GitLab CI built in to the GitLab instance we are running for your assignment submission. Navigate to your repository and select Pipelines from the navigation menu on the left-hand side.

On that page (once you’ve pushed to your repository at least once), you should see a list of pipelines like this:

Click on the “passed” icon. You’ll then see a list of jobs like this:

The cranfield job runs automatically by default and just tests whether your TwoStageSmoothing ranker successfully runs on the cranfield dataset we provided (in the data/ directory in your repository). The other jobs do not run by default and require you to manually start them. When you are certain that your implementation is correct, go ahead and click the little “play” button next to each of the jobs to start it. Note that you can have more than one job running at once, so that may save you some time if you generate your data in parallel.

Warning: Running the two-stage smoothing code is considerably slower than running the experiments for the built in rankers. This is because the built in rankers are implemented in the C++ library and thus are much more efficient.

Make sure you leave yourself enough time before the deadline to generate the data for your two-stage plots! Just to give you an idea, the experiment to generate the data for the two-stage ranker on long-verbose queries on the ap88-89 dataset takes a little under one hour. Budget your time accordingly.

Each job will generate one CSV file, which you can download once the job successfully completes. When you’re viewing a job, you can see the exact command-line output of the process in the center pane, which will look something like this:

On the right of the screen, you’ll see a pane that looks like this:

• dirichlet-prior-sensitivity-ap88-89.csv
• dirichlet-prior-sensitivity-ziff1-2.csv
• jelinek-mercer-sensitivity-ap88-89.csv
• jelinek-mercer-sensitivity-ziff1-2.csv
• two-stage-sensitivity-ap88-89.csv
• two-stage-sensitivity-ap88-89-long-verbose.csv
• two-stage-sensitivity-ziff1-2.csv
• two-stage-sensitivity-ziff1-2-long-verbose.csv

### 5. Creating Parameter Sensitivity Plots [24 pts]

Now it’s finally time to create our plots! Each of the figures you generate should have the following properties:

1. Each data point clearly marked, with a different marker for each query type
2. A line drawn between each data point for each specific query type
3. A legend indicating what symbols/colors indicate which query types
4. A plot title indicating the method and dataset used
5. A y-axis that ranges from 0 to 0.5.

We would like you to create the following figures for both datasets:

1. [8 pts] For Jelinek-Mercer, make a plot showing the obtained MAP for $\lambda \in \{0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9\}$ for each of the four query types.

That is, there should be one plot for Jelinek-Mercer for each dataset, on which there should be four lines, one for each query type, showing the resulting MAP for each specified value of $\lambda$.

2. [8 pts] For Dirichlet prior, make a plot showing the MAP on our query set for $\mu \in \{100, 500, 1000, 2000, 4000, 8000, 10000\}$ for each of the four query types.

That is, there should be one plot for Dirichlet prior for each dataset, on which there should be four lines, one for each query type, showing the resulting MAP for each specified value of $\mu$.

3. [8 pts] For the two-stage smoothing function, make a plot showing the MAP on our query set for $\lambda = 0.7$ and $\mu \in \{100, 500, 1000, 2000, 4000, 8000, 10000\}$ for each of the four query types.

That is, there should be one plot for the two-stage smoothing method for each dataset, on which there should be four lines, one for each query type, showing the resulting MAP for each specified value of $\mu$.

We recommend using the Python libraries pandas and matplotlib, but you are free to use whatever plotting software you’d like so long as you adhere to the above criteria. If your graphs don’t properly correspond to the CSV data in your repository, you will not receive credit for the graphs.

Notice: Please ensure you commit your graphs to your repository. If at all possible, please generate a SVG image; this is easy with pandas and matplotlib. If you absolutely cannot generate an SVG, please make sure you generate a high-resolution PNG image.

We’ll be looking for the following files:

• dirichlet-prior-sensitivity-ap88-89.(svg|png)
• jelinek-mercer-sensitivity-ap88-89.(svg|png)
• two-stage-sensitivity-ap88-89.(svg|png)
• dirichlet-prior-sensitivity-ziff1-2.(svg|png)
• jelinek-mercer-sensitivity-ziff1-2.(svg|png)
• two-stage-sensitivity-ziff1-2.(svg|png)

### 6. Evaluate Your Plots [28 pts]

1. [7 pts] Looking at all of the plots, what query types tend to exhibit the most variation across the parameter range? Why might that be?

2. [7 pts] Take a look at your plots for the Jelinek-Mercer and Dirichlet prior smoothing methods a little more closely. What do you notice about the shape of these lines as it relates to the query type? Which query types are similar, and which ones are different?

3. [7 pts] Now look at the two-stage smoothing plots. Are there differences in shape between query types for this method? Provide some brief reasoning as to why (or why not).

4. [7 pts] Suppose you’re implementing a search engine, and that you have no prior knowledge about the types of queries your engine will be receiving (they might be short keyword queries, they might be long verbose queries, or somewhere in between). Based on the experiments you performed here, what smoothing method would you pick, and why? What would you set its default parameters to be?

(Please ignore the speed component in making this decision. The ranker you implemented in this assignment can be just as fast as the built in Jelinek-Mercer and Dirichlet prior rankers if it is implemented in a compiled language.)

1. your writeup as a .md or .pdf file,
2. helpers.py and two_stage_smoothing.py should be updated with your solution code,