## CS598CXZ Assignment #5: Information Retrieval Systems and Experiments ( extended deadline 11:59pm, Thursday, Nov. 14, 2013, submit your assignment via Wiki)

### [100 points] MapReduce programming and word association discovery

The purpose of the second part of this assignment is (1) to give you handson experience with MapReduce programming by using a Hadoop cluster, and (2) to generate word associations that can be potentially useful for improving retrieval accuracy and other applications. Through this assignment, you will become familiar with the Hadoop File System (HDFS), and learn how to use MapReduce for simple statistical text processing tasks.

Your task for this assignment is to calculate mutual information of words. Mutual information is usually used to measure the correlation of two distributions or variables. In text information management, people also use mutual information to measure the correlation of two words. Suppose we have a collection of N documents. For a word A in the collection, we use p(XA), where XA ∈ {0, 1}, to represent the probability of whether A occurs (XA=1) in one document or not (XA=0). If word A appears in NA documents, then p(XA=1) = NA/N and p(XA=0) = (N -NA)/N. Similarly, we can define the probability p(XB) for another word B. Besides, we also need to define the joint probability of word A and B as follows:

1. p(XA=1, XB=1): the probability of word A and word B cooccurring in one document. If there are NAB documents containing both word A and B in the collection, then p(XA=1, XB=1) = NAB / N
2. p(XA=1, XB=0): the probability of word A occurs in one document but B does not occur. It can be calculated as p(XA=1, XB=0) = (NA - NAB) / N.

We also need to do smoothing in our formulas for probability estimation in order to avoid the log0 problem. For joint probability estimation, we assume that each of the four cases (corresponding to four different combinations of values of Xa and Xb) gets 0.25 pseudo count, thus in total we introduced 0.25*4=1 pseudo count. We can then compute marginal probability based on the joint probability, i.e. p(Xa=1)=p(Xa=1,Xb=0)+p(Xa=1,Xb=1). For example, p(XA=1, XB=1) = (NAB + 0.25) / (1 + N) and p(XA=1) = (NA + 0.5)/(1+N). Please use these smoothing formulas in your code) In general, the computation of mutual information is expensive, especially when done on a large collection of text documents, as it involves enumerating all the word pairs and collecting their counts of co-occurrences. It is thus desirable to use MapReduce to solve the problem. In this problem, we ask you to use the Cloud-Computing Testbed (CCT) of the Department of Computer Science at UIUC to write a MapReduce program to compute pair-wise mutual information for every pair of words in an Associated Press (AP) news data set, which we have already placed on the CCT. Run your code on CCT to generate word associations for all the words that have occurred at least 3 times in the whole AP data set, but with a document frequency lower than 30% (i.e., it has occurred in fewer than 30% of all the documents). (By now, you should all have an account on the CCT.) Submit all your code as one single zipped file and attach it to your assignment wiki page.

The AP data set is available in the following Hadoop File System (HDFS) file:

```/home/czhai/598f13/apsrc.txt
```
Note that this is an HDFS path, not a regular Linux file path. Thus in order to see its content, you must use a "hadoop dfs" command:
```hadoop dfs -cat /home/czhai/598f13/apsrc.txt |more
```
In general, you use "hadoop dfs" followed by various commands that are very similar to the corresponding commands in Linux. You may store the generated word associations in any HDFS file. Please report the path to this file in your written answer to this problem.

If you aren't familiar with Hadoop, you need to read more about it. Please refer to this tutorial for details. Also, the CCT machine name is "altocumulus.cloud.cs.illinois.edu".

### What to turn in

```[assignment 5 MapReduce|^your-code-file.zip]