## CS598CXZ Assignment #3: Basic Concepts in Information Theory (due 11:59pm, Sunday, Sept. 21, 2014)

1. Entropy [30 points]: Consider the random experiment of picking a word from an English text article. Let W be a random variable denoting the word that we might obtain from the article. Thus W can have any value from the set of words in our vocabularly V={w1, ..., wN}, where wi is a unique word in the vocabulary, and we have a probability distribution over all the words, which we can denote as {p(W=wi)}, where p(W=wi) is the probability that we would obtain word wi. Now we can compute the entropy of such a variable, i.e., H(W).

[10/30 points] Suppose we have in total N unique words in our vocabulary. What is the theoretical minimum value of H(W)? What is the theoretical maximum value of H(W)?

[10/30 points] Suppose we have only 6 words in the vocabulary {w1, w2, w3, w4, w5, w6}. Give two sample articles using this small vocabulary set for which H(W) reaches the minimum value and maximum value, respectively.

[10/30 points] Suppose we have two articles A1 and A2 for which H(W)=0. Suppose we concatenate A1 and A2 to form a longer article A3. What is the maximum value can H(W) be for article A3? Give an example of A1 and an example of A2 for which A3 would have the maximum H(W).

2. Conditional entropy and mutual information [20 points]

[10/20 points] What is the value of conditional entropy H(X|X)?

[10/20 points] What is the value of mutual information I(X;Y) if X and Y are independent? Why?

3. Calculate mutual information of words [50 points]

Mutual information can be used to measure the correlation of two words. Suppose we have a collection of N documents. For a word A in the collection, we use p(XA), where XA ∈ {0, 1}, to represent the probability of whether A occurs (XA=1) in one document or not (XA=0). If word A appears in NA documents, then p(XA=1) = NA/N and p(XA=0) = (N -NA)/N. Similarly, we can define the probability p(XB) for another word B. Besides, we also need to define the joint probability of word A and B as follows:

1. p(XA=1, XB=1): the probability of word A and word B cooccurring in one document. If there are NAB documents containing both word A and B in the collection, then p(XA=1, XB=1) = NAB / N
2. p(XA=1, XB=0): the probability of word A occurs in one document but B does not occur. It can be calculated as p(XA=1, XB=0) = (NA - NAB) / N.

A. [10/50 points] Given the value of NA, NB, NAB for two words A and B in a collection of N documents, can you write down the formular for the rest two joint probabilities of A and B, i.e. p(XA=0, XB=1) and p(XA=0, XB=0)?

Next, we will use the CACM test collection to do some real computation. You can download the data from here, in which highly frequent words and very low frequent words have been removed (the vocabulary size of the original data is very large, so we won't use it for this assignment). The CACM collection is a collection of titles and abstracts from the journal CACM. There are about 3,000 documents in the collection. The data set has been processed into lines. Each line contains one document, and the terms of each document are separated by blank space.

B. [20/50 points] Use any programming language you like (You may find it relatively easy if you use perl or python), for each pair of words in the collection, calculate the number of documents that contain both of the two words. Then, rank all the word pairs by their cooccurrence document counts. Print the largest 10 counts (one count number per line) (Hint: you may consider using hash to store the document counts for each word pair)

C. [20/50 points] Calculate the mutual information of all the possible word pairs in the collection. Rank the word pairs by their mutual information and print the results out. How are the top 10 pairs with the highest mutual information different from the top 10 pairs that you've got from problem B (i.e., the 10 pairs with the highest counts of co-occurrences)? Please write down the top 5 words which have the highest mutual information with word "programming" in the collection. Do you think your results reasonable? (Hint: In practice, we need to do some smoothing in our formulas in order to avoid the log0 problem. For joint probability estimation, we assume that each of the four cases (corresponding to four different combinations of values of Xa and Xb) gets 0.25 pseudo count, thus in total we introduced 0.25*4=1 pseudo count. We can then compute marginal probability based on the joint probability, i.e. p(Xa=1)=p(Xa=1,Xb=0)+p(Xa=1,Xb=1). For example, p(XA=1, XB=1) = (NAB + 0.25) / (1 + N) and p(XA=1) = (NA + 0.5)/(1+N). Please use these smoothing formulas in your code)

Please put your solutions to all the problems together with your code and a "readme.txt" file for the last problem into a single zip file and upload it to Compass.