Please submit your solutions via Compass.
[10/30 points] Suppose we have in total N unique words in our vocabulary. What is the theoretical minimum value of H(W)? What is the theoretical maximum value of H(W)?
[10/30 points] Suppose we have only 6 words in the vocabulary {w1, w2, w3, w4, w5, w6}. Give two sample articles using this small vocabulary set for which H(W) reaches the minimum value and maximum value, respectively.
[10/30 points] Suppose we have two articles A1 and A2 for which H(W)=0. Suppose we concatenate A1 and A2 to form a longer article A3. What is the maximum value can H(W) be for article A3? Give an example of A1 and an example of A2 for which A3 would have the maximum H(W).
[10/20 points] What is the value of conditional entropy H(X|X)?
[10/20 points] What is the value of mutual information I(X;Y) if X and Y are independent? Why?
Mutual information can be used to measure the correlation of two words. Suppose we have a collection of N documents. For a word A in the collection, we use p(X_{A}), where X_{A} ∈ {0, 1}, to represent the probability of whether A occurs (X_{A}=1) in one document or not (X_{A}=0). If word A appears in N_{A} documents, then p(X_{A}=1) = N_{A}/N and p(X_{A}=0) = (N -N_{A})/N. Similarly, we can define the probability p(X_{B}) for another word B. Besides, we also need to define the joint probability of word A and B as follows:
A. [10/50 points] Given the value of N_{A}, N_{B}, N_{AB} for two words A and B in a collection of N documents, can you write down the formular for the rest two joint probabilities of A and B, i.e. p(X_{A}=0, X_{B}=1) and p(X_{A}=0, X_{B}=0)?
Next, we will use the CACM test collection to do some real computation. You can download the data from here, in which highly frequent words and very low frequent words have been removed (the vocabulary size of the original data is very large, so we won't use it for this assignment). The CACM collection is a collection of titles and abstracts from the journal CACM. There are about 3,000 documents in the collection. The data set has been processed into lines. Each line contains one document, and the terms of each document are separated by blank space.
B. [20/50 points] Use any programming language you like (You may find it relatively easy if you use perl or python), for each pair of words in the collection, calculate the number of documents that contain both of the two words. Then, rank all the word pairs by their cooccurrence document counts. Print the largest 10 counts (one count number per line) (Hint: you may consider using hash to store the document counts for each word pair)
C. [20/50 points] Calculate the mutual information of all the possible word pairs in the collection. Rank the word pairs by their mutual information and print the results out. How are the top 10 pairs with the highest mutual information different from the top 10 pairs that you've got from problem B (i.e., the 10 pairs with the highest counts of co-occurrences)? Please write down the top 5 words which have the highest mutual information with word "programming" in the collection. Do you think your results reasonable? (Hint: In practice, we need to do some smoothing in our formulas in order to avoid the log0 problem. For joint probability estimation, we assume that each of the four cases (corresponding to four different combinations of values of X_{a} and X_{b}) gets 0.25 pseudo count, thus in total we introduced 0.25*4=1 pseudo count. We can then compute marginal probability based on the joint probability, i.e. p(X_{a}=1)=p(X_{a}=1,X_{b}=0)+p(X_{a}=1,X_{b}=1). For example, p(X_{A}=1, X_{B}=1) = (N_{AB} + 0.25) / (1 + N) and p(X_{A}=1) = (N_{A} + 0.5)/(1+N). Please use these smoothing formulas in your code)
Please put your solutions to all the problems together with your code and a "readme.txt" file for the last problem into a single zip file and upload it to Compass.