CS 598cxz Fall 2016

Assignment #2: Basic Concepts in Information Theory

Notice: This assignment is due Saturday, September 10th at 11:59pm.

Please submit your solutions via Compass. You should submit your assignment as a PDF, and your accompanying code as either a .zip or .tar.gz containing your files.

So, since my NetID is geigle1, I would submit the following files:

  • geigle1-assignment2.pdf
  • geigle1-assignment2.tar.gz or geigle1-assignment2.zip

1. Entropy [30 pts]

Consider the random experiment of picking a word from an English text article. Let WW be a random variable denoting the word that we might obtain from the article. Thus WW can have any value from the set of words in our vocabularly V={w1,,wN}V=\{w_1,\ldots,w_N\}, where wiw_i is a unique word in the vocabulary, and we have a probability distribution over all the words, which we can denote as {p(W=wi)}\{p(W=w_i)\}, where p(W=wi)p(W=w_i) is the probability that we would obtain word wiw_i. Now we can compute the entropy of such a variable, i.e., H(W)H(W).

  1. [10 pts] Suppose we have in total NN unique words in our vocabulary. What is the theoretical minimum value of H(W)H(W)? What is the theoretical maximum value of H(W)H(W)?

  2. [10 pts] Suppose we have only 6 words in the vocabulary {w1,w2,w3,w4,w5,w6}\{w_1, w_2, w_3, w_4, w_5, w_6\}. Give two sample articles using this small vocabulary set for which H(W)H(W) reaches the minimum value and maximum value, respectively.

  3. [10 pts] Suppose we have two articles A1A_1 and A2A_2 for which H(W)=0H(W)=0. Suppose we concatenate A1A_1 and A2A_2 to form a longer article A3A_3. What is the maximum value can H(W)H(W) be for article A3A_3? Give an example of A1A_1 and an example of A2A_2 for which A3A_3 would have the maximum H(W)H(W).

2. Conditional Entropy and Mutual Information [20 pts]

  1. [10 pts] What is the value of the conditional entropy H(XX)H(X \mid X)?

  2. [10 pts] What is the value of mutual information I(X;Y)I(X;Y) if XX and YY are independent? Why?

3. Mutual Information of Words (Programming Exercise) [50 pts]

Warning: This part of the assignment requires programming, so please make sure you start early on it!

You should, in addition to your PDF submission, submit the code you write. You should have a README in the root directory of your archive file that explains how to run your code for both problems.

Mutual information can be used to measure the correlation of two words. Suppose we have a collection of NN documents. For a word AA in the collection, we use p(XA)p(X_A), where XA{0,1}X_A \in \{0, 1\}, to represent the probability that AA occurs (XA=1X_A=1) in one document or not (XA=0X_A=0). If word AA appears in NAN_A documents, then p(XA=1)=NANp(X_A=1) = \frac{N_A}{N} and p(XA=0)=NNANp(X_A=0) = \frac{N-N_A}{N}. Similarly, we can define the probability p(XB)p(X_B) for another word BB. We also define the joint probability of word AA and BB as follows:

  1. [10 pts] Given the values of NAN_A, NBN_B, NABN_{AB} for two words AA and BB in a collection of NN documents, can you write down the formulas for the rest two joint probabilities of AA and BB, i.e. p(XA=0,XB=1)p(X_A=0, X_B=1) and p(XA=0,XB=0)p(X_A=0, X_B=0)?

  2. [20 pts] Next, we will use the CACM test collection to do some real computation. You can download the data here, in which highly frequent words and very low frequent words have been removed (the vocabulary size of the original data is very large, so we won’t use it for this assignment). The CACM collection is a collection of titles and abstracts from the journal CACM. There are about 3,000 documents in the collection. The data set has been processed into lines. Each line contains one document, and the terms of each document are separated by blank space.

    Using any programming language you like, for each pair of words in the collection, calculate the number of documents that contain both of the two words. Then, rank all the word pairs by their co-occurrence document counts (NABN_{AB}). Print the largest 10 counts (one count number per line) in the form

    word1 word2 count
    word3 word4 count
    ...
    

    (Hint: you may consider using a hash table to store the document counts for each word pair)

  3. [20 pts] Now, calculate the mutual information I(A;B)I(A; B) of all the possible word pairs in the collection. Rank the word pairs by their mutual information.

    Info: In practice, we need to do some smoothing in our formulas in order to avoid the log0\log 0 problem. For joint probability estimation, we’ll assume that each of the four cases (corresponding to four different combinations of values of XAX_A and XBX_B) gets 0.25 “pseudo-counts”. Thus, in total, we introduced 0.254=10.25 \cdot 4=1 “pseudo-count”. We can then compute marginal probabilities based on the joint probability, i.e. p(XA=1)=p(XA=1,XB=0)+p(XA=1,XB=1)p(X_A=1)=p(X_A=1,X_B=0)+p(X_A=1,X_B=1). For example, p(XA=1,XB=1)=NAB+0.25N+1p(X_A=1, X_B=1) = \frac{N_{AB} + 0.25}{N + 1} and p(XA=1)=NA+0.5N+1p(X_A=1) = \frac{N_A + 0.5}{N + 1}.

    Please use these smoothing formulas in your code.

    1. How are the top 10 pairs with the highest mutual information different from the top 10 pairs based on co-occurrence counts (from part b)?

    2. Write down the top 5 words which have the highest mutual information with the word “programming” in the collection. Do you think your results are reasonable?