Assignment #2: Basic Concepts in Information Theory
Notice: This assignment is due Saturday, September 10th at 11:59pm.
Please submit your solutions via Compass. You should submit your assignment as a PDF, and your accompanying code as either a .zip or .tar.gz containing your files.
So, since my NetID is
geigle1, I would submit the following files:
1. Entropy [30 pts]
Consider the random experiment of picking a word from an English text article. Let be a random variable denoting the word that we might obtain from the article. Thus can have any value from the set of words in our vocabularly , where is a unique word in the vocabulary, and we have a probability distribution over all the words, which we can denote as , where is the probability that we would obtain word . Now we can compute the entropy of such a variable, i.e., .
[10 pts] Suppose we have in total unique words in our vocabulary. What is the theoretical minimum value of ? What is the theoretical maximum value of ?
[10 pts] Suppose we have only 6 words in the vocabulary . Give two sample articles using this small vocabulary set for which reaches the minimum value and maximum value, respectively.
[10 pts] Suppose we have two articles and for which . Suppose we concatenate and to form a longer article . What is the maximum value can be for article ? Give an example of and an example of for which would have the maximum .
2. Conditional Entropy and Mutual Information [20 pts]
[10 pts] What is the value of the conditional entropy ?
[10 pts] What is the value of mutual information if and are independent? Why?
3. Mutual Information of Words (Programming Exercise) [50 pts]
Warning: This part of the assignment requires programming, so please make sure you start early on it!
You should, in addition to your PDF submission, submit the code you write. You should have a README in the root directory of your archive file that explains how to run your code for both problems.
Mutual information can be used to measure the correlation of two words. Suppose we have a collection of documents. For a word in the collection, we use , where , to represent the probability that occurs () in one document or not (). If word appears in documents, then and . Similarly, we can define the probability for another word . We also define the joint probability of word and as follows:
: the probability of word and word co-occurring in one document. If there are documents containing both word and in the collection, then
: the probability that word occurs in one document but does not occur in that document. It can be calculated as .
[10 pts] Given the values of , , for two words and in a collection of documents, can you write down the formulas for the rest two joint probabilities of and , i.e. and ?
[20 pts] Next, we will use the CACM test collection to do some real computation. You can download the data here, in which highly frequent words and very low frequent words have been removed (the vocabulary size of the original data is very large, so we won’t use it for this assignment). The CACM collection is a collection of titles and abstracts from the journal CACM. There are about 3,000 documents in the collection. The data set has been processed into lines. Each line contains one document, and the terms of each document are separated by blank space.
Using any programming language you like, for each pair of words in the collection, calculate the number of documents that contain both of the two words. Then, rank all the word pairs by their co-occurrence document counts (). Print the largest 10 counts (one count number per line) in the form
word1 word2 count word3 word4 count ...
(Hint: you may consider using a hash table to store the document counts for each word pair)
[20 pts] Now, calculate the mutual information of all the possible word pairs in the collection. Rank the word pairs by their mutual information.
Info: In practice, we need to do some smoothing in our formulas in order to avoid the problem. For joint probability estimation, we’ll assume that each of the four cases (corresponding to four different combinations of values of and ) gets 0.25 “pseudo-counts”. Thus, in total, we introduced “pseudo-count”. We can then compute marginal probabilities based on the joint probability, i.e. . For example, and .
Please use these smoothing formulas in your code.
How are the top 10 pairs with the highest mutual information different from the top 10 pairs based on co-occurrence counts (from part b)?
Write down the top 5 words which have the highest mutual information with the word “programming” in the collection. Do you think your results are reasonable?