Assignment #2: Basic Concepts in Information Theory
Notice: This assignment is due Thursday, September 14th at 11:59pm.
Please submit your solutions via Compass. You should submit your assignment as a typeset PDF. Please do not include scanned or photographed equations as they are difficult for us to grade.
1. Entropy [30 pts]
Consider the random experiment of picking a word from an English text article. Let be a random variable denoting the word that we might obtain from the article. Thus can have any value from the set of words in our vocabulary , where is a unique word in the vocabulary, and we have a probability distribution over all the words, which we can denote as , where is the probability that we would obtain word . Now we can compute the entropy of such a variable, i.e., .
[10 pts] Suppose we have in total unique words in our vocabulary. What is the theoretical minimum value of ? What is the theoretical maximum value of ?
[10 pts] Suppose we have only 6 words in the vocabulary . Give two sample articles using this small vocabulary set for which reaches the minimum value and maximum value, respectively.
[10 pts] Suppose we have two articles and for which . Suppose we concatenate and to form a longer article . What is the maximum value can be for article ? Give an example of and an example of for which would have the maximum .
2. Conditional Entropy and Mutual Information [20 pts]
[10 pts] What is the value of the conditional entropy ?
[10 pts] What is the value of mutual information if and are independent? Why?
3. Mutual Information of Words [30 pts]
Mutual information can be used to measure the correlation of two words. Suppose we have a collection of documents. For a word in the collection, we use , where , to represent the probability that occurs () in one document or not (). If word appears in documents, then and . Similarly, we can define the probability for another word . We also define the joint probability of word and as follows:
: the probability of word and word co-occurring in one document. If there are documents containing both word and in the collection, then
: the probability that word occurs in one document but does not occur in that document. It can be calculated as .
[10 pts] Given the values of , , for two words and in a collection of documents, can you write down the formulas for the rest two joint probabilities of and , i.e. and ?
[10 pts] Next, we will use the following tables to do some real computation of Mutual Information. The tables contain the document counts for different words. There are a total of documents in the collection.
Table 1 contains the document counts for words ‘computer’ and ‘program’, derived from the document collection (Hint: If and , then . This means there are 349 documents that contain ‘computer’ AND ‘program’):
349 2,021 1,041 22,983
Table 2 contains the document counts for words ‘computer’ and ‘baseball’, derived from the same document collection:
23 2,121 1,367 22,883
Calculate and using the document counts from Table 1 and 2.
[10 pts] Compare the results of and . Do the results conform with your intuition? Explain your intuition.
4. Kullback-Leibler Divergence (KL Divergence) [20 pts]
[5 pts] Please answer the following questions: 1) Whats the range of KL Divergence? 2) Under which circumstances is KL Divergence equal to 0?
[10 pts] From the course we know that KL Divergence is not symmetric. Show that this is true by creating two probability distributions p and q, where .
[5 pts] When calculating , what issues do you run into when an event has 0 probability in distribution q? How can you deal with 0 probabilities in this case?