CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

Assignment #2: Basic Concepts in Information Theory

Notice: This assignment is due Thursday, September 14th at 11:59pm.

Please submit your solutions via Compass. You should submit your assignment as a typeset PDF. Please do not include scanned or photographed equations as they are difficult for us to grade.

1. Entropy [30 pts]

Consider the random experiment of picking a word from an English text article. Let WW be a random variable denoting the word that we might obtain from the article. Thus WW can have any value from the set of words in our vocabulary V={w1,,wN}V=\{w_1,\ldots,w_N\}, where wiw_i is a unique word in the vocabulary, and we have a probability distribution over all the words, which we can denote as {p(W=wi)}\{p(W=w_i)\}, where p(W=wi)p(W=w_i) is the probability that we would obtain word wiw_i. Now we can compute the entropy of such a variable, i.e., H(W)H(W).

  1. [10 pts] Suppose we have in total NN unique words in our vocabulary. What is the theoretical minimum value of H(W)H(W)? What is the theoretical maximum value of H(W)H(W)?

  2. [10 pts] Suppose we have only 6 words in the vocabulary {w1,w2,w3,w4,w5,w6}\{w_1, w_2, w_3, w_4, w_5, w_6\}. Give two sample articles using this small vocabulary set for which H(W)H(W) reaches the minimum value and maximum value, respectively.

  3. [10 pts] Suppose we have two articles A1A_1 and A2A_2 for which H(W)=0H(W)=0. Suppose we concatenate A1A_1 and A2A_2 to form a longer article A3A_3. What is the maximum value can H(W)H(W) be for article A3A_3? Give an example of A1A_1 and an example of A2A_2 for which A3A_3 would have the maximum H(W)H(W).

2. Conditional Entropy and Mutual Information [20 pts]

  1. [10 pts] What is the value of the conditional entropy H(XX)H(X \mid X)?

  2. [10 pts] What is the value of mutual information I(X;Y)I(X;Y) if XX and YY are independent? Why?

3. Mutual Information of Words [30 pts]

Mutual information can be used to measure the correlation of two words. Suppose we have a collection of NN documents. For a word AA in the collection, we use p(XA)p(X_A), where XA{0,1}X_A \in \{0, 1\}, to represent the probability that AA occurs (XA=1X_A=1) in one document or not (XA=0X_A=0). If word AA appears in NAN_A documents, then p(XA=1)=NANp(X_A=1) = \frac{N_A}{N} and p(XA=0)=NNANp(X_A=0) = \frac{N-N_A}{N}. Similarly, we can define the probability p(XB)p(X_B) for another word BB. We also define the joint probability of word AA and BB as follows:

  1. [10 pts] Given the values of NAN_A, NBN_B, NABN_{AB} for two words AA and BB in a collection of NN documents, can you write down the formulas for the rest two joint probabilities of AA and BB, i.e. p(XA=0,XB=1)p(X_A=0, X_B=1) and p(XA=0,XB=0)p(X_A=0, X_B=0)?

  2. [10 pts] Next, we will use the following tables to do some real computation of Mutual Information. The tables contain the document counts for different words. There are a total of N=26,394N = 26,394 documents in the collection.

    Table 1 contains the document counts for words ‘computer’ and ‘program’, derived from the document collection (Hint: If A=computerA = computer and B=programB = program, then NAB=349N_{AB} = 349. This means there are 349 documents that contain ‘computer’ AND ‘program’):

      Xcomputer=1X_{computer}=1 Xcomputer=0X_{computer}=0
    Xprogram=1X_{program}=1 349 2,021
    Xprogram=0X_{program}=0 1,041 22,983

    Table 2 contains the document counts for words ‘computer’ and ‘baseball’, derived from the same document collection:

      Xcomputer=1X_{computer}=1 Xcomputer=0X_{computer}=0
    Xbaseball=1X_{baseball}=1 23 2,121
    Xbaseball=0X_{baseball}=0 1,367 22,883

    Calculate I(Xcomputer;Xprogram)I(X_{computer};X_{program}) and I(Xcomputer;Xbaseball)I(X_{computer};X_{baseball}) using the document counts from Table 1 and 2.

  3. [10 pts] Compare the results of I(Xcomputer;Xprogram)I(X_{computer};X_{program}) and I(Xcomputer;Xbaseball)I(X_{computer};X_{baseball}). Do the results conform with your intuition? Explain your intuition.

4. Kullback-Leibler Divergence (KL Divergence) [20 pts]

  1. [5 pts] Please answer the following questions: 1) Whats the range of KL Divergence? 2) Under which circumstances is KL Divergence equal to 0?

  2. [10 pts] From the course we know that KL Divergence is not symmetric. Show that this is true by creating two probability distributions p and q, where D(pq)D(qp)D(p||q) \ne D(q||p).

  3. [5 pts] When calculating D(pq)D(p||q), what issues do you run into when an event has 0 probability in distribution q? How can you deal with 0 probabilities in this case?