Assignment #2: Basic Concepts in Information Theory
Notice: This assignment is due Thursday, September 14th at 11:59pm.
Please submit your solutions via Compass. You should submit your assignment as a typeset PDF. Please do not include scanned or photographed equations as they are difficult for us to grade.
1. Entropy [30 pts]
Consider the random experiment of picking a word from an English text article. Let $W$ be a random variable denoting the word that we might obtain from the article. Thus $W$ can have any value from the set of words in our vocabulary $V=\{w_1,\ldots,w_N\}$, where $w_i$ is a unique word in the vocabulary, and we have a probability distribution over all the words, which we can denote as $\{p(W=w_i)\}$, where $p(W=w_i)$ is the probability that we would obtain word $w_i$. Now we can compute the entropy of such a variable, i.e., $H(W)$.

[10 pts] Suppose we have in total $N$ unique words in our vocabulary. What is the theoretical minimum value of $H(W)$? What is the theoretical maximum value of $H(W)$?

[10 pts] Suppose we have only 6 words in the vocabulary $\{w_1, w_2, w_3, w_4, w_5, w_6\}$. Give two sample articles using this small vocabulary set for which $H(W)$ reaches the minimum value and maximum value, respectively.

[10 pts] Suppose we have two articles $A_1$ and $A_2$ for which $H(W)=0$. Suppose we concatenate $A_1$ and $A_2$ to form a longer article $A_3$. What is the maximum value can $H(W)$ be for article $A_3$? Give an example of $A_1$ and an example of $A_2$ for which $A_3$ would have the maximum $H(W)$.
2. Conditional Entropy and Mutual Information [20 pts]

[10 pts] What is the value of the conditional entropy $H(X \mid X)$?

[10 pts] What is the value of mutual information $I(X;Y)$ if $X$ and $Y$ are independent? Why?
3. Mutual Information of Words [30 pts]
Mutual information can be used to measure the correlation of two words. Suppose we have a collection of $N$ documents. For a word $A$ in the collection, we use $p(X_A)$, where $X_A \in \{0, 1\}$, to represent the probability that $A$ occurs ($X_A=1$) in one document or not ($X_A=0$). If word $A$ appears in $N_A$ documents, then $p(X_A=1) = \frac{N_A}{N}$ and $p(X_A=0) = \frac{NN_A}{N}$. Similarly, we can define the probability $p(X_B)$ for another word $B$. We also define the joint probability of word $A$ and $B$ as follows:

$p(X_A=1, X_B=1)$: the probability of word $A$ and word $B$ cooccurring in one document. If there are $N_{AB}$ documents containing both word $A$ and $B$ in the collection, then $p(X_A=1, X_B=1) = \frac{N_{AB}}{N}$

$p(X_A=1, X_B=0)$: the probability that word $A$ occurs in one document but $B$ does not occur in that document. It can be calculated as $p(X_A=1, X_B=0) = \frac{N_A  N_{AB}}{N}$.

[10 pts] Given the values of $N_A$, $N_B$, $N_{AB}$ for two words $A$ and $B$ in a collection of $N$ documents, can you write down the formulas for the rest two joint probabilities of $A$ and $B$, i.e. $p(X_A=0, X_B=1)$ and $p(X_A=0, X_B=0)$?

[10 pts] Next, we will use the following tables to do some real computation of Mutual Information. The tables contain the document counts for different words. There are a total of $N = 26,394$ documents in the collection.
Table 1 contains the document counts for words ‘computer’ and ‘program’, derived from the document collection (Hint: If $A = computer$ and $B = program$, then $N_{AB} = 349$. This means there are 349 documents that contain ‘computer’ AND ‘program’):
$X_{computer}=1$ $X_{computer}=0$ $X_{program}=1$ 349 2,021 $X_{program}=0$ 1,041 22,983 Table 2 contains the document counts for words ‘computer’ and ‘baseball’, derived from the same document collection:
$X_{computer}=1$ $X_{computer}=0$ $X_{baseball}=1$ 23 2,121 $X_{baseball}=0$ 1,367 22,883 Calculate $I(X_{computer};X_{program})$ and $I(X_{computer};X_{baseball})$ using the document counts from Table 1 and 2.

[10 pts] Compare the results of $I(X_{computer};X_{program})$ and $I(X_{computer};X_{baseball})$. Do the results conform with your intuition? Explain your intuition.
4. KullbackLeibler Divergence (KL Divergence) [20 pts]

[5 pts] Please answer the following questions: 1) Whats the range of KL Divergence? 2) Under which circumstances is KL Divergence equal to 0?

[10 pts] From the course we know that KL Divergence is not symmetric. Show that this is true by creating two probability distributions p and q, where $D(pq) \ne D(qp)$.

[5 pts] When calculating $D(pq)$, what issues do you run into when an event has 0 probability in distribution q? How can you deal with 0 probabilities in this case?