You should know the following basic concepts which belong to the pre-requisite of the course: Evaluation measures such as Precision, Recall, Average Precision, Mean Average Precision, F measure, Mean Reciprocal Rnank, and nDCG; basic idea of vector space model and the major heuristics for term weighting such as TF-IDF weighting and document length normalization; basic idea of inverted index and why it can help scoring documents quickly; different types of feedback, including pseudo-relevance feedback, relevance feedback, and implicit feedback.
It's especially important to understand the Bayes rule, which has a lot of applications. First, you should remember the Bayes rule. Second, you should know how Bayes rule provides a general way of making probabilistic inference: we would update our prior p(H) based on data likelihood p(E|H) and obtain the so-called posterior probability p(H|E) ("posterior" in the sense that this is "after observing evidence E"). The prior p(H) represents our belief about the hypothesis before we see the evidence E (i.e., data), whereas the posterior p(H|E) represents our updated belief about hypothesis H after we have seen evidence E.
We have seen several examples of using Bayes rule to make probabilistic inference. One case is text categorization, where a hypothesis H is a category of a document and the evidence E is a document. You have worked on such an applicaiton in Assignment One. Make sure that you review that problem and know how to solve that problem. Another case is the E-step in the EM algorithm where we use the Bayes rule to infer the value of the hidden variable based on the current generation of parameter values (the parameter values are initially randomly set). In the mixture model for feedback, the evidence is a word in the feedback document and we have two hypotheses representing whether the word has been generated
using the background word distribution or the feedback topic word distribution, respectively.
You should know that how the E-step in the EM algorithm involves an application of Bayes rule to infer values of hidden variables.
You'll first need to know how to write down a likelihood function for a simple case like when the data is a document with a sequence of words w1 w2 ... wn and we have a unigram language model p(w|theta). In this case, the probability of observing the document (i.e., word sequence w1, ..., wn) from the language model is p(d|theta)=p(w1|theta)*p(w2|theta)*...*p(wn|theta). If we take logarithm of both sides, we have log p(d|theta)= [log p(w1|theta)] +[log p(w2|theta)] +...[log p(wn|theta)]. If we use c(w,d) to represent the count of word w in d, we have log p(d|theta)= sum_{all words W in vocabulary set} c(W,d)*log p(W|theta).
Then you should know that the Maximum Likelihood (ML) estimator is to find an optimal setting of the language model p(w|theta) (i.e., optimal setting of the probability of each word in our vocabulary p(w|theta)) so that p(d|theta), or equivalently log p(d|theta), would achieve the maximum value. In other words, if we set these word probabilities different values than their ML estimate, p(d|theta) would be smaller. The ML estimate is optimal in the sense that it maximizes the likelihood of the observed data, i.e., it finds the parameter setting that best explains the data. However, when the observed data sample is too small (e.g., the title of a document), it may be a biased representation of the entire population (e.g., the whole article), so if we overfit the observed data as the ML estimator would do, our estimated parameter values may not be optimal. For example, we would assign zero probability to all the unseen words (since the ML estimator would try to give as much probability mass to the observed words as possible in order to maximize the likelihood of data). You should go over the ML estimation problem that you worked on in Assignment 1 to make sure that you understand how that problem is solved. However, we won't ask you to do derivatives in the midterm or derive an ML estimator, but you should know the concept and intuition of the ML estimator.
You should know the fact that the ML estimate of a multinomial distribution (i.e., a unigram language model)
would give each word w a probability equal to the relative frequency of the word. That is, if the word distribution is theta, and the observed data is a document d, according to the ML estimator, we would have p(w|theta)=c(w,d)/|d| where
c(w,d) is the count of word w in d, and |d| is the length of document d (i.e., total counts of words in d).
You should remember the formulas for entropy, cross entropy and KL-divergence, and mutual information. Their relation is as follows.
The KL-divergence D(p||q) of two distributions p and q is equal to H(p,q) - H(p). The mutual information I(X;Y)= H(X)-H(X|Y)=H(Y)-H(Y|X). I(X;Y) is also the KL-divergence between p(X,Y) and P(X)P(Y). H(p) is the entropy of p which
measures the randomness of the distribution p, i.e., the more random p is, the higher H(p) is. When p is uniform,
H(p) reaches its maximum. (If you remember the formula of H(p), you should be able to easily see its maximum
value is log M where M is the number of possible values that the random variable p can take.) When p is entirely concentrated on a single value (i.e., it's actually not random at all), H(p) reaches its minimum which is zero.
H(p) can also be interpreted as the minimum number of bits we have to use to compress values following the distribution of p. (Note that we can call p either a random variable or a distribution.)
H(p,q) is the cross entropy, which is of a similar form of the function to H(p) with only a small difference. Pay attention to the difference and make sure you understand why this small difference allows us to interpret H(p,q) as
the minimum number of bits that we have to use to compress values following distribution p if we "mistakenly" thought that the values follow distribution q (i.e., we use q to design optimal coding). As a result, H(p,q) is always at least as large as H(p) (with a wrong distribution for designing an optimal code, we can never do better than using the original true distribution in terms of compression), and the KL-divergence captures the difference and can be interpreted as the number of bits wasted due to using a wrong distribution for coding. You should also know some basic properties of D(p||q), i.e., it's always non-negative. It's zero iff p=q. The mutual information I(X;Y) measures the association of two random variables X and Y. This can be understood from two perspectives: (1) I(X;Y)=H(X)-H(X|Y). This means that I(X;Y) is the reduction of entropy of X if we know Y. Intuitively, if X and Y are independent, there would be no reduction of entropy of X, thus I(X;Y) would be zero, whereas if X is completely determined by Y, then H(X|Y)=0, so I(X;Y)=H(X). Note that the maximum value of I(X;Y) is max{H(X),H(Y)}. If X is completely determined by Y AND at the same time, Y is completely determined by X, then H(X)=H(Y) since they have the same uncertainty; in general, H(X) and H(Y) can be different, though. (2) I(X;Y) is the KL-divergence of P(X,Y), which is the true joint distribution, and
p(X)p(Y), which is the joint distribution if X and Y are independent. Thus it essentially measures how far away p(X,Y) is from the assumed joint distribution under the assumption that X and Y are independent, so the two joint distributions are the same, it would mean that X and Y are indeed independent, and I(X;Y) would be zero. If the two distributions are far away from each other, it would mean that X and Y are far from independent, i.e., they are correlated/associated, and in such a case I(X;Y) would have a higher value.
A statistical language model (SLM) is a distribution over word sequences. Intuitively, it gives us a probability for any sequence of words, thus allows us to compare two sequences of words to see which has a higher probability. In general, SLMs help capture the uncerstanties associated with the use of natural language. For example, in general, non-grammatical sentences would have much smaller probabilities than grammatical sentences. Specialized language models can be used to answer many interesting questions that are directly related to many information management tasks.
While there are many different kinds of SLMs, we are particularly interested in the simplest one, i.e., the unigram language models. This model corresponds to a multinomial distribution over words. According to this model, a piece of text is "generated" by generating each word independently. As a result, the joint probability of generating all the words in a document D=w1 w2 ... wn is simply the product of generating each individual word, i.e., p(D)=p(w1)p(w2)...p(wn). Note that in general, the generation of one word may depend on another. For example, having seen "web search" being generated would make the probability of further generating a word like "engine" much higher. This means that p(w3="engine" |w1="web", w2="search") is much higher than p(w3="engine"). Thus the independence assumption made by the unigram language model doesn't really hold in reality. Indeed, with a bigram LM, we'd have p(D)=p(w1)p(w2|w1)p(w3|w2)...p(wn|wn-1), which would capture local dependency between two adjacent words.