You should know the following basic concepts which belong to the pre-requisite of the course: Evaluation measures such as Precision, Recall, Average Precision, Mean Average Precision, F measure, Mean Reciprocal Rnank, and nDCG; basic idea of vector space model and the major heuristics for term weighting such as TF-IDF weighting and document length normalization; basic idea of inverted index and why it can help scoring documents quickly; different types of feedback, including pseudo-relevance feedback, relevance feedback, and implicit feedback.
It's especially important to understand the Bayes rule, which has a lot of applications. First, you should remember the Bayes rule. Second, you should know how Bayes rule provides a general way of making probabilistic inference: we would update our prior p(H) based on data likelihood p(E|H) and obtain the so-called posterior probability p(H|E) ("posterior" in the sense that this is "after observing evidence E"). The prior p(H) represents our belief about the hypothesis before we see the evidence E (i.e., data), whereas the posterior p(H|E) represents our updated belief about hypothesis H after we have seen evidence E.
We have seen the use of Bayes rule to make
probabilistic inference in text categorization, where a
hypothesis H is a category of a document and the evidence E is a
document. You have worked on such an applicaiton in Assignment One. Make
sure that you review that problem and know how to solve that problem.
You'll first need to know how to write down a likelihood function for a simple case like when the data is a document with a sequence of words w1 w2 ... wn and we have a unigram language model p(w|theta). In this case, the probability of observing the document (i.e., word sequence w1, ..., wn) from the language model is p(d|theta)=p(w1|theta)*p(w2|theta)*...*p(wn|theta). If we take logarithm of both sides, we have log p(d|theta)= [log p(w1|theta)] +[log p(w2|theta)] +...[log p(wn|theta)]. If we use c(w,d) to represent the count of word w in d, we have log p(d|theta)= sum_{all words W in vocabulary set} c(W,d)*log p(W|theta).
Then you should know that the Maximum Likelihood (ML) estimator is to find an optimal setting of the language model p(w|theta) (i.e., optimal setting of the probability of each word in our vocabulary p(w|theta)) so that p(d|theta), or equivalently log p(d|theta), would achieve the maximum value. In other words, if we set these word probabilities different values than their ML estimate, p(d|theta) would be smaller. The ML estimate is optimal in the sense that it maximizes the likelihood of the observed data, i.e., it finds the parameter setting that best explains the data. However, when the observed data sample is too small (e.g., the title of a document), it may be a biased representation of the entire population (e.g., the whole article), so if we overfit the observed data as the ML estimator would do, our estimated parameter values may not be optimal. For example, we would assign zero probability to all the unseen words (since the ML estimator would try to give as much probability mass to the observed words as possible in order to maximize the likelihood of data). You should go over the ML estimation problem that you worked on in Assignment 1 to make sure that you understand how that problem is solved. However, we won't ask you to do derivatives in the midterm or derive an ML estimator, but you should know the concept and intuition of the ML estimator.
You should know the fact that the ML estimate of a multinomial
distribution (i.e., a unigram language model)
would give each word w a probability equal to the relative frequency of
the word. That is, if the word distribution is theta, and the observed
data is a document d, according to the ML estimator, we would have
p(w|theta)=c(w,d)/|d| where
c(w,d) is the count of word w in d, and |d| is the length of document d
(i.e., total counts of words in d).
You should remember the formulas for entropy, cross entropy and
KL-divergence, and mutual information. Their relation is as follows.
The KL-divergence D(p||q) of two distributions p and q is equal to
H(p,q) - H(p). The mutual information I(X;Y)= H(X)-H(X|Y)=H(Y)-H(Y|X).
I(X;Y) is also the KL-divergence between p(X,Y) and P(X)P(Y). H(p) is
the entropy of p which
measures the randomness of the distribution p, i.e., the more random p
is, the higher H(p) is. When p is uniform,
H(p) reaches its maximum. (If you remember the formula of H(p), you
should be able to easily see its maximum
value is log M where M is the number of possible values that the random
variable p can take.) When p is entirely concentrated on a single value
(i.e., it's actually not random at all), H(p) reaches its minimum which
is zero.
H(p) can also be interpreted as the minimum number of bits we have to
use to compress values following the distribution of p. (Note that we
can call p either a random variable or a distribution.)
H(p,q) is the cross entropy, which is of a similar form of the function
to H(p) with only a small difference. Pay attention to the difference
and make sure you understand why this small difference allows us to
interpret H(p,q) as
the minimum number of bits that we have to use to compress values
following distribution p if we "mistakenly" thought that the
values follow distribution q (i.e., we use q to design optimal coding).
As a result, H(p,q) is always at least as large as H(p) (with a wrong
distribution for designing an optimal code, we can never do better than
using the original true distribution in terms of compression), and the
KL-divergence captures the difference and can be interpreted as the
number of bits wasted due to using a wrong distribution for
coding. You should also know some basic properties of D(p||q), i.e.,
it's always non-negative. It's zero iff p=q. The mutual information
I(X;Y) measures the association of two random variables X and Y. This
can be understood from two perspectives: (1) I(X;Y)=H(X)-H(X|Y). This
means that I(X;Y) is the reduction of entropy of X if we know Y.
Intuitively, if X and Y are independent, there would be no reduction of
entropy of X, thus I(X;Y) would be zero, whereas if X is completely
determined by Y, then H(X|Y)=0, so I(X;Y)=H(X). Note that the maximum
value of I(X;Y) is max{H(X),H(Y)}. If X is completely determined by Y
AND at the same time, Y is completely determined by X, then H(X)=H(Y)
since they have the same uncertainty; in general, H(X) and H(Y) can be
different, though. (2) I(X;Y) is the KL-divergence of P(X,Y), which is
the true joint distribution, and
p(X)p(Y), which is the joint distribution if X and Y are
independent. Thus it essentially measures how far away p(X,Y) is from
the assumed joint distribution under the assumption that X and Y are
independent, so the two joint distributions are the same, it would mean
that X and Y are indeed independent, and I(X;Y) would be zero. If the
two distributions are far away from each other, it would mean that X and
Y are far from independent, i.e., they are correlated/associated, and
in such a case I(X;Y) would have a higher value.
A statistical language model (SLM) is a distribution over word sequences. Intuitively, it gives us a probability for any sequence of words, thus allows us to compare two sequences of words to see which has a higher probability. In general, SLMs help capture the uncerstanties associated with the use of natural language. For example, in general, non-grammatical sentences would have much smaller probabilities than grammatical sentences. Specialized language models can be used to answer many interesting questions that are directly related to many information management tasks.
While there are many different kinds of SLMs, we are particularly interested in the simplest one, i.e., the unigram language models. This model corresponds to a multinomial distribution over words. According to this model, a piece of text is "generated" by generating each word independently. As a result, the joint probability of generating all the words in a document D=w1 w2 ... wn is simply the product of generating each individual word, i.e., p(D)=p(w1)p(w2)...p(wn). Note that in general, the generation of one word may depend on another. For example, having seen "web search" being generated would make the probability of further generating a word like "engine" much higher. This means that p(w3="engine" |w1="web", w2="search") is much higher than p(w3="engine"). Thus the independence assumption made by the unigram language model doesn't really hold in reality. Indeed, with a bigram LM, we'd have p(D)=p(w1)p(w2|w1)p(w3|w2)...p(wn|wn-1), which would capture local dependency between two adjacent words. More generally, an n-gram language model would capture the dependency of a word on the previous n-1 words. You should know what is an n-gram language model, how many parameters are there in an n-gram language model, and why smoothing is needed when estimating an n-gram language model. You should know the major smoothing methods and how they work: additive smoothing, absolute discount, linear interpolation (fixed coefficient), and Dirichlet prior. You are expected to know the formula of these smoothing methods.