## Review List for CS598CXZ Midterm

Note: The midterm exam will be a closed book exam and last for 1 hour 15 minutes (2pm-3:15pm) in the classroom. Calculator is not allowed. You are not allowed to bring any paper or written material either. (You can use the back side of the examination sheet as Scratch paper.) The exam will start promptly at 2pm, so please arrive before 2pm.

The best way to prepare for the midterm is to (1) read all the assigned readings ((use the directions given in the Readings page as guidance); (2) go over all the lecture slides and make sure that you can follow them; (3) pay special attention to the topics listed below; (4) ask questions if you have trouble with following the materials (I strongly encourage you to use the Piazza to discuss the materials and help each other learning). The earlier you ask a question, the more likely it will be answered.

## Part I: Background

You are expected to
1. know some basic concepts in IR

You should know the two modes of information access, i.e., "push" and "pull". You should know that in the "pull" mode, a user can further have two complementary ways of finding information, i.e., "querying" and "browsing". You should know what is feedback, including relevance feedback, pseudo-relevance feedback, and implicit feedback. Know the similarity and differences between these three different forms of feedback.

2. have a good understanding of random variable, conditional probability, and the Bayes rule. You need to remember the Bayes Rule.

It's especially important to understand the Bayes rule, which has a lot of applications. First, you should remember the Bayes rule. Second, you should know how Bayes rule provides a general way of making probabilistic inference: we would update our prior p(H) based on data likelihood p(E|H) and obtain the so-called posterior probability p(H|E) ("posterior" in the sense that this is "after observing evidence E"). The prior p(H) represents our belief about the hypothesis before we see the evidence E (i.e., data), whereas the posterior p(H|E) represents our updated belief about hypothesis H after we have seen evidence E.

We have seen several examples of using Bayes rule to make probabilistic inference. One case is text categorization, where a hypothesis H is a category of a document and the evidence E is a document. You have worked on such an applicaiton in Assignment One. Make sure that you review that problem and know how to solve that problem. Another case is the E-step in the EM algorithm where we use the Bayes rule to infer the value of the hidden variable based on the current generation of parameter values (the parameter values are initially randomly set). In the mixture model for feedback, the evidence is a word in the feedback document and we have two hypotheses representing whether the word has been generated using the background word distribution or the feedback topic word distribution, respectively. You should know that how the E-step in the EM algorithm involves an application of Bayes rule to infer values of hidden variables.

3. know the basic idea of maximum likelihood estimation and know how to compute the maximum likelihood estimate of a multinomial distribution (i.e., relative frequency).

You'll first need to know how to write down a likelihood function for a simple case like when the data is a document with a sequence of words w1 w2 ... wn and we have a unigram language model p(w|theta). In this case, the probability of observing the document (i.e., word sequence w1, ..., wn) from the language model is p(d|theta)=p(w1|theta)*p(w2|theta)*...*p(wn|theta). If we take logarithm of both sides, we have log p(d|theta)= [log p(w1|theta)] +[log p(w2|theta)] +...[log p(wn|theta)]. If we use c(w,d) to represent the count of word w in d, we have log p(d|theta)= sum_{all words W in vocabulary set} c(W,d)*log p(W|theta).

Then you should know that the Maximum Likelihood (ML) estimator is to find an optimal setting of the language model p(w|theta) (i.e., optimal setting of the probability of each word in our vocabulary p(w|theta)) so that p(d|theta), or equivalently log p(d|theta), would achieve the maximum value. In other words, if we set these word probabilities different values than their ML estimate, p(d|theta) would be smaller. The ML estimate is optimal in the sense that it maximizes the likelihood of the observed data, i.e., it finds the parameter setting that best explains the data. However, when the observed data sample is too small (e.g., the title of a document), it may be a biased representation of the entire population (e.g., the whole article), so if we overfit the observed data as the ML estimator would do, our estimated parameter values may not be optimal. For example, we would assign zero probability to all the unseen words (since the ML estimator would try to give as much probability mass to the observed words as possible in order to maximize the likelihood of data). You should go over the ML estimation problem that you worked on in Assignment 1 to make sure that you understand how that problem is solved. However, we won't ask you to do derivatives in the midterm or derive an ML estimator, but you should know the concept and intuition of the ML estimator.

You should know the fact that the ML estimate of a multinomial distribution (i.e., a unigram language model) would give each word w a probability equal to the relative frequency of the word. That is, if the word distribution is theta, and the observed data is a document d, according to the ML estimator, we would have p(w|theta)=c(w,d)/|d| where c(w,d) is the count of word w in d, and |d| is the length of document d (i.e., total counts of words in d).

4. know how to compute entropy, cross entropy, mutual information, and KL-divergence, and know their relations. You need to remember the formulas of entropy, KL-divergence, and mutual information.

You should remember the formulas for entropy, cross entropy and KL-divergence, and mutual information. Their relation is as follows. The KL-divergence D(p||q) of two distributions p and q is equal to H(p,q) - H(p). The mutual information I(X;Y)= H(X)-H(X|Y)=H(Y)-H(Y|X). I(X;Y) is also the KL-divergence between p(X,Y) and P(X)P(Y). H(p) is the entropy of p which measures the randomness of the distribution p, i.e., the more random p is, the higher H(p) is. When p is uniform, H(p) reaches its maximum. (If you remember the formula of H(p), you should be able to easily see its maximum value is log M where M is the number of possible values that the random variable p can take.) When p is entirely concentrated on a single value (i.e., it's actually not random at all), H(p) reaches its minimum which is zero. H(p) can also be interpreted as the minimum number of bits we have to use to compress values following the distribution of p. (Note that we can call p either a random variable or a distribution.) H(p,q) is the cross entropy, which is of a similar form of the function to H(p) with only a small difference. Pay attention to the difference and make sure you understand why this small difference allows us to interpret H(p,q) as the minimum number of bits that we have to use to compress values following distribution p if we "mistakenly" thought that the values follow distribution q (i.e., we use q to design optimal coding). As a result, H(p,q) is always at least as large as H(p) (with a wrong distribution for designing an optimal code, we can never do better than using the original true distribution in terms of compression), and the KL-divergence captures the difference and can be interpreted as the number of bits wasted due to using a wrong distribution for coding. You should also know some basic properties of D(p||q), i.e., it's always non-negative. It's zero iff p=q. The mutual information I(X;Y) measures the association of two random variables X and Y. This can be understood from two perspectives: (1) I(X;Y)=H(X)-H(X|Y). This means that I(X;Y) is the reduction of entropy of X if we know Y. Intuitively, if X and Y are independent, there would be no reduction of entropy of X, thus I(X;Y) would be zero, whereas if X is completely determined by Y, then H(X|Y)=0, so I(X;Y)=H(X). Note that the maximum value of I(X;Y) is max{H(X),H(Y)}. If X is completely determined by Y AND at the same time, Y is completely determined by X, then H(X)=H(Y) since they have the same uncertainty; in general, H(X) and H(Y) can be different, though. (2) I(X;Y) is the KL-divergence of P(X,Y), which is the true joint distribution, and p(X)p(Y), which is the joint distribution if X and Y are independent. Thus it essentially measures how far away p(X,Y) is from the assumed joint distribution under the assumption that X and Y are independent, so the two joint distributions are the same, it would mean that X and Y are indeed independent, and I(X;Y) would be zero. If the two distributions are far away from each other, it would mean that X and Y are far from independent, i.e., they are correlated/associated, and in such a case I(X;Y) would have a higher value.

5. know what is POS tagging, what is parsing, and what is syntactic/structural ambiguity.

POS tagging is to assign a syntactic category (e.g., noun or verb) to each word in text. Parsing is to determine the structure of a sentence (e.g., figuring which words go together to form a noun phrase and which word modifies which other word etc). Syntactic/structural ambiguity refers to multiple possible structures of the same sentence or phrase. (Can you think of some sample sentences or phrases that are structurally ambiguous?)

6. know what is a statistical language model, what is a unigram/bigram language model

A statistical language model (SLM) is a distribution over word sequences. Intuitively, it gives us a probability for any sequence of words, thus allows us to compare two sequences of words to see which has a higher probability. In general, SLMs help capture the uncerstanties associated with the use of natural language. For example, in general, non-grammatical sentences would have much smaller probabilities than grammatical sentences. Specialized language models can be used to answer many interesting questions that are directly related to many information management tasks.

While there are many different kinds of SLMs, we are particularly interested in the simplest one, i.e., the unigram language models. This model corresponds to a multinomial distribution over words. According to this model, a piece of text is "generated" by generating each word independently. As a result, the joint probability of generating all the words in a document D=w1 w2 ... wn is simply the product of generating each individual word, i.e., p(D)=p(w1)p(w2)...p(wn). Note that in general, the generation of one word may depend on another. For example, having seen "web search" being generated would make the probability of further generating a word like "engine" much higher. This means that p(w3="engine" |w1="web", w2="search") is much higher than p(w3="engine"). Thus the independence assumption made by the unigram language model doesn't really hold in reality. Indeed, with a bigram LM, we'd have p(D)=p(w1)p(w2|w1)p(w3|w2)...p(wn|wn-1), which would capture local dependency between two adjacent words.

## Part II: Core topics in IR: Evaluation, Retrieval Models, and Scalability

You are expected to know the following:
1. What are the two different purposes of evaluating an IR system/method? (slide 14 of the lecture on "IR: Problem formulation and evaluation") For which purpose is the Cranfield test method suitable for?
2. How does Cranfield test methodology work? (slides 16 & 17 of the lecture on "IR: problem formulation and evaluation")
3. What is pooling? Why do we often choose to do pooling instead of judging all the documents in a collection? What are some possible concerns about pooling? (slides 28, 30, 31 of the lecture on "IR: Problem Formulation and Evaluation")
4. What are the major retrieval measures and how are they computed? Why do we need multiple measures rather than just use one single measure? In particular, know how to compute "precision", "recall", F1 measure, precision at k documents, average precision, mean average precision, mean reciprocal rank, R-precision, nDCG. When should we use arithmetic mean of average precision (i.e., MAP), and when should we use the geometric mean of average precision (i.e., gMAP)? (slides 18-25, slide 33 of the lecture on "IR: Problem Formulation and Evaluaiton")
5. In IR evaluation, why do we need to do statistical significance test? (slides 34-36 of the lecture on "IR: Problem Formulation and Evaluation")
6. Why is ranking fundamental to IR? What does the probability ranking principle say? What are the limitations of probability ranking principle? (Slides 7-11 of the lecture on "IR: Problem Formulation and Evaluation")
7. Know the basic idea of the vector space model, and the major heuristics used for term weighting in the vector space model (slides 7-22 of the lecture on "IR Models: Vector Space")
8. Know the formulas of the pivoted length normalization and BM25 (slide 23 of the lecture on "IR Models:Vector Space"; note the typo in Amit Singhal's original paper; slides 13-15 of lecture on "IR Models: Probabilistic Models")
9. Know how Rocchio feedback method works (slides 24-27 of the lecture on "IR Models: Vector Space").
10. Know how to derive the Robertson-Sparck-Jones (RSJ) model from the initial goal of ranking documents based on p(R=1|Q,D) (i.e., understanding why ranking based on p(R=1|Q,D) is equivalent to ranking based on O(R=1|Q,D), how we can then apply Bayes rule and ignore constant not affecting ranking, how we can use "document generation" to decompose joint probability, and how we eventually obtain RSJ). What assumptions have been made? If we have examples of relevant and non-relevant documents, how can we estimate the parameters of RSJ model (i.e., pi and qi)? When there are no examples available, under what assumptions would RSJ lead to a retrieval function that scores a document based on sum of IDF-like weights of the matched query terms? (slides 7-11 of the lecture on "IR models: Probabilistic Models")
11. Know how to derive the query likelihood scoring function p(Q|D, R=1) from the initial goal of ranking documents based on p(R=1|Q,D) using "query generation". What assumptions have we made? (slide 3 of the lecture on "IR models: Language Models")
12. Know what is smoothing; and know the formulas for Jelinek-Mercer and Dirichlet prior smoothing methods. (slides 8-16 of the lecture on "IR models: Language Models")
13. Know that p(Q|D,R=1) can be instantiated using two different models corresponding to two different query representations (i.e., multi-Bernoulli and multinomial). What are the independence assumptions made in each case? How are the two assumptions different? (Slide 6 of the lecture on "IR models: Language Models")
14. Assuming that document language models will be smoothed with a collection language model as a reference language model, we can rewrite the query likelihood retrieval function as a scoring function similar to a vector space retrieval function with TF-IDF weighting, i.e., the function shown on slide 20 of the lecture on "IR models: Language Models". Study this slide to understand how to do this derivation exactly.
15. What are the two different roles played by smoothing with a collection language model in the query likelihood retrieval method? How can this dual-role of smoothing explain that we need more smoothing for verbose queries than keyword queries? What's the basic idea of two-stage smoothing? (Slides 22-28 of the lecture on "IR models: Language Models")
16. Know the formula of the KL-divergence retrieval function and why it can cover the query likelihood retrieval function as a special case (thus in this sense, generalizing query likelihood). Know how KL-divergence retrieval function can support immediate relevance feedback for the current user (which is an advantage over the query likelihood method). (slides 31 and 33 of the lecture on "IR models: Language Models")
17. Know how the mixture model feedbback method works and how the EM algorithm works in this particular case. Know how to derive the EM algorithm for this simple mixture model using the Q-function. Know that the EM algorithm is a hill-climbing algorithm which can only reach a local maximum. Know why the EM algorithm is guaranteed to converge. (Slides 34, 36 of the lecture on "IR Models: Language Models"; slides 3-15 of the lecture on "Probabilistic Topic Models"; read this note on EM algorithm)