CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

Assignment #1: Background, Probability and Statistics

Notice: This assignment is due Saturday, September 9th at 11:59pm.

Please submit your solutions via Compass. You should submit your assignment as a typeset PDF. Please do not include scanned or photographed equations as they are difficult for us to grade.

1. Information Retrieval Background [25 pts]

Info: The following questions should be review from material covered in the CS 410 prerequisite (or equivalent). If you find yourself having difficulty with these concepts (or have never seen them before) this is OK, but you will have to teach yourself these concepts relatively quickly. This part of the assignment is designed to help you with that.

We encourage you to use the following textbook for reference: Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. The provided link should give you access to the full text after logging in with your NetID and AD password.

(In fact, most of these questions are lifted from the exercises presented in that book!)

  1. [5 pts] Why might using raw term frequency counts with a dot product similarity not produce the best ranking results?

  2. [5 pts] Let dd be a document in a corpus. Suppose we add another copy of dd to the collection. How does this affect the IDF values for all words in the corpus?

  3. [10 pts] Suppose there are 16 total relevant documents in a collection. Consider the following result, where ++ indicates a relevant document and - indicates an non-relevant document.

    {+,+,,+,+,,,+,,}\displaystyle \{+,+,-,+,+,-,-,+,-,-\}

    Calculate the following evaluation measures for this ranked list.

    1. Precision
    2. Recall
    3. F1F_1 score
    4. Average precision
  4. [5 pts] Using the same setup as above, assume that the “gain” of a relevant document is 1 and the “gain” of a non-relevant document is 0. Calculate the following:

    1. Cumulative Gain at 7 documents
    2. Normalized Discounted Cumulative Gain at 7 documents (use log2\log_2 for the discounting function)

2. Probabilistic Reasoning and Bayes Rule [25 pts]

Consider the problem of detecting email messages that may carry a virus. This problem can be modeled probabilistically by treating each email message as representing an observation of values of the following 4 random variables:

  1. AA: whether the message has an attachment (1 for yes);
  2. KK: whether the sender is unknown to the receiver (1 for yes);
  3. LL: whether the message is not longer than 10 words (1 for yes); and
  4. VV: whether the message carries a virus (1 for yes).

Given a message, we can observe the values of AA, LL, and KK, and we want to infer its value of VV. In terms of probabilistic reasoning, we are interested in evaluating the conditional probability p(VA,L,K)p(V \mid A, L, K), and we would say that the message carries a virus if p(V=1A,L,K)>p(V=0A,L,K)p(V=1 \mid A, L,K)>p(V=0 \mid A,L,K).

We make a further assumption that p(A,L,KV)=p(AV)p(LV)p(KV)p(A,L,K \mid V)=p(A \mid V)p(L \mid V)p(K \mid V) for V=0V=0 and V=1V=1, i.e., given the status whether a message carries a virus, the values of AA, KK, and LL are independent.

  1. [3 pts] Suppose we observe 12 samples (Table 1):

    sample # AA KK LL VV
    0 1 1 1 1
    1 1 1 0 1
    2 1 1 0 1
    3 1 0 0 0
    4 1 1 1 0
    5 0 1 0 1
    6 0 1 1 1
    7 0 0 0 0
    8 0 1 0 0
    9 0 1 1 0
    10 0 0 0 0
    11 1 0 0 1

    Fill in the following table (Table 2) with conditional probabilities using only the information present in the above 12 samples.

    VV p(A=1V)p(A = 1 \mid V) p(K=1V)p(K=1 \mid V) P(L=1V)P(L=1 \mid V) prior p(V)p(V)
    0       1/2
    1 4/6      
  2. [5 pts] With the independence assumption, use Bayes’ rule and probabilities you just computed in part A to compute the probability that a message MM with A=0A=0, K=1K=1, and L=0L=0 carries a virus. i.e., compute p(V=1A=0,K=1,L=0)p(V=1 \mid A=0,K=1,L=0) and p(V=0A=0,K=1,L=0)p(V=0 \mid A=0,K=1,L=0). Would we conclude that message MM carries a virus?

  3. [3 pts] Now, compute p(V=1A=0,K=1,L=0)p(V=1 \mid A=0, K=1, L=0) and p(V=0A=0,K=1,L=0)p(V=0 \mid A=0,K=1,L=0) directly from the 12 examples in Table 1, just like what you did in problem A. Do you get the same value as in problem B? Why?

  4. [2 pts] Now, ignore Table 1, and consider any possibilities you can fill in Table 2. Are there any constraints on these values that we must respect when assigning these values? In other words, can we fill in Table 2 with 8 arbitrary values between 0 and 1?

  5. [2 pts] Can you change your conclusion of problem B (i.e., whether message MM carries a virus) by only changing the value AA (i.e., if the message has an attachment) in 1 example of Table 1?

  6. [5 pts] Note that the conditional independence assumption p(A,L,KV)=p(AV)p(LV)p(KV)p(A,L,K \mid V)=p(A \mid V)p(L \mid V)p(K \mid V) helps simplify the computation of p(A,L,KV)p(A,L,K \mid V). In particular, with this assumption, we can compute p(A,L,KV)p(A,L,K \mid V) based on p(AV)p(A \mid V), p(LV)p(L \mid V), and p(KV)p(K \mid V). If we were to specify the values for p(A,L,KV)p(A,L,K \mid V) directly, what is the minimum number of probability values that we would have to specify in order to fully characterize the conditional probability distribution p(A,L,KV)p(A,L,K \mid V)? Why? Note that all the probability values of a distribution must sum to 1.

  7. [5 pts] Explain why the independence assumption p(A,L,KV)=p(AV)p(LV)p(KV)p(A,L,K \mid V)=p(A \mid V)p(L \mid V)p(K \mid V) does not necessarily hold in reality.

3. Maximum Likelihood Estimation [50 pts]

A Poisson distribution is often used to model the word frequency. Specifically, the number of occurrences of a word in a document with fixed length can be assumed to follow a Poisson distribution given by

p(X=x)=uxeux!,u>0\displaystyle p(X=x) = \frac{u^x e^{-u}}{x!}, u > 0

where XX is the random variable representing the number of times we have seen a specific word ww in a document, and uu is the parameter of the Poisson distribution (which happens to be its mean). Now, suppose we observe a sample of counts of a word ww, {x1,,xN}\{x_1,\ldots,x_N\}, from NN documents with the same length (xix_i is the counts of ww in one document). We want to estimate the parameter uu of the Poisson distribution for word ww. One commonly used method is the maximum likelihood method, in which we choose a value for uu that maximizes the likelihood of our data {x1,,xN}\{x_1,\ldots,x_N\}, i.e.,

u^=argmaxup(x1,,xNu),u>0\displaystyle \hat{u} = \arg\max_{u} p(x_1, \ldots, x_N \mid u), u > 0
  1. [35 pts] Derive a closed form formula for this estimate.

    (Hint: Write down the log likelihood of {x1,,xN}\{x_1,\ldots,x_N\}, which would be a function of uu. Set the derivative of this function w.r.t. uu to zero, and solve the equation for uu.)

  2. [15 pts] Now suppose uu has a prior exponential distribution

    p(u)=λeλu,u>0\displaystyle p(u) = \lambda e^{-\lambda u}, u > 0

    where λ\lambda is a given parameter. Derive a closed form for the maximum a posteriori estimate, i.e.,

    u^=argmaxup(x1,,xNu)p(u),u>0\displaystyle \hat{u} = \arg\max_u p(x_1, \ldots, x_N \mid u)p(u), u > 0

    (Hint: refer to this Wikipedia page and look for the Example section.)