Assignment #1: Background, Probability and Statistics
Notice: This assignment is due Saturday, September 9th at 11:59pm.
Please submit your solutions via Compass. You should submit your assignment as a typeset PDF. Please do not include scanned or photographed equations as they are difficult for us to grade.
1. Information Retrieval Background [25 pts]
Info: The following questions should be review from material covered in the CS 410 prerequisite (or equivalent). If you find yourself having difficulty with these concepts (or have never seen them before) this is OK, but you will have to teach yourself these concepts relatively quickly. This part of the assignment is designed to help you with that.
We encourage you to use the following textbook for reference: Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. The provided link should give you access to the full text after logging in with your NetID and AD password.
(In fact, most of these questions are lifted from the exercises presented in that book!)
[5 pts] Why might using raw term frequency counts with a dot product similarity not produce the best ranking results?
[5 pts] Let be a document in a corpus. Suppose we add another copy of to the collection. How does this affect the IDF values for all words in the corpus?
[10 pts] Suppose there are 16 total relevant documents in a collection. Consider the following result, where indicates a relevant document and indicates an non-relevant document.
Calculate the following evaluation measures for this ranked list.
- Average precision
[5 pts] Using the same setup as above, assume that the “gain” of a relevant document is 1 and the “gain” of a non-relevant document is 0. Calculate the following:
- Cumulative Gain at 7 documents
- Normalized Discounted Cumulative Gain at 7 documents (use for the discounting function)
2. Probabilistic Reasoning and Bayes Rule [25 pts]
Consider the problem of detecting email messages that may carry a virus. This problem can be modeled probabilistically by treating each email message as representing an observation of values of the following 4 random variables:
- : whether the message has an attachment (1 for yes);
- : whether the sender is unknown to the receiver (1 for yes);
- : whether the message is not longer than 10 words (1 for yes); and
- : whether the message carries a virus (1 for yes).
Given a message, we can observe the values of , , and , and we want to infer its value of . In terms of probabilistic reasoning, we are interested in evaluating the conditional probability , and we would say that the message carries a virus if .
We make a further assumption that for and , i.e., given the status whether a message carries a virus, the values of , , and are independent.
[3 pts] Suppose we observe 12 samples (Table 1):
sample # 0 1 1 1 1 1 1 1 0 1 2 1 1 0 1 3 1 0 0 0 4 1 1 1 0 5 0 1 0 1 6 0 1 1 1 7 0 0 0 0 8 0 1 0 0 9 0 1 1 0 10 0 0 0 0 11 1 0 0 1
Fill in the following table (Table 2) with conditional probabilities using only the information present in the above 12 samples.
prior 0 1/2 1 4/6
[5 pts] With the independence assumption, use Bayes’ rule and probabilities you just computed in part A to compute the probability that a message with , , and carries a virus. i.e., compute and . Would we conclude that message carries a virus?
[3 pts] Now, compute and directly from the 12 examples in Table 1, just like what you did in problem A. Do you get the same value as in problem B? Why?
[2 pts] Now, ignore Table 1, and consider any possibilities you can fill in Table 2. Are there any constraints on these values that we must respect when assigning these values? In other words, can we fill in Table 2 with 8 arbitrary values between 0 and 1?
[2 pts] Can you change your conclusion of problem B (i.e., whether message carries a virus) by only changing the value (i.e., if the message has an attachment) in 1 example of Table 1?
[5 pts] Note that the conditional independence assumption helps simplify the computation of . In particular, with this assumption, we can compute based on , , and . If we were to specify the values for directly, what is the minimum number of probability values that we would have to specify in order to fully characterize the conditional probability distribution ? Why? Note that all the probability values of a distribution must sum to 1.
[5 pts] Explain why the independence assumption does not necessarily hold in reality.
3. Maximum Likelihood Estimation [50 pts]
A Poisson distribution is often used to model the word frequency. Specifically, the number of occurrences of a word in a document with fixed length can be assumed to follow a Poisson distribution given by
where is the random variable representing the number of times we have seen a specific word in a document, and is the parameter of the Poisson distribution (which happens to be its mean). Now, suppose we observe a sample of counts of a word , , from documents with the same length ( is the counts of in one document). We want to estimate the parameter of the Poisson distribution for word . One commonly used method is the maximum likelihood method, in which we choose a value for that maximizes the likelihood of our data , i.e.,
[35 pts] Derive a closed form formula for this estimate.
(Hint: Write down the log likelihood of , which would be a function of . Set the derivative of this function w.r.t. to zero, and solve the equation for .)
[15 pts] Now suppose has a prior exponential distribution
where is a given parameter. Derive a closed form for the maximum a posteriori estimate, i.e.,
(Hint: refer to this Wikipedia page and look for the Example section.)