## CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

# Assignment #3: N-Gram Language Model

Notice: This assignment is due Thursday, September 21st at 11:59pm.

This assignment tests your understanding about N-Gram Language Models. In particular, you will work on specific problems related to the estimation of N-Gram Language Model parameters, the issues involved in the estimation process and ways to overcome those issues. Specially, you will deal with different kinds of smoothing techniques, including how smoothing works in practice and what are the commonalities and differences of various smoothing techniques.

### 1. N-Gram Language Model [15 pts]

1. [5 pts] What is the central assumption regarding word dependencies that is made in N-Gram Language Models?

2. [5 pts] Do you think the assumption made in N-Gram Language Models is reasonable? Why?

3. [5 pts] What is the primary benefit of applying the assumption made in N-Gram Language Models?

### 2. Unigram Language Model [15 pts]

Unigram Language Model is a special class of N-Gram Language Model where the next word in the document is assumed to be independent of the previous words generated by the model. Mathematically, this is written as, $P(w_m|w_{m-1},...,w_1)=P(w_m)$.

1. [10 pts] We can estimate the parameters of a Unigram Language Model through Maximum Likelihood Estimation. Given a particular document $d$ and the vocabulary set $V$, the maximum likelihood estimator for a Unigram Language Model is given by the following formula:
$\displaystyle P_{ML}(w|d)=\frac{c(w,d)}{\sum_{w'\in V} c(w',d)}$
Now, let us assume that, we have seen the following document d: “the sun rises in the east and sets in the west”. Assuming that this document was generated by a Unigram Language Model and words in the document $d$ constitute the entire vocabulary, how many parameters are necessary to specify the Unigram Language Model? Estimate the values of all these parameters using the maximum likelihood estimator.
2. [5 pts] Now assume that, the entire vocabulary $V$ consists of the set:
$\displaystyle \{a,the,from,retrieval,sun,rises,in,BM25,east,sets,and,west\}$

If we consider the same document $d$: “the sun rises in the east and sets in the west” and assume again that this document was generated by a Unigram Language Model, how many parameters are necessary to specify the Unigram Language Model in this case? Estimate the values of all these parameters using the maximum likelihood estimator.

### 3. Bigram Language Model [15 pts]

Bigram Language Model is another special class of N-Gram Language Model where the next word in the document depends only on the immediate preceding word. Mathematically, this is written as the conditional probability, $P(w_m|w_{m-1},...,w_1)=P(w_m|w_{m-1})$.

1. [8 pts] Given the same document $d$ from question 2(a) and same vocabulary set $V$ from question 2(b) and assuming the document $d$ is now generated by a Bigram Language Model, how many parameters are necessary to specify the Model? Using the maximum likelihood estimator, estimate the values of the following parameters (assume # to be the start of the sentence marker):

• $P(sun|the)$

• $P(the|in)$

• $P(from|the)$

• $P(BM25|retrieval)$

• $P(east|west)$

• $P(east|rises)$

• $P(sets|\#)$

• $P(the|\#)$

• Do you see any general problem associated with the estimation of the parameters of the Bigram Language Model from problem 3(a)?

• Do you see the same problem in the estimation process for question 2(b)?

• For which model, the problem is more severe? Can you derive some general conclusion based on this comparison?

• What can we do to solve this general problem?

### 4. Smoothing [15 pts]

1. [10 pts] Write down the formula for Dirichlet Prior Smoothing. Then, mathematically prove the following two lemmas:

• Show, in the limit where document length tends to infinity, that a unigram language model smoothed with a Dirichlet prior becomes equivalent to one estimated using the maximum likelihood estimate.

• Show, in the limit where the parameter $\mu$ tends to infinity, that a unigram language model smoothed with a Dirichlet prior becomes equivalent to the background language model used in the smoothing.

2. [5 pts] Point out one advantage of Jelinek-Mercer smoothing over Katz-Backoff smoothing. Explain why.

### 5. Application of Smoothing [40 pts]

Again, Consider the document $d$: “the sun rises in the east and sets in the west”. This time, assume that we have a background word distribution (pre-computed somehow) denoted by $REF$ which is characterized as follows:

• $P_{REF}(a)=0.18$

• $P_{REF}(the)=0.17$

• $P_{REF}(from)=0.13$

• $P_{REF}(retrieval)=0.02$

• $P_{REF}(sun)=0.05$

• $P_{REF}(rises)=0.04$

• $P_{REF}(in)=0.16$

• $P_{REF}(BM25)=0.01$

• $P_{REF}(east)=0.02$

• $P_{REF}(sets)=0.04$

• $P_{REF}(west)=0.02$

• $P_{REF}(and)=0.16$

1. [10 pts] Assume document $d$ is generated by a Unigram Language Model. Estimate the parameters of the Unigram Language Model using Dirichlet Prior Smoothing assuming $\mu=4$. Now, compare this result against the results obtained from 2(b).

2. [10 pts] Repeat problem 5(a) assuming $\mu=0.01$ and $\mu=100$. Compare these results with results from problem 5(a). Do the results match with your intuition? explain why.

3. [20 pts] Repeat problem 5(a) with Jelinek-Mercer smoothing instead of Dirichlet Prior Smoothing assuming $\lambda=\{0.01, 0.5, 0.9\}$ and compare the results obtained for different $\lambda$’s. Also, compare these results with results from problem 5(a) and 5(b). What similarities or differences do you observe?