## CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

# MP4: Hidden Markov Models

Notice: This assignment is due Tuesday, November 14th at 11:59pm Tuesday, November 21st at 11:59pm..

This is a programming assignment, so submission procedures are slightly different than for regular written assignments. You will upload your code as well as your write-up (in either Markdown or PDF) to your git repository. Concrete instructions on how to do so are included in this assignment specification. Make sure that you double-check that you can see your submission in your git repository!

Warning: Please make sure you’ve completed the setup instructions well before the deadline, as we cannot guarantee that you’ll get your installation questions answered at the last minute (since we will prioritize other assignment questions first, since by definition they are closer to finishing than you!).

### 1. Forward Algorithm and Backward Algorithm [30 pts]

In this problem, we are going to generate a sentence using HMM. Suppose there is a hidden variable $S$ that controls the generation of sentence, which takes 2 states $\{V, N\}$. We start from one of the states and then a serious of words is generated as output. The state transition probabilities are as follows:

$A=\{P(V|V)=0.6,~~P(N|V)=0.4,~~~~P(N|N)=0.5,~~P(V|N)=0.5\}$

There are in all 4 words in the vocabulary, $\{hello, ~world, ~print, ~line\}$ and the output probabilities are:

$B=\{P(hello|V) = 0.6, ~P(world|V) = 0.1, ~P(print|V) = 0.2, ~P(line|V) = 0.1$ $~~~~~~~~~~P(hello|N) = 0.1, ~P(world|N) = 0.6, ~P(print|N) = 0.1, ~P(line|N) = 0.2\}$

The start probabilities are:

$\Pi = \{P(V) = 0.5, ~~P(N) = 0.5\}$

Assume that the observed sentence is：

$x="print~~line~~hello~~world"$.

1. [15 pts] Compute $P(x|\lambda)$ using the forward algorithm, i.e., calculate $\alpha_1(V)$, $\alpha_1(N)$, $\alpha_2(V)$, $\alpha_2(N)$, $\alpha_3(V)$, $\alpha_3(N)$, $\alpha_4(V)$, $\alpha_4(N)$. Print out all of them.
2. [15 pts] Compute $P(x|\lambda)$ using the backward algorithm, i.e., calculate $\beta_1(V)$, $\beta_1(N)$, $\beta_2(V)$, $\beta_2(N)$, $\beta_3(V)$, $\beta_3(N)$, $\beta_4(V)$ and $\beta_4(N)$. Print out all of them.

### 2. Viterbi Algorithm[70 pts]

Now let’s consider a more complicated case. Suppose hidden variable $S$ that controls the generation of sentence takes 10 states, which are 10 different Part-of-speech(POS) tags:

$\{RB~~~NN~~~CC~~~VBN~~~JJ~~~IN~~~VBZ~~~DT~~~NNS~~~NNP\}$

The generation process is the same as shown in question #1. A sentence is a list of words are generated as output by a sequence of POS tags. However, the start probabilities($\Pi$), transition probabilities($A$) and output probabilities ($B$) are not given in advance.

We provide a set of sentences that are labeled with Part-of-speech tags in your data/ repository (data/train.json). It contains 38485 sentences and the vocabulary size is 58661. The data is saved in json format. Sentence is saved as a list of words in the field "words", and the corresponding Part-of-speech tags can be found in the field "pos_tags". For example, if you load the data by python in the following way:

with open("data/train.json") as fin:

Then data[i] gives you the $i$-th sentence with its pos tagging. For example, the 120-th sentence is “The landscape is typically lowland rainforest” and its POS tagging is “DT NN VBZ RB JJ NN”:

data[120]["words"] = ["The", "landscape", "is", "typically", "lowland", "rainforest"]
data[120]["pos_tags"] = ["DT", "NN", "VBZ", "RB", "JJ", "NN"]

We are going to estimate start probabilities($\Pi$), transition probabilities($A$) and output probabilities ($B$) based on this data set. Formally, we define start probability of POS tag $t$ as:

$\displaystyle P(t) = \frac{c(\#s, t)}{|D|}$

where $|D|$ is the total number of sentences in the collection and $c(\#s, t)$ is the number of sentences starting with POS tag $t$. Similarly, we define transition probability from $t_i$ to $t_j$ as:

$\displaystyle P(t_j|t_i) = \frac{c(t_i,t_j)}{\sum_{t_{j'} \in S}c(t_i, t_{j'})}$

where $c(t_i, t_j)$ is how many times $t_i$ is followed by $t_j$.

The output probability of generating word $w_j$ by POS tag $t_i$ is defined as:

$\displaystyle P(w_j|t_i) = \frac{c(w_j, t_i)}{c(t_i)}$

where $c(w_j,t_i)$ is the number of times that we see word $w_j$ is labeled with POS tag $t_i$; and $c(t_i)$ is the number of times we see $t_i$.

1. [15 pts] Compute the start probabilities, transition probabilities and output probabilities. Print out all the start probabilities and the top 10 words with the highest output probabilities for each POS tag.

2. [25 pts] Given an observed sequency (sentence), find the most likely state transition path which can be considered as Part-of-speech tagging for the sentence. Implement the Viterbi algorithm using the start probabilities, transition probabilities and output probabilities you get in question #2.a to find the path. Test your algorithm in a test dataset provided in file data/test_0 in the repository (one sentence in each line). Write your tagging result in a new file.

3. [25 pts] Sentences in test_0 only contain words in the vocabulary. However, given a sentence that contains words that are not in the vocabulary, can the algorithm developed in question #2.b deal with it? If not, how can you solve the problem? Explain your solution and implement it. Try it on data/test_1 and save the tagging in a new file.

4. [5 pts] Sentences in data/test_0 and data/test_1 are tagged by another POS tagger; and you can find the tagging in data/test_0_tagging and data/test_1_tagging respectively (the tagging for the sentence in the $i$-th line in test_0 can be found in the $i$-th line in test_0_tagging). Take them as groundtruth and calculate the accuracy for test_0 and test_1 (i.e., two accuracy scores, one for test_0 and the other for test_1). Report the scores in your MD or PDF file.