CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

MP4: Hidden Markov Models

Notice: This assignment is due Tuesday, November 14th at 11:59pm Tuesday, November 21st at 11:59pm..

This is a programming assignment, so submission procedures are slightly different than for regular written assignments. You will upload your code as well as your write-up (in either Markdown or PDF) to your git repository. Concrete instructions on how to do so are included in this assignment specification. Make sure that you double-check that you can see your submission in your git repository!

Warning: Please make sure you’ve completed the setup instructions well before the deadline, as we cannot guarantee that you’ll get your installation questions answered at the last minute (since we will prioritize other assignment questions first, since by definition they are closer to finishing than you!).

1. Forward Algorithm and Backward Algorithm [30 pts]

In this problem, we are going to generate a sentence using HMM. Suppose there is a hidden variable SS that controls the generation of sentence, which takes 2 states {V,N}\{V, N\}. We start from one of the states and then a serious of words is generated as output. The state transition probabilities are as follows:

A={P(VV)=0.6,  P(NV)=0.4,    P(NN)=0.5,  P(VN)=0.5}A=\{P(V|V)=0.6,~~P(N|V)=0.4,~~~~P(N|N)=0.5,~~P(V|N)=0.5\}

There are in all 4 words in the vocabulary, {hello, world, print, line}\{hello, ~world, ~print, ~line\} and the output probabilities are:

B={P(helloV)=0.6, P(worldV)=0.1, P(printV)=0.2, P(lineV)=0.1B=\{P(hello|V) = 0.6, ~P(world|V) = 0.1, ~P(print|V) = 0.2, ~P(line|V) = 0.1           P(helloN)=0.1, P(worldN)=0.6, P(printN)=0.1, P(lineN)=0.2}~~~~~~~~~~P(hello|N) = 0.1, ~P(world|N) = 0.6, ~P(print|N) = 0.1, ~P(line|N) = 0.2\}

The start probabilities are:

Π={P(V)=0.5,  P(N)=0.5}\Pi = \{P(V) = 0.5, ~~P(N) = 0.5\}

Assume that the observed sentence is:

x="print  line  hello  world"x="print~~line~~hello~~world".

  1. [15 pts] Compute P(xλ)P(x|\lambda) using the forward algorithm, i.e., calculate α1(V)\alpha_1(V), α1(N)\alpha_1(N), α2(V)\alpha_2(V), α2(N)\alpha_2(N), α3(V)\alpha_3(V), α3(N)\alpha_3(N), α4(V)\alpha_4(V), α4(N)\alpha_4(N). Print out all of them.
  2. [15 pts] Compute P(xλ)P(x|\lambda) using the backward algorithm, i.e., calculate β1(V)\beta_1(V), β1(N)\beta_1(N), β2(V)\beta_2(V), β2(N)\beta_2(N), β3(V)\beta_3(V), β3(N)\beta_3(N), β4(V)\beta_4(V) and β4(N)\beta_4(N). Print out all of them.

2. Viterbi Algorithm[70 pts]

Now let’s consider a more complicated case. Suppose hidden variable SS that controls the generation of sentence takes 10 states, which are 10 different Part-of-speech(POS) tags:

{RB   NN   CC   VBN   JJ   IN   VBZ   DT   NNS   NNP}\{RB~~~NN~~~CC~~~VBN~~~JJ~~~IN~~~VBZ~~~DT~~~NNS~~~NNP\}

The generation process is the same as shown in question #1. A sentence is a list of words are generated as output by a sequence of POS tags. However, the start probabilities(Π\Pi), transition probabilities(AA) and output probabilities (BB) are not given in advance.

We provide a set of sentences that are labeled with Part-of-speech tags in your data/ repository (data/train.json). It contains 38485 sentences and the vocabulary size is 58661. The data is saved in json format. Sentence is saved as a list of words in the field "words", and the corresponding Part-of-speech tags can be found in the field "pos_tags". For example, if you load the data by python in the following way:

with open("data/train.json") as fin:
  data = json.load(fin)

Then data[i] gives you the ii-th sentence with its pos tagging. For example, the 120-th sentence is “The landscape is typically lowland rainforest” and its POS tagging is “DT NN VBZ RB JJ NN”:

data[120]["words"] = ["The", "landscape", "is", "typically", "lowland", "rainforest"]
data[120]["pos_tags"] = ["DT", "NN", "VBZ", "RB", "JJ", "NN"]

We are going to estimate start probabilities(Π\Pi), transition probabilities(AA) and output probabilities (BB) based on this data set. Formally, we define start probability of POS tag tt as:

P(t)=c(#s,t)D\displaystyle P(t) = \frac{c(\#s, t)}{|D|}

where D|D| is the total number of sentences in the collection and c(#s,t)c(\#s, t) is the number of sentences starting with POS tag tt. Similarly, we define transition probability from tit_i to tjt_j as:

P(tjti)=c(ti,tj)tjSc(ti,tj)\displaystyle P(t_j|t_i) = \frac{c(t_i,t_j)}{\sum_{t_{j'} \in S}c(t_i, t_{j'})}

where c(ti,tj)c(t_i, t_j) is how many times tit_i is followed by tjt_j.

The output probability of generating word wjw_j by POS tag tit_i is defined as:

P(wjti)=c(wj,ti)c(ti)\displaystyle P(w_j|t_i) = \frac{c(w_j, t_i)}{c(t_i)}

where c(wj,ti)c(w_j,t_i) is the number of times that we see word wjw_j is labeled with POS tag tit_i; and c(ti)c(t_i) is the number of times we see tit_i.

  1. [15 pts] Compute the start probabilities, transition probabilities and output probabilities. Print out all the start probabilities and the top 10 words with the highest output probabilities for each POS tag.

  2. [25 pts] Given an observed sequency (sentence), find the most likely state transition path which can be considered as Part-of-speech tagging for the sentence. Implement the Viterbi algorithm using the start probabilities, transition probabilities and output probabilities you get in question #2.a to find the path. Test your algorithm in a test dataset provided in file data/test_0 in the repository (one sentence in each line). Write your tagging result in a new file.

  3. [25 pts] Sentences in test_0 only contain words in the vocabulary. However, given a sentence that contains words that are not in the vocabulary, can the algorithm developed in question #2.b deal with it? If not, how can you solve the problem? Explain your solution and implement it. Try it on data/test_1 and save the tagging in a new file.

  4. [5 pts] Sentences in data/test_0 and data/test_1 are tagged by another POS tagger; and you can find the tagging in data/test_0_tagging and data/test_1_tagging respectively (the tagging for the sentence in the ii-th line in test_0 can be found in the ii-th line in test_0_tagging). Take them as groundtruth and calculate the accuracy for test_0 and test_1 (i.e., two accuracy scores, one for test_0 and the other for test_1). Report the scores in your MD or PDF file.

3. Triple-Check Your Repository

When you’ve finished all of the above, triple check your repository by navigating to it in GitLab in a browser. Here are a few things to check:

  1. Make sure that you can see your writeup (and that it is the most up-to-date version).

  2. Make sure that all of the files needed for building/running your code are there.

  3. Make sure that you code compiles and runs properly in the CI environment.

  4. Make sure that your README.md is up to date with instructions about how to compile/run your code.

  5. Make sure that you update your .gitlab-ci.yml file.