MP4: Hidden Markov Models
Notice: This assignment is due
Tuesday, November 14th at 11:59pm
Tuesday, November 21st at 11:59pm..
This is a programming assignment, so submission procedures are slightly
different than for regular written assignments. You will upload your code
as well as your write-up (in either Markdown or PDF) to your
git
repository. Concrete instructions on how to do so are included in
this assignment specification. Make sure that you double-check that you
can see your submission in your git
repository!
Warning: Please make sure you’ve completed the setup instructions well before the deadline, as we cannot guarantee that you’ll get your installation questions answered at the last minute (since we will prioritize other assignment questions first, since by definition they are closer to finishing than you!).
1. Forward Algorithm and Backward Algorithm [30 pts]
In this problem, we are going to generate a sentence using HMM. Suppose there is a hidden variable that controls the generation of sentence, which takes 2 states . We start from one of the states and then a serious of words is generated as output. The state transition probabilities are as follows:
There are in all 4 words in the vocabulary, and the output probabilities are:
The start probabilities are:
Assume that the observed sentence is:
.
- [15 pts] Compute using the forward algorithm, i.e., calculate , , , , , , , . Print out all of them.
- [15 pts] Compute using the backward algorithm, i.e., calculate , , , , , , and . Print out all of them.
2. Viterbi Algorithm[70 pts]
Now let’s consider a more complicated case. Suppose hidden variable that controls the generation of sentence takes 10 states, which are 10 different Part-of-speech(POS) tags:
The generation process is the same as shown in question #1. A sentence is a list of words are generated as output by a sequence of POS tags. However, the start probabilities(), transition probabilities() and output probabilities () are not given in advance.
We provide a set of sentences that are labeled with Part-of-speech tags in your data/
repository (data/train.json
). It contains 38485 sentences and the vocabulary size is 58661. The data is saved in json format. Sentence is saved as a list of words in the field "words"
, and the corresponding Part-of-speech tags can be found in the field "pos_tags"
. For example, if you load the data by python in the following way:
with open("data/train.json") as fin:
data = json.load(fin)
Then data[i]
gives you the -th sentence with its pos tagging. For example, the 120-th sentence is “The landscape is typically lowland rainforest” and its POS tagging is “DT NN VBZ RB JJ NN”:
data[120]["words"] = ["The", "landscape", "is", "typically", "lowland", "rainforest"]
data[120]["pos_tags"] = ["DT", "NN", "VBZ", "RB", "JJ", "NN"]
We are going to estimate start probabilities(), transition probabilities() and output probabilities () based on this data set. Formally, we define start probability of POS tag as:
where is the total number of sentences in the collection and is the number of sentences starting with POS tag . Similarly, we define transition probability from to as:
where is how many times is followed by .
The output probability of generating word by POS tag is defined as:
where is the number of times that we see word is labeled with POS tag ; and is the number of times we see .
-
[15 pts] Compute the start probabilities, transition probabilities and output probabilities. Print out all the start probabilities and the top 10 words with the highest output probabilities for each POS tag.
-
[25 pts] Given an observed sequency (sentence), find the most likely state transition path which can be considered as Part-of-speech tagging for the sentence. Implement the Viterbi algorithm using the start probabilities, transition probabilities and output probabilities you get in question #2.a to find the path. Test your algorithm in a test dataset provided in file
data/test_0
in the repository (one sentence in each line). Write your tagging result in a new file. -
[25 pts] Sentences in
test_0
only contain words in the vocabulary. However, given a sentence that contains words that are not in the vocabulary, can the algorithm developed in question #2.b deal with it? If not, how can you solve the problem? Explain your solution and implement it. Try it ondata/test_1
and save the tagging in a new file. -
[5 pts] Sentences in
data/test_0
anddata/test_1
are tagged by another POS tagger; and you can find the tagging indata/test_0_tagging
anddata/test_1_tagging
respectively (the tagging for the sentence in the -th line intest_0
can be found in the -th line intest_0_tagging
). Take them as groundtruth and calculate the accuracy fortest_0
andtest_1
(i.e., two accuracy scores, one fortest_0
and the other fortest_1
). Report the scores in your MD or PDF file.
3. Triple-Check Your Repository
When you’ve finished all of the above, triple check your repository by navigating to it in GitLab in a browser. Here are a few things to check:
-
Make sure that you can see your writeup (and that it is the most up-to-date version).
-
Make sure that all of the files needed for building/running your code are there.
-
Make sure that you code compiles and runs properly in the CI environment.
-
Make sure that your
README.md
is up to date with instructions about how to compile/run your code. -
Make sure that you update your
.gitlab-ci.yml
file.