MP3: PLSA and the EM Algorithm
Notice: This assignment is due
Tuesday, October 31st at 11:59pm
Tuesday, November 7th at 11:59pm..
This is a programming assignment, so submission procedures are slightly
different than for regular written assignments. You will upload your code
as well as your write-up (in either Markdown or PDF) to your
git repository. Concrete instructions on how to do so are included in
this assignment specification. Make sure that you double-check that you
can see your submission in your
Partner Assignment: You are explicitly allowed (but not required) to work with at most one other person for this assignment. You are encouraged to use this as an opportunity to “scout” potential project partners. Try to find a partner that complements your skill set.
Everyone will have their own repository for this assignment; if you work in
a group, please designate one member of the group to have the
repository to be used for grading, and make sure that the other member of
the partnership indicates this (e.g., by providing a link to the canonical
repository) in the
README.md file of their repository.
Please also indicate in your writeup in the graded repository who your partner was.
To grant your partner access to your repository (so that you can use it to collaborate), you can go to Settings -> Members in the sidebar on your repository, and then add your partner under a Developer or Master role.
The PLSA algorithm, initially introduced in 1999 by Thomas Hofmann, models a corpus of documents as a mixture of different “topics” , each of which is a multinomial over words in a fixed vocabulary . Each document is associated with a distribution which is a multinomial over the topics and represents the likelihood that a word in that document is generated from a specific topic.
Under this model, we have a log likelihood of
If we wished to find the maximum likelihood estimate of the parameters and , we would have trouble coming up with an analytical solution due to the summation occurring inside of the logarithm. Thus, we turn to the EM algorithm to come up with a maximum likelihood estimator for the parameters instead.
In lecture, we discussed a modification of the original PLSA model that incorporated a fixed background topic. In that model, with probability we generated a word from the background and with probability we generated it from a mixture of K different topics like the original PLSA model. Formally, we have a log likelihood of
1. PLSA With and Without the Background Topic [15 pts]
In this question, we will investigate the difference between these two models.
[5 pts] The second model (the one we covered in lecture) uses a fixed background topic with a fixed probability that is specified in advance. How can you configure this model so that it reduces to the first one? (In other words, for what setting of the parameters is the second model equivalent to the first?)
[5 pts] In the second model, is our background language model which is assumed to be fixed throughout the parameter learning process. Give the maximum likelihood estimate for this language model.
[5 pts] Formulate a hypothesis (as in, a falsifiable statement) about the impact of on the topics found by the model. What happens if is very large? What if it is very small?
Describe how you could test your hypothesis systematically. (In other words, describe an experiment that could prove or refute said hypothesis.)
2. Deriving the EM Algorithm for PLSA with a Background Topic [10 pts]
You may find the following resources helpful:
[5 pts] In the discrete case, it is common to formulate EM algorithms as computing and normalizing expected counts of events.
Let be the number of times we expect to see any word token in document assigned to topic , and let be the number of times we expect the word type to be assigned the topic .
Write down the equations for computing and . Computing these across the entire corpus is your E-step calculation.
[5 pts] Write down the formula for re-estimating the parameters and in terms of the expected counts you computed above. This is your M-step.
3. Implementing PLSA with a Background Topic [75 pts]
Warning: This part of the assignment is programming heavy, so please start as early as you possibly can. You need to provide yourself enough time to implement the model, debug it, and analyze its results to answer the questions that we ask in this assignment.
If you wait until the last minute, you may end up putting a lot of time and effort into programming your model, but run out of time before you can actually run it to answer the questions in this and the last part of the homework. Don’t do that.
To give you a rough idea, it takes our parallelized solution code approximately 2 minutes to converge on a quad-core machine on the dataset provided. You’ll need to run your code to convergence at least 5 times to answer the questions here, so please budget your time accordingly.
Now, let’s implement the PLSA model with the background topic. You may use any programming language you would like, provided your code can be compiled/run on a Linux machine for grading purposes.
You should use our GitLab CI setup to test this so that we can be certain
to use the exact same environment for grading. In your repository, you will
find a file
.gitlab-ci.yml that you should modify in order to install any
required dependencies for your code, and then run your code on the provided
dataset in your
data/ directory with and . There are helpful comments in that file to help you
understand what’s going on; you can also refer to the previous two MPs for
some more examples, and also GitLab CI’s documentation
for more information.
Make sure you also update the
README.md file to describe in plain
English how to build and run your code for each of the tasks we ask you to
complete here, so that we can replicate your results.
[20 pts] Implement the EM algorithm you derived above. Make sure that you are able to set a specific random seed for your random initialization (that is, the seed you use to initialize your random number generator that is used to create the initial random starting parameters and ).
You should terminate your algorithm after 100 iterations, or when the relative change in the log likelihood of the data is less than 0.0001.
The relative change in log likelihood is
At the end of each iteration, print both the log likelihood and the relative change in log likelihood.
Make sure to include instructions in your
READMEso that I can run your model with a particular setting for and !
The log likelihood should never decrease between iterations. If this happens, you’ve got a bug somewhere.
Make sure all of your distributions sum to 1 at all times! Remember that your distributions will sum to 1 across the topics (excluding the background) and that your distributions will sum to 1 across the entire vocabulary .
You should expect many distributions to be quite sparse, so you might want to take advantage of this to reduce memory consumption by only using memory for values that are non-zero.
[5 pts] Before running your code, predict what the shape of a plot of the log likelihood over each iteration will be. Specifically, let the x-axis be the iteration number and the y-axis be the log likelihood.
[5 pts] Before running your code, predict what the shape of a plot of the relative difference in log likelihood will be. Specifically, let the x-axis be the iteration number and the y-axis be the from above.
[10 pts] Now run your PLSA implementation on the dataset of DBLP abstracts we provided in your repository’s
data/directory with and .
Make both of the plots from above (log likelihood and relative difference in log likelihood) with the actual data now. Briefly explain why each is shaped the way that it is. Did this match your prediction?
[10 pts] Print out the top 10 words in each of the 20 topics you found above with . Now set and print the top 10 words in each of the 20 topics.
Describe the difference between the topics you found with a large vs a small . Is this what you expected? Explain why you observe this phenomena.
Based on this, if someone wanted to find topics that have a very descriptive top ten words, what would you recommend to them for setting ?
[10 pts] Using the background topic with an appropriate setting of is just one way of accomplishing the effect you observed above. Describe another solution you could employ to ensure that you arrive at very discriminative (very distinct) topics even if was set inappropriately. This might be in the form of a model extension, modified data pre-processing, etc. Be creative (but realistic)!
[15 pts] Run your algorithm again with and four separate times, each with a different random seed. Plot the log-likelihood over each iteration for each run.
Do all four runs reach the same log likelihood value? Will this be true in general? Why or why not?
If someone wanted to achieve the highest possible log likelihood for your model on their data, what would you suggest?
4. Triple-Check Your Repository
When you’ve finished all of the above, triple check your repository by navigating to it in GitLab in a browser. Here are a few things to check:
Make sure that you can see your writeup (and that it is the most up-to-date version).
Make sure that all of the files needed for building/running your code are there.
Make sure that you code compiles and runs properly in the CI environment.
Make sure that your
README.mdis up to date with instructions about how to compile/run your code for the different settings of and .
Make sure you’ve included any code you used to generate the plots we have asked you to make.