MP3: PLSA and the EM Algorithm
Notice: This assignment is due
Tuesday, October 31st at 11:59pm
Tuesday, November 7th at 11:59pm..
This is a programming assignment, so submission procedures are slightly
different than for regular written assignments. You will upload your code
as well as your writeup (in either Markdown or PDF) to your
git
repository. Concrete instructions on how to do so are included in
this assignment specification. Make sure that you doublecheck that you
can see your submission in your git
repository!
Partner Assignment: You are explicitly allowed (but not required) to work with at most one other person for this assignment. You are encouraged to use this as an opportunity to “scout” potential project partners. Try to find a partner that complements your skill set.
Everyone will have their own repository for this assignment; if you work in
a group, please designate one member of the group to have the
repository to be used for grading, and make sure that the other member of
the partnership indicates this (e.g., by providing a link to the canonical
repository) in the README.md
file of their repository.
Please also indicate in your writeup in the graded repository who your partner was.
To grant your partner access to your repository (so that you can use it to collaborate), you can go to Settings > Members in the sidebar on your repository, and then add your partner under a Developer or Master role.
The PLSA algorithm, initially introduced in 1999 by Thomas Hofmann, models a corpus of documents $D=\{d_1, d_2, \ldots, d_N\}$ as a mixture of $K$ different “topics” $\theta_1, \theta_2, \ldots, \theta_K$, each of which is a multinomial over words in a fixed vocabulary $V$. Each document $d_i$ is associated with a distribution $\pi_i$ which is a multinomial over the $K$ topics and represents the likelihood that a word in that document is generated from a specific topic.
Under this model, we have a log likelihood of
If we wished to find the maximum likelihood estimate of the parameters $\Theta$ and $\Pi$, we would have trouble coming up with an analytical solution due to the summation occurring inside of the logarithm. Thus, we turn to the EM algorithm to come up with a maximum likelihood estimator for the parameters instead.
In lecture, we discussed a modification of the original PLSA model that incorporated a fixed background topic. In that model, with probability $\lambda$ we generated a word from the background $p(w \mid D)$ and with probability $1  \lambda$ we generated it from a mixture of K different topics like the original PLSA model. Formally, we have a log likelihood of
1. PLSA With and Without the Background Topic [15 pts]
In this question, we will investigate the difference between these two models.

[5 pts] The second model (the one we covered in lecture) uses a fixed background topic with a fixed probability $\lambda$ that is specified in advance. How can you configure this model so that it reduces to the first one? (In other words, for what setting of the parameters is the second model equivalent to the first?)

[5 pts] In the second model, $p(w \mid D)$ is our background language model which is assumed to be fixed throughout the parameter learning process. Give the maximum likelihood estimate for this language model.

[5 pts] Formulate a hypothesis (as in, a falsifiable statement) about the impact of $\lambda$ on the topics $\{\theta_1, \ldots, \theta_K\}$ found by the model. What happens if $\lambda$ is very large? What if it is very small?
Describe how you could test your hypothesis systematically. (In other words, describe an experiment that could prove or refute said hypothesis.)
2. Deriving the EM Algorithm for PLSA with a Background Topic [10 pts]
You may find the following resources helpful:
 Cheng’s note on the EM algorithm,
 My note on the EM algorithm, which includes a derivation of the EM algorithm for this model (see section 4), and
 Qiaozhu Mei’s note on the EM algorithm for PLSA, which includes a different derivation of the EM algorithm for this model.

[5 pts] In the discrete case, it is common to formulate EM algorithms as computing and normalizing expected counts of events.
Let $n_{d,k}$ be the number of times we expect to see any word token in document $d$ assigned to topic $k$, and let $n_{w,k}$ be the number of times we expect the word type $w$ to be assigned the topic $k$.
Write down the equations for computing $n_{d,k}$ and $n_{w,k}$. Computing these across the entire corpus is your Estep calculation.

[5 pts] Write down the formula for reestimating the parameters $\Theta$ and $\Pi$ in terms of the expected counts you computed above. This is your Mstep.
3. Implementing PLSA with a Background Topic [75 pts]
Warning: This part of the assignment is programming heavy, so please start as early as you possibly can. You need to provide yourself enough time to implement the model, debug it, and analyze its results to answer the questions that we ask in this assignment.
If you wait until the last minute, you may end up putting a lot of time and effort into programming your model, but run out of time before you can actually run it to answer the questions in this and the last part of the homework. Don’t do that.
To give you a rough idea, it takes our parallelized solution code approximately 2 minutes to converge on a quadcore machine on the dataset provided. You’ll need to run your code to convergence at least 5 times to answer the questions here, so please budget your time accordingly.
Now, let’s implement the PLSA model with the background topic. You may use any programming language you would like, provided your code can be compiled/run on a Linux machine for grading purposes.
You should use our GitLab CI setup to test this so that we can be certain
to use the exact same environment for grading. In your repository, you will
find a file .gitlabci.yml
that you should modify in order to install any
required dependencies for your code, and then run your code on the provided
dataset in your data/
directory with $K = 20$ and $\lambda
= 0.9$. There are helpful comments in that file to help you
understand what’s going on; you can also refer to the previous two MPs for
some more examples, and also GitLab CI’s documentation
for more information.
Make sure you also update the README.md
file to describe in plain
English how to build and run your code for each of the tasks we ask you to
complete here, so that we can replicate your results.

[20 pts] Implement the EM algorithm you derived above. Make sure that you are able to set a specific random seed for your random initialization (that is, the seed you use to initialize your random number generator that is used to create the initial random starting parameters $\Theta^{(0)}$ and $\Pi^{(0)}$).
You should terminate your algorithm after 100 iterations, or when the relative change in the log likelihood of the data is less than 0.0001.
The relative change in log likelihood is
$\displaystyle \Delta_i = \frac{\log p(D \mid \Theta^{(i1)}, \Pi^{(i1)})  \log p(D \mid \Theta^{(i)}, \Pi^{(i)})}{\log p(D \mid \Theta^{(i1)}, \Pi^{(i1)})}.$At the end of each iteration, print both the log likelihood and the relative change in log likelihood.
Make sure to include instructions in your
README
so that I can run your model with a particular setting for $K$ and $\lambda$!Debugging/Implementation Tips

The log likelihood should never decrease between iterations. If this happens, you’ve got a bug somewhere.

Make sure all of your distributions sum to 1 at all times! Remember that your $\pi_i$ distributions will sum to 1 across the $K$ topics (excluding the background) and that your $\theta_k$ distributions will sum to 1 across the entire vocabulary $V$.

You should expect many distributions to be quite sparse, so you might want to take advantage of this to reduce memory consumption by only using memory for values that are nonzero.


[5 pts] Before running your code, predict what the shape of a plot of the log likelihood over each iteration will be. Specifically, let the xaxis be the iteration number and the yaxis be the log likelihood.

[5 pts] Before running your code, predict what the shape of a plot of the relative difference in log likelihood will be. Specifically, let the xaxis be the iteration number and the yaxis be the $\Delta_i$ from above.

[10 pts] Now run your PLSA implementation on the dataset of DBLP abstracts we provided in your repository’s
data/
directory with $K = 20$ and $\lambda = 0.9$.Make both of the plots from above (log likelihood and relative difference in log likelihood) with the actual data now. Briefly explain why each is shaped the way that it is. Did this match your prediction?

[10 pts] Print out the top 10 words in each of the 20 topics you found above with $\lambda = 0.9$. Now set $\lambda = 0.3$ and print the top 10 words in each of the 20 topics.
Describe the difference between the topics you found with a large $\lambda$ vs a small $\lambda$. Is this what you expected? Explain why you observe this phenomena.
Based on this, if someone wanted to find topics that have a very descriptive top ten words, what would you recommend to them for setting $\lambda$?

[10 pts] Using the background topic with an appropriate setting of $\lambda$ is just one way of accomplishing the effect you observed above. Describe another solution you could employ to ensure that you arrive at very discriminative (very distinct) topics even if $\lambda$ was set inappropriately. This might be in the form of a model extension, modified data preprocessing, etc. Be creative (but realistic)!

[15 pts] Run your algorithm again with $K = 20$ and $\lambda = 0.9$ four separate times, each with a different random seed. Plot the loglikelihood over each iteration for each run.
Do all four runs reach the same log likelihood value? Will this be true in general? Why or why not?
If someone wanted to achieve the highest possible log likelihood for your model on their data, what would you suggest?
4. TripleCheck Your Repository
When you’ve finished all of the above, triple check your repository by navigating to it in GitLab in a browser. Here are a few things to check:

Make sure that you can see your writeup (and that it is the most uptodate version).

Make sure that all of the files needed for building/running your code are there.

Make sure that you code compiles and runs properly in the CI environment.

Make sure that your
README.md
is up to date with instructions about how to compile/run your code for the different settings of $\lambda$ and $K$. 
Make sure you’ve included any code you used to generate the plots we have asked you to make.