Assignment #5: PLSA and the EM Algorithm
Partner Assignment: You are explicitly allowed to work with at most one other person for this assignment. You are encouraged to use this as an opportunity to “scout” potential project partners. Try to find a partner that complements your skill set.
Notice: This assignment is due Saturday, October 15th at 11:59pm.
Please submit your solutions via Compass. You should submit your assignment as a PDF, and your accompanying code as either a .zip or .tar.gz containing your files.
So, since my NetID is geigle1
, and my partner was czhai
, one of us
would submit the following files:
czhaigeigle1assignment5.pdf
czhaigeigle1assignment5.tar.gz
orczhaigeigle1assignment5.zip
The PLSA algorithm, initially introduced in 1999 by Thomas Hofmann, models a corpus of documents $D=\{d_1, d_2, \ldots, d_N\}$ as a mixture of $K$ different “topics” $\theta_1, \theta_2, \ldots, \theta_K$, each of which is a multinomial over words in a fixed vocabulary $V$. Each document $d_i$ is associated with a distribution $\pi_i$ which is a multinomial over the $K$ topics and represents the likelihood that a word in that document is generated from a specific topic.
Under this model, we have a log likelihood of
If we wished to find the maximum likelihood estimate of the parameters $\Theta$ and $\Pi$, we would have trouble coming up with an analytical solution due to the summation occurring inside of the logarithm. Thus, we turn to the EM algorithm to come up with a maximum likelihood estimator for the parameters instead.
In lecture, we discussed a modification of the original PLSA model that incorporated a fixed background topic. In that model, with probability $\lambda$ we generated a word from the background $p(w \mid D)$ and with probability $1  \lambda$ we generated it from a mixture of K different topics like the original PLSA model. Formally, we have a log likelihood of
1. PLSA With and Without the Background Topic [15 pts]
In this question, we will investigate the difference between these two models.

[5 pts] The second model (the one we covered in lecture) uses a fixed background topic with a fixed probability $\lambda$ that is specified in advance. How can you configure this model so that it reduces to the first one? (In other words, for what setting of the parameters is the second model equivalent to the first?)

[5 pts] In the second model, $p(w \mid D)$ is our background language model which is assumed to be fixed throughout the parameter learning process. Give the maximum likelihood estimate for this language model.

[5 pts] Formulate a hypothesis (as in, a falsifiable statement) about the impact of $\lambda$ on the topics $\{\theta_1, \ldots, \theta_K\}$ found by the model. What happens if $\lambda$ is very large? What if it is very small?
Describe how you could test your hypothesis systematically. (In other words, describe an experiment that could prove or refute said hypothesis.)
2. Deriving the EM Algorithm for PLSA with a Background Topic [10 pts]
You may find the following resources helpful:
 Cheng’s note on the EM algorithm,
 My note on the EM algorithm, which includes a derivation of the EM algorithm for this model (see section 4), and
 Qiaozhu Mei’s note on the EM algorithm for PLSA, which includes a different derivation of the EM algorithm for this model.

[5 pts] In the discrete case, it is common to formulate EM algorithms as computing and normalizing expected counts of events.
Let $n_{d,k}$ be the number of times we expect to see any word token in document $d$ assigned to topic $k$, and let $n_{w,k}$ be the number of times we expect the word type $w$ to be assigned the topic $k$.
Write down the equations for computing $n_{d,k}$ and $n_{w,k}$. Computing these across the entire corpus is your Estep calculation.

[5 pts] Write down the formula for reestimating the parameters $\Theta$ and $\Pi$ in terms of the expected counts you computed above. This is your Mstep.
3. Implementing PLSA with a Background Topic [75 pts]
Warning: This part of the assignment is programming heavy, so please start as early as you possibly can. You need to provide yourself enough time to implement the model, debug it, and analyze its results to answer the questions that we ask in this assignment.
If you wait until the last minute, you may end up putting a lot of time and effort into programming your model, but run out of time before you can actually run it to answer the questions in this and the last part of the homework. Don’t do that.
To give you a rough idea, it takes our solution code approximately 5 minutes to converge on the dataset provided. You’ll need to run your code to convergence at least 5 times to answer the questions here, so budget accordingly.
Now, let’s implement the PLSA model with the background topic. You may use any programming language you would like, provided your code can be compiled/run on a Linux machine for grading purposes. (You can use EWS to test this.) Make sure you provide a README that describes how to build and run your code for each of the tasks we ask you to complete here so we can replicate your results.

[20 pts] Implement the EM algorithm you derived above. Make sure that you are able to set a specific random seed for your random initialization (that is, the seed you use to initialize your random number generator that is used to create the initial random starting parameters $\Theta^{(0)}$ and $\Pi^{(0)}$).
You should terminate your algorithm after 100 iterations, or when the relative change in the log likelihood of the data is less than 0.0001.
The relative change in log likelihood is
$\displaystyle \Delta_i = \frac{\log p(D \mid \Theta^{(i1)}, \Pi^{(i1)})  \log p(D \mid \Theta^{(i)}, \Pi^{(i)})}{\log p(D \mid \Theta^{(i1)}, \Pi^{(i1)})}.$At the end of each iteration, print both the log likelihood and the relative change in log likelihood.
Make sure to include instructions in your
README
so that I can run your model with a particular setting for $K$ and $\lambda$!Debugging/Implementation Tips

The log likelihood should never decrease between iterations. If this happens, you’ve got a bug somewhere.

Make sure all of your distributions sum to 1 at all times! Remember that your $\pi_i$ distributions will sum to 1 across the $K$ topics (excluding the background) and that your $\theta_k$ distributions will sum to 1 across the entire vocabulary $V$.

You should expect many distributions to be quite sparse, so you might want to take advantage of this to reduce memory consumption by only using memory for values that are nonzero.


[5 pts] Before running your code, predict what the shape of a plot of the log likelihood over each iteration will be. Specifically, let the xaxis be the iteration number and the yaxis be the log likelihood.

[5 pts] Before running your code, predict what the shape of a plot of the relative difference in log likelihood will be. Specifically, let the xaxis be the iteration number and the yaxis be the $\Delta_i$ from above.

[10 pts] Now run your PLSA implementation on this dataset of abstracts from DBLP with $K = 20$ and $\lambda = 0.9$.
Make both of the plots from above (log likelihood and relative difference in log likelihood) with the actual data now. Briefly explain why each is shaped the way that it is. Did this match your prediction?

[10 pts] Print out the top 10 words in each of the 20 topics you found above with $\lambda = 0.9$. Now set $\lambda = 0.3$ and print the top 10 words in each of the 20 topics.
Describe the difference between the topics you found with a large $\lambda$ vs a small $\lambda$. Is this what you expected? Explain why you observe this phenomena.
Based on this, if someone wanted to find topics that have a very descriptive top ten words, what would you recommend to them for setting $\lambda$?

[10 pts] Using the background topic with an appropriate setting of $\lambda$ is just one way of accomplishing the effect you observed above. Describe another solution you could employ to ensure that you arrive at very discriminative (very distinct) topics even if $\lambda$ was set inappropriately. This might be in the form of a model extension, modified data preprocessing, etc. Be creative (but realistic)!

[15 pts] Run your algorithm again with $K = 20$ and $\lambda = 0.9$ four separate times, each with a different random seed. Plot the loglikelihood over each iteration for each run.
Do all four runs reach the same log likelihood value? Will this be true in general? Why or why not?
If someone wanted to achieve the highest possible log likelihood for your model on their data, what would you suggest?