CS 598cxz Fall 2016

Assignment #5: PLSA and the EM Algorithm

Partner Assignment: You are explicitly allowed to work with at most one other person for this assignment. You are encouraged to use this as an opportunity to “scout” potential project partners. Try to find a partner that complements your skill set.

Notice: This assignment is due Saturday, October 15th at 11:59pm.

Please submit your solutions via Compass. You should submit your assignment as a PDF, and your accompanying code as either a .zip or .tar.gz containing your files.

So, since my NetID is geigle1, and my partner was czhai, one of us would submit the following files:

  • czhai-geigle1-assignment5.pdf
  • czhai-geigle1-assignment5.tar.gz or czhai-geigle1-assignment5.zip

The PLSA algorithm, initially introduced in 1999 by Thomas Hofmann, models a corpus of documents D={d1,d2,,dN}D=\{d_1, d_2, \ldots, d_N\} as a mixture of KK different “topics” θ1,θ2,,θK\theta_1, \theta_2, \ldots, \theta_K, each of which is a multinomial over words in a fixed vocabulary VV. Each document did_i is associated with a distribution πi\pi_i which is a multinomial over the KK topics and represents the likelihood that a word in that document is generated from a specific topic.

Under this model, we have a log likelihood of

logp(DΘ,Π)=i=1Nj=1dilog{k=1Kp(zi,j=kπi)p(di,j=wθk)}.\displaystyle \log p(D \mid \Theta, \Pi) = \sum_{i=1}^N \sum_{j=1}^{|d_i|} \log \left\{ \sum_{k=1}^K p(z_{i,j} = k \mid \pi_i) p(d_{i,j} = w \mid \theta_k) \right\}.

If we wished to find the maximum likelihood estimate of the parameters Θ\Theta and Π\Pi, we would have trouble coming up with an analytical solution due to the summation occurring inside of the logarithm. Thus, we turn to the EM algorithm to come up with a maximum likelihood estimator for the parameters instead.

In lecture, we discussed a modification of the original PLSA model that incorporated a fixed background topic. In that model, with probability λ\lambda we generated a word from the background p(wD)p(w \mid D) and with probability 1λ1 - \lambda we generated it from a mixture of K different topics like the original PLSA model. Formally, we have a log likelihood of

logp(DΘ,Π)=i=1Nj=1dilog{λp(di,j=wD)+(1λ)k=1Kp(zi,j=kπi)p(di,j=wθk)}.\displaystyle \log p(D \mid \Theta, \Pi) = \sum_{i=1}^N \sum_{j=1}^{|d_i|} \log \left\{ \lambda p(d_{i,j} = w \mid D) + (1-\lambda) \sum_{k=1}^K p(z_{i,j} = k \mid \pi_i) p(d_{i,j} = w \mid \theta_k) \right\}.

1. PLSA With and Without the Background Topic [15 pts]

In this question, we will investigate the difference between these two models.

  1. [5 pts] The second model (the one we covered in lecture) uses a fixed background topic with a fixed probability λ\lambda that is specified in advance. How can you configure this model so that it reduces to the first one? (In other words, for what setting of the parameters is the second model equivalent to the first?)

  2. [5 pts] In the second model, p(wD)p(w \mid D) is our background language model which is assumed to be fixed throughout the parameter learning process. Give the maximum likelihood estimate for this language model.

  3. [5 pts] Formulate a hypothesis (as in, a falsifiable statement) about the impact of λ\lambda on the topics {θ1,,θK}\{\theta_1, \ldots, \theta_K\} found by the model. What happens if λ\lambda is very large? What if it is very small?

    Describe how you could test your hypothesis systematically. (In other words, describe an experiment that could prove or refute said hypothesis.)

2. Deriving the EM Algorithm for PLSA with a Background Topic [10 pts]

You may find the following resources helpful:

  1. [5 pts] In the discrete case, it is common to formulate EM algorithms as computing and normalizing expected counts of events.

    Let nd,kn_{d,k} be the number of times we expect to see any word token in document dd assigned to topic kk, and let nw,kn_{w,k} be the number of times we expect the word type ww to be assigned the topic kk.

    Write down the equations for computing nd,kn_{d,k} and nw,kn_{w,k}. Computing these across the entire corpus is your E-step calculation.

  2. [5 pts] Write down the formula for re-estimating the parameters Θ\Theta and Π\Pi in terms of the expected counts you computed above. This is your M-step.

3. Implementing PLSA with a Background Topic [75 pts]

Warning: This part of the assignment is programming heavy, so please start as early as you possibly can. You need to provide yourself enough time to implement the model, debug it, and analyze its results to answer the questions that we ask in this assignment.

If you wait until the last minute, you may end up putting a lot of time and effort into programming your model, but run out of time before you can actually run it to answer the questions in this and the last part of the homework. Don’t do that.

To give you a rough idea, it takes our solution code approximately 5 minutes to converge on the dataset provided. You’ll need to run your code to convergence at least 5 times to answer the questions here, so budget accordingly.

Now, let’s implement the PLSA model with the background topic. You may use any programming language you would like, provided your code can be compiled/run on a Linux machine for grading purposes. (You can use EWS to test this.) Make sure you provide a README that describes how to build and run your code for each of the tasks we ask you to complete here so we can replicate your results.

  1. [20 pts] Implement the EM algorithm you derived above. Make sure that you are able to set a specific random seed for your random initialization (that is, the seed you use to initialize your random number generator that is used to create the initial random starting parameters Θ(0)\Theta^{(0)} and Π(0)\Pi^{(0)}).

    You should terminate your algorithm after 100 iterations, or when the relative change in the log likelihood of the data is less than 0.0001.

    The relative change in log likelihood is

    Δi=logp(DΘ(i1),Π(i1))logp(DΘ(i),Π(i))logp(DΘ(i1),Π(i1)).\displaystyle \Delta_i = \frac{\log p(D \mid \Theta^{(i-1)}, \Pi^{(i-1)}) - \log p(D \mid \Theta^{(i)}, \Pi^{(i)})}{\log p(D \mid \Theta^{(i-1)}, \Pi^{(i-1)})}.

    At the end of each iteration, print both the log likelihood and the relative change in log likelihood.

    Make sure to include instructions in your README so that I can run your model with a particular setting for KK and λ\lambda!

    Debugging/Implementation Tips

    1. The log likelihood should never decrease between iterations. If this happens, you’ve got a bug somewhere.

    2. Make sure all of your distributions sum to 1 at all times! Remember that your πi\pi_i distributions will sum to 1 across the KK topics (excluding the background) and that your θk\theta_k distributions will sum to 1 across the entire vocabulary VV.

    3. You should expect many distributions to be quite sparse, so you might want to take advantage of this to reduce memory consumption by only using memory for values that are non-zero.

  2. [5 pts] Before running your code, predict what the shape of a plot of the log likelihood over each iteration will be. Specifically, let the x-axis be the iteration number and the y-axis be the log likelihood.

  3. [5 pts] Before running your code, predict what the shape of a plot of the relative difference in log likelihood will be. Specifically, let the x-axis be the iteration number and the y-axis be the Δi\Delta_i from above.

  4. [10 pts] Now run your PLSA implementation on this dataset of abstracts from DBLP with K=20K = 20 and λ=0.9\lambda = 0.9.

    Make both of the plots from above (log likelihood and relative difference in log likelihood) with the actual data now. Briefly explain why each is shaped the way that it is. Did this match your prediction?

  5. [10 pts] Print out the top 10 words in each of the 20 topics you found above with λ=0.9\lambda = 0.9. Now set λ=0.3\lambda = 0.3 and print the top 10 words in each of the 20 topics.

    Describe the difference between the topics you found with a large λ\lambda vs a small λ\lambda. Is this what you expected? Explain why you observe this phenomena.

    Based on this, if someone wanted to find topics that have a very descriptive top ten words, what would you recommend to them for setting λ\lambda?

  6. [10 pts] Using the background topic with an appropriate setting of λ\lambda is just one way of accomplishing the effect you observed above. Describe another solution you could employ to ensure that you arrive at very discriminative (very distinct) topics even if λ\lambda was set inappropriately. This might be in the form of a model extension, modified data pre-processing, etc. Be creative (but realistic)!

  7. [15 pts] Run your algorithm again with K=20K = 20 and λ=0.9\lambda = 0.9 four separate times, each with a different random seed. Plot the log-likelihood over each iteration for each run.

    Do all four runs reach the same log likelihood value? Will this be true in general? Why or why not?

    If someone wanted to achieve the highest possible log likelihood for your model on their data, what would you suggest?