CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

MP3: PLSA and the EM Algorithm

Notice: This assignment is due Tuesday, October 31st at 11:59pm Tuesday, November 7th at 11:59pm..

This is a programming assignment, so submission procedures are slightly different than for regular written assignments. You will upload your code as well as your write-up (in either Markdown or PDF) to your git repository. Concrete instructions on how to do so are included in this assignment specification. Make sure that you double-check that you can see your submission in your git repository!

Partner Assignment: You are explicitly allowed (but not required) to work with at most one other person for this assignment. You are encouraged to use this as an opportunity to “scout” potential project partners. Try to find a partner that complements your skill set.

Everyone will have their own repository for this assignment; if you work in a group, please designate one member of the group to have the repository to be used for grading, and make sure that the other member of the partnership indicates this (e.g., by providing a link to the canonical repository) in the README.md file of their repository.

Please also indicate in your writeup in the graded repository who your partner was.

To grant your partner access to your repository (so that you can use it to collaborate), you can go to Settings -> Members in the sidebar on your repository, and then add your partner under a Developer or Master role.

The PLSA algorithm, initially introduced in 1999 by Thomas Hofmann, models a corpus of documents D={d1,d2,,dN}D=\{d_1, d_2, \ldots, d_N\} as a mixture of KK different “topics” θ1,θ2,,θK\theta_1, \theta_2, \ldots, \theta_K, each of which is a multinomial over words in a fixed vocabulary VV. Each document did_i is associated with a distribution πi\pi_i which is a multinomial over the KK topics and represents the likelihood that a word in that document is generated from a specific topic.

Under this model, we have a log likelihood of

logp(DΘ,Π)=i=1Nj=1dilog{k=1Kp(zi,j=kπi)p(di,j=wθk)}.\displaystyle \log p(D \mid \Theta, \Pi) = \sum_{i=1}^N \sum_{j=1}^{|d_i|} \log \left\{ \sum_{k=1}^K p(z_{i,j} = k \mid \pi_i) p(d_{i,j} = w \mid \theta_k) \right\}.

If we wished to find the maximum likelihood estimate of the parameters Θ\Theta and Π\Pi, we would have trouble coming up with an analytical solution due to the summation occurring inside of the logarithm. Thus, we turn to the EM algorithm to come up with a maximum likelihood estimator for the parameters instead.

In lecture, we discussed a modification of the original PLSA model that incorporated a fixed background topic. In that model, with probability λ\lambda we generated a word from the background p(wD)p(w \mid D) and with probability 1λ1 - \lambda we generated it from a mixture of K different topics like the original PLSA model. Formally, we have a log likelihood of

logp(DΘ,Π)=i=1Nj=1dilog{λp(di,j=wD)+(1λ)k=1Kp(zi,j=kπi)p(di,j=wθk)}.\displaystyle \log p(D \mid \Theta, \Pi) = \sum_{i=1}^N \sum_{j=1}^{|d_i|} \log \left\{ \lambda p(d_{i,j} = w \mid D) + (1-\lambda) \sum_{k=1}^K p(z_{i,j} = k \mid \pi_i) p(d_{i,j} = w \mid \theta_k) \right\}.

1. PLSA With and Without the Background Topic [15 pts]

In this question, we will investigate the difference between these two models.

  1. [5 pts] The second model (the one we covered in lecture) uses a fixed background topic with a fixed probability λ\lambda that is specified in advance. How can you configure this model so that it reduces to the first one? (In other words, for what setting of the parameters is the second model equivalent to the first?)

  2. [5 pts] In the second model, p(wD)p(w \mid D) is our background language model which is assumed to be fixed throughout the parameter learning process. Give the maximum likelihood estimate for this language model.

  3. [5 pts] Formulate a hypothesis (as in, a falsifiable statement) about the impact of λ\lambda on the topics {θ1,,θK}\{\theta_1, \ldots, \theta_K\} found by the model. What happens if λ\lambda is very large? What if it is very small?

    Describe how you could test your hypothesis systematically. (In other words, describe an experiment that could prove or refute said hypothesis.)

2. Deriving the EM Algorithm for PLSA with a Background Topic [10 pts]

You may find the following resources helpful:

  1. [5 pts] In the discrete case, it is common to formulate EM algorithms as computing and normalizing expected counts of events.

    Let nd,kn_{d,k} be the number of times we expect to see any word token in document dd assigned to topic kk, and let nw,kn_{w,k} be the number of times we expect the word type ww to be assigned the topic kk.

    Write down the equations for computing nd,kn_{d,k} and nw,kn_{w,k}. Computing these across the entire corpus is your E-step calculation.

  2. [5 pts] Write down the formula for re-estimating the parameters Θ\Theta and Π\Pi in terms of the expected counts you computed above. This is your M-step.

3. Implementing PLSA with a Background Topic [75 pts]

Warning: This part of the assignment is programming heavy, so please start as early as you possibly can. You need to provide yourself enough time to implement the model, debug it, and analyze its results to answer the questions that we ask in this assignment.

If you wait until the last minute, you may end up putting a lot of time and effort into programming your model, but run out of time before you can actually run it to answer the questions in this and the last part of the homework. Don’t do that.

To give you a rough idea, it takes our parallelized solution code approximately 2 minutes to converge on a quad-core machine on the dataset provided. You’ll need to run your code to convergence at least 5 times to answer the questions here, so please budget your time accordingly.

Now, let’s implement the PLSA model with the background topic. You may use any programming language you would like, provided your code can be compiled/run on a Linux machine for grading purposes.

You should use our GitLab CI setup to test this so that we can be certain to use the exact same environment for grading. In your repository, you will find a file .gitlab-ci.yml that you should modify in order to install any required dependencies for your code, and then run your code on the provided dataset in your data/ directory with K=20K = 20 and λ=0.9\lambda = 0.9. There are helpful comments in that file to help you understand what’s going on; you can also refer to the previous two MPs for some more examples, and also GitLab CI’s documentation for more information.

Make sure you also update the README.md file to describe in plain English how to build and run your code for each of the tasks we ask you to complete here, so that we can replicate your results.

  1. [20 pts] Implement the EM algorithm you derived above. Make sure that you are able to set a specific random seed for your random initialization (that is, the seed you use to initialize your random number generator that is used to create the initial random starting parameters Θ(0)\Theta^{(0)} and Π(0)\Pi^{(0)}).

    You should terminate your algorithm after 100 iterations, or when the relative change in the log likelihood of the data is less than 0.0001.

    The relative change in log likelihood is

    Δi=logp(DΘ(i1),Π(i1))logp(DΘ(i),Π(i))logp(DΘ(i1),Π(i1)).\displaystyle \Delta_i = \frac{\log p(D \mid \Theta^{(i-1)}, \Pi^{(i-1)}) - \log p(D \mid \Theta^{(i)}, \Pi^{(i)})}{\log p(D \mid \Theta^{(i-1)}, \Pi^{(i-1)})}.

    At the end of each iteration, print both the log likelihood and the relative change in log likelihood.

    Make sure to include instructions in your README so that I can run your model with a particular setting for KK and λ\lambda!

    Debugging/Implementation Tips

    1. The log likelihood should never decrease between iterations. If this happens, you’ve got a bug somewhere.

    2. Make sure all of your distributions sum to 1 at all times! Remember that your πi\pi_i distributions will sum to 1 across the KK topics (excluding the background) and that your θk\theta_k distributions will sum to 1 across the entire vocabulary VV.

    3. You should expect many distributions to be quite sparse, so you might want to take advantage of this to reduce memory consumption by only using memory for values that are non-zero.

  2. [5 pts] Before running your code, predict what the shape of a plot of the log likelihood over each iteration will be. Specifically, let the x-axis be the iteration number and the y-axis be the log likelihood.

  3. [5 pts] Before running your code, predict what the shape of a plot of the relative difference in log likelihood will be. Specifically, let the x-axis be the iteration number and the y-axis be the Δi\Delta_i from above.

  4. [10 pts] Now run your PLSA implementation on the dataset of DBLP abstracts we provided in your repository’s data/ directory with K=20K = 20 and λ=0.9\lambda = 0.9.

    Make both of the plots from above (log likelihood and relative difference in log likelihood) with the actual data now. Briefly explain why each is shaped the way that it is. Did this match your prediction?

  5. [10 pts] Print out the top 10 words in each of the 20 topics you found above with λ=0.9\lambda = 0.9. Now set λ=0.3\lambda = 0.3 and print the top 10 words in each of the 20 topics.

    Describe the difference between the topics you found with a large λ\lambda vs a small λ\lambda. Is this what you expected? Explain why you observe this phenomena.

    Based on this, if someone wanted to find topics that have a very descriptive top ten words, what would you recommend to them for setting λ\lambda?

  6. [10 pts] Using the background topic with an appropriate setting of λ\lambda is just one way of accomplishing the effect you observed above. Describe another solution you could employ to ensure that you arrive at very discriminative (very distinct) topics even if λ\lambda was set inappropriately. This might be in the form of a model extension, modified data pre-processing, etc. Be creative (but realistic)!

  7. [15 pts] Run your algorithm again with K=20K = 20 and λ=0.9\lambda = 0.9 four separate times, each with a different random seed. Plot the log-likelihood over each iteration for each run.

    Do all four runs reach the same log likelihood value? Will this be true in general? Why or why not?

    If someone wanted to achieve the highest possible log likelihood for your model on their data, what would you suggest?

4. Triple-Check Your Repository

When you’ve finished all of the above, triple check your repository by navigating to it in GitLab in a browser. Here are a few things to check:

  1. Make sure that you can see your writeup (and that it is the most up-to-date version).

  2. Make sure that all of the files needed for building/running your code are there.

  3. Make sure that you code compiles and runs properly in the CI environment.

  4. Make sure that your README.md is up to date with instructions about how to compile/run your code for the different settings of λ\lambda and KK.

  5. Make sure you’ve included any code you used to generate the plots we have asked you to make.