Software Track Project Ideas
If you are thinking about pursuing the software track for your project, here are some topics you might consider working on to improve MeTA, a toolkit for information retrieval and text analysis.
Information Retrieval Improvements
Information retrieval is a strong focus in MeTA, but there are still a few gaps. The following projects focus on filling those gaps.
Feedback Methods for Retrieval
MeTA currently lacks an implementation of pseudo-relevance feedback for retrieval. There are two main approaches here: (1) Rocchio which corresponds to the vector-space model for retrieval and (2) Model-based feedback in the KL-divergence retrieval model, which focuses on feedback in the language modeling approach. There has been some recent work on axiomatic analysis of pseudo-relevance feedback methods.
The concrete task for this project would be to add pseudo-relevance feedback method(s) to MeTA. In particular, exploring the methods modified using axiomatic analysis would be good.
Query Spelling Correction
One major problem in information retrieval systems occurs when users misspell things in their queries, which often results in poor retrieval performance. There are many models that can be used to automatically detect and correct spelling mistakes.
The concrete task for this project would be to implement spelling correction method(s) in MeTA.
Learning to Rank
A very popular approach for formulating a ranking function for information retrieval tasks is to use learning to rank, where a ranking function is constructed by learning weights for different features based on some training data (queries + relevance judgments). MeTA currently has implementations of some of the most popular classifiers, but does not yet leverage them to provide a ranking function based on learning to rank. There are many formulations for learning to rank; this one might be the simplest to implement.
The concrete task for this project would be to add learning to rank method(s) to MeTA. It would be great if you could compare more than one method.
Topic Model Improvements
A major component of MeTA is a library for topic modeling. Here are some projects that would improve MeTA’s topic modeling library.
Hyperparameter optimization for LDA
This project would focus on adding methods for optimizing the hyperparameters and for the LDA topic models in MeTA. These are currently assumed by most of the methods to be (1) symmetric and (2) fixed. However, there is work that suggests that the proper setting of these parameters is very important.
The concrete task for this project would be to add the hyperparameter optimization methods mentioned in the previous paper and Hannah Wallach’s thesis (see chapter 2) to MeTA to improve the topic modeling performance. You should implement a few of these optimization methods and compare their effectiveness in a similar way to the Wallach paper.
Speed up Collapsed Gibbs Sampling with Sparse Sampling Methods
Lots of effort has been put into speeding up collapsed Gibbs sampling algorithm for inference in LDA. One approach leverages the sparsity of the full conditional distribution to speed up sampling.
The concrete task for this project would be to speed up the existing Gibbs sampling methods in MeTA (
parallel_lda_gibbs) by exploiting the sparsity of the full conditional distribution using the method described in the above paper.
Make “Stochastic Collapsed Variational Inference” Actually Stochastic
This project would focus on MeTA’s current implementation of Stochastic Collapsed Variational Inference for LDA. Currently, the implementation of this method is essentially still a batch implementation—there is no way to feed documents into the method in a streaming fashion.
The concrete task for this project would be to modify the existing SCVB0 method in MeTA to support streaming document collections.
This project would focus on MeTA’s current implementation of CVB0. The paper that introduced CVB0 also discussed a parallelized implementation, but did not describe how this was actually achieved. We suspect this is essentially the same as was done for AD-LDA, which exists as
The concrete task for this project would be to add a parallelized implementation of CVB0 to meta that mirrors
There are many modifications to LDA. One of these incorporates document labels into the inference process, where the labels might be discrete (like classes) or real-valued (like ratings). This method is called supervised LDA.
The concrete task for this project would be to add a supervised LDA implementation to MeTA’s topic modeling library.
Latent Aspect Rating Analysis
One modification of LDA from Prof. Zhai’s group is LARA and its modification LARAM, which can be used to mine review corpora for aspects, their scores for different reviewed items, and their importance to different reviewers.
The concrete task for this project would be to add an implementation of LARAM to MeTA’s topic modeling library.
Word Embedding Improvements
Word embeddings are a recent addition to MeTA. Currently, MeTA implements the GloVe algorithm (cite paper) for word embeddings that we discussed in lecture.
Skip-Gram Negative Sampling
Perhaps the most popular word embedding method today is the negative-sampling method for learning the Skip-Gram model. This is an important baseline for word embedding methods today.
The concrete task for this project would be to add an implementation of skip-gram negative smapling to MeTA’s word embedding library. It would be good to compare the performance of GloVe with SGNS as a result.
It is natural to start to think about embeddings of larger objects than individual words. For example, one can create an embedding for a document and thus compute similarities between documents instead of similarities between words.
The concrete task for this project would be to add an implementation of word mover’s distance to MeTA.