Assignment #6: Word Embeddings
Notice: This assignment is due Saturday, October 29th at 11:59pm.
Please submit your solutions via Compass. You should submit your assignment as a PDF, and your accompanying code as either a .zip or .tar.gz containing your files.
So, since my NetID is
geigle1, I would submit the following files:
In this assignment you will investigate a different method for computing word embeddings and compare it against a state-of-the-art method based on matrix factorization called GloVe.
Warning: This assignment requires programming using Python. You should install Python 3 on your system in order to do the assignment.
Please make sure you’ve completed the setup instructions well before the deadline, as we cannot guarantee that you’ll get your installation questions answered at the last minute (since we will prioritize other assignment questions first, since by definition they are closer to finishing than you!).
First, install Python 3 if you haven’t already. It should be available on Linux, OS X, and Windows. Instructions for installing Python 3 should be available on the Python website.
Next, you’ll need to install some supporting libraries using Python’s
If you need to, you can use the EWS remote machines if you have trouble getting Python or the above libraries installed (we strongly encourage you to use your own machine if you can, however). Once you SSH into an EWS machine, you should be able to use Python 3 by running
and then install the required libraries with
1. Word Embeddings via Singular Value Decomposition [45 pts]
Info: This assignment follows closely the following paper:
Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3:211–225. (pdf)
Recall that the goal of word embedding methods is to derive a low-dimensional continuous vector representation for words so that words that are syntactically or semantically related are close together in that vector space (and, thus, share a similar representation).
In lecture we discussed a few methods for doing this using probabilistic modeling (word2vec) and matrix factorization (GloVe). Following the idea of word embeddings resulting from factorizing a large, sparse matrix, we can formulate an embedding method via computing a rank- matrix factorization to approximate some large sparse matrix of interest.
In assignment 2 we asked you to compute the mutual information between words based on their co-occurrence within documents modeled using bit vectors. This measure allowed us to compute a score for each pair of words that captured how related they were. We will leverage this basic idea again, but this time we will compute embeddings using pointwise mutual information.
Pointwise mutual information, or PMI, is the (unweighted) term that occurs inside of the summation of mutual information and measures the correlation between two specific events. Specifically, PMI is defined as
and measures the (log) ratio of the joint probability of the two events as compared to the joint probability of the two events assuming they were independent. Thus, PMI is high when the two events and co-occur with higher probability than would be expected if they were independent.
If we suppose that and are words, we can measure how likely we see and together compared to what we would expect of they were unrelated by computing their PMI under some model for the joint probability .
Let represent a collection of observed word-context pairs (with contexts being other words). We can construct by considering the full context of a specific word occurrence as the collection of all word occurrences that appear within a fixed-size window of length before and after it.
For a specific word in position in a large, ordered collection of words , we would have the context as and could thus collect counts (a total of ) of each of those words as appearing in the context of word . We will refer to as the “target word” and the words appearing in the -sized window around as “context words”.
Consider a sample corpus containing only one sentence:
The dog ran around the park.
We can construct by considering each word position and extracting the pairs for . In such a pair, we would call the “target word” and the “context word”.
For example, we would extract the following pairs for () if we let our window size :
Similarly, for , we would extract the following pairs:
Let’s let represent the number of times we observe word type in the context of word type . We can then define as the number of times we see a “target” word in the collection of pairs and as the number of times we see the context word in the collection of pairs . (A common modification is to compute these counts by weighting the contexts by some function of their distance away from the target word. In this assignment, we’ll use the simple harmonic function where represents the distance between the word and a particular context word.)
We can then define the joint probability of a word and a context word as
where is simply the total number of word-context occurrences we see (which might be fractional if we weight counts like mentioned above). Similarly, we can define
and thus the PMI between a word and context word is
If we compute the PMI between all pairs of words in our vocabulary , we will arrive at a large, real-valued matrix. However, some of the values of this matrix will be if the word-context pair is unobserved. To remedy this, we could simply define a modified PMI that is equal to 0 when . An alternative approach is to instead compute the positive pointwise mutual information (PPMI) which truncates all negative PMI values to 0:
The resulting PPMI matrix is very sparse and can be approximated well via a rank- singular value decomposition. We can use the resulting left singular vectors () of dimension as our vector representations for each of the words in our vocabulary.
Please grab the assignment’s skeleton code and data files
(warning: large file, approximately 100MB). This archive is compressed
xz, and can be decompressed using 7-zip on Windows or the
utility on OS X and Linux via
tar xvf assignment6.tar.xz. Once you’ve
decompressed the archive, you should see the following files of interest:
cooccur.npz, which is the (weighted) co-occurrence matrix in NumPy format;
analogies.txt, which is a newline delimited file of analogies (A is to B as C is to D) formatted as
a b c d;
similarity.txt, which is a newline delimited file of words;
make_embeddings.py, which is a script to compute word embeddings using rank- sparse SVD;
query_embeddings.py, which is a simple demo script for you to interactively play around with your generated word embeddings;
analogies.py, which is a simple script to compute the MRR for an analogy solving task;
similarity.py, which is a simple script to perform a user evaluation on a word similarity task
The remainder of the assignment will be using this data and code.
[30 pts] Fill in the blank section of the code (clearly marked) in
make_embeddings.pyto modify the matrix variable PPMI (which currently just contains ) to contain the pointwise mutual information for each pair of words. (The two lines below that convert the PMI matrix into the PPMI matrix by dropping all negative values.)
It is common to evaluate word embeddings on analogy tasks of the form “X is to Y as Z is to __”.
[10 pts] Let be the unit-length embedding for X, be the unit-length embedding for Y, and be the unit-length embedding for Z. Provide a scoring function for ranking words that likely complete this analogy using vector operations like addition, subtraction, norms, and dot product.
[5 pts] We have provided the script
query_embeddings.pyto allow you to try querying your own analogies on your learned word embeddings from part (a).
Try querying for “man is to king as woman is to __”. Do you get what you expect? Why?
Come up with a few analogies of your own (for which all query words occur in our vocabulary, otherwise
query_embeddings.pywill error out). Qualitatively, how well do your word embeddings perform?
2. Evaluating Word Embeddings via an Analogy Task [30 pts]
Info: The data used here is a subset of the data used in the original word2vec paper and is available here.
Now let’s do a more formal evaluation of the PPMI SVD-based word embeddings. Like above, we can formulate the problem of analogy solving as a ranking problem and return a top-k list of words that might answer the analogy.
We could use an evaluation metric like accuracy to assess how well our embeddings answer analogy questions. (If we do so, it is common to eliminate any of the query words from the ranking.) Instead, however, we could compute the MRR (mean reciprocal rank) of the desired word in our ranked list of candidate words.
[10 pts] Why might we prefer to use MRR over accuracy? Give one scenario where MRR is more informative than accuracy.
Why might we prefer accuracy over MRR? Give one scenario where accuracy is more informative than MRR.
[5 pts] In
analogies.py, modify the function
compute_reciprocal_rankto compute the reciprocal rank of a word given a ranked list of words. (If the word does not occur in the list, return 0).
The standard way of computing the embeddings using SVD is to let the embedding matrix be
A more general form of this
where is some weight for the singular values.
analogies.py, compute the MRR for three different embedding methods:
- SVD with (default)
- SVD with (modify
make_embeddings.pyto accomplish this)
- SVD with (modify
- GloVe (provided; run with
--gloveas the first argument instead of a filename for an SVD embedding)
Report these values in a table in your writeup.
[5 pts] How does performance vary for the SVD methods with the setting of ?
[10 pts] Is the best performing method statistically significantly better than the method that gets second place? (Hint: use a -test.)
3. Evaluating Word Embeddings via a Similarity Task [25 pts]
Info: The queries used in this part of the assignment are a subset of the query inventory available here.
Another method for evaluating word embeddings is on similarity tasks. Queries take the form of a single word, and the goal is to return the most similar word to the query word.
While we could evaluate this using a metric like accuracy or MRR if we knew in advance the exact word we wanted to define as “most similar”, another more general method of performing an evaluation like this is to use a user study. Usually, this involves recruiting some experimental subjects and asking them to complete some task using your model. As a result, they produce something that can be used to assess the quality of a method. A very popular approach recently is to recruit these annotators via Amazon Mechanical Turk or other crowdsourcing platforms.
similarity.py we have set up a user task for evaluating word embedding
methods. For each word in a set of query words, we return the top most
similar word (that is not the query word) for four different methods: SVD
with , , , and
GloVe. The user is asked to select the “best” word in the list of up to
four words provided (which will be presented in random order). If two
methods produce the same top word, it is only shown once.
Then, we can compute the “win ratio” for a given method as the number of times the user picked that method’s top word divided by the total number of queries presented to the user.
[5 pts] Run
similarity.pyto simulate running the user study with you as the annotator. (You can’t really do this for research publications, but here we’re just trying to show you what that procedure might look like). You’ll need to invoke it with the paths to your SVD embedding files in the following order: , , and
Report the win ratios reported for each method in a table in your report. Which method performed the best on this task? Is that what you expected?
[10 pts] Is the best performing method statistically significantly better than the second-best?
[5 pts] This user study setup has a potential problem when a user thinks two words in the list are equally good. (Did this happen to you?) Describe how you could address this problem by modifying the user study design.
[5 pts] Another problem could occur when all of the top words are equally bad. (Did this happen to you?) Describe how you could address this problem by modifying the user study design.