MP2: Word Embeddings
Notice: This assignment is due
Tuesday, October 10th
Friday, October 13th at 11:59pm.
This is a programming assignment, so submission procedures are slightly
different than for regular written assignments. You will upload your code
as well as your write-up (in either Markdown or PDF) to your
git repository. Concrete instructions on how to do so are included in
this assignment specification. Make sure that you double-check that you
can see your submission in your
In this assignment you will investigate a different method for computing word embeddings and compare it against a state-of-the-art method based on matrix factorization called GloVe.
Warning: This assignment requires programming using Python. You should install Python 3 on your system in order to do the assignment.
Please make sure you’ve completed the setup instructions well before the deadline, as we cannot guarantee that you’ll get your installation questions answered at the last minute (since we will prioritize other assignment questions first, since by definition they are closer to finishing than you!).
First, install Python 3 if you haven’t already. It should be available on Linux, OS X, and Windows. Instructions for installing Python 3 should be available on the Python website.
Next, you’ll need to install some supporting libraries using Python’s
Info: On Windows, you may need to add the python install path and the
Scripts sub-directory to your path.
Assuming you installed Python to the directory
C:\Python36, this would
look something like this in a command prompt:
You can refer to this StackOverflow post for more information.
If you need to, you can use the EWS remote machines if you have trouble getting Python or the above libraries installed (we strongly encourage you to use your own machine if you can, however). Once you SSH into an EWS machine, you should be able to use Python 3 by running
and then install the required libraries with
1. Word Embeddings via Singular Value Decomposition [45 pts]
Info: This assignment follows closely the following paper:
Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3:211–225. (pdf)
Recall that the goal of word embedding methods is to derive a low-dimensional continuous vector representation for words so that words that are syntactically or semantically related are close together in that vector space (and, thus, share a similar representation).
In lecture we discussed a few methods for doing this using probabilistic modeling (word2vec) and matrix factorization (GloVe). Following the idea of word embeddings resulting from factorizing a large, sparse matrix, we can formulate an embedding method via computing a rank- matrix factorization to approximate some large sparse matrix of interest.
In assignment 2 we asked you to compute the mutual information between words based on their co-occurrence within documents modeled using bit vectors. This measure allowed us to compute a score for each pair of words that captured how related they were. We will leverage this basic idea again, but this time we will compute embeddings using pointwise mutual information.
Pointwise mutual information, or PMI, is the (unweighted) term that occurs inside of the summation of mutual information and measures the correlation between two specific events. Specifically, PMI is defined as
and measures the (log) ratio of the joint probability of the two events as compared to the joint probability of the two events assuming they were independent. Thus, PMI is high when the two events and co-occur with higher probability than would be expected if they were independent.
If we suppose that and are words, we can measure how likely we see and together compared to what we would expect of they were unrelated by computing their PMI under some model for the joint probability .
Let represent a collection of observed word-context pairs (with contexts being other words). We can construct by considering the full context of a specific word occurrence as the collection of all word occurrences that appear within a fixed-size window of length before and after it.
For a specific word in position in a large, ordered collection of words , we would have the context as and could thus collect counts (a total of ) of each of those words as appearing in the context of word . We will refer to as the “target word” and the words appearing in the -sized window around as “context words”.
Consider a sample corpus containing only one sentence:
The dog ran around the park.
We can construct by considering each word position and extracting the pairs for . In such a pair, we would call the “target word” and the “context word”.
For example, we would extract the following pairs for () if we let our window size :
Similarly, for , we would extract the following pairs:
Let’s let represent the number of times we observe word type as a context word of word type . We can then define as the number of times we see a “target” word in the collection of pairs and as the number of times we see the context word in the collection of pairs . (A common modification is to compute these counts by weighting the contexts by some function of their distance away from the target word. In this assignment, we’ll use the simple harmonic function where represents the distance between the word and a particular context word.)
We can then define the joint probability of a word and a context word as
where is simply the total number of word-context occurrences we see (which might be fractional if we weight counts like mentioned above). Similarly, we can define
and thus the PMI between a word and context word is
If we compute the PMI between all pairs of words in our vocabulary , we will arrive at a large, real-valued matrix. However, some of the values of this matrix will be if the word-context pair is unobserved. To remedy this, we could simply define a modified PMI that is equal to 0 when . An alternative approach is to instead compute the positive pointwise mutual information (PPMI) which truncates all negative PMI values to 0:
The resulting PPMI matrix is very sparse and can be approximated well via a rank- singular value decomposition. We can use the resulting left singular vectors () of dimension as our vector representations for each of the words in our vocabulary.
You should have a project on GitLab called
YOUR_USERNAME_HERE/cs510-fa17-mp2. Once you have cloned your
specific repository, you should see the following files of interest:
analogies.txt, which is a newline delimited file of analogies (A is to B as C is to D) formatted as
a b c d;
similarity.txt, which is a newline delimited file of words;
make_embeddings.py, which is a script to compute word embeddings using rank- sparse SVD;
query_embeddings.py, which is a simple demo script for you to interactively play around with your generated word embeddings;
analogies.py, which is a simple script to compute the MRR for an analogy solving task;
similarity.py, which is a simple script to perform a user evaluation on a word similarity task
The remainder of the assignment will be using this data and code.
[24 pts] Fill in the blank section of the code (clearly marked) in
make_embeddings.pyto modify the matrix variable PPMI, which currently just contains
PPMI[w, c](that is, the rows are for and the columns are for ), to contain the pointwise mutual information for each pair of words. (The two lines below that convert the PMI matrix into the PPMI matrix by dropping all negative values.)
- [6 pts] The standard way of computing the embeddings using SVD is
to let the embedding matrix be
A more general form of this is
where is some weight for the singular values.
Using GitLab CI, create and download a separate embedding matrix for (default), , and by modifying the value of in
Once you’ve filled in the code in
make_embeddings.py, you can use it to generate an embedding matrix by using GitLab CI (just like we generated the tables for MP1). If you go to your repository, and then click on Pipelines in the sidebar, you should see a familiar page. Run the make_embeddings job, and it should produce a file called
svd_embeddings.npy. Please make a separate file for the above values of and save those files in your local directory for use later.
It is common to evaluate word embeddings on analogy tasks of the form “X is to Y as Z is to __”.
[10 pts] Let be the unit-length embedding for X, be the unit-length embedding for Y, and be the unit-length embedding for Z. Provide a scoring function for ranking words that likely complete this analogy using vector operations like addition, subtraction, norms, and dot product.
[5 pts] We have provided the script
query_embeddings.pyto allow you to try querying your own analogies on your learned word embeddings from part (b).
Try querying for “man is to king as woman is to __”. Do you get what you expect? Why?
Come up with a few analogies of your own (for which all query words occur in our vocabulary, otherwise
query_embeddings.pywill error out). Qualitatively, how well do your word embeddings perform?
2. Evaluating Word Embeddings via an Analogy Task [30 pts]
Info: The data used here is a subset of the data used in the original word2vec paper and is available here.
Now let’s do a more formal evaluation of the PPMI SVD-based word embeddings. Like above, we can formulate the problem of analogy solving as a ranking problem and return a top-k list of words that might answer the analogy.
We could use an evaluation metric like accuracy to assess how well our embeddings answer analogy questions. (If we do so, it is common to eliminate any of the query words from the ranking.) Instead, however, we could compute the MRR (mean reciprocal rank) of the desired word in our ranked list of candidate words.
[10 pts] Why might we prefer to use MRR over accuracy? Give one scenario where MRR is more informative than accuracy.
Why might we prefer accuracy over MRR? Give one scenario where accuracy is more informative than MRR.
[5 pts] In
analogies.py, modify the function
compute_reciprocal_rankto compute the reciprocal rank of a word given a ranked list of words. (If the word does not occur in the list, return 0).
analogies.py, compute the MRR for four different embedding methods:
- SVD with
- SVD with
- SVD with
- GloVe (provided; run with
gloveas the filename argument)
Report these values in a table in your writeup.
[5 pts] How does performance vary for the SVD methods with the setting of ?
[5 pts] Fill in the code at the bottom of
analogies.pyto run a -test when there are two files/method names provided as arguments to the program.
recip_rankscontains a list of the reciprocal rank scores for every query for the first method, and
recip_rankscontains a list of the reciprocal rank scores for every query for the second method.
You may find this documentation about the
[5 pts] Based on your -test, would you conclude that the best performing method of the four is statistically significantly better than the second best method?
3. Evaluating Word Embeddings via a Similarity Task [25 pts]
Info: The queries used in this part of the assignment are a subset of the query inventory available here.
Another method for evaluating word embeddings is on similarity tasks. Queries take the form of a single word, and the goal is to return the most similar word to the query word.
While we could evaluate this using a metric like accuracy or MRR if we knew in advance the exact word we wanted to define as “most similar”, another more general method of performing an evaluation like this is to use a user study. Usually, this involves recruiting some experimental subjects and asking them to complete some task using your model. As a result, they produce something that can be used to assess the quality of a method. A very popular approach recently is to recruit these annotators via Amazon Mechanical Turk or other crowdsourcing platforms.
similarity.py we have set up a user task for evaluating word embedding
methods. For each word in a set of query words, we return the top most
similar word (that is not the query word) for four different methods: SVD
with , , , and
GloVe. The user is asked to select the “best” word in the list of up to
four words provided (which will be presented in random order). If two
methods produce the same top word, it is only shown once.
Then, we can compute the “win ratio” for a given method as the number of times the user picked that method’s top word divided by the total number of queries presented to the user.
[5 pts] Fill in the code at the bottom of
similarity.pyto perform a -test for the best performing method against the second best performing method.
Note that each
ilost on query
iwon on query
j. The methods are SVD (), SVD (), SVD (), and GloVe, in that order.
[5 pts] Run
similarity.pyto simulate running the user study with you as the annotator. (You can’t really do this for research publications, but here we’re just trying to show you what that procedure might look like). You’ll need to invoke it with the paths to your SVD embedding files in the following order: , , and
Report the win ratios reported for each method in a table in your report. Which method performed the best on this task? Is that what you expected?
[5 pts] According to your -test, Is the best performing method statistically significantly better than the second-best?
[5 pts] This user study setup has a potential problem when a user thinks two words in the list are equally good. (Did this happen to you?) Describe how you could address this problem by modifying the user study design.
[5 pts] Another problem could occur when all of the top words are equally bad. (Did this happen to you?) Describe how you could address this problem by modifying the user study design.
4. Triple-Check Your Repository
At this point, you should triple-check that everything needed for grading is part of your repository, including your write-up itself. I strongly encourage you to try to use Markdown for this (and use GitLab’s built-in support for LaTeX for any math), but as a fallback option you are allowed to commit a typeset PDF (no scanned documents, please).
In particular, you need to have the following things present in your repository:
- your writeup as a
make_embeddings.pyshould be updated with your code to compute the PMI matrix,
analogies.pyshould be updated to compute reciprocal rank properly, as well as contain a proper -test for comparing two methods, and
similarity.pyshould be updated to contain a proper -test for comparing the best and second best method.