CS 598cxz Fall 2016

Assignment #6: Word Embeddings

Notice: This assignment is due Saturday, October 29th at 11:59pm.

Please submit your solutions via Compass. You should submit your assignment as a PDF, and your accompanying code as either a .zip or .tar.gz containing your files.

So, since my NetID is geigle1, I would submit the following files:

  • geigle1-assignment6.pdf
  • geigle1-assignment6.tar.gz or geigle1-assignment6.zip

In this assignment you will investigate a different method for computing word embeddings and compare it against a state-of-the-art method based on matrix factorization called GloVe.

Warning: This assignment requires programming using Python. You should install Python 3 on your system in order to do the assignment.

Please make sure you’ve completed the setup instructions well before the deadline, as we cannot guarantee that you’ll get your installation questions answered at the last minute (since we will prioritize other assignment questions first, since by definition they are closer to finishing than you!).

Setup Instructions

First, install Python 3 if you haven’t already. It should be available on Linux, OS X, and Windows. Instructions for installing Python 3 should be available on the Python website.

Next, you’ll need to install some supporting libraries using Python’s pip:

pip install --upgrade pip
pip install numpy scipy metapy

EWS Instructions

If you need to, you can use the EWS remote machines if you have trouble getting Python or the above libraries installed (we strongly encourage you to use your own machine if you can, however). Once you SSH into an EWS machine, you should be able to use Python 3 by running

module load python3

and then install the required libraries with

pip install --user numpy scipy metapy

1. Word Embeddings via Singular Value Decomposition [45 pts]

Info: This assignment follows closely the following paper:

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3:211–225. (pdf)

Recall that the goal of word embedding methods is to derive a low-dimensional continuous vector representation for words so that words that are syntactically or semantically related are close together in that vector space (and, thus, share a similar representation).

In lecture we discussed a few methods for doing this using probabilistic modeling (word2vec) and matrix factorization (GloVe). Following the idea of word embeddings resulting from factorizing a large, sparse matrix, we can formulate an embedding method via computing a rank-kk matrix factorization to approximate some large sparse matrix of interest.

In assignment 2 we asked you to compute the mutual information between words based on their co-occurrence within documents modeled using bit vectors. This measure allowed us to compute a score for each pair of words that captured how related they were. We will leverage this basic idea again, but this time we will compute embeddings using pointwise mutual information.

Pointwise mutual information, or PMI, is the (unweighted) term that occurs inside of the summation of mutual information and measures the correlation between two specific events. Specifically, PMI is defined as

PMI(a,b)=logp(a,b)p(a)p(b)\displaystyle PMI(a, b) = \log \frac{p(a,b)}{p(a)p(b)}

and measures the (log) ratio of the joint probability of the two events as compared to the joint probability of the two events assuming they were independent. Thus, PMI is high when the two events aa and bb co-occur with higher probability than would be expected if they were independent.

If we suppose that aa and bb are words, we can measure how likely we see aa and bb together compared to what we would expect of they were unrelated by computing their PMI under some model for the joint probability p(a,b)p(a, b).

Let DD represent a collection of observed word-context pairs (with contexts being other words). We can construct DD by considering the full context of a specific word occurrence as the collection of all word occurrences that appear within a fixed-size window of length LL before and after it.

For a specific word wiw_i in position ii in a large, ordered collection of words w1,w2,,wNw_1, w_2, \ldots, w_N, we would have the context as wiL,,wi1,wi+1,,wi+Lw_{i-L},\ldots,w_{i-1},w_{i+1},\ldots,w_{i+L} and could thus collect counts (a total of 2L2L) of each of those words as appearing in the context of word wiw_i. We will refer to wiw_i as the “target word” and the words appearing in the LL-sized window around wiw_i as “context words”.

Consider a sample corpus containing only one sentence:

The dog ran around the park.

We can construct DD by considering each word position ii and extracting the pairs (wi,wi+k)(w_i, w_{i+k}) for LkL;k0-L \leq k \leq L; k\neq 0. In such a pair, we would call wiw_i the “target word” and wi+kw_{i+k} the “context word”.

For example, we would extract the following pairs for i=4i = 4 (wi=aroundw_i = \text{around}) if we let our window size L=2L = 2:

(around,dog),(around,ran),(around,the),(around,park).\displaystyle (\text{around}, \text{dog}), (\text{around}, \text{ran}), (\text{around}, \text{the}), (\text{around}, \text{park}).

Similarly, for i=5i = 5, we would extract the following pairs:

(the,ran),(the,around),(the,park).\displaystyle (\text{the}, \text{ran}), (\text{the}, \text{around}), (\text{the}, \text{park}).

Let’s let nw,cn_{w,c} represent the number of times we observe word type cc in the context of word type ww. We can then define nw=cnw,cn_w = \sum_{c'} n_{w,c'} as the number of times we see a “target” word ww in the collection of pairs DD and nc=wnw,cn_c = \sum_{w'} n_{w',c} as the number of times we see the context word cc in the collection of pairs DD. (A common modification is to compute these counts by weighting the contexts by some function of their distance away from the target word. In this assignment, we’ll use the simple harmonic function 1d\frac{1}{d} where dd represents the distance between the word and a particular context word.)

We can then define the joint probability of a word and a context word as

p(w,c)=nw,cD\displaystyle p(w, c) = \frac{n_{w,c}}{|D|}

where D|D| is simply the total number of word-context occurrences we see (which might be fractional if we weight counts like mentioned above). Similarly, we can define

p(w)=nwD\displaystyle p(w) = \frac{n_w}{|D|}

and

p(c)=ncD\displaystyle p(c) = \frac{n_c}{|D|}

and thus the PMI between a word ww and context word cc is

PMI(w,c)=logp(w,c)p(w)p(c)=lognw,cDnwnc.\displaystyle PMI(w, c) = \log \frac{p(w,c)}{p(w)p(c)} = \log \frac{n_{w,c} \cdot |D|}{n_w \cdot n_c}.

If we compute the PMI between all pairs of words in our vocabulary VV, we will arrive at a large, real-valued matrix. However, some of the values of this matrix will be log0\log 0 if the word-context pair (w,c)(w,c) is unobserved. To remedy this, we could simply define a modified PMI that is equal to 0 when nw,c=0n_{w,c} = 0. An alternative approach is to instead compute the positive pointwise mutual information (PPMI) which truncates all negative PMI values to 0:

PPMI(w,c)=max(0,PMI(w,c))\displaystyle PPMI(w,c) = \max(0, PMI(w,c))

The resulting PPMI matrix is very sparse and can be approximated well via a rank-kk singular value decomposition. We can use the resulting left singular vectors (UU) of dimension kk as our vector representations for each of the words in our vocabulary.

Please grab the assignment’s skeleton code and data files (warning: large file, approximately 100MB). This archive is compressed with xz, and can be decompressed using 7-zip on Windows or the tar utility on OS X and Linux via tar xvf assignment6.tar.xz. Once you’ve decompressed the archive, you should see the following files of interest:

Data:

Code:

The remainder of the assignment will be using this data and code.

  1. [30 pts] Fill in the blank section of the code (clearly marked) in make_embeddings.py to modify the matrix variable PPMI (which currently just contains nw,cn_{w,c}) to contain the pointwise mutual information for each pair of words. (The two lines below that convert the PMI matrix into the PPMI matrix by dropping all negative values.)

  2. It is common to evaluate word embeddings on analogy tasks of the form “X is to Y as Z is to __”.

    1. [10 pts] Let exe_x be the unit-length embedding for X, eye_y be the unit-length embedding for Y, and eze_z be the unit-length embedding for Z. Provide a scoring function for ranking words that likely complete this analogy using vector operations like addition, subtraction, norms, and dot product.

    2. [5 pts] We have provided the script query_embeddings.py to allow you to try querying your own analogies on your learned word embeddings from part (a).

      Try querying for “man is to king as woman is to __”. Do you get what you expect? Why?

      Come up with a few analogies of your own (for which all query words occur in our vocabulary, otherwise query_embeddings.py will error out). Qualitatively, how well do your word embeddings perform?

2. Evaluating Word Embeddings via an Analogy Task [30 pts]

Info: The data used here is a subset of the data used in the original word2vec paper and is available here.

Now let’s do a more formal evaluation of the PPMI SVD-based word embeddings. Like above, we can formulate the problem of analogy solving as a ranking problem and return a top-k list of words that might answer the analogy.

We could use an evaluation metric like accuracy to assess how well our embeddings answer analogy questions. (If we do so, it is common to eliminate any of the query words from the ranking.) Instead, however, we could compute the MRR (mean reciprocal rank) of the desired word in our ranked list of candidate words.

  1. [10 pts] Why might we prefer to use MRR over accuracy? Give one scenario where MRR is more informative than accuracy.

    Why might we prefer accuracy over MRR? Give one scenario where accuracy is more informative than MRR.

  2. [5 pts] In analogies.py, modify the function compute_reciprocal_rank to compute the reciprocal rank of a word given a ranked list of words. (If the word does not occur in the list, return 0).

  3. The standard way of computing the embeddings using SVD is to let the embedding matrix EE be

    E=UkΣk.\displaystyle E = U_k \cdot \Sigma_k.

    A more general form of this

    E=UkΣkp\displaystyle E = U_k \cdot \Sigma_k^p

    where p[0,1]p \in [0,1] is some weight for the singular values.

    Using analogies.py, compute the MRR for three different embedding methods:

    • SVD with p=1p = 1 (default)
    • SVD with p=0.5p = 0.5 (modify make_embeddings.py to accomplish this)
    • SVD with p=0p = 0 (modify make_embeddings.py again)
    • GloVe (provided; run with --glove as the first argument instead of a filename for an SVD embedding)

    Report these values in a table in your writeup.

    1. [5 pts] How does performance vary for the SVD methods with the setting of pp?

    2. [10 pts] Is the best performing method statistically significantly better than the method that gets second place? (Hint: use a tt-test.)

3. Evaluating Word Embeddings via a Similarity Task [25 pts]

Info: The queries used in this part of the assignment are a subset of the query inventory available here.

Another method for evaluating word embeddings is on similarity tasks. Queries take the form of a single word, and the goal is to return the most similar word to the query word.

While we could evaluate this using a metric like accuracy or MRR if we knew in advance the exact word we wanted to define as “most similar”, another more general method of performing an evaluation like this is to use a user study. Usually, this involves recruiting some experimental subjects and asking them to complete some task using your model. As a result, they produce something that can be used to assess the quality of a method. A very popular approach recently is to recruit these annotators via Amazon Mechanical Turk or other crowdsourcing platforms.

In similarity.py we have set up a user task for evaluating word embedding methods. For each word in a set of query words, we return the top most similar word (that is not the query word) for four different methods: SVD with p=1.0p = 1.0, p=0.5p = 0.5, p=0.0p = 0.0, and GloVe. The user is asked to select the “best” word in the list of up to four words provided (which will be presented in random order). If two methods produce the same top word, it is only shown once.

Then, we can compute the “win ratio” for a given method as the number of times the user picked that method’s top word divided by the total number of queries presented to the user.

  1. [5 pts] Run similarity.py to simulate running the user study with you as the annotator. (You can’t really do this for research publications, but here we’re just trying to show you what that procedure might look like). You’ll need to invoke it with the paths to your SVD embedding files in the following order: p=1.0p=1.0, p=0.5p=0.5, and p=0.0p=0.0

    Report the win ratios reported for each method in a table in your report. Which method performed the best on this task? Is that what you expected?

  2. [10 pts] Is the best performing method statistically significantly better than the second-best?

  3. [5 pts] This user study setup has a potential problem when a user thinks two words in the list are equally good. (Did this happen to you?) Describe how you could address this problem by modifying the user study design.

  4. [5 pts] Another problem could occur when all of the top words are equally bad. (Did this happen to you?) Describe how you could address this problem by modifying the user study design.