## CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

# MP2: Word Embeddings

Notice: This assignment is due Tuesday, October 10th Friday, October 13th at 11:59pm.

This is a programming assignment, so submission procedures are slightly different than for regular written assignments. You will upload your code as well as your write-up (in either Markdown or PDF) to your git repository. Concrete instructions on how to do so are included in this assignment specification. Make sure that you double-check that you can see your submission in your git repository!

In this assignment you will investigate a different method for computing word embeddings and compare it against a state-of-the-art method based on matrix factorization called GloVe.

Warning: This assignment requires programming using Python. You should install Python 3 on your system in order to do the assignment.

Please make sure you’ve completed the setup instructions well before the deadline, as we cannot guarantee that you’ll get your installation questions answered at the last minute (since we will prioritize other assignment questions first, since by definition they are closer to finishing than you!).

### Setup Instructions

First, install Python 3 if you haven’t already. It should be available on Linux, OS X, and Windows. Instructions for installing Python 3 should be available on the Python website.

Next, you’ll need to install some supporting libraries using Python’s pip:

Info: On Windows, you may need to add the python install path and the Scripts sub-directory to your path.

Assuming you installed Python to the directory C:\Python36, this would look something like this in a command prompt:

#### EWS Instructions

If you need to, you can use the EWS remote machines if you have trouble getting Python or the above libraries installed (we strongly encourage you to use your own machine if you can, however). Once you SSH into an EWS machine, you should be able to use Python 3 by running

and then install the required libraries with

### 1. Word Embeddings via Singular Value Decomposition [45 pts]

Info: This assignment follows closely the following paper:

Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3:211–225. (pdf)

Recall that the goal of word embedding methods is to derive a low-dimensional continuous vector representation for words so that words that are syntactically or semantically related are close together in that vector space (and, thus, share a similar representation).

In lecture we discussed a few methods for doing this using probabilistic modeling (word2vec) and matrix factorization (GloVe). Following the idea of word embeddings resulting from factorizing a large, sparse matrix, we can formulate an embedding method via computing a rank-$k$ matrix factorization to approximate some large sparse matrix of interest.

In assignment 2 we asked you to compute the mutual information between words based on their co-occurrence within documents modeled using bit vectors. This measure allowed us to compute a score for each pair of words that captured how related they were. We will leverage this basic idea again, but this time we will compute embeddings using pointwise mutual information.

Pointwise mutual information, or PMI, is the (unweighted) term that occurs inside of the summation of mutual information and measures the correlation between two specific events. Specifically, PMI is defined as

$\displaystyle PMI(a, b) = \log \frac{p(a,b)}{p(a)p(b)}$

and measures the (log) ratio of the joint probability of the two events as compared to the joint probability of the two events assuming they were independent. Thus, PMI is high when the two events $a$ and $b$ co-occur with higher probability than would be expected if they were independent.

If we suppose that $a$ and $b$ are words, we can measure how likely we see $a$ and $b$ together compared to what we would expect of they were unrelated by computing their PMI under some model for the joint probability $p(a, b)$.

Let $D$ represent a collection of observed word-context pairs (with contexts being other words). We can construct $D$ by considering the full context of a specific word occurrence as the collection of all word occurrences that appear within a fixed-size window of length $L$ before and after it.

For a specific word $w_i$ in position $i$ in a large, ordered collection of words $w_1, w_2, \ldots, w_N$, we would have the context as $w_{i-L},\ldots,w_{i-1},w_{i+1},\ldots,w_{i+L}$ and could thus collect counts (a total of $2L$) of each of those words as appearing in the context of word $w_i$. We will refer to $w_i$ as the “target word” and the words appearing in the $L$-sized window around $w_i$ as “context words”.

Consider a sample corpus containing only one sentence:

The dog ran around the park.

We can construct $D$ by considering each word position $i$ and extracting the pairs $(w_i, w_{i+k})$ for $-L \leq k \leq L; k\neq 0$. In such a pair, we would call $w_i$ the “target word” and $w_{i+k}$ the “context word”.

For example, we would extract the following pairs for $i = 4$ ($w_i = \text{around}$) if we let our window size $L = 2$:

$\displaystyle (\text{around}, \text{dog}), (\text{around}, \text{ran}), (\text{around}, \text{the}), (\text{around}, \text{park}).$

Similarly, for $i = 5$, we would extract the following pairs:

$\displaystyle (\text{the}, \text{ran}), (\text{the}, \text{around}), (\text{the}, \text{park}).$

Let’s let $n_{w,c}$ represent the number of times we observe word type $c$ as a context word of word type $w$. We can then define $n_w = \sum_{c'} n_{w,c'}$ as the number of times we see a “target” word $w$ in the collection of pairs $D$ and $n_c = \sum_{w'} n_{w',c}$ as the number of times we see the context word $c$ in the collection of pairs $D$. (A common modification is to compute these counts by weighting the contexts by some function of their distance away from the target word. In this assignment, we’ll use the simple harmonic function $\frac{1}{d}$ where $d$ represents the distance between the word and a particular context word.)

We can then define the joint probability of a word and a context word as

$\displaystyle p(w, c) = \frac{n_{w,c}}{|D|}$

where $|D|$ is simply the total number of word-context occurrences we see (which might be fractional if we weight counts like mentioned above). Similarly, we can define

$\displaystyle p(w) = \frac{n_w}{|D|}$

and

$\displaystyle p(c) = \frac{n_c}{|D|}$

and thus the PMI between a word $w$ and context word $c$ is

$\displaystyle PMI(w, c) = \log \frac{p(w,c)}{p(w)p(c)} = \log \frac{n_{w,c} \cdot |D|}{n_w \cdot n_c}.$

If we compute the PMI between all pairs of words in our vocabulary $V$, we will arrive at a large, real-valued matrix. However, some of the values of this matrix will be $\log 0$ if the word-context pair $(w,c)$ is unobserved. To remedy this, we could simply define a modified PMI that is equal to 0 when $n_{w,c} = 0$. An alternative approach is to instead compute the positive pointwise mutual information (PPMI) which truncates all negative PMI values to 0:

$\displaystyle PPMI(w,c) = \max(0, PMI(w,c))$

The resulting PPMI matrix is very sparse and can be approximated well via a rank-$k$ singular value decomposition. We can use the resulting left singular vectors ($U$) of dimension $k$ as our vector representations for each of the words in our vocabulary.

You should have a project on GitLab called YOUR_USERNAME_HERE/cs510-fa17-mp2. Once you have cloned your specific repository, you should see the following files of interest:

Data:

• analogies.txt, which is a newline delimited file of analogies (A is to B as C is to D) formatted as a b c d;
• similarity.txt, which is a newline delimited file of words;

Code:

• make_embeddings.py, which is a script to compute word embeddings using rank-$k$ sparse SVD;
• query_embeddings.py, which is a simple demo script for you to interactively play around with your generated word embeddings;
• analogies.py, which is a simple script to compute the MRR for an analogy solving task;
• similarity.py, which is a simple script to perform a user evaluation on a word similarity task

The remainder of the assignment will be using this data and code.

1. [24 pts] Fill in the blank section of the code (clearly marked) in make_embeddings.py to modify the matrix variable PPMI, which currently just contains PPMI[w, c] $= n_{w,c}$ (that is, the rows are for $w$ and the columns are for $c$), to contain the pointwise mutual information for each pair of words. (The two lines below that convert the PMI matrix into the PPMI matrix by dropping all negative values.)

2. [6 pts] The standard way of computing the embeddings using SVD is to let the embedding matrix $E$ be
$\displaystyle E = U_k \cdot \Sigma_k.$

A more general form of this is

$\displaystyle E = U_k \cdot (\Sigma_k)^p$

where $p \in [0,1]$ is some weight for the singular values.

Using GitLab CI, create and download a separate embedding matrix for $p = 1$ (default), $p = 0.5$, and $p = 0$ by modifying the value of $p$ in make_embeddings.py.

Once you’ve filled in the code in make_embeddings.py, you can use it to generate an embedding matrix by using GitLab CI (just like we generated the tables for MP1). If you go to your repository, and then click on Pipelines in the sidebar, you should see a familiar page. Run the make_embeddings job, and it should produce a file called svd_embeddings.npy. Please make a separate file for the above values of $p$ and save those files in your local directory for use later.

3. It is common to evaluate word embeddings on analogy tasks of the form “X is to Y as Z is to __”.

1. [10 pts] Let $e_x$ be the unit-length embedding for X, $e_y$ be the unit-length embedding for Y, and $e_z$ be the unit-length embedding for Z. Provide a scoring function for ranking words that likely complete this analogy using vector operations like addition, subtraction, norms, and dot product.

2. [5 pts] We have provided the script query_embeddings.py to allow you to try querying your own analogies on your learned word embeddings from part (b).

Try querying for “man is to king as woman is to __”. Do you get what you expect? Why?

Come up with a few analogies of your own (for which all query words occur in our vocabulary, otherwise query_embeddings.py will error out). Qualitatively, how well do your word embeddings perform?

### 2. Evaluating Word Embeddings via an Analogy Task [30 pts]

Info: The data used here is a subset of the data used in the original word2vec paper and is available here.

Now let’s do a more formal evaluation of the PPMI SVD-based word embeddings. Like above, we can formulate the problem of analogy solving as a ranking problem and return a top-k list of words that might answer the analogy.

We could use an evaluation metric like accuracy to assess how well our embeddings answer analogy questions. (If we do so, it is common to eliminate any of the query words from the ranking.) Instead, however, we could compute the MRR (mean reciprocal rank) of the desired word in our ranked list of candidate words.

1. [10 pts] Why might we prefer to use MRR over accuracy? Give one scenario where MRR is more informative than accuracy.

Why might we prefer accuracy over MRR? Give one scenario where accuracy is more informative than MRR.

2. [5 pts] In analogies.py, modify the function compute_reciprocal_rank to compute the reciprocal rank of a word given a ranked list of words. (If the word does not occur in the list, return 0).

3. Using analogies.py, compute the MRR for four different embedding methods:

• SVD with $p = 1$
• SVD with $p = 0.5$
• SVD with $p = 0$
• GloVe (provided; run with glove as the filename argument)

Report these values in a table in your writeup.

1. [5 pts] How does performance vary for the SVD methods with the setting of $p$?

2. [5 pts] Fill in the code at the bottom of analogies.py to run a $t$-test when there are two files/method names provided as arguments to the program.

Note that recip_ranks[0] contains a list of the reciprocal rank scores for every query for the first method, and recip_ranks[1] contains a list of the reciprocal rank scores for every query for the second method.

3. [5 pts] Based on your $t$-test, would you conclude that the best performing method of the four is statistically significantly better than the second best method?

### 3. Evaluating Word Embeddings via a Similarity Task [25 pts]

Info: The queries used in this part of the assignment are a subset of the query inventory available here.

Another method for evaluating word embeddings is on similarity tasks. Queries take the form of a single word, and the goal is to return the most similar word to the query word.

While we could evaluate this using a metric like accuracy or MRR if we knew in advance the exact word we wanted to define as “most similar”, another more general method of performing an evaluation like this is to use a user study. Usually, this involves recruiting some experimental subjects and asking them to complete some task using your model. As a result, they produce something that can be used to assess the quality of a method. A very popular approach recently is to recruit these annotators via Amazon Mechanical Turk or other crowdsourcing platforms.

In similarity.py we have set up a user task for evaluating word embedding methods. For each word in a set of query words, we return the top most similar word (that is not the query word) for four different methods: SVD with $p = 1.0$, $p = 0.5$, $p = 0.0$, and GloVe. The user is asked to select the “best” word in the list of up to four words provided (which will be presented in random order). If two methods produce the same top word, it is only shown once.

Then, we can compute the “win ratio” for a given method as the number of times the user picked that method’s top word divided by the total number of queries presented to the user.

1. [5 pts] Fill in the code at the bottom of similarity.py to perform a $t$-test for the best performing method against the second best performing method.

Note that each wins[i][j] is 0 if method i lost on query j or 1 if method i won on query j. The methods are SVD ($p=1$), SVD ($p=0.5$), SVD ($p=0$), and GloVe, in that order.

2. [5 pts] Run similarity.py to simulate running the user study with you as the annotator. (You can’t really do this for research publications, but here we’re just trying to show you what that procedure might look like). You’ll need to invoke it with the paths to your SVD embedding files in the following order: $p=1.0$, $p=0.5$, and $p=0.0$

Report the win ratios reported for each method in a table in your report. Which method performed the best on this task? Is that what you expected?

3. [5 pts] According to your $t$-test, Is the best performing method statistically significantly better than the second-best?

4. [5 pts] This user study setup has a potential problem when a user thinks two words in the list are equally good. (Did this happen to you?) Describe how you could address this problem by modifying the user study design.

5. [5 pts] Another problem could occur when all of the top words are equally bad. (Did this happen to you?) Describe how you could address this problem by modifying the user study design.

At this point, you should triple-check that everything needed for grading is part of your repository, including your write-up itself. I strongly encourage you to try to use Markdown for this (and use GitLab’s built-in support for LaTeX for any math), but as a fallback option you are allowed to commit a typeset PDF (no scanned documents, please).

In particular, you need to have the following things present in your repository:

1. your writeup as a .md or .pdf file,
2. make_embeddings.py should be updated with your code to compute the PMI matrix,
3. analogies.py should be updated to compute reciprocal rank properly, as well as contain a proper $t$-test for comparing two methods, and
4. similarity.py should be updated to contain a proper $t$-test for comparing the best and second best method.