CS 510: Advanced Information Retrieval (Fall 2017)

Instructor: ChengXiang Zhai

Assignment #5: Word Embeddings

Notice: This assignment is due Thursday, October 19th at 11:59pm.

Please submit your solutions via Compass. You should submit your assignment as a typeset PDF. Please do not include scanned or photographed equations as they are difficult for us to grade.

Warning: Please answer as concise as possible and limit your answer to a maximum of three sentences per question.

1. Brown Clusters [35 pts]

  1. [5 pts] List three applications of Brown clusters.

  2. [5 pts] When performing Brown clustering we merge classes that minimize the loss of mutual information. Intuitively speaking, what kind of words are merged together into the same class?

  3. [5 pts] We discussed that Brown clusters do not directly address at what level of the tree it is best to stop merging clusters (i.e. the total number of clusters). In a machine learning setting, how could you find the optimal number of clusters experimentally?

  4. [5 pts] How can the information that is learned in an unsupervised way in Brown clusters be utilized in a supervised learning setting?

  5. [5 pts] How is the cut-off depth (i.e. the number of clusters) related to the generality of the word clusters? Describe one scenario where it would be useful to have more general clusters. Describe one scenario where it would be useful to have more specific clusters.

  6. [10 pts] Show that maximizing the log likelihood of the bigram class model is the same as maximizing mutual information of adjacent class pairs in Brown clustering.

2. Word Embeddings (Theory) [35 pts]

  1. [5 pts] What is the distributional hypothesis and how is it incorporated into Skip-Gram and CBOW?

  2. [10 pts] In negative sampling we try to learn a model by contrasting actual data from a noise distribution. In the case of word embeddings what is the actual data and what is the noise distribution? How can we obtain samples for them?

  3. The following questions relate to subsampling of frequent words: In “Distributed Representations of Words and Phrases and their Compositionality” by Mikolov et al. the probability of a word being discarded is defined as P(wi)=1tf(wi)P(w_i)=1-\sqrt{\frac{t}{f(w_i)}}, where f(wi)f(w_i) is the frequency of word wiw_i and tt is a chosen threshold, typically around 10510^{-5}.

    1. [10 pts] How does the subsampling affect very frequent and infrequent words? How does the subsampling affect the ranking of words when ordered by frequency?

    2. [10 pts] How does the subsampling affect the size of the context window?

3. Word Embeddings (Practice) [30 pts]

Imagine you trained word embeddings on a corpus and obtained the following vector space with words wiw_i and dimensions wijw_{ij}:

wiw_i wi1w_{i1} wi2w_{i2} wi3w_{i3} wi4w_{i4} wi5w_{i5} wi6w_{i6} wi7w_{i7} wi8w_{i8} wi9w_{i9} wi10w_{i10}
the 0.131 0.001 0.023 0.918 0.991 0.912 0.787 0.675 0.787 0.987
cat 0.911 0.891 0.912 0.016 0.099 0.189 0.777 0.776 0.853 0.992
for 0.112 0.009 0.032 0.819 0.971 0.932 0.788 0.677 0.777 0.988
data 0.954 0.919 0.881 0.812 0.901 0.990 0.012 0.002 0.014 0.909
mouse 0.912 0.881 0.922 0.019 0.100 0.199 0.011 0.003 0.016 0.898
it 0.142 0.010 0.026 0.820 0.917 0.923 0.781 0.611 0.722 0.977
dog 0.922 0.882 0.931 0.011 0.101 0.193 0.769 0.762 0.841 0.989
also 0.121 0.004 0.021 0.919 0.981 0.917 0.790 0.617 0.712 0.969
computer 0.912 0.923 0.899 0.853 0.910 0.991 0.022 0.010 0.016 0.912
  1. [5 pts] What are the three nearest neighbors of ‘cat’? List the neighbors with their distances according to cosine similarity.

  2. [5 pts] What are the three nearest neighbors of ‘computer’? List the neighbors with their distances according to cosine similarity.

  3. [5 pts] What are the three nearest neighbors of ‘the’? List the neighbors with their distances according to cosine similarity.

  4. [10 pts] If you were to cluster the words into three semantically coherent clusters, which words would you cluster together? List each cluster as a set of words and describe their semantics briefly (hint: the clusters don’t have to be disjunct). Does this clustering correspond to the clustering that can be obtained from the vector space, e.g. by using an unsupervised clustering method such as k-means?

  5. [5 pts] Do you see a problem with methods that perform hard clustering (e.g. kNN) as compared to fuzzy clustering (e.g. EM clustering) in the context of word ambiguity?