Dataset

Take a large corpus of text data, and convert it into (input, output) pairs, using a rolling window of fixed length.

- Example dataset: "the quick brown fox jumped over the lazy dog". If window size =1, windows are "the quick brown", "quick brown fox", "brown fox jumped", etc.

Goal

Train a distributed vector representation of words (why?). So, we'll construct a toy task to learn these representations in an unsupervised setting.

Word2Vec

Continuous Bag-of-Words (CBOW) Model

Using the outer words, predict center word.

- Outer word vectors are summed.

- Example: Dataset would be ([the, brown], quick),

([quick, fox], brown),

([brown, jumped], fox),

...

Models

Skip-Gram Model

Using center word, predict outer words.

- Example: Dataset would be

(quick, the),

(quick, brown),

(brown, quick),

(brown, fox),

...

U Matrix (Outer words)

Outer word lookup matrix.

u₀ may correspond to "aardvark", u₁ to "a", un to "zebra", etc.

V Matrix (Center words)

Center word lookup matrix.

v0 may correspond to "aardvark", v1 to "a", vn to "zebra", etc.

Training U & V Matrices

To train U and V matrices for all of the words in our vocabulary.

- To compute our final word vector for each word w_i, we simply sum w_i = u_i + v_i.

- Alternatively, we could concatenate ui and vi.

Objective Function - Softmax Cross-Entropy Loss

For each word t = 1 ... T, predict surrounding words within the window size m. We'll minimize:

One-Hot Vectors

Vectors that contain a single 1 along a dimension, and zeros everywhere else. Example:

p(o | c)

Uses the softmax function, which converts scores all classes into a probability distribution.

Why Dot Product?

- Dot products are bigger when component vectors are more similar to each other.

- Thus, this model pushes the vectors u_o and v_c to be similar to each other.

Objective Function - Negative Sampling Loss

- Problem with Softmax Cross-Entropy: calculating p(o | c) is expensive. Specifically, calculating the denominator of the softmax requires a pass over the entire vocabulary!

Word2Vec

Models

Cost Functions