BoW to BERT

Word vectors have evolved over the years to know the difference between “record the play” vs “play the record”. They have evolved from a one-hot world where every word was orthogonal to every other word, to a place where word vectors morph to suit the context. Slapping a BoW on word vectors is the usual way to build a document vector for tasks such as classification. But BERT does not need a BoW as the vector shooting out of the top [CLS] token is already primed for the specific classification objective…

Natural Language Processing (NLP) has seen a renaissance over the past decade. There is a lot of excitement over embeddings, transformers, and language models that can understand and work with speech/text as people do. The progress has been nothing less than phenomenal. Functionality like summarizing documents, machine translation, completing sentences, and conversing is now being attempted with some success. In fact it has been an evolution from NLP to NLU (Natural Language Understanding). BERT (Bidirectional Encoder Representations from Transformers) model due to Devlin et. al. happens to be the current leader in this space.

Traditional NLP makes no bones about understanding the words/text that it is processing. It does not even try. But this does not make it entirely useless. In fact for classifying generic documents one-hot word vectors tied together with a BoW (Bag-of-Words) have not done badly over the years. Sure, understanding the semantics/context of words and classifying the documents accordingly would be better. That is what new language models like BERT bring to the table – albeit with some expense.

The objective of this post is to look at the evolution of word vectors from one-hot to contextual. We start with words with no friends, as one-hot vectors that are the basis of an orthogonal word space as large as the size of the vocabulary. We move to word embeddings that incorporate co-occurrence statistics enabling words to have a fixed friends circle. And finally to BERT embeddings that account for context as well thereby letting the words make new friends or unfriend the current ones depending on the situation. Friendship is measured with cosine similarity. We pull the same & similar words out of different sentences/contexts and see whether the cosine similarity of their corresponding word vectors jives with our understanding of the meaning of those words in those sentences. That is basically the gist of this post.

Finally, the focus here is on what these advances in word embeddings mean to us, how we can apply and benefit from them, rather than on the technical details of how those advances have materialized. We go through some code snippets here but the complete code to reproduce the results is on github.

We start with a brief review of BoW before getting to word vectors. We will need it in the upcoming posts as well to build document vectors from one-hot and fastText word vectors. In the next post in this series, we will be evaluating such document vectors against BERT [CLS] token vector for several classification tasks.

1. Bag of Words

In the Bag-of-Words (BoW) approach the document vector is a weighted sum of the numerical vectors of the words making up the document. The weight can simply be the frequency count of that word in that document, a tf-idf value, or other variations.

Equation 1: The BoW vector for a document is a weighted sum of word-vectors When w_i is one-hot then p = N. When w_i is obtained from fastText, Glove, BERT etc… p << N

A glaring shortcoming of the BoW vectors clearly is that the order of words in the document makes no difference as the following image shows.

2. Word Vectors

There are options for the word vectors w_i in Equation 1. Traditional NLP started with using one-hot vectors, while the recent (well starting 2003) entrants experimented with others in order to address the shortcomings of these one-hot vectors.

2.1 Long and one-hot

In traditional NLP, each word is a vector orthogonal to every other word. The word vector length p is equal to N, the dimensionality of the word space. The ith word’s vector has 1 at ith location and 0 every where else. Hence the name one-hot vectors. Simply, the ith word vector is the basis vector for the ith dimension in this word space of N dimensions.

A document then is simply a point/vector in this N dimensional word-space with { W^j_i } as the coordinates. In fact W^j_i in Equation 1 can be the frequency count of the ith word in the jth document or a variation of it like tf-idf. The following figure demonstrates the shred-bag-tag operation that turns a document into a vector in word-space.

Figure 2. In word space, a BoW vector for a document is a point. It is a weighted (the word count in this case) sum of the one-hot word vectors making up the document.

Some issues with one-hot word vectors

Too Long: They are as long as the size of the vocabulary in the text corpus being processed. This is computational disadvantage
Meaningless Tokens: Words are treated as just tokens with no meaning and no relationships with other words. You can replace a word good in the corpus with say some junk word like gobbledeegook and BoW does not care – so long as this new word gobbledeegook is not already present in the corpus. In fact you can replace each of the N words in the corpus with some random non-conflicting word, and BoW will give you the same document vectors.
No Context/Position Awareness: There is only one vector per word and every word vector is orthogonal to all others. So there is no question of context/position dependent word vector, or relationships with other words. But in reality we know that words are complex beings.
- There are synonyms and antonyms of various shades for a given word. good/better/best & good/bad/worse for example
- Position dependent meaning for the same word in a sentence. “record the play” vs “play the record“. The meanings for play and record change as they switch places
- Polysemy – bright can mean shining, or intelligent for example
- Homophony – run can be the noun run as in baseball, or the verb run

2.2 Short, dense, and fixed

The efforts for better numerical vectors for words started with word2vec in 2013 and soon were followed by Glove and fastText. We use fastText word embeddings in the examples here, but the conclusions apply equally well to others. All of these word embeddings are derived based on Distributional Hypothesis that states:

semantically related words will have similar co-occurrence patterns with other words

According to this hypothesis, if two words generally keep the same company (of other words), then those two words are semantically related. The algorithms utilize vast amounts of text such as wikipedia to figure out the co-occurrence patterns for each word against all other words. The numerical vector representation thus obtained for a word encodes this as part of the optimization process. The word vectors obtained here are an improvement over one-hot vectors in two respects.

2.2.1 Word vectors are shorter now

The length p of the word vectors obtained here is much, much smaller than N. While N (the length of the one-hot word vectors) can run into hundreds of thousands, p is more like 50 or 100 or 300. This is a huge computational advantage.

2.2.2 Words have friends now

In the one-hot world all words are independent of each other – not a useful scenario. The word good is orthogonal to both bad and better. The word vectors obtained under the distributional hypothesis remedy this somewhat. They enable words to relate to each other somewhat mimicking an understanding of text.

Consider the words – holiday, vacation, and paper for example. The words holiday and vacation occur together with other words such as travel, beach etc… They share a similar company of words – so they would have similar co-occurrence patterns with other words. So their word vectors reflect this by being more similar/parallel. More than what? More similar/parallel than they are with a word like paper. This aligns with our natural understanding of these words as well. Reading in the 300-dim (p = 300) word vectors from fastText we can readily compute the cosine similarity for these words. Running fasttext_word_similarity.py with

pipenv run python ./fasttext_word_similarity.py holiday vacation paper

shows a much larger similarity between holiday and vacation as expected.

Cosine Similarity: holiday & vacation : 0.7388389
Cosine Similarity: holiday & paper : 0.2716892
Cosine Similarity: vacation & paper : 0.27176374

From the perspective of a larger task such as the classification of documents, it is NOT helpful to have all word vectors as orthogonal. The BoW document vectors are built from word vectors as in Equation 1. Two sentences with similar but different words will exhibit zero cosine similarity when one-hot word vectors are used. But a non-zero similarity with fastText word vectors. For example, the sentence “have a fun vacation” would have a BoW vector that is more parallel to “enjoy your holiday” compared to a sentence like “study the paper“. Running fasttext_sentence_similarity.py we see a larger cosine similarity for the first two sentences.

pipenv run python ./fasttext_sentence_similarity.py

# words not found in fasttext.. 0
Cosine Similarity: enjoy your holiday & have a fun vacation : 0.72311985
Cosine Similarity: enjoy your holiday & study the paper : 0.5743288
Cosine Similarity: have a fun vacation & study the paper : 0.51478416

2.2.3 But still, no context awareness

The issues associated with context and position of the word are not addressed by distributional hypothesis. This was the last shortcoming we listed in Section 2.1 with one-hot word vectors. Even while a word has advantageously gained some relationships with other words, a word is still represented by a single same vector, no matter where it appears in text.

For example the word pitch in cricket is different from pitch in music. The word pitch in cricket keeps a very different company compared to pitch in music. So they really do NOT share a lot of common friends, and so should have different representations. They would, if the training corpus for cricket pitch came from the sports section, and the training corpus for the music pitch came from the arts section. But that is not how it is and such an approach will not scale, given the variety of contexts and words. The embedding obtained for a word here is an average over all contexts it appears in and so loses context… This leads to the latest crop of language models that BERT belongs to.

3. Short, dense, and context sensitive

The language modeling tools such as ELMO, GPT-2 and BERT allow for obtaining word vectors that morph knowing their place and surroundings. Refer to the excellent series of articles by Jay Alammar The Illustrated BERT, ELMO, and co. (How NLP Cracked Transfer Learning, The Annotated Transformer etc… for insights into how this has been achieved. We jump directly to using BERT to see for ourselves that it generates context, and position aware vectors for words that make some sense.

The pre-trained BERT models can be downloaded and they have scripts to run BERT and get the word vectors from any and all layers. The base case BERT model that we use here employs 12 layers (transformer blocks) and yields word vectors with p = 768. The script getBertWordVectors.sh below reads in some sentences and generates word embeddings for each word in each sentence, and from every one of 12 layers.

#!/bin/bash
# Extract BERT word embeddings. getBertWordVectors.sh

input_file=./bert_sentences.txt
output_file=./bertWordVectors.jsonl

BERT_BASE_DIR="$PRE_TRAINED_HOME/bert/uncased_L-12_H-768_A-12"
bert_master=./bert_master

pipenv run python $bert_master/extract_features.py \
  --input_file=$input_file \
  --output_file=$output_file \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --layers=-1,-2,-3,-4,-5,-6,-7,-8,-9,-10,-11,-12 \
  --max_seq_length=128 \
  --batch_size=8

We pull the embeddings from the 11th layer (Why 11th? You can try others too. Bert As a Service uses the 11th layer :)) for our work and compute cosine similarity.

# bert_similarity.py
import random as rn
import numpy as np
import json

bertVectors, sentences = {}, []
with open("./bertWordVectors.jsonl") as f:
    jsonlines = f.readlines()
    jsonlines = [x.strip() for x in jsonlines]
    for j, jsonline in enumerate(jsonlines):
        json_content = json.loads(jsonline)
        allTokens = [feature['token'] for feature in json_content['features']]
        sentences.append(' '.join(allTokens[1:-1]))  # Exclude CLS & SEP
        bertVectors[j] = {}
        for i, token in enumerate(allTokens[1:-1]):
            wv = np.array(json_content['features'][i+1]['layers'][1]['values'])
            bertVectors[j][token] = wv/np.linalg.norm(wv)

def checkPairs (iStart, word):
    print (sentences[iStart], ' <=> ', sentences[iStart+1] , '\t\t\t<=> ', round(np.dot(bertVectors[iStart][word], bertVectors[iStart+1][word]),3))
    print (sentences[iStart], ' <=> ', sentences[iStart+2] , '\t\t\t<=> ', round(np.dot(bertVectors[iStart][word], bertVectors[iStart+2][word]),3))

checkPairs (12, 'play')

Our goal here is to show that the BERT word vectors morph themselves based on context. Take the following three sentences for example.

record the play
play the record
play the game

The word play in the second sentence should be more similar to play in the third sentence and less similar to play in the first. We can come up with any number of triplets like the above to test how well BERT embeddings do. Here are a bunch of such triplets and the results show that BERT is able to figure out context the word is being used in.

Figure 3. BERT embeddings are contextual. Each row show three sentences. The sentence in the middle expresses the same context as the sentence on its right, but different from the one on its left. All three sentences in the row have a word in common. The numbers show the computed cosine-similarity between the indicated word pairs. BERT embedding for the word in the middle is more similar to the same word on the right than the one on the left.

When classification is the larger objective, there is no need to build a BoW sentence/document vector from the BERT embeddings. The [CLS] token at the start of the document contains a representation fine tuned for the specific classification objective. But for a clustering task we do need to work with the individual BERT word embeddings and perhaps with a BoW on top to yield a document vector we can use. We will get to these in future posts.

3. Conclusions

We have tracked the evolution of word vectors from long/sparse/one-hot to short/dense/dynamic. It was a journey from a place where a word had no friends (one-hot, orthogonal), to places with a friends circle (e.g. fastText), and currently to a place where it can adapt and find friends (e.g. BERT). These advances have progressively improved our ability to model text.

With that we conclude this post. We move on to evaluating these word vectors for several classification tasks in the next post.