Word Bags vs Word Sequences for Text Classification

Sequence respecting approaches have an edge over bag-of-words implementations when the said sequence is material to classification. Long Short Term Memory (LSTM) neural nets with word sequences are evaluated against Naive Bayes with tf-idf vectors on a synthetic text corpus for classification effectiveness.

Consider the two one-liners: “Eat to Live” and “Live to Eat“. They contain the same words, but in different order – leading to a big difference in meaning. Bag of words based models cannot tell these documents apart and so place them in the same cluster or classification bucket. Word-embeddings are of no help either as the pre-trained or custom word vectors are a function of the word alone, without any consideration to the position of that word in a sentence. All the NLP exercises we have considered in the earlier posts (classification, clustering) have used this bag-of-words approach to turn documents into numerical vectors (with or without word-embeddings) and hence suffer the same deficiency. This is where the promise of deep learning with Long Short-Term Memory (LSTM) neural networks can be put to test.

Figure 1. (A) The meaning implied by the specific sequence of words is destroyed in a bag-of-words approach. (B) Sequence respecting models have an edge when a play on words changes the meaning and the associated classification label

LSTM neural nets are an offshoot of Recurrent Neural Nets (RNN) whose hall mark is learning & predicting from sequences – such as sentences/speech that are sequences of words, or monitoring data that evolves with time. LSTM neural nets remedy some short comings associated with learning from long sequences and so are more efficient compared to RNN. There is a lot that has been written about RNN and LSTM networks so we will not delve into the details here. See Karpathy’s article, Colah’s blog for an excellent introduction to RNNs, LSTMs, and what they can do. And of course any number of articles on Medium and those written by Jason Brownlee at Machine Learning Mastery.

This post attempts to classify synthetic word sequences with LSTM and with Naive Bayes using tf-idf vectors. LSTM is implemented via Keras with Tensorflow backend. Naive Bayes is implemented via SciKit. While we go through some code snippets here, the full code for reproducing the results can be downloaded from github. The main imports are as follows.

import numpy as np
import os
import json
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import StratifiedShuffleSplit
import random as rn
import keras
import tensorflow as tf
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

In the code snippet below, we fix the random seeds for numpy and tensorflow so we can get reproducible results. Besides, we set PYTHONHASHSEED=0 as an environment variable in the shell before running the simulations.

#All this for reproducibility
np.random.seed(1)
rn.seed(1)
tf.set_random_seed(1)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
keras.backend.set_session(sess)

1. Construct a Text Corpus

This post is for illustrative puposes. We want to drive home the point that bag-of-words based classifiers fail badly when faced with having to classify text where the sequence of words is the chief differentiator among the classes as seen in Figure 1 above. In order to do that easily we construct a text corpus without worrying about the meaning of the constructed sentences.

# Build the corpus and sequences
with open ('words.txt' , 'r') as f:
  words = sorted(list(set(f.read().lower().strip().split(','))))
X, labels = [], []
labelToName = { 0 : 'ordered', 1 : 'unordered', 2 : 'reversed' }
namesInLabelOrder = ['ordered', 'unordered', 'reversed']
nWords = len(words)
sequenceLength=15
for i in range(0, nWords-sequenceLength):
  X.append(words[i:i+sequenceLength])
  labels.append(0)
for i in range(nWords-sequenceLength, nWords):
  X.append(words[i:nWords] + words[0:sequenceLength + i -nWords])
  labels.append(0)
nSegments = len(X)
for i in range(nSegments):
  X.append(X[i][::-1])
  labels.append(1)
for i in range(nSegments):
  randIndices = np.random.randint(0, size=sequenceLength, high=nWords)
  X.append(list( words[i] for i in randIndices ))
  labels.append(2)
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1).split(X, labels)
train_indices, test_indices = next(sss)

Pick a set of words like the top 1000 most common words and sort them. Get 995 unique words. Lines 2-6
Make 3 separate classes of sentences from these words.
- ordered: Take 15 words in a sequence like 0..14, 1..15, … 995..13 etc… from this list. We get 995 such sentences that belong to this ordered class. Lines 9-14
- reversed: Each one of the above sequences is reversed. Lines 16-18
- unordered:Pick 25 words at random to form a sequence. Again 995 such sequences, so all three classes are balanced. Lines 19-22
Shuffle, and split to get the train & test data sets. We set aside 20% of all the data to be used exclusively for testing. Lines 23-24

The table below shows sample sequences formed by this means and the classes they belong to.

ordered	reversed	unordered
[‘a’, ‘able’, ‘about’, ‘above’, ‘act’, ‘add’, ‘afraid’, ‘after’, ‘again’, ‘against’, ‘age’, ‘ago’, ‘agree’, ‘air’, ‘all’]	[‘all’, ‘air’, ‘agree’, ‘ago’, ‘age’, ‘against’, ‘again’, ‘after’, ‘afraid’, ‘add’, ‘act’, ‘above’, ‘about’, ‘able’, ‘a’]	[‘atom’, ‘ease’, ‘try’, ‘boat’, ‘sleep’, ‘trouble’, ‘see’, ‘push’, ‘take’, ‘who’, ‘cold’, ‘choose’, ‘winter’, ‘own’, ‘side’]
[‘able’, ‘about’, ‘above’, ‘act’, ‘add’, ‘afraid’, ‘after’, ‘again’, ‘against’, ‘age’, ‘ago’, ‘agree’, ‘air’, ‘all’, ‘allow’]	[‘allow’, ‘all’, ‘air’, ‘agree’, ‘ago’, ‘age’, ‘against’, ‘again’, ‘after’, ‘afraid’, ‘add’, ‘act’, ‘above’, ‘about’, ‘able’]	[‘miss’, ‘hour’, ‘fear’, ‘crop’, ‘farm’, ‘especially’, ‘had’, ‘under’, ‘lost’, ‘true’, ‘equal’, ‘me’, ‘red’, ‘very’, ‘i’]
[‘young’, ‘a’, ‘able’, ‘about’, ‘above’, ‘act’, ‘add’, ‘afraid’, ‘after’, ‘again’, ‘against’, ‘age’, ‘ago’, ‘agree’, ‘air’]	[‘air’, ‘agree’, ‘ago’, ‘age’, ‘against’, ‘again’, ‘after’, ‘afraid’, ‘add’, ‘act’, ‘above’, ‘about’, ‘able’, ‘a’, ‘young’]	[‘wing’, ‘mouth’, ‘special’, ‘plane’, ‘person’, ‘pattern’, ‘design’, ‘water’, ‘moon’, ‘happy’, ‘chart’, ‘contain’, ‘leg’, ‘system’, ‘count’]

2. Document = String of Words

With each document as a string of words we build the data structures that can be consumed as input by various sequence respecting models such as RNN, LSTM etc… in the Keras library.

2.1 Encoding

Use the Keras text processor on all the sentences/sequences so it can generate a word index and encode each sequence (Lines 2-4 below) accordingly. Note that we do not need padding as the length of all our sequences is exactly 15.

# Encode the documents
kTokenizer = keras.preprocessing.text.Tokenizer()
kTokenizer.fit_on_texts(X)
Xencoded = np.array([np.array(xi) for xi in kTokenizer.texts_to_sequences(X)])
labels = np.array(labels)

2.2 LSTM Implementation

We use the most simple LSTM model with an embedding layer, LSTM layer and an output layer illustrated in Figure 2 below.

Figure 2. A simple model for LSTM. Input sequences of length 15 are fed in batches of 32 The embedding layer reduces the size of the 1-hot vectors, while the magic of LSTM results in softmax predictions for the 3 possible classes.

Fifteen words (where each word is a 1-hot vector) in sequence are pumped as the input to an embedding layer that learns the weights for order reduction from 995 long to 248 long numerical vectors. This sequence of 248-long vectors are fed to the LSTM layer to do its magic activating the output layer to yield 3-long numerical vector via softmax. The index of the largest value is then the predicted class

# Build the LSTM model
def getModel():
    units1, units2 = int (nWords/4), int (nWords/8)
    model = keras.models.Sequential()
    model.add(keras.layers.embeddings.Embedding(input_dim = len(kTokenizer.word_index)+1,output_dim=units1,input_length=sequenceLength, trainable=True))
    model.add(keras.layers.LSTM(units = units2, return_sequences =False))
    model.add(keras.layers.Dense(len(labelToName), activation ='softmax'))
    model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics=['acc'])
    return model

Line # 5: The embedding layer builds a word-vector that is 1/4 th size of the input vector. That is, in our case the LSTM layer is fed a numerical vector of size about 250 as our input vector is about 1000 in size.
Line #6: The LSTM layer outputs a 125 long vector that gets fed to the final layer with softmax activation.
Line # 7: The final output layer yields a vector that is as long as the number of labels, and the argmax of that vector is the predicted class label.

With the data and model in hand we are ready to train the model and test the predictions. Because this is a multi-class classification we convert the labels to 1-hot vectors in Lines 3-4.

train_x = Xencoded[train_indices]
test_x = Xencoded[test_indices]
train_labels = keras.utils.to_categorical(labels[train_indices], len(labelToName))
test_labels = keras.utils.to_categorical(labels[test_indices], len(labelToName))

The 80% of the data is split into validation (20% of this 80%, i.e 16% overall) and training sets (64% of the overall data) in multiple training/validation simulations.

# Train and test over multiple train/validation sets
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=2, mode='auto', restore_best_weights=False)
sss2 = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=1).split(train_x, train_labels)
for i in range(10):
    train_indices_2, val_indices = next(sss2)
    model = getModel()
    model.fit(x=train_x[train_indices_2], y=train_labels[train_indices_2], epochs=50, batch_size=32, shuffle=True, validation_data = (train_x[val_indices], train_labels[val_indices]), verbose=2, callbacks=[early_stop])
    test_loss, test_accuracy = model.evaluate(test_x, test_labels, verbose=2)
    print (test_loss, test_accuracy)
    predicted = model.predict(test_x, verbose=2)
    predicted_labels = predicted.argmax(axis=1)
    print (confusion_matrix(labels[test_indices], predicted_labels))
    print (classification_report(labels[test_indices], predicted_labels, digits=4, target_names=namesInLabelOrder))

Line #2: If the validation loss does not decrease in any 5 consecutive epochs, we bail out and stop training.
Line #3: We set up 10 different train/validation splits to loop over.
Line #7: Training is done in batches of 32 sentences, and for a maximum of 50 epochs, with early stopping callback in place as discussed.
Line #11. Get the actual predicted labels from the softmax output vectors.

3. Document = Bags of Words

Building vectors out of bags-of-words approach is straightforward with SciKit’s vectorizers. We have done this many times in the earlier posts. Here we use the tf-idf vectorizer for encoding.

3.1 Encoding

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer=lambda x: x, min_df=1).fit(X)
Xencoded=vectorizer.transform(X)

The result from Line #2 above is an array of sparse tf-idf numerical vectors that can be fed to any number of traditional classifiers such as Naive Bayes, Support Vector Machines etc… Here we pick Naive Bayes, the simplest of them all.

3.2 Naive Bayes Implementation

We use SciKit’s API for this in the code snippet below. The same training/test data used with LSTM is employed here for training and prediction.

from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
train_x = Xencoded[train_indices]
test_x = Xencoded[test_indices]
train_labels = labels[train_indices]
test_labels = labels[test_indices]
model.fit(train_x, train_labels)
predicted_labels = model.predict(test_x)
print (confusion_matrix(labels[test_indices], predicted_labels))
print (classification_report(labels[test_indices], predicted_labels, digits=4, target_names=namesInLabelOrder))

4. Results

The entire code is on github for download as lstm.py and nb.py for the two approaches we are contrasting here.

4.1 LSTM

A sample run with

#!/bin/bash
PYTHONHASHSEED=0 ; pipenv run python lstm.py

yieds:

Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 15, 248)           247008    
_________________________________________________________________
lstm_1 (LSTM)                (None, 124)               185008    
_________________________________________________________________
dense_1 (Dense)              (None, 3)                 375       
=================================================================
Total params: 432,391
Trainable params: 432,391
Non-trainable params: 0
_________________________________________________________________
None
Train on 1910 samples, validate on 478 samples
Epoch 1/50
 - 3s - loss: 1.1036 - acc: 0.2890 - val_loss: 1.1027 - val_acc: 0.2322
Epoch 2/50
 - 3s - loss: 1.0813 - acc: 0.4429 - val_loss: 1.1533 - val_acc: 0.2343
Epoch 3/50
..
..
Epoch 11/50
 - 3s - loss: 8.5663e-04 - acc: 1.0000 - val_loss: 0.6081 - val_acc: 0.9079
Epoch 00011: early stopping

Confusion Matrix:

[[190   4   5]
 [  0 199   0]
 [ 14  28 157]]

Classification Report:

              precision    recall  f1-score   support

     ordered     0.9314    0.9548    0.9429       199
    reversed     0.8615    1.0000    0.9256       199
   unordered     0.9691    0.7889    0.8698       199

   micro avg     0.9146    0.9146    0.9146       597
   macro avg     0.9207    0.9146    0.9128       597
weighted avg     0.9207    0.9146    0.9128       59

showing that early stop was triggered in the 11th epoch while getting a f1-score of 0.91. The rate of convergence for a sample split, and the f1-scores obtained with the ten different splits are shown in Figure 3 below.

Figure 3. Results with LSTM. (A) Training is essentially completed in 7 epochs but patience keeps us till 12 epochs or so. (B). Very good f1-scores in all cases. The average is close to 0.9

4.2 Naive Bayes

Running naive bayes is likewise similar, with PYTHONHASHSEED=0 ; pipenv run python nb.py. This yields an average f1-score of 0.23 unfortunately as we see below. This is a far cry from 0.9 obtained with LSTM.

Confusion Matrix:

[[ 36 103  60]
 [ 92  39  68]
 [ 62  75  62]]

Classification Report:

              precision    recall  f1-score   support

     ordered     0.1895    0.1809    0.1851       199
    reversed     0.1797    0.1960    0.1875       199
   unordered     0.3263    0.3116    0.3188       199

   micro avg     0.2295    0.2295    0.2295       597
   macro avg     0.2318    0.2295    0.2305       597
weighted avg     0.2318    0.2295    0.2305       597

Figure 4 lays out the confusion matrix obtained with LSTM and Naive Bayes side-by-side for comparison. The diagonal dominance in the case of LSTM is indicative of its better predictions and large f1-scores compared to Naive Bayes.

Figure 4. (A) LSTM does extremely well correctly predicting labels for ordered and reversed sequences. Some hesitation can be seen with predicting the label for unordered sequences. (B) Naive Bayes using tf-idf vectors has great difficulty figuring out the class and gets them mostly wrong.

5. Conclusions and Next Steps

That concludes our demonstration that sequence respecting approaches will do better at NLP tasks when the said sequence is material to the task at hand. Thus a string of words approach for documents has advantages over the traditional bag of words approach for the same when the sequence of words is significant. Deep learning models such as LSTM respect the sequence of words and hence can be expected to do better. We constructed a synthetic text corpus and showed that LSTM achieved over 90% for a 3-class f1-score while Naive Bayes classifier working with tf-idf vectors yielded only 0.23.

The artificial nature of the text corpus is the reason for the extreme under performance of Naive Bayes compared to LSTM. But as we said the purpose of this post is illustrative, i.e highlight when & why sequence respecting approaches hold an edge over the traditional bag-of-words approaches. In the next post in this series we will repeat these tests against some real text corpus and see where we stand.