Sequence respecting approaches have an edge over bag-of-words implementations when the said sequence is material to classification. Long Short Term Memory (LSTM) neural nets with word sequences are evaluated against Naive Bayes with tf-idf vectors on a synthetic text corpus for classification effectiveness.
Consider the two one-liners: “Eat to Live” and “Live to Eat“. They contain the same words, but in different order – leading to a big difference in meaning. Bag of words based models cannot tell these documents apart and so place them in the same cluster or classification bucket. Word-embeddings are of no help either as the pre-trained or custom word vectors are a function of the word alone, without any consideration to the position of that word in a sentence. All the NLP exercises we have considered in the earlier posts (classification, clustering) have used this bag-of-words approach to turn documents into numerical vectors (with or without word-embeddings) and hence suffer the same deficiency. This is where the promise of deep learning with Long Short-Term Memory (LSTM) neural networks can be put to test.
LSTM neural nets are an offshoot of Recurrent Neural Nets (RNN) whose hall mark is learning & predicting from sequences – such as sentences/speech that are sequences of words, or monitoring data that evolves with time. LSTM neural nets remedy some short comings associated with learning from long sequences and so are more efficient compared to RNN. There is a lot that has been written about RNN and LSTM networks so we will not delve into the details here. See Karpathy’s article, Colah’s blog for an excellent introduction to RNNs, LSTMs, and what they can do. And of course any number of articles on Medium and those written by Jason Brownlee at Machine Learning Mastery.
This post attempts to classify synthetic word sequences with LSTM and with Naive Bayes using tf-idf vectors. LSTM is implemented via Keras with Tensorflow backend. Naive Bayes is implemented via SciKit. While we go through some code snippets here, the full code for reproducing the results can be downloaded from github. The main imports are as follows.
1 2 3 4 5 6 7 8 9 |
import numpy as np import os import json from sklearn.metrics import classification_report, confusion_matrix from sklearn.model_selection import StratifiedShuffleSplit import random as rn import keras import tensorflow as tf os.environ['TF_CPP_MIN_LOG_LEVEL']='2' |
In the code snippet below, we fix the random seeds for numpy and tensorflow so we can get reproducible results. Besides, we set PYTHONHASHSEED=0 as an environment variable in the shell before running the simulations.
1 2 3 4 5 6 7 |
#All this for reproducibility np.random.seed(1) rn.seed(1) tf.set_random_seed(1) session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,inter_op_parallelism_threads=1) sess = tf.Session(graph=tf.get_default_graph(), config=session_conf) keras.backend.set_session(sess) |
1. Construct a Text Corpus
This post is for illustrative puposes. We want to drive home the point that bag-of-words based classifiers fail badly when faced with having to classify text where the sequence of words is the chief differentiator among the classes as seen in Figure 1 above. In order to do that easily we construct a text corpus without worrying about the meaning of the constructed sentences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Build the corpus and sequences with open ('words.txt' , 'r') as f: words = sorted(list(set(f.read().lower().strip().split(',')))) X, labels = [], [] labelToName = { 0 : 'ordered', 1 : 'unordered', 2 : 'reversed' } namesInLabelOrder = ['ordered', 'unordered', 'reversed'] nWords = len(words) sequenceLength=15 for i in range(0, nWords-sequenceLength): X.append(words[i:i+sequenceLength]) labels.append(0) for i in range(nWords-sequenceLength, nWords): X.append(words[i:nWords] + words[0:sequenceLength + i -nWords]) labels.append(0) nSegments = len(X) for i in range(nSegments): X.append(X[i][::-1]) labels.append(1) for i in range(nSegments): randIndices = np.random.randint(0, size=sequenceLength, high=nWords) X.append(list( words[i] for i in randIndices )) labels.append(2) sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=1).split(X, labels) train_indices, test_indices = next(sss) |
- Pick a set of words like the top 1000 most common words and sort them. Get 995 unique words. Lines 2-6
- Make 3 separate classes of sentences from these words.
- ordered: Take 15 words in a sequence like 0..14, 1..15, … 995..13 etc… from this list. We get 995 such sentences that belong to this ordered class. Lines 9-14
- reversed: Each one of the above sequences is reversed. Lines 16-18
- unordered:Pick 25 words at random to form a sequence. Again 995 such sequences, so all three classes are balanced. Lines 19-22
- Shuffle, and split to get the train & test data sets. We set aside 20% of all the data to be used exclusively for testing. Lines 23-24
The table below shows sample sequences formed by this means and the classes they belong to.
ordered | reversed | unordered |
[‘a’, ‘able’, ‘about’, ‘above’, ‘act’, ‘add’, ‘afraid’, ‘after’, ‘again’, ‘against’, ‘age’, ‘ago’, ‘agree’, ‘air’, ‘all’] | [‘all’, ‘air’, ‘agree’, ‘ago’, ‘age’, ‘against’, ‘again’, ‘after’, ‘afraid’, ‘add’, ‘act’, ‘above’, ‘about’, ‘able’, ‘a’] | [‘atom’, ‘ease’, ‘try’, ‘boat’, ‘sleep’, ‘trouble’, ‘see’, ‘push’, ‘take’, ‘who’, ‘cold’, ‘choose’, ‘winter’, ‘own’, ‘side’] |
[‘able’, ‘about’, ‘above’, ‘act’, ‘add’, ‘afraid’, ‘after’, ‘again’, ‘against’, ‘age’, ‘ago’, ‘agree’, ‘air’, ‘all’, ‘allow’] | [‘allow’, ‘all’, ‘air’, ‘agree’, ‘ago’, ‘age’, ‘against’, ‘again’, ‘after’, ‘afraid’, ‘add’, ‘act’, ‘above’, ‘about’, ‘able’] | [‘miss’, ‘hour’, ‘fear’, ‘crop’, ‘farm’, ‘especially’, ‘had’, ‘under’, ‘lost’, ‘true’, ‘equal’, ‘me’, ‘red’, ‘very’, ‘i’] |
[‘young’, ‘a’, ‘able’, ‘about’, ‘above’, ‘act’, ‘add’, ‘afraid’, ‘after’, ‘again’, ‘against’, ‘age’, ‘ago’, ‘agree’, ‘air’] | [‘air’, ‘agree’, ‘ago’, ‘age’, ‘against’, ‘again’, ‘after’, ‘afraid’, ‘add’, ‘act’, ‘above’, ‘about’, ‘able’, ‘a’, ‘young’] | [‘wing’, ‘mouth’, ‘special’, ‘plane’, ‘person’, ‘pattern’, ‘design’, ‘water’, ‘moon’, ‘happy’, ‘chart’, ‘contain’, ‘leg’, ‘system’, ‘count’] |
2. Document = String of Words
With each document as a string of words we build the data structures that can be consumed as input by various sequence respecting models such as RNN, LSTM etc… in the Keras library.
2.1 Encoding
Use the Keras text processor on all the sentences/sequences so it can generate a word index and encode each sequence (Lines 2-4 below) accordingly. Note that we do not need padding as the length of all our sequences is exactly 15.
1 2 3 4 5 |
# Encode the documents kTokenizer = keras.preprocessing.text.Tokenizer() kTokenizer.fit_on_texts(X) Xencoded = np.array([np.array(xi) for xi in kTokenizer.texts_to_sequences(X)]) labels = np.array(labels) |
2.2 LSTM Implementation
We use the most simple LSTM model with an embedding layer, LSTM layer and an output layer illustrated in Figure 2 below.
Fifteen words (where each word is a 1-hot vector) in sequence are pumped as the input to an embedding layer that learns the weights for order reduction from 995 long to 248 long numerical vectors. This sequence of 248-long vectors are fed to the LSTM layer to do its magic activating the output layer to yield 3-long numerical vector via softmax. The index of the largest value is then the predicted class
1 2 3 4 5 6 7 8 9 |
# Build the LSTM model def getModel(): units1, units2 = int (nWords/4), int (nWords/8) model = keras.models.Sequential() model.add(keras.layers.embeddings.Embedding(input_dim = len(kTokenizer.word_index)+1,output_dim=units1,input_length=sequenceLength, trainable=True)) model.add(keras.layers.LSTM(units = units2, return_sequences =False)) model.add(keras.layers.Dense(len(labelToName), activation ='softmax')) model.compile(optimizer='adam', loss = 'categorical_crossentropy', metrics=['acc']) return model |
- Line # 5: The embedding layer builds a word-vector that is 1/4 th size of the input vector. That is, in our case the LSTM layer is fed a numerical vector of size about 250 as our input vector is about 1000 in size.
- Line #6: The LSTM layer outputs a 125 long vector that gets fed to the final layer with softmax activation.
- Line # 7: The final output layer yields a vector that is as long as the number of labels, and the argmax of that vector is the predicted class label.
With the data and model in hand we are ready to train the model and test the predictions. Because this is a multi-class classification we convert the labels to 1-hot vectors in Lines 3-4.
1 2 3 4 |
train_x = Xencoded[train_indices] test_x = Xencoded[test_indices] train_labels = keras.utils.to_categorical(labels[train_indices], len(labelToName)) test_labels = keras.utils.to_categorical(labels[test_indices], len(labelToName)) |
The 80% of the data is split into validation (20% of this 80%, i.e 16% overall) and training sets (64% of the overall data) in multiple training/validation simulations.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# Train and test over multiple train/validation sets early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=5, verbose=2, mode='auto', restore_best_weights=False) sss2 = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=1).split(train_x, train_labels) for i in range(10): train_indices_2, val_indices = next(sss2) model = getModel() model.fit(x=train_x[train_indices_2], y=train_labels[train_indices_2], epochs=50, batch_size=32, shuffle=True, validation_data = (train_x[val_indices], train_labels[val_indices]), verbose=2, callbacks=[early_stop]) test_loss, test_accuracy = model.evaluate(test_x, test_labels, verbose=2) print (test_loss, test_accuracy) predicted = model.predict(test_x, verbose=2) predicted_labels = predicted.argmax(axis=1) print (confusion_matrix(labels[test_indices], predicted_labels)) print (classification_report(labels[test_indices], predicted_labels, digits=4, target_names=namesInLabelOrder)) |
- Line #2: If the validation loss does not decrease in any 5 consecutive epochs, we bail out and stop training.
- Line #3: We set up 10 different train/validation splits to loop over.
- Line #7: Training is done in batches of 32 sentences, and for a maximum of 50 epochs, with early stopping callback in place as discussed.
- Line #11. Get the actual predicted labels from the softmax output vectors.
3. Document = Bags of Words
Building vectors out of bags-of-words approach is straightforward with SciKit’s vectorizers. We have done this many times in the earlier posts. Here we use the tf-idf vectorizer for encoding.
3.1 Encoding
1 2 3 |
from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(analyzer=lambda x: x, min_df=1).fit(X) Xencoded=vectorizer.transform(X) |
The result from Line #2 above is an array of sparse tf-idf numerical vectors that can be fed to any number of traditional classifiers such as Naive Bayes, Support Vector Machines etc… Here we pick Naive Bayes, the simplest of them all.
3.2 Naive Bayes Implementation
We use SciKit’s API for this in the code snippet below. The same training/test data used with LSTM is employed here for training and prediction.
1 2 3 4 5 6 7 8 9 10 |
from sklearn.naive_bayes import MultinomialNB model = MultinomialNB() train_x = Xencoded[train_indices] test_x = Xencoded[test_indices] train_labels = labels[train_indices] test_labels = labels[test_indices] model.fit(train_x, train_labels) predicted_labels = model.predict(test_x) print (confusion_matrix(labels[test_indices], predicted_labels)) print (classification_report(labels[test_indices], predicted_labels, digits=4, target_names=namesInLabelOrder)) |
4. Results
The entire code is on github for download as lstm.py and nb.py for the two approaches we are contrasting here.
4.1 LSTM
A sample run with
1 2 |
#!/bin/bash PYTHONHASHSEED=0 ; pipenv run python lstm.py |
yieds:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, 15, 248) 247008 _________________________________________________________________ lstm_1 (LSTM) (None, 124) 185008 _________________________________________________________________ dense_1 (Dense) (None, 3) 375 ================================================================= Total params: 432,391 Trainable params: 432,391 Non-trainable params: 0 _________________________________________________________________ None Train on 1910 samples, validate on 478 samples Epoch 1/50 - 3s - loss: 1.1036 - acc: 0.2890 - val_loss: 1.1027 - val_acc: 0.2322 Epoch 2/50 - 3s - loss: 1.0813 - acc: 0.4429 - val_loss: 1.1533 - val_acc: 0.2343 Epoch 3/50 .. .. Epoch 11/50 - 3s - loss: 8.5663e-04 - acc: 1.0000 - val_loss: 0.6081 - val_acc: 0.9079 Epoch 00011: early stopping Confusion Matrix: [[190 4 5] [ 0 199 0] [ 14 28 157]] Classification Report: precision recall f1-score support ordered 0.9314 0.9548 0.9429 199 reversed 0.8615 1.0000 0.9256 199 unordered 0.9691 0.7889 0.8698 199 micro avg 0.9146 0.9146 0.9146 597 macro avg 0.9207 0.9146 0.9128 597 weighted avg 0.9207 0.9146 0.9128 59 |
showing that early stop was triggered in the 11th epoch while getting a f1-score of 0.91. The rate of convergence for a sample split, and the f1-scores obtained with the ten different splits are shown in Figure 3 below.
4.2 Naive Bayes
Running naive bayes is likewise similar, with PYTHONHASHSEED=0 ; pipenv run python nb.py. This yields an average f1-score of 0.23 unfortunately as we see below. This is a far cry from 0.9 obtained with LSTM.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Confusion Matrix: [[ 36 103 60] [ 92 39 68] [ 62 75 62]] Classification Report: precision recall f1-score support ordered 0.1895 0.1809 0.1851 199 reversed 0.1797 0.1960 0.1875 199 unordered 0.3263 0.3116 0.3188 199 micro avg 0.2295 0.2295 0.2295 597 macro avg 0.2318 0.2295 0.2305 597 weighted avg 0.2318 0.2295 0.2305 597 |
Figure 4 lays out the confusion matrix obtained with LSTM and Naive Bayes side-by-side for comparison. The diagonal dominance in the case of LSTM is indicative of its better predictions and large f1-scores compared to Naive Bayes.
5. Conclusions and Next Steps
That concludes our demonstration that sequence respecting approaches will do better at NLP tasks when the said sequence is material to the task at hand. Thus a string of words approach for documents has advantages over the traditional bag of words approach for the same when the sequence of words is significant. Deep learning models such as LSTM respect the sequence of words and hence can be expected to do better. We constructed a synthetic text corpus and showed that LSTM achieved over 90% for a 3-class f1-score while Naive Bayes classifier working with tf-idf vectors yielded only 0.23.
The artificial nature of the text corpus is the reason for the extreme under performance of Naive Bayes compared to LSTM. But as we said the purpose of this post is illustrative, i.e highlight when & why sequence respecting approaches hold an edge over the traditional bag-of-words approaches. In the next post in this series we will repeat these tests against some real text corpus and see where we stand.
Pingback: BoW to BERT – Data Science Austria