Sentiment Analysis with Word Bags and Word Sequences

For generic text, word bag approaches are very efficient at text classification. For a binary text classification task studied here, LSTM working with word sequences is on par in quality with SVM using tf-idf vectors. But performance is a different matter…

The bag-of-words approach to turning documents into numerical vectors ignores the sequence of words in the documents. Classifiers that work off of such vectors are then expected to pay the price for not accounting for the difference a specific sequence of words makes to the meaning and the implied target class thereof. Makes sense. That is the argument anyways in considering the newer deep learning approaches such as LSTM (Long Short-Term Memory) neural nets that can work with sequences. In the previous article we have indeed shown that the naive bayes classifier using word bag vectors (tf-idf to be specific) took a drubbing in the hands of LSTM (0.23 for naive-bayes/tf-idf vs 0.91 with LSTM for the F1-score) when the sequence of words was the deciding the factor for classification.

But that text corpus was artificial. It was constructed to bring out the best in sequence respecting models such as LSTM and the worst in others that ignored the said sequence at their peril. Does this out performance on the part of LSTM extend to a real life text corpus where the sequence of words may not be the deciding factor for classification? That is the question we explore here. We start with a simpler binary classification task in this post and consider a multilabel classification task in a later post. We use Support Vector Machines (SVM) with tf-idf vectors as the proxy for bag-of-words approach and LSTM for the sequence respecting approach. SVM is implemented via SciKit and LSTM is implemented via Keras. While we go through some code snippets here, the full code for reproducing the results can be downloaded from github.

1. Tokenize the Movie Reviews

The text corpus, large movie reviews from Stanford is often used for binary sentiment classification ā€“ i.e. is the movie good or bad based on the reviews. The positive and negative reviews are downloaded to disk in separate directories. Here is the code snippet to ‘clean’ the documents and tokenize them for analysis.

  • Lines #10 – 11. Tokenization. Remove all punctuation and NLTK stop words. Make sure all words/tokens start with a letter. And only retain those words between 3 and 15 characters long.
  • Line #15 – 24: Loop through the movie review files in each folder and tokenize.
  • Line # 25: Taking note of number of words in each document helps us choose a reasonable sequence length for LSTM later. The percentile stats on nTokens shows that over 86% of the documents have less than 200 words in them.

2. Pack Bags and Sequences

LSTM works with word sequences as input while the traditional classifiers work with word bags such as tf-idf vectors. Having each document in hand as a list of tokens we are ready for either.

2.1 Tf-Idf Vectors for SVM

We use Scikit’s Tf-Idf Vectorizer to build the vocabulary and the document vectors from the tokens.

2.2 Sequences for LSTM

The text processor in Keras turns each document into a sequence/string of integers, where the integer value indicates the actual word as per the {word:index} dictionary that the same processing generates. We use 200-long sequences as the stats on the tokens show that over 86% of the documents have less than 200 words. In Line # 8 in the code below, the documents with fewer than 200 words will be ‘post’ padded with the index value 0 that is ignored by the embedding layer (mask_zero=True is set for in the definition of embedding layer in Section 3).

3. Models

LSTM is implemented via Keras while SVM is implemented via SciKit. Both work with the same train/test split so a comparison would be fair. Twenty percent of the overall corpus (i.e 10,000 documents) are set aside for test while training on the remaining 40,000 documents.

3.1 LSTM

As in the earlier article, we use the simplest possible LSTM model, with an embedding layer, one LSTM layer and the output layer.

Figure 1. A simple LSTM model for binary classification.

The embedding layer in Figure 1 reduces the number of features from 98089 (the number of unique words in the corpus) to 300. The LSTM layer outputs a 150-long vector that is fed to the output layer for classification. The model itself is defined quite simply below.

  • Line #4: Embedding layer is trained to convert the 98089 long 1-hot vetcors to dense 300-long vectors
  • Line #6: The dropout fields are to help with preventing overfitting

Training is done with early stopping to prevent over training in Line #6 in the code below. The final output layer yields a vector that is as long as the number of labels, and the argmax of that vector is the predicted class label.

3.2 SVM

The model for SVM is much less involved as there are far fewer moving parts and parameters to decide upon. That is always a good thing of course.

4. Simulations

The confusion matrix and the F1-scores obtained are what we are interested in. With the predicted labels in hand from either approach we use SciKit’s API to compute them.

While we have gone through some snippets in different order, the complete code for lstm_movies.py for running LSTM and svm_movies.py for running SVM is on github. As indicated in the previous article, various random seeds are initialized for repeatability.

4.1 LSTM

Running LSTM with:

yields about 0.87 as the F1-score converging in 6 epochs due to early stopping.

4.2 SVM

Running SVM with

yields 0.90 as the F1-score

5. Conclusions

Clearly, both SVM at 0.90 as the F1-score and LSTM at 0.87 have done very well for binary classification. The confusion matrices show excellent diagonal dominance as expected.

Figure 2. Both LSTM and SVM have done very well for this binary sentiment classification exercise

While they are equal on the quality side, LSTM does take much longer – 2hrs as opposed to less than a second. That is too big a difference to be ignored.

With that we conclude this post. In the next post we go over the results for a multilabel classification exercise and the impact of external word-embeddings such as fasttext.

2 thoughts on “Sentiment Analysis with Word Bags and Word Sequences

Leave a Reply