Multiclass Classification with Word Bags and Word Sequences

SVM with Tf-idf vectors edges out LSTM in quality and performance for classifying the 20-newsgroups text corpus.

Document is a specific sequence of words. But not all sequences of words are documents. Teaching the difference to an algorithm is a tall order. Taking the sequence of words into account for text analysis is in general computationally expensive. Deep learning approaches such as LSTM allow us to model a document as a string- of-words and they have indeed found some success in NLP tasks recently.

On the other hand when we shred the document and make bags by word, we end up with a vector of weights/counts of these words. The mapping from a document to this vector can be many-to-one as all possible sequences of words yield the same vector. So the deciphering of the meaning of the original document (much less resurrecting it!), from this vector is not possible. Nevertheless this decades old bags-of-words approach to modeling documents has been the main stay for NLP tasks.

When the sequence of words is important in determining the class of a document, string-of-words approaches will outshine the bags-of-words. We have demonstrated this with synthetic documents where LSTM trounced the bags-of-words approach (Naive Bayes working with tf-idf vectors) for classification. But for a real text corpus of movie reviews for binary sentiment classification, we have shown that both LSTM and SVM (with tf-idf vectors) were comparable in quality even while the former took much longer.

The objective of this post is to further evaluate “bags vs strings” for a multiclass situation. We will work with the 20-newsgroups text corpus that is available from scikit-learn api. We will also look at the impact of using word-embeddings – both pre-trained and custom. We go through some code snippets here, but the complete code to reproduce the results can be downloaded from github.

1. Tokenize the 20news Corpus

This corpus consists of posts made to 20 news groups so they are well-labeled. There are over 18000 posts that are more or less evenly distributed across the 20 topics. In the code snippet below we fetch these posts, clean and tokenize them to get ready for classification.

  • Lines #9 – 10. Tokenization. Remove all punctuation and NLTK stop words. Make sure all words/tokens start with a letter. And only retain those words between 3 and 15 characters long.
  • Line #14: Use scikit-learn api to fetch the posts but make sure to remove the “dead give away” clues as to what topic a given post belongs to.
  • Line # 23: Taking note of number of words in each document helps us choose a reasonable sequence length for LSTM later. The percentile stats on nTokens shows that over 92% of the documents have less than 200 words in them.

2. Word-Embeddings

Word-embeddings are short (of length p that is much, much shorter than the size of the vocabulary nWords) numerical vector representations for words. They allow us to reduce the dimensionality of the word-space from the length of the corpus vocabulary (about 107, 000 here) to a much shorter length like 300 used here. Pre-trained fasttext word vectors are downloaded, and the custom fasttext ones for the movies corpus are generated offline via Gensim. In either case once in hand they are simply read off of the disk.

  • Lines #7 – 8: We have the custom Gensim generated word vectors in a json file structured as { word : vector, ..} so we simply read it off as a dictionary
  • Lines #9 – 22: In case of pre-trained vectors, we read the downloaded file, and process with some error checking.
  • Lines #26-28: Prepare the nWords x 300 embedding matrix where each row represents the 300 long numerical vector for the corresponding word.

The end result is a matrix where each row represents a 300 long vector for a word. The words/rows are ordered as per the integer index in the word_index dictionary – {word:index}. In case of Keras, the words are ordered based on their frequency. In case of tf-idf vectorizer a word gets its index based on its alphabetical order in the vocabulary. Just book keeping, nothing complex.

3. Pack Bags and Sequences

LSTM works with word sequences as input while the traditional classifiers work with word bags such as tf-idf vectors. Having each document in hand as a list of tokens we are ready for either.

3.1 Tf-Idf Vectors for SVM

We use scikit-learn Tf-Idf Vectorizer to build the vocabulary (the word_index dict variable in Line #7 below) and the document vectors (Line #8) from the tokens.

Xencoded is a sparse nDocs x nWords matrix. When using word-embeddings we convert that to a dense nDocs x 300 matrix by multiplying with the embedding matrix we computed in Section 2. These shorter 300-long dense vectors are then classified.

    \[\underbrace{Xencoded}_{nDocs \times 300} = \underbrace{Xencoded}_{nDocs \times nWords} \, \cdot \, \underbrace{embeddingMatrix}_{nWords \times 300}\]

3.2 Sequences for LSTM

The text processor in Keras turns each document into a sequence/string of integers, where the integer value indicates the actual word as per the {word:index} dictionary that the same processing generates. The index values start at 1, skipping 0 which is reserved for padding. We use 200-long sequences as the stats on the tokens show that over 92% of the documents have less than 200 words. In Line # 8 in the code below, the documents with fewer than 200 words are ‘post’ padded with the index value 0 that is ignored by the embedding layer (mask_zero=True is set for in the definition of embedding layer in Section 4).

4. Models

LSTM is implemented via Keras while SVM is implemented via scikit-learn. Both work with the same train/test split so a comparison would be fair. Twenty percent of the overall corpus (i.e 3660 documents) are set aside for test while training on the remaining 14636 documents.

4.1 LSTM

As in the earlier articles in this series, we use the simplest possible LSTM model, with an embedding layer, one LSTM layer and the output layer. When using external word-embeddings the embedding layer will not be trained i.e., the weights will be what we have read from the disk in Section 2.

Figure 1.A simple LSTM model for multiclass classification

The embedding layer in Figure 1 reduces the number of features from 107196 (the number of unique words in the corpus) to 300. The LSTM layer outputs a 150-long vector that is fed to the output layer for classification. The model itself is defined quite simply below.

  • Lines #4 – 8: Embedding layer is trained only when not using external word-embeddings.
  • Line #10: The dropout fields are to help with preventing overfitting

Training is done with early stopping to prevent over training in Line #6 in the code below. The final output layer yields a vector that is as long as the number of labels, and the argmax of that vector is the predicted class label.

4.2 SVM

The model for SVM is much less involved as there are far fewer moving
parts and parameters to decide upon. That is always a good thing of
course.

5. Simulations

The confusion matrix and the F1-scores obtained are what we are interested in. With the predicted labels in hand from either approach we use scikit-learn API to compute them.

While we have gone through some snippets in different order, the complete code for lstm-20news.py for running LSTM and svm-20news.py for running SVM is on github. As indicated in the earlier articles various random seeds are initialized for repeatability. The simulations are carried out with the help of a shell script below that loops over the variations we are considering.

5.1 LSTM

The embedding layer should contribute to

107196 * 300 weight parameters + 300 bias parameters = 32159100 params

This matches the number of non-trainable parameters in Line # 11 below for an LSTM run with external word-embeddings.

The run takes over 2 hrs, stops due to the early stopping criteria and obtains an F1-score of 0.73. Figure 2 shows the rate of convergence flattening out a good bit by about 20 epochs or so.

Figure 2 Convergence of LSTM model with fastext word-embeddings.

5.2 SVM

SVM has far fewer moving parts and it finishes much more quickly as well. With fasttext embeddings, it works with a 18296 x 300 dense matrix (Line# 7 below), and obtains F1-score of 0.68.

6. Results

We have the results in hand to not only compare bag & sequences for multiclass classification but also the impact of using pre-trained and custom word-embeddings. Figure 3 shows the F1-scores obtained and the time taken in all cases. SVM with direct tf-idf vectors does the best both for quality & performance. Pre-trained word-embeddings help LSTM improve its F1-score. The larger run times for LSTM are expected and they are in line with what we have seen in the earlier articles in this series.

Figure 3. Quality and performance comparison of bags vs strings for 20-news classification. SVM is clearly the leader in quality and performance. Word-embeddings seem to help LSTM achieve better ressults.

Figure 4 below compares the best confusion matrices obtained by either approach.

Figure 4. The diagonal dominance is observed in either case. Interestingly both approaches seem to be confused between more or less the same pairs of classes.

7. Conclusions

So what are we to make of the results obtained in this three part series? For a synthetic text corpus dominated by sequences, word strings beat out word bags handily. For a binary classification task, the score was even. In this multiclass classification task, the scale has tilted towards word bags. Given that the deep learning approaches have so many knobs one can never be sure if the obtained results cannot be improved by tweaking some (which ones, pray tell me… units/layers/batches/cells/… and by how much too… while you are at it…). So here are some loose assessments for whatever they are worth.

  • For a real text corpus word bags are tough to beat – especially given the much shorter run times
  • Word bag vectors do not really benefit from the use of word-embeddings. We have seen this in earlier articles such as Word Embeddings and Document Vectors: Part 2. Classification
  • Word-embeddings can improve the quality of results for word-strings based approaches.

18 thoughts on “Multiclass Classification with Word Bags and Word Sequences

  1. Sarwoedy152

    On which environment this code run smoothly? Linux or Windows? Because im still have problem running code on the both environment, even i have been install all the depedencies. Please help

    Reply
    1. Ashok

      Hi Edy,

      I ran it on a linux laptop. Have you cloned the repo from github and followed the instructions there?

      What error messages do you get?

      Reply
      1. Sarwoedy152

        Of course i have cloned the repo to my local directory in Linux and followed the instruction, but i got this similar error message “2019-02-16 03:12:49.453982: F tensorflow/core/platform/cpu_feature_guard.cc:37] The TensorFlow library was compiled to use AVX instructions, but these aren’t available on your machine.
        Aborted (core dumped)”. Any idea?

        Reply
        1. Ashok

          Hi Edy,

          Googling for that error yields this link https://github.com/tensorflow/tensorflow/issues/24548

          where

          >> rootkitchao commented on Dec 24, 2018

          >> The current version of Tensorflow installed via pip uses the AVX instruction set at compile time.This means that your CPU needs to support the AVX instruction set. This instruction set is supported from the second generation of Intel Core CPUs (codenamed SandyBridge). You can compile a Tensorflow from the source that does not use the AVX instruction set. Or find an already compiled one on the internet.

          I believe I am using “version”: “==1.13.1” of tensorflow as per my Pipfile.lock. If you are using a later version try with a lower version perhaps?

          Reply
          1. Sarwoedy152

            Sure, i read about the error code too.. so i will change the deployment on the PC which has newer intel processor. Is it need GPU version of tensorflow?

            I will update the progress in next 12hours, thanks very much

  2. sarwoedy152

    i have been deployed the sentiment binary classification smoothly, thanks for guidance. then i want to deploy the multiclass sentiment classification, i will update again later

    Reply
    1. sarwoedy152

      Hello, I have been confused what is missing on the module. when i tried to deploy lstm-20news.py.. i got this error below. please kindly give me instruction to fix it. (FYI, on the vectors folder there is 20news-fasttext.json and crawl-300d-2M-subword.vec)

      Using TensorFlw backend.
      Traceback (most recent call last):
      File “lstm-20news.py”, line 27, in
      vectorSource = str(sys.argv[1]) # none, fasttext, custom-fasttext
      IndexError: list index out of range

      Reply
  3. Ashok

    You are perhaps missing the argument. It should be run with a single argument that is one of “none” or “fasttext” or “custom-fasttext”

    So run it like:

    pipenv run python ./lstm-20news.py none

    for example. See the shell script that loops over the 3 possible args and runs it 3 times…

    Reply
  4. Sarwoedy152

    Perfect! I was on training now, i will try the possible three args one by one. I will update soon if finish training. So excuse me to know what is the purpose use thee different args? Which is optional to use one of them. Thanks

    Reply
    1. Ashok

      Well, we are testing the effectiveness of different word-embeddings on the performance of LSTM for this classification exercise. When the arg is ‘none’ we are asking LSTM to train the embed matrix parameters as part of the overall optimization process while fitting the training data. With the other two arguments we are ‘supplying’ the embed matrix parameters from an external file and telling LSTM to NOT to train for them. And this external file is the one we are reading from disk with vectors (fastext or custom-fastext) for different words. You can take a look at this series of articles

      http://xplordat.com/2018/10/09/word-embeddings-and-document-vectors-part-2-classification/

      for working with pre-trained or custom word-embeddings.

      Reply
  5. Sarwoedy152

    Hello, i have been tried to deploy both LSTM and SVM script, but i think the accuracy need more higher. I got 68-72% accuracy of LSTM and SVM. Does anything i can do to make it better accuracy? Thanks

    Reply
  6. Ashok

    It depends on the text corpus and the quality of training data… If you have been able to reproduce the reported the results in these blogs then we know that the basic set up is working as expected. Now, when you replace the text corpus with a different one, you will need to experiment with modifying the tokenization perhaps (what kind of documents are these?). First get SVM ‘without’ any embeddings to do the best job it can. That will be the baseline and you will understand any limitations posed by your particular text corpus.

    Reply
  7. Sarwoedy152

    I think i need to change the text corpus with (.txt) document like in this blog post “Sentiment Analysis with Word Bags and Sequence”. I saw it contains of .txt document with separated folder of negative and positive. Because at least i have to classify three class (negative, neutral and positive). Honestly im not familiar with python, so kindly please help.

    Reply
  8. Ashok

    It is very simple, Edy.

    (1) Just make a 3rd folder by the name ‘neutral’ under both ‘test’ & ‘train’ folders.

    (2) Place your train & test neutral documents in the corresponding folder.

    (3) Once done, the only change you need should be these 2 lines (use the lines below and replace what is there in the file)

    X, labels, labelToName = [], [], { 0 : ‘neg’, 1: ‘pos’ , 2:’neutral’}

    for classIndex, directory in enumerate([‘neg’, ‘pos’, ‘neutral’]):

    in the function ‘getMovies’

    Should work

    Reply

Leave a Reply