Want to Cluster Text? Try Custom Word-Embeddings!

That is welcome news after our ho-hum results for text classification when using word-embeddings. In the context of classification we concluded that keeping it simple with naive bayes and tf-idf vectors is a great starting point. While we cannot generalize, all that extra work put into constructing document+word vectors did not yield commensurate gains for classification quality. In this post we pick up on the foundations we laid earlier for clustering text, specifically the two posts – Want Clusters? How Many Will You Have? and Clustering Text with Transformed Document Vectors. We choose the 20-news and the movie reviews text corpus we are familiar with from the classification exercise. We evaluate the different order reducing transformations for their clustering effectiveness. The code to reproduce these results can be downloaded from github.

Why should order reducing transformations help with clustering?

Before we plunge into the details, let us stop and consider whether there is any reason to expect some benefit from applying word-embeddings. Well, besides the obvious benefit of compute performance which is of course useless if we do not get good quality clusters!

The document vectors arising from the VSM model are long and sparse. The centroid of every document cluster would unfortunately be close to the origin. That is, all document clusters pretty much have a common centroid! Plus, since we normalize the document vectors to have unit length, every document in a cluster would be at about the same distance from this common centroid. We have a situation where the intercluster distances are close to zero and intracluster distances are close to unity! Not a good situation to work with for demarcating the clusters.

If we can reduce the order of the document vectors without destroying their content/meaning in the process, we can certainly expect better quality results. That precisely is the point of this post and here is the plan.

Take documents in known groups/clusters and mix them up
Apply an order reducing transformation and build document vectors
Cluster the document vectors and see if we have recovered the original groups intact
Evaluate different transformations in their ability to recover pure & intact document clusters

1. Document vectors for clustering

The prep work for building document vectors from the text corpus with/without word-embeddings is already done in the earlier post – Word Embeddings and Document Vectors: Part 2. Classification. We have the tokenized 20-news and movie-reviews text corpus in an elasticsearch index. The pre-trained and custom word-vectors built from different algorithms (FastText, Word2Vec, and Glove) are in the index as well. Applying K-Means to the reduced order document vectors is straightforward.

Get the (stopped tokens) sparse n-long document vectors from the elasticsearch index. n is the size of the vocabulary of the text corpus. Convert these n-long sparse vectors to dense p-long vectors by applying word-embeddings.
Apply K-Means clustering (with K=3 for twenty-news, and K = 2 for movie reviews) and find out how pure the obtained clusters are.

The following diagram illustrates the mechanics. The set up is similar to the one we had for the classification except that we are clustering the vectors here. The earlier article on classification has a detailed description of each of the steps in the diagram and the code is at its github as well.

Figure 1. The mechanics of the text clustering exercise. The document (row) vectors X and pre-trained/custom (column) word-vectors W are fetched from the elasticsearch index to build dense (row) vectors Z that are then normalized to have unit length before clustering with K-Means

2. Distances in transformed spaces

To illustrate the typical effect of an order reducing transformation on intra & intercluster distances let us pick the ‘alt.atheism’ group from the 20-news corpus and compute the following.

Intercluster Distances: The centroids of each of the 20 groups, and 19 distances from the centroid of ‘alt.athiesm’ to the centroids of the other 19 groups
Intracluster Distances: The distance of each of the 799 documents in the alt.atheism group from its centroid.

The box-whisker plots (whiskers at 5 & 95% percentile) in Figure 2 show the distribution of distances before & after the transformation. With the sparse and long (n = 44870, the size of the stopped vocabulary for 20-news) raw document vectors, the distances within the cluster are all around 1.0 as expected, and the intercluster distances are close to 0.1. With the custom fasttext word-embeddings (with p = 300, i.e the transformed vectors have a length of 300) we get a favorable distribution of distances where the cluster itself got crunched (median intracluster distance decreased to 0.27 from 1.0) while different clusters got pushed apart (median distance from other clusters increased to 0.28 from 0.1 or so).

Figure 2. The order reduced document vectors in a known cluster show larger compaction among themselves and greater separation from other clusters. Such transformations are better for the clustering task.

3. Evaluating the transformations

We know from the previous article Clustering Text with Transformed Document Vectors that,

we need to normalize the document vectors to have unit length, and
distances between transformed vectors cannot be directly compared across different transformations, and
the ratio of intercluster distance to intracluster distance in each space can be compared for some clues as to the usefulness of the transformation.

Further we have outlined the following procedure for computing this ratio.

Compute the centroid of each cluster, and the distance between each document in the cluster from this centroid. Find the median of these distances.
Take the average of the medians obtained in 1 over all clusters. This is a ‘representative’ distance A for this space.
Compute the average distance B between the centroids.
The ratio B/A is what we can compare across transformations

For better clusterability we want B/A to be large. That is, the transformation that maximizes B/A should do better at clustering. Figure 3 below gives us a glimpse of what at to expect. The documents in all cases have been vectorized with Scikit’s tf-idf vectorizer using stopped tokens (but no stemming). Different order reducing transformations are then applied to these vectors. We choose p as 300, i.e. the transformed document vectors have a length of 300 in all cases. The attempted transformations include the following.

Singular value decomposition
Glove (with glove.840B.300d)
FastText (with crawl-300d-2M-subword.vec)
Word2Vec (with GoogleNews-vectors-negative300.bin)
Custom vectors with FastText (using Gensim)
Custom vectors with Word2Vec (using Gensim)

The document vectors obtained in each transformation are processed as per the steps 1-4, to compute the intercluster/intracluster distance ratio B/A. Figure 3A shows this ratio for 20-news document vectors considering all the 20 groups/clusters. Figure 3B is for the movie reviews dataset.

Figure 3. Order reducing transformations seem to improve the intercluster/intracluster distance ratio, boding well for the clustering task.

The main takeaway from Figure 3 is that order reducing transformations and in particular the custom word-embeddings seem to yield large values for B/A and hence can help improve clusterability. Time to verify if this indeed is the case.

4. Clustering with K-Means

Now we get down to the actual clustering task outlined in Figure 1 in order to verify what Figure 3 is indicating. We will employ the same list of transformations and check how pure the obtained clusters are. That is, in any obtained cluster we hope to find articles of a single group/class. Our position is that those transformations with larger values for B/A should yield purer clusters.

We give the known number of clusters to the K-Means simulation in every case. Given that we are helping K-Means here by giving the exact K, we hope that it would at least be able separate out the articles by group and place them in distinct clusters. But that does not always happen unfortunately and in some cases the articles get lopsidedly bunched up in just one or two clusters. That is a failed clustering exercise from our view point.

If the clustering task is successful at all, we should end up with clusters each having a dominant fraction of articles from just one of the groups so we can identify that cluster with that group.

The ‘purity’ of an obtained cluster is then measured by number of articles that do not belong in that cluster/group.

4.1 Twenty-news articles

In the case of 20-news dataset we pick articles from 3 groups for this exercise – just so we can plot the results easily. Note however that the results in Figure 3A are obtained by considering all 20 groups/clusters. A word-cloud plot gives a bird’s eye-view of the content we are dealing with.

Figure 4. The dominant words in these three groups are mostly different. This enhances dissimilarity and so we expect to be successful at clustering

Figure 5A below shows that fewest number of articles were misplaced overall when using custom word-embeddings, whereas the raw document vectors fared the worst. Document vectors with custom word2vec embeddings yield the most pure clusters, each identifying with a specific group. The performance of different transformations is in good match with what Figure 3A has indicated.

Figure 5. (A) Six of the 7 attempted transformations were successful at clustering. Order reducing transformations seem to have helped with clistering (B) Several articles about hockey seem to have found their way into hardware & religion based content when using the raw document vectors.

4.2 Movie reviews

The movie-reviews dataset has only 2 classes so easy enough to plot up and we consider all the documents. Here is a word-cloud of the two groups once again.

Figure 6. The positive & negative reviews share a lot of common words that are dominant. We can expect to have some difficulty with clustering.

Figure 7A identifies only two successful transformations and they coincide with the two that have the largest B/A ratio in Figure 3B. Clearly as the word-clouds in Figure 6 have indicated, the movie reviews are more difficult to cluster. Overall, compared to 20-news we can make the following observations.

Only 2 out 7 (compared to 6 out 7) attempted transformations succeeded in the clustering task. The others were quite lopsided where just one of the clusters ended up with most of the documents.
Even in these two successful transformations, the purity/quality of the obtained clusters is poorer with over 30% misclustered.

Figure 7. Of the 7 attempted transformations, only 2 make the grade in reasonably segregating the positive and negative reviews as separate clusters. And they both of them involve custom word-embeddings.

5. Conclusions

With that we wrap up another post. We have shown here that order reducing transformations can help with clustering documents. Within those transformations, the custom word-embeddings seem to have an edge – hence it made its way into the title. As to why custom word-embeddings have done better than other other order reducing transformations and if that conclusion has legs beyond the text repositories studied here is open to further study.