That is welcome news after our ho-hum results for text classification when using word-embeddings. In the context of classification we concluded that keeping it simple with naive bayes and tf-idf vectors is a great starting point. While we cannot generalize, all that extra work put into constructing document+word vectors did not yield commensurate gains for classification quality. In this post we pick up on the foundations we laid earlier for clustering text, specifically the two posts – Want Clusters? How Many Will You Have? and Clustering Text with Transformed Document Vectors. We choose the 20-news and the movie reviews text corpus we are familiar with from the classification exercise. We evaluate the different order reducing transformations for their clustering effectiveness. The code to reproduce these results can be downloaded from github.
Why should order reducing transformations help with clustering?
Before we plunge into the details, let us stop and consider whether there is any reason to expect some benefit from applying word-embeddings. Well, besides the obvious benefit of compute performance which is of course useless if we do not get good quality clusters!
The document vectors arising from the VSM model are long and sparse. The centroid of every document cluster would unfortunately be close to the origin. That is, all document clusters pretty much have a common centroid! Plus, since we normalize the document vectors to have unit length, every document in a cluster would be at about the same distance from this common centroid. We have a situation where the intercluster distances are close to zero and intracluster distances are close to unity! Not a good situation to work with for demarcating the clusters.
If we can reduce the order of the document vectors without destroying their content/meaning in the process, we can certainly expect better quality results. That precisely is the point of this post and here is the plan.
- Take documents in known groups/clusters and mix them up
- Apply an order reducing transformation and build document vectors
- Cluster the document vectors and see if we have recovered the original groups intact
- Evaluate different transformations in their ability to recover pure & intact document clusters
1. Document vectors for clustering
The prep work for building document vectors from the text corpus with/without word-embeddings is already done in the earlier post – Word Embeddings and Document Vectors: Part 2. Classification. We have the tokenized 20-news and movie-reviews text corpus in an elasticsearch index. The pre-trained and custom word-vectors built from different algorithms (FastText, Word2Vec, and Glove) are in the index as well. Applying K-Means to the reduced order document vectors is straightforward.
- Get the (stopped tokens) sparse n-long document vectors from the elasticsearch index. n is the size of the vocabulary of the text corpus. Convert these n-long sparse vectors to dense p-long vectors by applying word-embeddings.
- Apply K-Means clustering (with K=3 for twenty-news, and K = 2 for movie reviews) and find out how pure the obtained clusters are.
The following diagram illustrates the mechanics. The set up is similar to the one we had for the classification except that we are clustering the vectors here. The earlier article on classification has a detailed description of each of the steps in the diagram and the code is at its github as well.
2. Distances in transformed spaces
To illustrate the typical effect of an order reducing transformation on intra & intercluster distances let us pick the ‘alt.atheism’ group from the 20-news corpus and compute the following.
- Intercluster Distances: The centroids of each of the 20 groups, and 19 distances from the centroid of ‘alt.athiesm’ to the centroids of the other 19 groups
- Intracluster Distances: The distance of each of the 799 documents in the alt.atheism group from its centroid.
The box-whisker plots (whiskers at 5 & 95% percentile) in Figure 2 show the distribution of distances before & after the transformation. With the sparse and long (n = 44870, the size of the stopped vocabulary for 20-news) raw document vectors, the distances within the cluster are all around 1.0 as expected, and the intercluster distances are close to 0.1. With the custom fasttext word-embeddings (with p = 300, i.e the transformed vectors have a length of 300) we get a favorable distribution of distances where the cluster itself got crunched (median intracluster distance decreased to 0.27 from 1.0) while different clusters got pushed apart (median distance from other clusters increased to 0.28 from 0.1 or so).
3. Evaluating the transformations
We know from the previous article Clustering Text with Transformed Document Vectors that,
- we need to normalize the document vectors to have unit length, and
- distances between transformed vectors cannot be directly compared across different transformations, and
- the ratio of intercluster distance to intracluster distance in each space can be compared for some clues as to the usefulness of the transformation.
Further we have outlined the following procedure for computing this ratio.
- Compute the centroid of each cluster, and the distance between each document in the cluster from this centroid. Find the median of these distances.
- Take the average of the medians obtained in 1 over all clusters. This is a ‘representative’ distance A for this space.
- Compute the average distance B between the centroids.
- The ratio B/A is what we can compare across transformations
For better clusterability we want B/A to be large. That is, the transformation that maximizes B/A should do better at clustering. Figure 3 below gives us a glimpse of what at to expect. The documents in all cases have been vectorized with Scikit’s tf-idf vectorizer using stopped tokens (but no stemming). Different order reducing transformations are then applied to these vectors. We choose p as 300, i.e. the transformed document vectors have a length of 300 in all cases. The attempted transformations include the following.
- Singular value decomposition
- Glove (with glove.840B.300d)
- FastText (with crawl-300d-2M-subword.vec)
- Word2Vec (with GoogleNews-vectors-negative300.bin)
- Custom vectors with FastText (using Gensim)
- Custom vectors with Word2Vec (using Gensim)
The document vectors obtained in each transformation are processed as per the steps 1-4, to compute the intercluster/intracluster distance ratio B/A. Figure 3A shows this ratio for 20-news document vectors considering all the 20 groups/clusters. Figure 3B is for the movie reviews dataset.
The main takeaway from Figure 3 is that order reducing transformations and in particular the custom word-embeddings seem to yield large values for B/A and hence can help improve clusterability. Time to verify if this indeed is the case.
4. Clustering with K-Means
Now we get down to the actual clustering task outlined in Figure 1 in order to verify what Figure 3 is indicating. We will employ the same list of transformations and check how pure the obtained clusters are. That is, in any obtained cluster we hope to find articles of a single group/class. Our position is that those transformations with larger values for B/A should yield purer clusters.
We give the known number of clusters to the K-Means simulation in every case. Given that we are helping K-Means here by giving the exact K, we hope that it would at least be able separate out the articles by group and place them in distinct clusters. But that does not always happen unfortunately and in some cases the articles get lopsidedly bunched up in just one or two clusters. That is a failed clustering exercise from our view point.
If the clustering task is successful at all, we should end up with clusters each having a dominant fraction of articles from just one of the groups so we can identify that cluster with that group.
The ‘purity’ of an obtained cluster is then measured by number of articles that do not belong in that cluster/group.
4.1 Twenty-news articles
In the case of 20-news dataset we pick articles from 3 groups for this exercise – just so we can plot the results easily. Note however that the results in Figure 3A are obtained by considering all 20 groups/clusters. A word-cloud plot gives a bird’s eye-view of the content we are dealing with.
Figure 5A below shows that fewest number of articles were misplaced overall when using custom word-embeddings, whereas the raw document vectors fared the worst. Document vectors with custom word2vec embeddings yield the most pure clusters, each identifying with a specific group. The performance of different transformations is in good match with what Figure 3A has indicated.
4.2 Movie reviews
The movie-reviews dataset has only 2 classes so easy enough to plot up and we consider all the documents. Here is a word-cloud of the two groups once again.
Figure 7A identifies only two successful transformations and they coincide with the two that have the largest B/A ratio in Figure 3B. Clearly as the word-clouds in Figure 6 have indicated, the movie reviews are more difficult to cluster. Overall, compared to 20-news we can make the following observations.
- Only 2 out 7 (compared to 6 out 7) attempted transformations succeeded in the clustering task. The others were quite lopsided where just one of the clusters ended up with most of the documents.
- Even in these two successful transformations, the purity/quality of the obtained clusters is poorer with over 30% misclustered.
5. Conclusions
With that we wrap up another post. We have shown here that order reducing transformations can help with clustering documents. Within those transformations, the custom word-embeddings seem to have an edge – hence it made its way into the title. As to why custom word-embeddings have done better than other other order reducing transformations and if that conclusion has legs beyond the text repositories studied here is open to further study.