The words that are significant to a class can be used improve the precision-recall trade off in classification. Using the top significant terms as the vocabulary to drive a classifier yields improved results with a smaller sized model for predicting MIMIC-III CCU readmissions from discharge notes…
And it is tougher (sorry Yogi!) when the target classes to predict have widely varying supports.
But that does happen often with real world datasets. Case in point is the prediction of a near future CCU readmission of a patient based on a discharge note. Only a small fraction of patients get readmitted to CCU within 30 days of a discharge. Our analysis of MIMIC-III dataset in the previous post showed that over 93% of the patients did not require readmission. That is definitely good news. But if you are in the prediction business you have a harder problem. You want to identify as many as possible of those few future actual readmission cases from a sea of cases that will not experience readmission. And you want to of course avoid falsely tagging cases as future readmissions.
What we have is a classic precision vs recall problem familiar to all of us in the information retrieval business.
- Ignore precision and achieve 100% recall. Simply label all notes as future readmissions. FN = 0
- Ignore recall and achieve 100% precision. Simply label all notes as future no-readmissions. FP = 0
1. Significant terms
Doing justice to both precision and recall is a challenge and that is where significant terms can help. In the previous post Predicting ICU Readmission from Discharge Notes: Significant Terms we went over the background on significant terms, what they are and how to extract them from a text corpus using Elasticsearch. The gist of that post was the following.
- The most frequent terms in a discharge note showed little difference, whether that note experienced a readmission or not
- The long tail of rare terms in the discharge notes were plagued by typos, yielding little discriminative capacity between the two classes
- The terms that are significant for each class were markedly different, potentially offering a hook to our classification task
So we proceed here with the #3 item above and look at different ways we can use the significant terms for the classification task. First let us define our train and test data from the MIMIC-III dataset.
MIMIC-III Data
In the previous post we went over the MIMIC-III dataset and prepared an Elasticsearch index.
Each document is a discharge note with a label readmit obtained as either 0 (this discharge did not experience a readmission within 30 days) or 1 (patient readmitted to CCU within 30 days of this discharge). We had 40873 documents with readmit=0 and 2892 documents with readmit=1. This makes it harder to get the predictions right for the minority class readmit = 1. We split the data into train and test sets while stratifying on the readmit flag.
1 |
train_docs, test_docs, train_labels, test_labels, train_ids, test_ids = train_test_split (docs, labels, ids, test_size=0.20, random_state=0, stratify=labels) |
We end up with the following distribution for train and test. Only about 6.6% of the documents have readmit=1 in either train or test sets. We build models using the training set and predict how well we do for the minority readmit class.
1 2 3 |
# Of Train / Test : 35012 / 8753 # Of Train NoReadmit / Readmit: 32698 / 2314 # Of Test NoReadmit / Readmit: 8175 / 578 |
Skewed as it is, we use the entire training set for building the models. Subsampling the majority noreadmit class or oversampling the minority readmit class with SMOTE have their own issues that we want to steer clear of. We try to account for it however later by attaching a higher weight to the minority class predictions in the context of a classification for example with logistic regression…
The following code snippet supplies the IDs of the training set for each class and obtains the significant terms using Elasticsearch.
1 2 3 4 5 6 7 8 9 10 11 |
def findSignificantTerms (ids, label): body = { "query": { "bool" : { "must" : [ {"term": { "readmit": label} }, { "ids" : {'values' : ids} } ] } }, "aggregations": { "driver_words": { "significant_terms": { "field": "TOKENS", "size": max_sig_terms, chi_square : {}, "background_filter": { "terms": { "readmit": [ 0, 1 ] } } } } }, "size": 0 } response = client.search(index=index,body=body, request_timeout=6000) max_score = response['aggregations']['driver_words']['buckets'][0]['score'] words = [bucket['key'] for bucket in response['aggregations']['driver_words']['buckets']] if (boosting == 'yes'): boosts = [bucket['score']/max_score for bucket in response['aggregations']['driver_words']['buckets']] else: boosts = [1.0] * len(words) return words, boosts |
The chi_square method is used in the above but there are alternatives. The end result is a set of words for each class with a weight indicating its significance for that class.
2. Direct Classification with Significant Terms
With the significant terms in hand, we can score a discharge note simply based on the presence of these terms. We get a total count of significant terms for a class that are present in a test discharge note. This is the score for this note and class, and we place the note in the class with the highest score. We can further normalize (linear or softmax) the scores over the labels for each note and treat them as probabilities. That will come in handy to compute metrics such as the areas under precision-recall & ROC curves. Here is a snippet of code that
- uses the significant terms as the vocabulary and the CountVectorizer to turn the test discharge notes into vectors
- computes a score for each note by label, and normalizes them for predictive purposes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
def scoreBySigTerms (test_docs, sig_words, n_sig_terns): scoresByLabel = {} for label in [0,1]: useSigWords = sig_words[label][0:n_sig_terns] vectorizer = CountVectorizer(analyzer=lambda x: x, min_df=1, vocabulary=useSigWords) test_doc_vectors = vectorizer.transform(test_docs) a = np.sum(test_doc_vectors,axis=1) b = [] for i in range(len(a)): b.append(a[i,0]) scoresByLabel[label] = b predicted_labels, probabilities = [], np.zeros((len(test_docs),2)) for i, testDoc in enumerate(test_docs): docScores = np.array([scoresByLabel[label][i] for label in [0,1]]) probabilities[i] = docScores / (np.sum(predList) + 1.0e-16) # Linear normalization to 1. Can do softmax... predicted_label = np.argmax(probabilities[i]) predicted_labels.append(predicted_label) |
The number of significant terms considered is varied to see its impact on the results. Figure 1 below shows the precision and recall obtained for predicting the readmission class (readmit = 1).
The key take aways here are as follows.
- Recall increases and precision decreases as we consider more and more significant terms. Makes sense of course.
- A near total recall of 99% is obtained with 500-1000 significant terms with precision at about 7% . Only one discharge note out of the total 578 true readmission cases was misclassified as a no-readmission case.
- Both the chi_square and jlh methods for extracting the significant terms yield similar results with a slight edge held by chi_square. We will stick with chi_square for the remainder of this post.
3. BoW with Significant Terms as the Vocabulary
We can of course use any old classifier such as logistic regression on the labeled discharge notes and build a predictive model. CountVectorizer is used to build a vocabulary and document vectors for classification. One could try the TfidfVectorizer but it muddies up things for our purposes.
The significant terms are extracted from the corpus based on relative count measures across the classes, and each term gets a score as to its importance for the class. Tf-Idf brings in its own weights that are corpus wide and irrespective of class…
The corpus vocabulary in any case is almost always quite large. Even with the purging we employed we got over 124000 terms from the MIMIC-III text corpus. We can limit that number of course. In the code snippet below only the top n_features number of all words will be considered so the document vectors will be n_features long.
1 |
vectorizer = CountVectorizer(analyzer=lambda x: x, min_df=1, max_features=n_features).fit(train_docs + test_docs) |
But what if we used only the significant words and NOT all the words? And, just the top 500 or 100 of those at that? We have seen earlier that the significant terms for one class are quite different from those for the other. In fact for a binary situation there would be zero overlap.
the document vectors built from just the significant terms for the readmit class may be sufficient to provide enough discriminative capacity …
That is the thesis anyway. If it works we would have a vastly smaller model and hopefully comparable precision and recall as may be given by the full blown long vectors that used all the vocabulary. Providing custom vocabulary to CountVectorizer is straightforward. We already have the list of significant terms for the readmit class in the order of importance. The top n_features number of words from that list are used as the vocabulary in the snippet below.
1 |
vectorizer = CountVectorizer(analyzer=lambda x: x, min_df=1, vocabulary=significant_terms_for_class_readmit_1[0:n_features]).fit(train_docs + test_docs) |
Building and running the logistic regression model with either vocabulary is straightforward. We simply supply a different vectorizer to the function below. Also we make sure to apply a class_weight (balanced) so that the dominant no-readmit class has a lower weight compared to the smaller readmit class. The reason we do this is because we work with the entire training set that is heavily skewed towards the no-readmit class. It helps to even the playing field between the classes by encouraging the classifier to value the minority class predictions more.
1 2 3 4 5 6 7 8 |
def runModel (train_docs, train_labels, test_docs, test_labels, vectorizer): model = LogisticRegression(tol=1.0e-6,random_state=0,max_iter=20000, class_weight='balanced') train_X = vectorizer.transform(train_docs) test_X = vectorizer.transform(test_docs) model.fit(train_X, train_labels) predicted_labels = model.predict (test_X) scores = model.predict_proba(test_X) return findMetrics (scores, predicted_labels, test_labels) |
Figure 2 shows the results for recall and precision with these vocabularies as a function of the number of words used.
Certainly an interesting graphic.
- When you limit the number of features, the CountVectorizer picks the top high frequency words. That favors recall over precision as we know. Increasing the number of words used improves the recall – up to a point! When all the words are used, recall drops precipitously with a small increase in precision. Makes sense.
- A more even and consistent results are obtained when significant terms define the vocabulary. More interestingly it looks like we could just use the top 25 significant words for the readmit class in conjunction with Countvectorizer and logistic regression to get both higher precision and recall.
4. Precision-Recall and ROC Areas
The predictions for precision and recall in Figures 1 and 2 are based on a threshold probability of 0.5. That is, when the note’s normalized score (that sums to 1.0 over the two labels) for the readmit class is greater than 0.5, then the prediction for that note is readmission. There is of course some freedom in choosing that threshold.
- If you want to be certain (high precision!) that you are making the right decision then you want this threshold probability to be high
- If you cannot afford to miss (high recall!) a potential readmission possibility then you want this threshold probability to be low.
Once again it bears repeating that the predictions in the earlier figures are based on the neutral threshold of 0.5. Choose a low enough threshold probability, any classifier can obtain a 99% recall.
But the 99% recall using top significant terms in Figure 1 was obtained with the neutral threshold of 0.5.
One way to evaluate a classifier is to see how sensitive it is to this threshold probability. This is conveniently obtained by the precision-recall and ROC curves and the areas they enclose. Figure 3 shows the precision-recall, and ROC curves when using the top 100 terms (significant terms or corpus vocabulary).
Figure 4 shows the area under curve (AUC) for precision-recall & ROC curves as a function of the vocabulary size employed by the different methods we have tried in this post.
Using logistic regression with the significant terms as the vocabulary does better than using the high frequency words. It is also better than the simple count based classifier – at least as far as the precision-recall area goes.
But we know that the count based method was better if the recall of minority class is paramount. This may perhaps the case for predicting something like a CCU readmission event. It is your call as to whether you want recall!
5. Conclusions
Getting the predictions right for the minority class is tough. Here are some loose conclusions from this series of articles.
- Identifying the words that are special/significant for a class is useful in the precision-recall dance.
- These words serve as a window into the nature of content for that class. Helping explain why a document has been labeled as such by the classifier (Explainable AI?)
- Using the significant words as the vocabulary to build document vectors seems to hold some promise for optimizing precision and recall. Besides you only need a few of those resulting in much smaller models.
- Tf-Idf vectorization in conjunction with significant terms as the vocabulary needs further analysis as to what it exactly does.
Very interesting read. A couple of suggestions:
1) did you build index using shingles? 2-word shingles convey more information than single words.
2) take a look at the “significant text” aggregation- it’s the significant terms algorithms but with better data collection for text eg it can trim duplicate passages that might otherwise skew stats.
Thank you for your comments, Mark.
You are right about shingles. A colleague of mine suggested we try the same as well. The hope for improved predictive power with shingles is clear. An abundance of a phrase such as “high blood pressure” carries much more meaning and weight to the classification of a discharge note in a clinical context. This is as opposed to the abundance of individual terms “blood”, “pressure” etc… that would be equally abundant no matter which class a discharge note may belong to. Having said that, I have not tried that – yet. It is in the list of to-do things
As for “significant text” – yes I have tried it briefly. I do not think the MIMIC-III text corpus has duplicate text passages. I believe it has been carefully prepared by researchers in this area. I was also facing performance issues on my laptop with the size of the text and such and the dynamic tokenization required.