Querying with high frequency terms improves recall and, the rare terms precision. The significant terms balance both while offering some discriminative capacity among the latent classes the retrieved documents may belong to. The MIMIC-III dataset is studied here in the context of predicting patient readmission from the discharge notes with Elasticsearch driving the significance measures…
Knowing which patients are likely for repeat Critical Care Unit (CCU) admissions is useful. The patient and the support team can be extra vigilant and risk averse to minimize such an outcome. There will be a good amount of data from any CCU admission as to the patient’s vitals, diagnosis, treatment offered, lab results etc… A text summary of all that is generally captured in the Discharge Notes, in addition to prognosis and future care plan. Can we use all this information to help with our prediction task?
A variety of traditional classifiers (e.g. with Logistic Regression here) and the new fancier BERT based models (e.g. ClinicalBERT) have been used in the past for the same task with some success. So, no claim is being made here for groundbreaking work – but we do approach the task a bit differently. And of course compare how it does against the others and see what the benefits may be. Here is an outline of the post.
- Go over Significant Terms and see why they could be useful
- Briefly describe the publicly available MIMIC-III dataset, specifically the CCU admissions and discharge notes thereof that we are interested in
- Build an Elastic search index where each document is a discharge note with a label indicating whether the same patient was readmitted (readmit = 1) or not (readmit = 0) within 30 days after that discharge
- Extract the significant terms for each of the two classes
- Use these significant terms to score a test set of discharge notes and see how well we do. Due to length concerns we will take up this item in the next post.
We will go over some code snippets here for turning the MIMIC-III csv data into an Elasticsearch index, but the MIMIC-3 dataset itself needs to be obtained separately.
1. Significant Terms
In any text classification task, it is very useful to know which words may be relatively more prevalent in one class as opposed to all other classes. These terms can potentially be used as the defining characteristic of the class, thereby providing a hook to the classification task. These are the significant terms. They need not be either the most frequent terms or the rare terms in that class. In most cases they are not. To the extent that such terms exist, we can define them for our purpose here as:
the terms that show a significant change in frequency measured between all discharge notes Vs the discharge notes specific to a class
(paraphrased from elastic.co)
Searching by rare terms gives good precision but recall will be poor. The high frequency terms cast a wide net and improve recall but precision suffers. The significant terms attempt to strike a balance between the two. When the classes are built from essentially the same vocabulary (like the patient discharge notes for example), the significant terms can overall be better at identifying the matching documents, given the precision/recall tradeoff.
Please see Yang and Pedersen, “A Comparative Study on Feature Selection in Text Categorization” for a detailed background, analysis and methods for extracting these significant terms from a text corpus spanning multiple classes. Also, Elasticsearch provides an out-of-the-box implementation of these approaches (and some others) on indexed text. In its default implementation we can get these significant terms ordered by the so-called jlh score computed for each term as follows.
With these terms in hand for each class we can potentially score a discharge note for class probability and determine which class that note should fall into. That is the basic idea.
1. MIMIC-III Dataset
The abstract from the paper states:
MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.
Johnson et. al. 2016
The data is made available in a variety of csv files. For our purposes we are interested in “ADMISSIONS.CSV” and “NOTEEVENTS.CSV”. The ADMISSIONS file allows us to figure out if a particular admission event is actually readmission event (with in a certain period of course, taken to be 30 days here) for the same patient. The useful columns of “ADMISSIONS.CSV” for our purposes are:
1 |
"ROW_ID", "SUBJECT_ID", "HADM_ID", "ADMITTIME","DISCHTIME", "DEATHTIME", "ADMISSION_TYPE",.. "HOSPITAL_EXPIRE_FLAG" .. |
A few things to note while processing the ADMISSIONS file are:
- DEATHTIME being not null or HOSPITAL_EXPIRE_FLAG equal to 1 indicates that this admission ended in patient death. ADMISSION_TYPE can be one of [“EMERGENCY”, “ELECTIVE”, “URGENT”, “NEWBORN”]
- If an admission event ends up in the death of the patient, that patient will naturally never ever have a readmission event. So such admission events should NOT be considered as no-readmission events.
- If an admission event is an ELECTIVE (i.e. initiated by patient), it is NOT counted as a readmission event even if it falls within 30 days of a previous discharge
- All admission events for NEWBORN are ignored as newborns are apparently routinely moved in and out of CCU for a variety of reasons
The NOTEEVENTS file has just a few columns we are interested in.
1 |
"ROW_ID","SUBJECT_ID","HADM_ID", ..., "CATEGORY", ... ,"TEXT" |
- The “TEXT” column is the one we use when that row’s “CATEGORY” column says it is “Discharge summary”.
- SUBJECT_ID and HADM_ID allow us join a row with an admission event in ADMISSIONS.
- While there is at least one row in NOTEEVENTS for each row in ADMISSIONS, that row may not have the CATEGORY as ‘Discharge summary’. Those admission events from the ADMISSIONS file will not be part of our analysis naturally.
- When there is a discharge summary there is usually one per admission event. But some admission events do have two discharge summaries for whatever reason. We concatenate them here.
- In a handful of cases, the discharge notes indicate that the patient expired, while the ADMISSIONS file for the same event (SUBJECT_ID and HADM_ID combination) does not. Not sure why, but to be safe we do not consider them.
2. Elasticsearch Index
We index the content in ADMISSIONS and NOTEEVENTS files into a single Elasticsearch index. Each document in the index will
- correspond to an admission event (SUBJECT_ID and HADM_ID combo)
- have the associated discharge summary (concatenated if more than one)
- have the readmit label computed as per the discussion above.
- readmit = 1 This admission is a readmission event where the patient has been readmitted with in 30 days after being discharged earlier
- readmit = 0 Not a readmission event
We pick just the columns we need from each of the two files. The data model for the index is straightforward.
2.1 The Model
The data model is essentially 1-to-1 for the column to field with the right type keyword/text/date etc… from each of the two files that are programmatically joined. Every row in these files has SUBJECT_ID and HADM_ID. It gives us a natural join condition while enforcing one discharge summary per admission, concatenated in our case when there are more in the NOTEEVENTS file. SUBJECT_ID and HADM_ID together is the document ID for our index. And each document gets the computed label readmit – 0 or 1. This sets us up nicely for whatever classifier we want to throw at this data.
The dates in the ADMISSIONS file are like: 2196-04-09 12:26:00 (way in the future!) hence the mapping seen below for date fields. The date values may seem weird but they have been shifted in time from their actual values – PHI & HIPAA at work. Here is our final mapping for the index. The fields TOKENS and CONTENT come from NOTEEVENTS file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
{ "properties": { "ROW_ID" : { "type": "keyword" }, "SUBJECT_ID" : { "type": "keyword" }, "HADM_ID": { "type": "keyword" }, "ADMITTIME" : { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "DISCHTIME" : { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "DEATHTIME" : { "type": "date", "format": "yyyy-MM-dd HH:mm:ss" }, "ADMISSION_TYPE" : { "type": "keyword" }, "HOSPITAL_EXPIRE_FLAG" : { "type": "integer" }, "CONTENT" : { "type": "text" }, "TOKENS" : { "type": "keyword" }, "readmit" : { "type": "integer" } } } |
The reason for carrying the array of TOKENS in the index is that we want to control the tokenization closely – as opposed to letting Elasticsearch do it for us. So we tokenize the discharge summary in NOTEEVENTS with NLTK with optional lemmatization. CONTENT is simply a concatenation of the TOKENS. It is there just so we can see and query the discharge notes as text.
1 2 3 4 5 6 |
def clean (text): # no punctuation & starts with a letter & between 3-15 characters in length tokens = [word.strip(string.punctuation) for word in RegexpTokenizer(r'\b[a-zA-Z][a-zA-Z0-9]{2,14}\b').tokenize(text)] tokens = [f.lower() for f in tokens if f and f.lower() not in nltk_stopw] tokens = [f.lower() for f in tokens if f not in more_stop_words] # tokens = [lemmatizer.lemmatize(w) for w in tokens] return tokens, ' '.join(tokens).strip() #TOKENS & CONTENT in model |
The discharge notes contain several mask words that were used to remove PHI in the summaries – as per HIPAA compliance. We have no use for them here and so remove some of those in addition to the NLTK stopwords.
1 2 3 4 5 |
more_stop_words = ["sig", "day", "name", "date", "namepattern", "first", "last", "firstname", "lastname"] for i in range(20): more_stop_words.append('hospital' + str(i)) more_stop_words.append('name' + str(i)) more_stop_words.append('namepattern' + str(i)) |
2.2 The Index
Indexing takes a few minutes to run on my laptop. One thing we need to do is to increase the number of aggregation buckets Elasticsearch can make. We update the cluster settings so this number is greater than the number of patients we have.
1 2 3 4 5 6 |
PUT _cluster/settings { "transient" : { "search.max_buckets" : 100000 } } |
First the ADMISSIONS are all (except the NEWBORN ones) indexed as they are read in from the file and set with readmit = 0. These documents are then updated with the discharge summary (TOKENS and CONTENT) as the NOTEEVENTS file is processed. Most of these are carried out as bulk indexing operations with the following flow. The ‘time.sleep()’ call is added to ensure that the indexed documents are retrievable with a search query as the processing flows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
print ('Reading in admissions...') bulk(client=client, actions=indexAdmissions(),chunk_size=1000,request_timeout=request_timeout) time.sleep(10) print ('Patient ReAdmissons within 30 days...') readmissionIds = findReAdmissions() bulk(client=client, actions=indexReAdmissions(readmissionIds),chunk_size=1000,request_timeout=request_timeout) print ('Patient death event admission should not count as a no-readmission event...') removeDeaths() time.sleep(10) print ('Reading in notes...') getNotes() bulk(client=client, actions=indexNotes(),chunk_size=1000,request_timeout=request_timeout) time.sleep(10) print ("'expired' as a term in the notes is actually a death event... sometimes it is not reflected in the NOTEEVENTS.csv") removeExpired() time.sleep(10) print ('Some admissions had empty discharge summaries... we cannot really use them for our text analysis') removeEmptyNotes() |
The key part in the above is computing the readmit label. The code snippet below queries the index to get the admissions for each patient ordered by ADMITTIME. An ELECTIVE admission will never have readmit = 1. But it can lead to a future readmission event so we need to process it. We also exclude those admissions that resulted from in-hospital-expiration events. The difference between the current ADMITTIME and the immediately previous DISCHTIME should be less than 30 days to qualify as a readmission.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def findReAdmissions(): readmissionIds = [] patients = getPatients() for patient in patients: admissions = getAllAdmissions (patient) prev_admission = admissions[0] for admission in admissions[1:]: if (admission['ADMISSION_TYPE'] != 'ELECTIVE'): timeDiff = ( datetime.strptime(admission['ADMITTIME'], '%Y-%m-%d %H:%M:%S') - datetime.strptime(prev_admission['DISCHTIME'], '%Y-%m-%d %H:%M:%S') ).days if ( (timeDiff < nDays) and (prev_admission['HOSPITAL_EXPIRE_FLAG'] != "1") ): readmissionIds.append(prev_admission['SUBJECT_ID'] + '_' + prev_admission['HADM_ID']) prev_admission = admission return readmissionIds |
We have to remember to remove those in-hospital-expired events that have been labeled 0 for readmit as default. The following snippet removeDeaths() updates them so that readmit = -1. They will be in the index but we filter them out in the analysis.
1 2 3 4 |
def removeDeaths(): body = { "bool": { "must": [ { "match": { "HOSPITAL_EXPIRE_FLAG": 1 } }, { "match": { "readmit": 0 } } ] } } update_by_query = { "script": { "source": "ctx._source.readmit=-1", "lang": "painless" }, "query": body} client.update_by_query(index=index, body=update_by_query, timeout='2m') |
The function removeExpired() sets the readmit label to -1 for the handful of cases when the discharge notes has the term “expired” in it, but for whatever reason the HOSPITAL_EXPIRE_FLAG was not properly set to be 1 in the data file.
1 2 3 4 |
def removeExpired (): body = { "bool": { "must": { "term": { "CONTENT": "expired" } } } } update_by_query = { "script": { "source": "ctx._source.readmit=-1", "lang": "painless" }, "query": body} client.update_by_query(index=index, body=update_by_query, timeout='2m') |
Finally the function removeEmptyNotes() sets the readmit label to -1 for all those admission events without a discharge summary notes as they belong to a different CATEGORY. We discussed this in the previous section.
1 2 3 4 |
def removeEmptyNotes(): body = { "bool": { "must_not": { "exists": { "field": "TOKENS" } } } } update_by_query = { "script": { "source": "ctx._source.readmit=-1", "lang": "painless" }, "query": body} client.update_by_query(index=index, body=update_by_query, timeout='2m') |
Guess we could have combined these three remove functions into one and save some writing… but they do the job so let us move on.
2.3 The Data
Running the indexer outputs the progression of the distribution of computed labels on the notes.
The classes are quite unbalanced with readmissions being only 6.6% of the total 43765 admissions considered . We handle this with sub-sampling the larger class in our classification exercise in the next post.
3. Significant Terms
With the clean tokens and index in hand we are ready to extract the significant terms in each class. First let us illustrate what we said in Section 1 about the jlh score by actually computing it for one term.
3.1 The jlh score
Our background set is all the discharge notes: 40873 + 2892 = 43765. The foreground set is the notes in either of the two classes. So we get a set of significant terms for each class. The query is simple. For readmit = 1 class, the following would be the query. The foreground set of notes is from readmission events. The background_filter clause includes both readmit = 0 and readmit = 1 (while excluding readmit = -1 that we still have in the index!). Exactly what we want.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
{ "query": { "term": { "readmit": 1 } }, "aggregations": { "driver_words": { "significant_terms": { "field": "TOKENS", "size": 100, "background_filter": { "terms": { "readmit": [ 0, 1 ] } } } } }, "size": 0 } |
Running the above yields the top 100 significant terms ordered by the jlh score for the class readmit = 1. Let us look at the top term and confirm that the score jives with our formula in Equation 1.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
"aggregations" : { "driver_words" : { "doc_count" : 2892, "bg_count" : 43765, "buckets" : [ { "key" : "failure", "doc_count" : 1549, "score" : 0.2392225894630799, "bg_count" : 16204 }, { ... |
The jlh score we compute here matches with the score in the response.
3.2 Frequency vs Significance
It is instructive to look at the top 10 terms for the two classes (readmit = 0 & 1) – both frequency wise and significance wise. Just to drive home the point that the raw frequency is not super helpful in characterizing a class…
Even better, for a bird’s eye-view we can look at the wordcloud plots for frequent vs significant terms in either class side-by-side.
3.3 Rarity Vs Significance
One may think that rare terms (by class) may do better than high-frequency terms in identifying the classes. Unfortunately, the rare terms are plagued by spelling errors in any real text corpus, with many such words having a count of 1 (the long tail) thereby yielding no value. For the sake of completeness here are the rarest terms (with count frequency) by class in the MIMIC-III dataset.
Almost all of the rare terms seem to be misspelled words. Using a minimum threshold frequency (> 1) to qualify as a rare word may improve things but your mileage may vary.
4. Conclusions
In this post we have:
- Described the usefulness of significant terms for information retrieval
- Described the MIMIC-III dataset and indexed it in Elasticsearch for the purpose of predicting potential future ICU readmission event (or not) using the discharge note from the current admission
- Used Elasticsearch to filter for significant terms in the notes for each class
- We showed that the significant terms seem to better distinguish between the two classes while the high frequency terms or the rare terms do not
This lays the groundwork for the subsequent post on how we can utilize these significant terms by class in a predictive effort for a test set of discharge notes.
Pingback: Have Unbalanced Courses? Attempt Important Phrases - TechMintz