Category Archives: Elasticsearch

Have Unbalanced Classes? Try Significant Terms

The words that are significant to a class can be used improve the precision-recall trade off in classification. Using the top significant terms as the vocabulary to drive a classifier yields improved results with a much small sized model for predicting MIMIC-III CCU readmissions from discharge notes

Predicting ICU Readmission from Discharge Notes: Significant Terms

Querying with high frequency terms improves recall and, the rare terms precision. The significant terms balance both while offering some discriminative capacity among the latent classes the retrieved documents may belong to. The MIMIC-III dataset is studied here in the context of predicting patient readmission from the discharge notes with Elasticsearch driving the significance measures…

Semantics at Scale: BERT + Elasticsearch

Semantic search at scale is made possible with the advent of tools like BERT, bert-as-service, and of course support for dense vector manipulations in Elasticsearch. While the degree may vary depending on the use case, the search results can certainly benefit from augmenting the keyword based results with the semantic ones…

Word Embeddings and Document Vectors: Part 2. Classification

In the previous post Word Embeddings and Document Vectors: Part 1. Similarity we laid the groundwork for using bag-of-words based document vectors in conjunction with word embeddings (pre-trained or custom-trained) for computing document similarity, as a precursor to classification. It seemed that document+word vectors were better at picking up on similarities… Read more »

Kafka Streams – Catching Data in the Act. 3: The Mechanics.

In the previous post we designed the experiment, simulated different operational states and confirmed that the results were as expected – more or less. Here we go over the implementation and a few relevant code snippets before wrapping up this series of posts. As usual the package is available for download… Read more »

Kafka Streams – Catching Data in the Act. 2: Steady and Unsteady States

I was on vacation with my son at Yosemite over the spring break this past weekend. Early part of the trip was washed out due to rain as they closed the park and we were cooped up in the lodge waiting it out. But we had a patio view of… Read more »

ELK Clusters on AWS with Ansible

      No Comments on ELK Clusters on AWS with Ansible

In the previous post we built a virtual ELK cluster with Vagrant and Ansible, where the individual VMs comprising the cluster were carved out of a single host. While that allowed for a self-contained development & testing of all the necessary artifacts – it is not a real world scenario…. Read more »

ELK Stack with Vagrant and Ansible

      No Comments on ELK Stack with Vagrant and Ansible

It has been a while unfortunately since I sat down for some writing on this blog. But the writing bug is persistent – once hooked, you got to write. Serious writing requires original research, data collection/analysis etc… so can take a good bit of time depending on the topic. I had… Read more »

Virtual Clusters with Vagrant & Virtualbox

We take a break from the H-1B analysis and set the stage here for future posts that require us to work in environments with distributed compute & storage. A simple way to simulate them is with Virtualbox as the provider of VMs (‘Virtual Machines’) & Vagrant as a the front-end… Read more »