Semantic search at scale is made possible with the advent of tools like BERT, bert-as-service, and of course support for dense vector manipulations in Elasticsearch. While the degree may vary depending on the use case, the search results can certainly benefit from augmenting the keyword based results with the semantic ones…
Keyword based search across text repositories is a known art. The Lucene library and tools like Elasticsearch excel at lightning fast retrieval of matching documents for a given query. The search results are driven by terms/tokens and tf-idf metrics around them. Generally speaking, documents that do not share any common terms with the query will not be a part of the result set. This is the feature of the keyword based search. This can clearly exclude a lot of otherwise relevant documents but those that do not share any keywords with the query. Careful use of synonyms and stemming can help increase the recall. But since when has a synonym meant exact equality? We may think sunny is a synonym to bright. There is a bright moon but no sunny moon – ever. At least not on our planet! Plus, a rare synonym applied to expand the query can push an otherwise poor result to the top. And stemming? Yes, let us not even talk about it. A runner is not the same as the run or run! Clearly, all these machinations around keywords cannot get around to addressing semantics in text.
Approaches such as Latent Semantic Analysis (LSA) have been used in the past to include semantically related documents in the search results. But the application of Singular Value Decomposition (SVD) on the term-document matrix built from millions of documents distributed on a cluster of nodes is non-trivial. Semantic search based on SVD at the scale and throughput Elasticsearch deals with is impractical. So where does that leave us if we want to enable semantics with Elasticsearch?
The recently popular fixed size numerical vectors for words can help. Word embeddings with say a bag-of-words approach can turn a sentence or a document into a short dense numerical vector. We have gone over this at length in previous articles. Embeddings like the ones obtained from language models like BERT are context sensitive as well unlike the one-hot word vectors or the fastText embeddings. That is, we get different sentence vectors with BERT for “eat to live” vs “live to eat“, allowing us to distinguish between them. The key for enabling semantic search at scale is then in integrating these vectors with Elasticsearch.
Fortunately, the current versions (7.3+) of Elasticsearch support a dense_vector field with a variety of relevancy metrics such as cosine-similarity, euclidean distance and such that can be computed via a script_score. Exactly what we need as we can rank documents in the index as per their score for these metrics with the dense vector representation of the query. The lightning fast speed of Elasticsearch applied to millions of dense vectors distributed across a cluster of nodes. That is basically the gist of this post. Let us get with it.
1. BERT as a Broker
The architecture could not be simpler. The pieces are all there in open source and all we have to do is to put them together. We use bert-as-service to get dense vector representations of the documents and the queries. The indexing and search requests are brokered through the BERT server that generates the dense vector for the supplied document or query text.
Here is a simple configuration that defines an index with a sentence (a short quote in our case) and its numerical vector as the only fields. The vector is defined as 768 long as per the uncased base BERT (uncased_L-12_H-768_A-12).
1 2 3 4 5 6 7 8 9 10 11 12 |
{ "settings": { "index": { "number_of_shards": "1", "number_of_replicas": "0" } }, "mappings" : { "properties": { "quote" : { "type": "text" }, "vector" : { "type": "dense_vector", "dims" : 768 } } } } } |
The quotes are read from a file, the dense vector is computed by calling bert-as-service, and indexed into Elasticsearch in bulk.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
from elasticsearch import Elasticsearch from elasticsearch.helpers import bulk import numpy as np from bert_serving.client import BertClient bc = BertClient() es = Elasticsearch([{'host':'localhost','port':9200}]) def getQuotes(): f = open('./quotes.txt', 'r') for line in f: quote = line.strip().lower() if (len(quote.split()) < 510): # 510 IS THE MAX vector = bc.encode([quote])[0].tolist() yield { "_index": 'quotes', "quote" : quote, "vector" : vector } bulk(client=es, actions = getQuotes(), chunk_size=1000, request_timeout = 120) |
2. Relevancy Scoring with Dense Vectors
Elasticsearch employs Lucene’s practical scoring function for traditional keyword based search. It is not applicable to us here as we work with numerical vectors. We can override the default with any custom scoring function around the dense vectors. But it is better to use Elasticsearch predefined functions such as cosine-similarity, L1, or L2 norms for efficiency reasons. The relevancy order of search results will certainly vary some based on which metric is used. Not sure if one of them is always better than the others, but cosine-similarity seemed to do fine for my tests. Let us look at a quick example. Consider the following three sentences in a file sentences.txt.
- children playing in park
- kids running on the grass
- traffic is bad today
Clearly the first two sentences are similar. And they are both dissimilar to the third. We can readily compute the vectors for each sentence and compute different metrics. Running bert_sentence_similarity.py with:
1 |
pipenv run python ./bert_sentence_similarity.py sentences.txt |
we get:
1 2 3 4 |
1 & 2 1 & 2 2 & 3 Cosine Similarity: 0.852 0.677 0.69 Inv. Manhattan Distance (L1): 0.083 0.056 0.058 Inv. Euclidean Distance (L2): 1.839 1.243 1.27 |
All the metrics did the right thing here by yielding the highest score for the 1-2 pair. For the remainder of the post we will stick with cosine similarity of the BERT query & sentence dense vectors as the relevancy score to use with Elasticsearch. The order of the top hits varies some if we choose L1 or L2 but our task here is to compare BERT powered search against the traditional keyword based search.
3. An Index of Quotes
To test how well this scheme is going to work for us, we prepare a rather large index made up of quotes from various people. The index is similar to the index in Section 1. The quotes are no more than 50 words in length. The index has several thousands of quotes – with some near duplicates for sure. Here is a sample:
Once the game is over, the king and the pawn go back in the same box
He uses statistics as a drunken man uses lamp posts for support rather than for illumination
You should always go to other people’s funerals; otherwise they won’t go to yours.
Intelligent life on other planets? I’m not even sure there is on earth!
We know that many different quotes convey similar meanings. We query the index with a quote (or a paraphrased version of it) and examine the quality of the top results. We want to see the top hits to be the most similar to the query quote – as we understand them. We can do this directly with the More Like This (MLT) query that Elasticsearch offers, and of course with cosine similarity on our BERT derived vectors as shown in Figure 1. The task is to evaluate if BERT vectors have meaningfully enhanced the quality of the results.
Querying Elasticsearch
For the MLT query we override some defaults so as to not to exclude any terms in the query or documents. For the script_score query we use for semantic search, we get the dense query_vector from bert-as-service. We start it up with:
1 |
bert-serving-start -model_dir $BERT_BASE_DIR -max_seq_len=52 -num_worker=1 |
where BERT_BASE_DIR points to the directory where uncased_L-12_H-768_A-12 is on the disk. Here is snippet of code to query the same index in these two different ways.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
from bert_serving.client import BertClient bc = BertClient() from elasticsearch import Elasticsearch client = Elasticsearch([{'host':'localhost','port':9200}]) def findRelevantHits (inQuiry): inQuiry_vector = bc.encode([inQuiry])[0].tolist() queries = { 'bert': { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilarity(params.inQuiry_vector, doc['vector']) + 1.0", "params": { "inQuiry_vector": inQuiry_vector } } } }, 'mlt': { "more_like_this": { "fields": ["quote"], "like": inQuiry, "min_term_freq": 1, "max_query_terms": 50, "min_doc_freq": 1 } } } result = {'bert' : [], 'mlt' : [] } for metric, query in queries.items(): body = { "query": query, "size" : 10, "_source" : ["quote"] } response = client.search(index='quotes',body=body, request_timeout=120) result[metric] = [a['_source']['quote'] for a in response['hits']['hits']] return result inQuiry = "Most folks are about as happy as they make up their minds to be" result = findRelevantHits (inQuiry.strip().lower()) |
4. Results
With the apparatus ready, all that is left to do is run some sample queries/quotes through and see if BERT powered Elasticsearch is able to return more meaningful results than those just based on keyword abundance. We pick a few quotes and compare the top 10 BERT & MLT results side-by-side. Each result is scored – “red 1, blue 0.5, default 0” based on the quality of the match – subjective for sure. Let us start with the first one.
Holding on to anger is like grasping a hot coal with the intent of throwing it at someone else – you are the one who gets burned
Buddha
The top hit for MLT is totally irrelevant as it has been hijacked by the term “someone”. Its 8th hit was misled by “coal” a potentially rare term in a repository of quotes. But MLT does get its 6th hit missed by BERT. Overall we see that BERT is able to pull out quotes that mean the same as the query but using different words than the query (see 4, 6 and 7). Let us look at an another one.
Most folks are about as happy as they make up their minds to be
abraham lincoln
The problem unfortunately for MLT in this case is the choice of words used in the query. The top hit for BERT nails it, but MLT missed it because it is looking for the terms “minds” or “make” but saw “mind” and “makes”. Its own top hit got totally taken in by the phrase “make up their minds” – a complete match with the query. Applying a stemmer may have helped MLT to catch BERT’s top hit. For its second hit, MLT may have gotten side-tracked by a possibly rare word like “folks” and the phrase “make up” of all things as this quote is certainly not about dressing up!
But based on the above two examples, it would be premature to think that BERT has dealt a death blow to traditional one-hot word vectors. Take this final example for instance.
fortune sides with him who dares
virgil
The word dares is likely a rare word in the repo and MLT does very well sticking to it and perhaps the overall phrase who dares. BERT on the other hand thinks (logically, we might add) dares is related to passion, disaster etc… and finds very different matches. That is generally the problem with overthinking someone might say – an unkind cut against BERT.
5. Conclusions
We have shown that we can obtain semantic search results at scale with Elasticsearch. This is made possible with the advent of tools like BERT, bert-as-service, and of course support for dense vector manipulations in Elasticsearch. The quality of the semantic search results will depend on the nature of documents in the index and whether semantics are important in those documents. Semantics is important in most free flowing text and speech, unless perhaps if you are talking equations! That takes us to this final quote before we close.
One reason math texts are so abstruse and technical is because of all the specifications and conditions that have to be put on theorems to keep them out of crevasses. In this sense, they’re like legal documents, and often about as much fun to read.
david edgar wallace
So there may be some dry text out there with no semantics or a play on words, but few would ever want read it much less search for it. So there we go – any search worthy text will have semantically related documents.
To summarize:
- We have seen that Elasticsearch, augmented with BERT vectors can pull out semantically related results from a document repo
- Sticking to the actual keywords is sometimes the right thing to do instead of trying to find relations as BERT does with its vectors. Keyword based search has an upper hand in such cases.
- With the keyword based results, we can readily explain why a result rose to the top. There is even an explain api to dig through the details of how the relevance score has been obtained. This is a clear plus for the keyword based approach.
- With BERT driven semantic results, there is little insight into why a result came out on top or did not make it. It is the dense vectors that determine the relevancy score. Who knows what all parameters went into coming up with those dense vectors… besides there are 110 million of them for crying out loud!
So are we ready to replace the traditional keyword based results with these semantic ones? No. But there is perhaps no harm in augmenting them and letting the end users vote!