The WordCloud module helps generate the image. Elasticsearch module enables talking to the index to fire queries & get results. Matplotlib manages the layout.
from elasticsearch import Elasticsearch
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
global titleFontSize
titleFontSize = 18
def plotWordCloud (word_freqs, title):
fig = plt.figure(figsize=(10,10),dpi=720)
wordcloud = WordCloud(max_font_size=40, relative_scaling=1.0).fit_words(word_freqs)
plt.imshow(wordcloud)
plt.axis("off")
plt.title(title,fontsize=titleFontSize)
plt.show()
Connect to the elastic server and get a handle 'es'
es = Elasticsearch([{'host':'localhost','port':9210}])
The query 'q_cities' below instructs elastic to return the H-1B employer cities in a decreasing order of the counts.
{ "query": { "filtered": { "filter": { "bool": { "must": [ { "term": { "APPROVAL_STATUS_S": "certified" } } ] } } } }, "aggs": { "CITY": { "terms": { "field": "CITY_S", "size": 0, "order": { "_count" : "desc" } } } } }
q_cities = '{"size" : 0, "query": { "filtered": { "filter": { "bool": { "must": [ { "term": { "APPROVAL_STATUS_S": "certified" } } ] } } } }, "aggs": { "CITY": { "terms": { "field": "CITY_S", "size": 0, "order": { "_count" : "desc" } } } }}'
Send the query to the server using our handle. The response 'cities' is a Python dictionary, essentially a 1-1 map of the raw json response
cities = es.search(index=['h1b'],doc_type=['case'],body=q_cities)
The raw response looks like:
{
"took" : 310,"timed_out" : false,
"_shards" : { "total" : 3, "successful" : 3, "failed" : 0 },
"hits" : { "total" : 9029125, "max_score" : 0.0, "hits" : [ ] },
"aggregations" : {
"CITY" : {
"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0,
"buckets" : [
{ "key" : "new york", "doc_count" : 535311 },
{ "key" : "houston", "doc_count" : 221833 },
{ "key" : "chicago", "doc_count" : 170667 },
{ "key" : "atlanta", "doc_count" : 159079 },
{ "key" : "san jose", "doc_count" : 157382 },
{ "key" : "san francisco", "doc_count" : 138776 },
...
Elasticsearch module turns this into a dictionary so the top city 'new york' can be referenced as cities['aggregations']['CITY']['buckets'][0]['key']. So we can efficiently pull out the [city name, its count] list using a one liner dictionary/list comprehension
word_freqs = [[row['key'],row['doc_count']] for row in cities['aggregations']['CITY']['buckets']]
And send it over to 'WordCloud' to generate an image
plotWordCloud (word_freqs,title='Top H1B Cities')