The WordCloud module helps generate the image. Elasticsearch module enables talking to the index to fire queries & get results. Matplotlib manages the layout.

In [1]:
from elasticsearch import Elasticsearch
from wordcloud import WordCloud
import matplotlib.pyplot as plt

%matplotlib inline

global titleFontSize
titleFontSize = 18

def plotWordCloud (word_freqs, title): 
    fig = plt.figure(figsize=(10,10),dpi=720)
    
    wordcloud = WordCloud(max_font_size=40, relative_scaling=1.0).fit_words(word_freqs)
    
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.title(title,fontsize=titleFontSize)    
    plt.show()    

Connect to the elastic server and get a handle 'es'

In [4]:
es = Elasticsearch([{'host':'localhost','port':9210}])

The query 'q_cities' below instructs elastic to return the H-1B employer cities in a decreasing order of the counts.

{ "query": { "filtered": { "filter": { "bool": { "must": [ { "term": { "APPROVAL_STATUS_S": "certified" } } ] } } } }, "aggs": { "CITY": { "terms": { "field": "CITY_S", "size": 0, "order": { "_count" : "desc" } } } } }

In [5]:
q_cities = '{"size" : 0, "query": { "filtered": { "filter": { "bool": { "must": [ { "term": { "APPROVAL_STATUS_S": "certified" } } ] } } } }, "aggs": { "CITY": { "terms": { "field": "CITY_S", "size": 0, "order": { "_count" : "desc" } } } }}'

Send the query to the server using our handle. The response 'cities' is a Python dictionary, essentially a 1-1 map of the raw json response

In [6]:
cities = es.search(index=['h1b'],doc_type=['case'],body=q_cities)

The raw response looks like:

{
"took" : 310,"timed_out" : false,
"_shards" : { "total" : 3, "successful" : 3, "failed" : 0 },
"hits" : { "total" : 9029125, "max_score" : 0.0, "hits" : [ ] },
"aggregations" : {
"CITY" : {
"doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 0,
"buckets" : [
{ "key" : "new york", "doc_count" : 535311 },
{ "key" : "houston", "doc_count" : 221833 },
{ "key" : "chicago", "doc_count" : 170667 },
{ "key" : "atlanta", "doc_count" : 159079 },
{ "key" : "san jose", "doc_count" : 157382 },
{ "key" : "san francisco", "doc_count" : 138776 },
...

Elasticsearch module turns this into a dictionary so the top city 'new york' can be referenced as cities['aggregations']['CITY']['buckets'][0]['key']. So we can efficiently pull out the [city name, its count] list using a one liner dictionary/list comprehension

In [7]:
word_freqs = [[row['key'],row['doc_count']] for row in cities['aggregations']['CITY']['buckets']]

And send it over to 'WordCloud' to generate an image

In [9]:
plotWordCloud (word_freqs,title='Top H1B Cities')
/opt/software/anaconda3/lib/python3.4/site-packages/PIL/ImageDraw.py:104: UserWarning: setfont() is deprecated. Please set the attribute directly instead.
  "Please set the attribute directly instead.")
In [ ]: