H-1B. The Mechanics

      No Comments on H-1B. The Mechanics

The previous post in this series detailed what we would like to find out from the published data around the H-1B program. Here we go over the setting up of the apparatus (using only open source tools) so to speak that we will use in future posts to get some insights. Let us get to it.

java_tech pythonlogoelastic

1. The Approach

In a nutshell the approach taken here is to build a search index around the data, and fire our questions (queries) at it and process the answers (responses). Could one use a  DB instead of a search index? Sure one could, but note that the data here has no relations so does not warrant a traditional relational DB like MySql. Each csv record is independent and gives the complete information about a single  H-1B application. One could very well use a document DB such as Mongo. But over the years I have grown partial to search tools that offer the ability to efficiently query structured & unstructured data under a single roof especially when the data in question is non-relational. As for the search tool, either SOLR or ELASTICSEARCH would do fine, and we go with ELASTIC here. Finally, a generic approach to getting answers to some of our questions will likely use this search index as backend, letting the users operate a (friendly!) front-end web app and use an api layer that mediates between them. That is all too much work for this expository blog :), and my preference is to use Python and its interactive notebook framework. Here is the complete tooling.

  • Ubuntu 15.04 64bit on my laptop with 8 core i7, 2.2GHz CPU, 16GB ram
  • Elasticsearch 1.7.2
  • Java 1.8 (for preparing the search index)
  • Python 3.4 (for querying & graphing)

2. The Data

The source files open up fine with spreadsheet programs, or even script based processing. But as indicated in the previous post, the data needed to be scrubbed before feeding it to Elasticsearch.

  • The data from the earlier years (prior to 2008) was reported in text/csv files that needed to be fixed a bit to remove non-matching quotes etc… They switched to ‘xlsx’ format for the more recent years, and those seemed structurally sound and converted well to csv. There were 14 csv files in all one for each year, with a raw total count of over 5.1 million records.
  • The list of columns, their order & their names  varied some over the years. So the column headers had to be renamed (and moved around) to mean the same thing across all files, and as per the elastic mapping above.
  • The column ‘LCA_CASE_NUMBER_S’ was supposed to be the unique identifier for a record, but there were some repeats in which case only the first one was processed. Erring on the cautious side I rejected about 74,000 records. Some other incomplete & badly formatted records were rejected as well. Not painful actually because Elastic does the heavy lifting  by raising exceptions at the time of  indexing.
  • While each record mostly  corresponded to a single application for H-1B visa, it was not always so. Apparently one could request for multiple candidate visas in a single application. Some  records seemed to have  typos for this number at several thousands (probably a cell got shifted as the previous cell was about the wage!). Needless to say – those records were ignored. Taking all these into account, and after removing the duplicates I ended up with over 9.5 million visa applications in the index.
  • We needed the sponsor names be given the same way in all records, as we will be aggregating on them in the analysis. But alas, people make up their own names all the time! For example the outsourcing company ‘Infosys’ has filed for H-1B applications as ‘infosys limited’, ‘infosys solutions’, ‘infosys Corp’ , ‘infosys technologies limited’ etc… Treating each of these as different company would give erroneously smaller numbers in the aggregates, as these are all well the same company. Issues such as these led to  by far the most intrusive/iterative scrubbing of the data before sending it for indexing .  Lowercasing, removing punctuation/prepositions, common terms like ‘ltd’/’inc’/’llc’ etc… helped get things going. Splitting the result into multiple words and picking the first two words for the name of the sponsor, worked in several cases. In cases like ‘Infosys’ the first word is what we need. A few cases needed the first 3 or even 4 words to correctly identify the name. Needless to say it was an iterative process and somewhat painstaking.  Some cases required explicit assignment as shown below. I was tempted to try an entity recognition approach, but that will likely have issues given the nature of the data for this field. This whole business of extracting a consistent sponsor (and to some extent the City)  names out of variable text patterns introduces some error into the analysis – no doubt about it. But the focus of the analysis for the most part is in top 100 or so names where these errors get drowned out in the volume.
  • The job category & position title were deciphered from the supplied jobtitle in the csv.  For example the category ‘software’ is assigned by looking for the presence of any of the terms like {"javascript", "java", "python", "windows", "programmer", "oracle", "computer", "developer", "database", "linux", "unix", "web", "administrator", "architect"}. Likewise the category ‘health’ is assigned if any of {"biolo","clinic", "physician", "therapist", "neurologist", "medical", "hospital"} are found. Going from the most specific to least, we assign the first category that matches and move on.

3. Elasticsearch

Installing elastic is quite straight forward. I started up a two node cluster (node is a running instance of elasticsearch) each with 2g of heap.  Running the script below sets up  the index ‘h1b’ with 3 shards (‘shard’ is just a lucene index, and the complete h1b index here is distributed across 3 such lucene indices ) & one replica. The field mapping for the type ‘case’ is kept generic with suffixes indicating the type.

where ‘h1b.json’ file contains:

Indexing

A Java program processed the csv files into a List<hashmap<string,object>> where each csv record is a HashMap.</hashmap<string,object>

The method ‘getDoc’ in line#7 processes each record and returns a HashMap<string, object=””>. The entire list of these HashMap objects is loaded into the index via bulk api</string,>

Upon the completion of indexing,  a cluster stats check via http://localhost:9210/_cluster/stats?human&prettyshows
Cluster stats

So we have working index with over 9.5 million documents, that can now be queried to get some answers to the questions we started with. Elastic allows for pretty sophisticated & detailed queries to be fired at the index. The json query structure while being a bit verbose, more than makes up for it by being highly readable. For example the query to find the top H-1B sponsors by year we would fire a query like:

Sample query

This query instructs elastic to find the companies ordering them by the number of ‘approved’ applications and also further dividing this count by year. Put above query in a file (say ‘company_by_year.json’) and fire it at the index via ‘curl’ as in:

to get

Sample response

The response indicates that there were 9029125 certified applications in all, over these years. The sponsor ‘Infosys’ led the pack with 581886  applications (with 28578 of them in the year 2015, 25500 of them in the year 2014  and so on…) followed by ‘Wipro’, and all the universities/schools. A bar/pie chart would be lot easier to appreciate of course than wading through these  numbers. This is where we turn to Python and its excellent set of helper modules to make a quick work of querying, processing & graphing the results – all in one place.

4. Querying & Graphing with Python

The workhorse modules used here are ‘matplotlib’ for graphics & ‘elasticsearch’ for talking to our index. The interactive nature of the IPython’s notebooks makes it fun to quickly experiment with queries and see the updated plots right there.  We take a look at what is involved in firing the  sample query above and showing a bar chart. We start up a new python notebook with ipython notebook and import all the needed modules. Shown below is a sample session to generate a wordcloud of top destination cities for H-1B. The trick for embedding a python notebook session in a wordpress post was adopted from the article IPython Notebooks in WordPress.

Armed with this mechanism that can answer our questions, and a front-end to process & put up pretty graphics we are ready for the analysis. That is what we will do in the next post, H-1B. The Analysis – Part 1

Thanks for reading & feel free to leave a comment if you liked it (or not!)

Leave a Reply