Quote Mechanics – It is the ‘Data’ Stupid!

Ready to write again after an extended break over the holidays and we start off where we left in 2015 with our unfinished quotes… The objective for this post is to assemble the data we need to analyze the nature of quotes in some way, at least in a dry statistical sense to begin with. Whether this effort can lead to deep  insights into the quote extraction/creation processes is not clear now but we have ideas and so we plunge ahead remembering the ancient Chinese Philosopher Lao Tzu who wrote  –  “The journey of a thousand miles begins with a single step“.

1. The Data

Statue_of_Sherlock_Holmes_in_EdinburghData! Data! Data! He cried impatiently. I can’t make bricks without clay! – The Adventure of the Copper Beeches

Data is the quite the need here as we start out this machine learning exercise. We do not know upfront how much data we will need to train our model to recommend quotes or more difficult,  to flag a sentence as a potential quote. The more data the better of course,  as Datawocky writes – “More data usually beats better algorithms“. Assembling the raw material is our first order of business, as my puny collection of few hundred  quotes will not cut it, even while the quality may arguably be good. Fortunately there are many sources –  quotation books and quotation web sites are good sources. Biography books are chock full of quotes. Dictionaries, thesauri, newspaper articles, wiki quotes, blogs, etc… feature great many quotes as well. The nature and amount of effort required to pull out quotes for our use depends on the source of course.

Tooling wise in this exercise we employ:

  • Python (3.4), Perl (5.20), Shell scripts , & Java (1.8) programs for quote collection, clean-up, querying, and graphics
  • Mongo DB (3.2) as a raw repo for the quotes data as we want a schema free flexibility with our data to be collected from disparate sources. Plus, my personal preference is to work with JSON as the medium of data exchange across tools/applications and Mongo plays well with that.
  • Apache SOLR (5.4) for quick keyword based search & simple text analytics. Plus plays well with JSON.

2. Books

While printed books are good to curl up with, it is a lot of work to get that text into an analytical framework. Either you type out each quote or scan pages to a digitized form such as a pdf and then extract text out of it.  Luckily some of the books are already available in a digital form (e.g. The Great Book of Best Quotes Of All Time, Words of Wisdom etc… ) saving us the first step. Converting those files into text, and  extracting the quotes takes some work but doable.

Several free tools exist for extracting text from pdf  files.  I tried a bunch and settled on a couple. Basically you want a tool that preserves the structure, and ‘reading order’ as much as possible so we can script a parser to pull out the information we want. For the most part the Apache PDFBox library succeeded when others either failed or produced text files that were too cumbersome to parse. It is as simple as running:

to generate a text file ‘quotes.txt’ from ‘quotes.pdf’. In a couple of cases I used ‘pdftotext’ (  pdftotext quotes.pdf ) a tool that comes bundled with most Linux distros. With the text in hand it is all regex & scripting to separate out the quote+author pairs. Massaging the generated text a bit can help to simplify the regex feats needed. I used a fair mix of Python & Perl for scripting here but any tool will do.

3. Web

I contacted a number of quotation sites on the internet about using some of the quotes they have collected for this analytic work. The response was uniformly magnanimous. A lot of work goes into collecting the quotes, curating them, and putting up a pleasing front-end – so I am quite grateful. Some websites such as Goodreads, Quotes.net  provide API to tap into their vast collection of quotes. The very nice folks at Values.com provided me with an export of their database of quotes.  Wikiquote has a number of quotes one can work with as well. For an excellent compilation of quotes by famous scientists, one can take look at todayinsci. When the site owners permit, one can crawl sites – ‘nicely’ (!), and process the html content to extract the quotes. A variety of free & commercial tools exist for extracting text from web pages. When the url patterns of the pages are known (no crawling needed), it is straightforward to script one yourself, first to fetch the html content and then to parse it. The scripts to fetch will need to account for the pages that may be server driven (page content composed on the server) or otherwise.

3.1 Server driven pages

For example the following Python script fetch.py will fetch & save every page like http://fetchme.com/quotes/a.html, http://fetchme.com/quotes/b.html, … while sleeping for a minute or two before each retrieval (that is being nice!).

3.2 Client driven pages

Server driven pages are mostly the case still in early 2016, though with the growing popularity of Single Page Application (SPA) frameworks like AngularJS, one is  likely to find pages where Javascript on the client/browser does the heavy-lifting by pulling the content together from the server via AJAX calls based on user interactions on the page. In these cases one needs something like PhantomJS, a headless browser that can run in the terminal. The ‘selenium’ module for Python can be used to modify fetch.py above in these cases.

3.3 Parsing

Depending on the structure of the html, and how embedded our information is in that, the details of text extraction can vary, but libraries such as Java’s ‘Jsoup’, Python’s ‘Beautiful Soup’, ‘lxml etree, lxml.html’ make this just a matter of writing the correct xpath expressions to pull out the relevant pieces of text. Take for example, a page where the quotes are placed two per row in a table.

The code snippet in Python for extracting the quote & author would be like:

The same in Java would be like:

Running either would generate the ‘quotes.json’ file:

4. Data staging

The staging exercise (euphemism for data-cleanup drudgery!) attempts to resolve data quality issues bound to be present when dealing with information from multiple sources. For example, the name of a quote’s author is presented in so many different ways across sources and even within the same source. The author “mahatma gandhi” is variously referred to as “mohandas gandhi”,  “mohandas karamchand gandhi”,  “gandhi mahatma”,  and simply “gandhi” of course. Taking the common denominator “gandhi” as the author in all these cases will not work because there are ‘other gandhis’ who are not the ‘mahatma gandhi’! Besides including a new ‘if conditional’ in my scripts for each issue I ‘see’, I have not found a reliable approach that would automate this for me. So much for machine learning I guess!

But it is all not bad news. This exercise has unexpectedly had some benefits. When you pool together quotes from multiple sources you would expect duplicates & even conflicts (i.e the same quote but attributed to different authors). But such dupes & conflicts within a single source are likely an oversight and this exercise allowed us to catch some of those.

5. Mongo and SOLR

While the examples above were focused only on extracting two fields, ‘quote’ & ‘author’ – we can add any other field that we think will add value for the analysis. For example, the number of words (or characters) in a quote could be a useful metric to have. We would expect longer quotes to be less popular than shorter ones (people’s attention span is getting shorter all the time!), but perhaps not too short… Would be fun to see the distribution of quote length in terms of number of words/characters even if we do not have hard data on popularity.

Given that we may add new fields as the analytics may drive us, it is good to not to have to define a schema upfront. Another reason Mongo works great for us here. In order to save the quotes with all its fields to Mongo, we start-up the mongodb daemon with

The data volume (& of course the query volume) is small so we can do this all on a single host without shards, replicas. We import the json quotes into this db with:

Likewise we start up a SOLR instance and set up a ‘quotes’ core.

The configuration file, ‘schema.xml’ is modified to define the ‘quote’, ‘author’, and any other keys we have defined for the quotes in ‘quotes.json’

Setting ‘termVectors’ to be true for ‘quote’ allows us to find ‘similar’ quotes. Indexing the author field as a ‘string’ allows for faceting on exact match. Indexing the same field as regular english text (‘text_en’) allows for searching by partial names – like ‘einstein’ instead of ‘albert einstein’. Configuring the SOLR instance for optimal query flexibility is more involved… and we are only scratching the surface here. For the moment we simply assume that we have a well-configured quotes index that we can query at will. We put our quotes in this index by running:

That is it for creating a searchable index of quotes as SOLR plays well with JSON. Before wrapping up this post, we will put our index to quick example use by asking it to give us a distribution of the number of words our quotes contain. The basic query would be  q=*:*&stats=true&stats.field={!mean=true percentiles='50,95'}quoteWords&f.quoteWords.facet.range.gap=5&fq=quoteWords:[2 TO *]  that aggregates over all the quotes with 2 or more words, and finds the mean, median and the 95 percentile number of words. The following python code snippet fires the query and processes the response (and plots it up as well using matplotlib – not shown here).

The insight that more than half of our thousands of quotes have less than 25 words is not revolutionary (makes sense actually) but useful – why? Well, the number of words could well be a parameter influencing the overall probability of a sentence being a quote, why not, we have evidence correct? And machine learning is always about probabilities anyways. It is unfortunate to see our beloved quotes reduced to numbers but hopefully from those numbers will arise the means of extracting and even generating new quotes…

word_distribution

With that, we conclude this long post. In the upcoming posts we will put this machinery and more to work to make some headway into understanding ‘Quote Mechanics’. Hopefully it will not be as complex as the other mechanics  the ‘quantum’ type,  which our mystery author declared pretty much as un-understandable…

I think I can safely say that nobody understands quantum mechanicsrichard_feynman

3 thoughts on “Quote Mechanics – It is the ‘Data’ Stupid!

  1. Ashok Chilakapati Post author

    Thank you for the data Yigal, and you are way ahead of me on the NLP/ML front! I took a quick look at the citizen-quotes. It is doing pretty well and the quality is decent. I am going to be trying my hand at it in the near future after some more mundane chores like classification, clustering etc… Pulling out ‘quotables’ from flowing text is the holy grail. Will make for a fun exercise nevertheless.

    Reply

Leave a Reply