Who does not love a good quote? I had always been a fan myself and collected a bunch over the years. Each morning as I drive kids to school a quote or more spill out as a matter of course. So much so that they started calling me a quote-monster. Back in the days when we had ‘real’ books I would jot down the piece of text that I happened to like. That was the hard way but you only did it if you ‘really’ liked it, so the quality tended to be good. You could buy a book of quotes if you wanted. These days, quotes are everywhere – quotes that scroll away if you do not read quick, quotes that fade away only to be replaced by a new one, and even quotes that flash at you… If that is not enough, you can subscribe to daily quotes via e-mail or have the rss feeds integrated directly into your desktop/webpage… There are a tonne of smartphone apps as well bringing quotes to your phone. You can have your phone greet you with a random favorite quote as you turn it on. And then, there are quotes by occasion, mood, author, subject… take your pick. There are quotes galore.
That brings up several questions. Are people discovering & adding new ones to the body of quotes or, are they mostly recycling the old ones via new mediums? What exactly makes a ‘piece of text’ a ‘quote’ anyway? Whatever it is, I very much doubt that ‘quote discovery’ can be automated.
Quote Extraction
Can you throw a chapter or two from the ‘Tale of Two Cities’ at a computer and have it extract out what we can agree to be meaningful quotes? What would be the characteristics of a sentence that would allow it pass as a quote? Because that’s what we would need to tell the computer to look for. But even with his positronic brain Lt. Commander Data had unending troubles understanding the vagaries of human speech, getting a joke and much less telling one in Star Trek. Here is one where he struggles with irony.
Commander William T. Riker: Charming woman!
Lt. Commander Data: [voice-over] The tone of Commander Riker’s voice makes me suspect that he is not serious about finding Ambassador T’Pel charming. My experience suggests that in fact he may mean the exact opposite of what he says. Irony is a form of expression I have not yet been able to master.
Or what can a script make out of these lines from Ozymandias… by P.B. Shelly
‘My name is Ozymandias, king of kings:
Look on my works, ye Mighty, and despair!’
Nothing beside remains. Round the decay
Of that colossal wreck, boundless and bare
The lone and level sands stretch far away.
The essence of a quote often is not in the words used but rather in words not used – what is left unsaid. It is often a play on words as well. The context of the quote arises in readers’ heads as they read it. Those that get the quote are able to imagine for themselves what is unsaid. I do not think any amount of data/algorithmic training can turn our present day computers into quote generating machines any time soon – just my opinion. Take for example the quote attributed to Abraham Lincoln (there is some controversy as to who actually said it but let us leave that aside for now)
Give me six hours to chop down a tree, and I will spend the first four sharpening the ax
The quote is about ‘preparation’ being the key to success. Parse the words whichever way I might – I do not see a way to teach the computer to reach this conclusion and extract it as a quote. No amount of stemming, synonym expansion, part of speech tagging or entity extraction is any help! The computer will likely interpret this as instructions on how to cut down trees. Even if I tell the computer that Abraham Lincoln wrote it and trained the model some how with Lincoln’s biography, I doubt it will do any better. May be those ‘Deep Learning’ techniques coming in vogue these days, have something on this but for traditional machine learning, automatic extraction of quotes from text & speech seems very difficult to do it right. Other than speed-reading nothing can perhaps speed up quote extraction. Don’t tell that to Woody Allen though.
I took a speed-reading course and read War and Peace in twenty minutes. It involves Russia.
Quote Classification
How about ‘tagging’ quotes by topics? That is, we have some quotes at our disposal and we want to classify them into buckets or make clusters of ‘similar’ quotes – so people can navigate and find them easily. Can this be automated? It is easier, it seems – at least superficially, i.e. if you do not care much for strict quality. As quotes are usually short pieces of text, computer cannot generate a reliable ‘context’ out of that text. But it can whittle the text down to a few words and assign those words as ‘topics’. Sometimes that is all that may be needed. Take the same quote as above. The important words in the quote are like: “tree, chop, spend, sharp, ax“. The quote is about none of these really, but we can go ahead and bucket it under all of these so a user looking for quotes about ‘trees‘ can find this one as well! The target applications are quite tolerant about errors here so findability is not affected. One website classifies this under ‘Preparation’, another one under ‘Axe, Sharpening, Tree’, and yet another under ‘Preparation, Planning’. Clearly we know who is doing automatic classification Vs manual!
Take an another one, a funny one this time attributed to the master – Yogi Berra. Again there is some controversy as to who actually said it. The Quote Investigator site does a great job of digging through many of these controversies around popular quotes. If you like quotes and want to find out how they have come about, it is a good resource.
You better cut the pizza in four pieces because I’m not hungry enough to eat six.
Some web sites have classified this under ‘food, eat, four’! But classification by simple word tokenization is not always bad. Take this one for example by Martha Washington:
I am determined to be cheerful and happy in whatever situation I may find myself. For I have learned that the greater part of our misery or unhappiness is determined not by our circumstance but by our disposition.
Profound as it is, this one can be correctly placed by just parsing the words. Even while they miss the key tag ‘equanimity’, the web sites I have looked at bin this variously under ‘happiness, experience, misery, circumstance’ which are all fine. So we cannot say that profound quotes are difficult to classify.
There are some quotes that ‘can’ be reasonably classified by a computer but can really benefit from human intervention. Take this Buddhist quote for example:
Holding on to anger is like grasping a hot coal with the intent of throwing it at someone else; you are the one who gets burned.
While ‘anger’ is a reasonable bucket, the quote is definitely not about ‘coal’! Plus it can really help put this in the ‘forgiveness’ bucket, as that is the true intent here – but that would have to be manual.
If the true intent of the quote can be described by using the same/similar set of words making up the actual quote, an automated classification will likely work. But I would still vote for manual classification as classification is much less onerous (than extraction for example) while improving the quality a good bit.
A related task is clustering of the quotes where ‘similar’ quotes are placed in a pile, forming a cluster. Clusters can be automatically named by the algorithm using the chief differentiating markers/features of the cluster that it has identified. The number of such clusters and the similarity measures to employ to extract features, can be controlled some. The more the clusters you want the more nuanced/sensitive your algorithm has to be in detecting the textual features. Given the nature of quotes, the very same issues that plagued the classification task will cause problems here.
Quote Deduplication
So if the computers can’t extract and cannot really classify/cluster quotes, is there anything else they could help us with here (the question is rhetorical of course)? How about finding duplicates or near duplicates so we have a well organized set of quotes? Most of the time we use quotes/jokes spontaneously in our conversations with others and each time we tell it a bit differently – it is unavoidable. If a machine were to be transcribing the quote it would file for a new quote, each time it hears it. While researching for this article, I have found a web site that has listed a near duplicate of Martha’s quote 4 times on the same page! Here are the three other renditions of the quote we saw earlier:
The greater part of our happiness or misery depends on our dispositions and not on our circumstances. We carry the seeds of the one or the other about with us in our minds wherever we go.
The greater part of our happiness or misery depends on our dispositions and not our circumstances.
I’ve learned from experience that the greater part of our happiness or misery depends on our dispositions and not on our circumstances.
Anyone reading these would know that these are all just one quote really. A computer may decide the first dupe above to be a new quote because of the brand new sentence. But the last two are nearly the same and are more or less contained within the first two. Tokenizing the text and building document vectors of each of the quotes will point that out. So all is not lost perhaps w.r.t computers & quotes, and we may still be able to automate getting rid of duplicates.
Quote Predictors & Recommenders
How about recommending quotes? Can we recommend a quote to a user based on what she/he may have already liked? If we base it strictly on the quote text we will run into the same issues as with classification. To improve our recommendation engine, we can try things like:
- augmenting our user profile with information such as their favorite authors (can get it from their current likes), books, subjects, and hobbies etc….
- clustering users based on their likes, and serving up recommendations from a pool of liked quotes of all these users
Can we predict if a quote is going to be popular? What makes a quote popular anyway? There are a lot of web sites featuring ‘best’ quotes, ‘famous’ quotes, ‘great’ quotes, ‘best quotes ever’, ‘greatest quotes’, ‘greatest quotes of all time’ … Some sites even show a count of likes against the quote so there is likely hard data on these ‘likes’. Will a textual analysis of these quotes coupled with their like counts give us a way to understand what makes a quote ‘popular’? Also is there a relationship between famous people & popular quotes? Do quotes from famous people tend to become popular, or coming up with witty quips is the road to becoming popular? Guess we have more questions than answers, but we will not worry as others have already said…
Life is too short to be wasted on finding answers… Enjoy the questions
I would rather have questions that can’t be answered than answers which cannot be questioned
It is left as an exercise to the reader to identify these two. Great if you can get both, but everyone should be able to get at least one.