Etymologists’ unlikely friend


I’ve often written about how, when it comes to translation, nothing beats a professional, qualified human being.  There may be some advantages to using translation apps and bots, or amateurs, but you run a pretty solid risk of ending up with not-so-great results.  Still, like anything in life, there are exceptions.  Take “translating” old manuscripts and books to e-books.

I say “translating” because it turns out that it’s more than just using technology to convert text from one format to another: the digitized version of a document has to be carefully read through and compared to the original, to be sure that any irregularities on the page didn’t turn into typos.  Not all printed words are legible.  For example, there could be typeface issues, faded ink. or marks and stains on the paper.  You might think that what’s needed here is a meticulous proofreader.  But asking just one person what they think a damaged bit of print says can be problematic – after all, we don’t all see the same things.

Even two people could be wrong. That’s why multiple readers are definitely the best option when it comes to deciphering hard-to-read text.  But it’s not easy to find a large group of people willing to proofread old manuscripts (especially not for free).  Luckily, someone’s discovered a solution – and who he is and how it works are probably going to surprise you.

You know those words in weird-looking letters that you often have to type in to prove you’re not a spambot, when you’re doing something online?  That program is called CAPTCHA, and Luis von Ahn, who helped create it, realized a few years ago that people were essentially wasting their time typing random words. So he partnered with The New York Times and the Internet Archive and reconfigured the system to something called reCAPTCHA.

With reCAPTCHA, one of the two words you type in is really an illegible word from an old book or newspaper. The same word will be shown to numerous other people, to see if everyone agrees on what it is, and, if that’s the case, it will be input into the digitized version of the document it came from.

There are some errors – for example, at times, an ink blot has shown up, instead of a word.  But ultimately, the project has been able to majorly advance digitizing old books and the New York Times’ newspaper archives.

You might be skeptical about the method, which has already allowed billions of words to be deciphered.  But hopefully, the fact that von Ahn claims the system has been proven to be over 99% accurate, will reassure you, at least a little.

(Inadvertently) giving your opinion about an illegible word so that e- books and -documents will be typo-free isn’t just a nice thing to do for bibliophiles and digital newspaper junkies.  Old print material often contains hard-to-find historical details.  And it can be an invaluable resource for linguists trying to untangle the etymologies of words with mysterious pasts: sometimes the sources you’d expect, like dictionaries and scholarly tomes, don’t have the answers, but things you might not consider, like old dime novels, advertisements, and letters to the editor, can reveal how words have evolved, and even where they originated.  So the next time you type in reCAPTCHA letters, remember it’s not just an annoying part of internet life: you’re contributing to a better understanding of our language!

by Alysa Salzberg


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s