Active OCR: Tightening the Loop in Human Computing for OCR Correction will develop a proof-of-concept application that will experiment with the use of active learning and other iterative techniques for the correction of eighteenth-century texts (primarily but not exclusively written in English) provided by the HathiTrust Digital Library and the 2,231 ECCO text transcriptions released into the public domain by Gale and distributed by the Text Creation Partnership (TCP) and 18thConnect.
In an application based on active learning or a similarly iterative approach, the user could identify dozens or hundreds of difficult characters that appear in the articles from that same time period, and the system would use this new knowledge to improve optical character recognition (OCR) across the entire corpus.
A portion of our funded efforts will focus on the need to incentivize engagement in tasks of this type, whether they are traditionally crowdsourced or engaged through a more active, iterative process like the one we propose. We intend, as we develop an OCR correction system designed to create opportunities for users to improve the system and not just the text itself, to examine how explorations of a users’ preferences can improve their engagement with corpora of materials.
The technological infrastructure for this project has been supported in part by a generous grant from Amazon Web Services.