We worked with a manufacturing firm that needed to search through a library of pdf and image documents and select all documents that contained a set of terms to address a regulatory compliance issue.
The key complications were that the raster images were in an unsearchable format: they contained a mix of text and diagrams, the font was broken and ill-defined in places, and the orientation of the text was typically in multiple directions. The company’s first solution was to have a team search manually through thousands of documents.
We built an OCR (optical character recognition) system to extract the text from the images. We then built a searchable database to match the extracted text to each document.
The firm was able to perform the term searches automatically rather than manually. At a later date, they deployed our OCR solution more generally across other sets of legacy documents.