Searchable PDFs and TIFFs with OCR text

Almost all cases now contain some form of electronic data. Even if most of your case involves paper documents, the paper is scanned and the document productions are actually electronic image files (TIFFs or PDF’s) on a disk. For this example we’ll use a scenario where our images are from scanned documents.

Regardless of the image format, you must OCR image files for them to be searchable. When PDF images are put through the OCR process they treat the OCR text a little differently than TIFF images. A PDF image actually embeds the OCR text within itself (kind of behind the image itself). The PDF software and the OCR software work together to align the OCR text directly behind the words within the image so when you go to the search function within Adobe, the search hit will be highlighted on the image. As a user, all you will see is the PDFs text while the software program will see both the PDF text and the OCR text that is in alignment behind it. Many clients like this method because it is typically fairly cheap and easy to use. This method does have its limits, however, and cannot be utilized with high efficiency in large collections.

This is where TIFF images come in. When a paper document is scanned in a TIFF format it also has to be put through the OCR process to enable search capability. When OCRing a TIFF image a separate text file (.txt) is created and used in the TIFF searching process. So let’s recap real quick…

  • A searchable PDF is only one file
  • A TIFF image that was OCRed has two files (the actual TIFF image and a text file).

Technically a TIFF image is not really searchable, it is the “text file” produced by the OCR process that is searchable. After OCR, each text file corresponds to its respective TIFF image (where the text came from) and when a search hit occurs in the text file, the TIFF image is indicated in the search hit. This cannot be done in Adobe and requires more advanced document management software to take advantage of these features. In addition to our on-line review platform ImageDepot, the two most common document management packages are Summation and Concordance. We consult with our clients to help them choose the right solution for their case.

These options may seem cumbersome, but in large document collections it is the way to go. You can use multiple search criteria and conditional searching such as “AND/OR” type searches to further cull down your collection. Beginning to understand OCR, and how it works, may mean saving literally hundreds of review hours for you and your client.

Jason Lopez
Imaging Department