Home / Frequently Asked Questions / View

What causes PDFTextStream to emit empty text extracts?

There are some (usually rare) situations where extracting text from PDF documents is not possible:

  1. PDF documents whose pages are simply a series of images. This is most common with PDF documents that have been scanned from physical documents, but which have not had a text "layer" added by an optical character recognition (OCR) process.
  2. In some very rare cases (less than 0.1% of all PDF documents in our testing), a PDF document may contain text that uses a font that "draws" glyphs using images, rather than referring to an actual character. In these cases, PDFTextStream will yield either empty text extracts, or extracts that are "junk" -- a series of nonsensical characters.

In principle, these issues could be solved by embedding an OCR process within PDFTextStream. This may be done for some future release.

PDFTextStream Bug

Finally, it is possible that you have stumbled across a bug in PDFTextStream. The easiest way to test for this possibility is to attempt to copy-and-paste text from the PDF file in question using Adobe Acrobat. If you can successfully do this (and the pasted text looks correct), but PDFTextStream is not delivering text, or the text it is delivering is incorrect in some way, then you have likely discovered a bug in PDFTextStream. In this case, please open a support ticket with us, and we'll begin working to resolve the problem straight away.

What's your PDF problem?