Home / Frequently Asked Questions / View

Normalized Extracted Text Layout

The formatting of the text extracts provided by PDFTextStream is defined by the OutputHandler that is used to "collect" the PDF text events generated by PDFTextStream.

The default OutputHandler, OutputTarget, is optimized for performance and use in semantically-sensitive environments: search, indexing, summarization, etc. In these kinds of environments, maintaining the spacing of text elements (including table columns and such) is mostly unnecessary.

Solution

There are a number of alternate OutputHandler implementations included with PDFTextStream. For maintaining the layout of each page of PDF content so that the text extracts look as much like the original PDF as possible, use VisualOutputTarget.

Making this change is simple in most cases -- just replace all of your references to OutputTarget to VisualOutputTarget instead.

If you are using the java.io.Reader interface provided by PDFTextStream, you will need to switch to using a VisualOutputTarget and the accompanying .pipe(OutputHandler) functions (found on PDFTextStream, Page, and Block objects). We may make it possible to specify a different default OutputHandler in a future PDFTextStream release, which will enable you to continue using the java.io.Reader interface provided by PDFTextStream along and still be able to use alternate OutputHandler implementations.

What's your PDF problem?