Home / Products / Why PDFTextStream?

Why PDFTextStream?

The Complexity of PDF Content Extraction

PDF is fundamentally a distribution and presentation platform, not one designed for content or data interchange. As a result, the need to extract content from PDF documents raises some unique and highly complex problems -- challenges not present in most other content extraction or conversion situations.

Properly Ordered Content Extracts

PDF documents often encode text content out of order -- for example, encoding lines at the bottom of a page before lines at the top, or mixing up the order in which columns of text are encoded. Because most libraries extract text generally using the order provided by each PDF document, their extracts can often contain mis-ordered text. This can wreak havoc on downstream applications, such as content management systems, search engines, and data mining processes.

PDFTextStream is the only PDF text extraction API that uses its own OCR-like process to properly order text extracts. The result is that PDFTextStream produces the most accurate PDF text extracts available in the market today.

CJK Character Encoding

The PDF platform also makes it difficult to accurately extract font and character encoding attributes, including CJK (Chinese, Japanese, and Korean) characters. In today's global environment, this can be a significant obstacle for organizations that work with Asian-language content.

PDFTextStream is the only PDF content extraction library to include full CJK character encoding support (of course, it handles nearly all other character encodings with ease as well). Even better, PDFTextStream properly handles CJK content that is vertically oriented, again thanks to its OCR-like page layout engine. This attention to detail is what makes PDFTextStream the best PDF extraction solution for the global market.

Enabling Conversion of Unstructured Content into Data

For better or worse, data is distributed as PDF files quite often -- financial tables, inventory figures, and reports of all kinds find their way into PDF format. It is often necessary to resurrect these critical pieces of data from the unstructured form they take in PDF files.

PDFTextStream includes a number of exclusive, proprietary tools that make the content-to-data conversion possible. It automatically and intelligently recognizes tabular data within PDF documents, and provides a straightforward API (application programming interface) for accessing that data in a structured way.

PDFTextStream also is the only library that has the extraction fidelity and output algorithms necessary to preserve the visual layout of a page when extracted to text. This makes it easy to plug PDFTextStream's extracts into many existing text analysis tools and processes.

When Performance is Critical...

Of course, all of these features are no good if adding PDFTextStream to your application slows everything to a crawl or causes you to overflow your batch windows. Thankfully, you can have it all: in addition to offering you second-to-none accuracy and conversion features, PDFTextStream is the fastest PDF content extraction component on the market.

PDFTextStream is the undisputed performance leader within the Java market. The early-access releases of PDFTextStream for .NET and python are holding their own as well, both coming within a few percentage points of the fastest native (C/C++) PDF extraction tools for those platforms.

When you require superior extraction performance, PDFTextStream is the only viable choice.

Learn more about the PDFTextStream's key differentiators and how they drive business value:

Comprehensive PDF Format Support >>

Tons of Features >>

Superior Performance >>

What's your PDF problem?