Home / Products / What's New in Version 2.x?

What's New in v2.x

We are very proud to introduce you to PDFTextStream v2.0 (read the press release). Having worked on this release for nearly 18 months, we are confident that you will find it as compelling today as PDFTextStream v1.0 was two years ago.

Each of the major changes introduced in PDFTextStream v2.0 are the direct result of us listening to our customers constantly pushing us to deliver more. These major changes include:

New in v2.1: Updating Interactive AcroForms

Starting with v2.1, PDFTextStream can update persistent form field values in PDF documents that contain interactive AcroForms. This is in addition to its existing form field value extraction capability. Document workflows continue to be migrated in astounding numbers to 100% paperless solutions, PDF documents with AcroForms in particular. Given this trend, it's good to know that PDFTextStream now delivers robust form-filling and updated PDF document writing capabilities to enable your enterprise or products to deliver customized PDF forms for customers, business partners, government compliance, and data archival requirements.

Now Available for .NET and Python

One of the most frequently asked-for "features" was that PDFTextStream should be available for platforms other than Java. And, while PDFTextStream's origins are rooted firmly in the Java world, we have found ways to bring PDFTextStream to two more very popular, very powerful platforms, .NET and Python.

PDFTextStream.NET and PDFTextStream.Python don't ask you to compromise: you get all of the performance, functionality, robustness, and features provided by PDFTextStream for Java, but on the development platform of your choice. Rarely is life this kind.

Chinese, Japanese, Korean (CJK) Text Extraction

Today's is a global marketplace, and surviving means being able to work well with others, regardless of their location or language. So, it is fitting that PDFTextStream v2.0 now supports extracting Chinese, Japanese, and Korean (CJK) text from PDF documents (as well as other less well known double-byte character sets).

PDFTextStream v2.0's CJK text extraction capabilities aren't half baked or bolted-on; they've been built into the library at the lowest levels. This enables PDFTextStream v2.0 to:

  • properly extract CJK text written horizontally as well as vertically
  • properly segment and order chunks of CJK text on each page of a PDF (rather than blindly outputting the CJK text in the order it is encoded in the PDF file, as is done by some PDF extraction libraries)
  • do all of this with no performance penalty or additional cost to you

Huge Performance Boost

PDFTextStream's performance has always led the pack, but PDFTextStream v2.0 raises the bar yet again. PDFTextStream v2.0 is up to twice as fast as PDFTextStream v1.4, so it blows away the competition more than ever before.

This translates into increased efficencies and reduced costs for your IT department, improved productivity for your staff and users, and a better experience for your customers.

New Unstructured Content Tools

Finding the mission-critical data trapped in the sea of unstructured content in your enterprise and integrating it into your existing systems is one of the most valuable contributions an IT team can make. PDFTextStream v2.0 introduces some functionality that can make tackling PDF-bound data extraction and conversion goals easier:

Table Structure Derivation

Extracting tabular data from PDF documents is one of the most common unstructured content tasks. PDFTextStream v2.0 makes this job a lot easier by:

  • automatically recognizing most visually-defined tables (those that use visible lines to indicate columns and rows),
  • deriving their structure,
  • and providing a powerful API for accessing the data held in such tables

VisualOutputTarget

Some kinds of tabular content (or other data formats) cannot be recognized as tables. In such circumstances, extracting the text of the source PDF files while retaining each page's visual layout can simplify the conversion of unstructured content into structure data elements. PDFTextStream v2.0 makes this possible via the new VisualOutputTarget class, which keeps the text extracted from each PDF file formatted as it appears when viewing it in Adobe Acrobat, for example.

"And so much more!"

There's a ton of additional features and beneficial tweaks included in PDFTextStream. Some of these include:

  • PDF File Merging
  • Lucene support (v1.2 - v2.2)
  • Improved font support (Type1C improvements)
  • Improved handling of rotated text

Please refer to the change log for a full list of the changes present in PDFTextStream v2.0.

What's your PDF problem?