Home / View

PDFTextStream Change Log

Changes in PDFTextStream v2.2.1

  • PDFTextStream.NET now ships with ikvm v0.3.4, which fixes a number of problems that prevented PDFTextStream from functioning properly across multiple AppDomains (598)
  • Added PDFTextStream.loadLicense(URL) function (475)
  • Added a 'spacing scale' property to VisualOutputTarget which allows applications to control the amount of horizontal whitespace that should be emitted per physical amount of whitespace found in the source document (528)
  • PDFTextStream will now attempt to load a license file from the host application's current directory before checking the current classpath / AppDomain (661)
  • Fixed a problem where pathological embedded Unicode character encodings were causing PDFTextStream to strings of control characters rather than reasonable extracted content. (428)
  • Fixed a bug in PDFTextStream's handling of cross reference entries that caused fatal errors in some documents (620)
  • Fixed a problem where UTF-16 encoded bookmark titles were not being decoded properly (618)

Changes in PDFTextStream v2.2.0

  • Added support for Apache Lucene v2.1 and v2.2 to PDFTextStream's Lucene integration module (com.snowtide.pdf.lucene.PDFDocumentFactory)
  • Added com.snowtide.pdf.PDFTextStreamConfig, which enables simple static and runtime configuration of PDFTextStream
  • Added new PDFTextStream constructors that accept customized PDFTextStreamConfig instances, and a setConfiguration(PDFTextStreamConfig) function to set a PDFTextStream instance's configuration at runtime
  • PDFTextStream now joins adjacent rectangles that have similar stroke and fill colors, which improves various page segmentation results
  • Improved table detection processes to adaptively recognize very small "variant" table cells
  • Improved pdfts.examples.XMLOutputTarget to build an XML DOM Document instead of constructing XML using a StringBuffer; block elements now include a type attribute of "table" if the block is a table
  • Significantly improved the quality of PDF documents generated when merging PDF files (com.snowtide.pdf.util.MergeUtil) and when saving updated PDF forms (com.snowtide.pdf.forms.AcroForm#writeUpdatedDocument(OutputStream))
  • Rotated text blocks are now properly grouped within bounded regions
  • Changed pdfts.cjk.disable and pdfts.mmap.disable system properties to pdfts.cjk.enable and pdfts.mmap.enable, respectively
  • Fixed an overflow bug in PDFTextStream's PDF data parser
  • Fixed a bug where the ascent and descent characteristics of some fonts were defaulting to improper values
  • Fixed a bug where lines and rectangles drawn with a Separation color space were not being recognized properly
  • Fixed a bug where an error would result when reading a PDF file with a non-conforming linebreak sequence after the `stream' tag
  • Fixed a bug where tables containing underlined text would not be recognized properly
  • Fixed a bug where edges of rectangles were improperly recognized as text underlines
  • Fixed a bug where PDFTextStream wouldn't recognize PDF data stream filter name abbreviations

Changes in PDFTextStream v2.1.6

  • Added com.snowtide.pdf.util.TableUtils, which provides a set of CSV conversion functions for exporting the contents of tables
  • Added options to specify path to load PDFTextStream license file via pdftslicensepath environment variable or system property
  • Added com.snowtide.pdf.PDFTextStream.loadLicense(String) - programmatic way to specify path from which to load PDFTextStream license file
  • Changed PDFTextStream's default page segmentation algorithms to not eliminate empty table cells, making it simpler to export tabular content to Excel, etc.
  • Fixed bug in VisualOutputTarget where vertically-adjacent lines of text were being inappropriately combined
  • Fixed text encoding bug where text extracted from PDF documents generated by Adobe InDesign v4.0 - v5.0 would be "scrambled", or appear to be series of Chinese glyphs
  • Fixed bug where AFM font mappings were sometimes applied in an incorrect order, leading to spot errors in text extracts
  • Fixed bug where certain embedded Type1 font encodings were not being loaded correctly, resulting in single-character extraction errors

Changes in PDFTextStream v2.1.5

  • Significant improvements in the handling and standard output of rotated content
  • Added com.snowtide.pdf.layout.TextUnit.getTheta()

Changes in PDFTextStream v2.1.3

  • Added com.snowtide.pdf.Font.isItalic() -- indicates whether a font is italicized
  • Added com.snowtide.pdf.layout.TextUnit.isUnderlined() -- indicates whether a character is underlined
  • Added tagging of italic text regions in pdfts.examples.XMLOutputTarget

Changes in PDFTextStream v2.1.2

  • Fixed page rotation detection bug when processing PDF documents generated by Crystal Reports

Changes in PDFTextStream v2.1.1

  • Significant improvements in output of VisualOutputTarget, especially for pages with many different font sizes
  • Fixed calculation of character widths for Type0 font that have a recognized AFM base font name

Changes in PDFTextStream v2.1

  • Added support for updating text, checkbox, radio button, and choice interactive form fields
  • Added support for Kodak print job data extraction (%KDK commands) via com.snowtide.pdf.util.KodakPrintData
  • Exposed the AcroFormField.isReadOnly() function
  • Added ByteBuffer-based buildPDFDocument() functions to com.snowtide.pdf.lucene.PDFDocumentFactory
  • Added the pdfts.logfactory and pdfts.loggingtype system variables to simplify the customization of logging via com.snowtide.util.logging.LoggingRegistry
  • java.util.logging is now the default logging toolkit; pdfts.loggingtype may be used to change that. Refer to the LoggingRegistry javadocs for more info.
  • Improved documentation significantly
  • Fixed a problem where merged PDF documents that contained empty dictionaries would be improperly generated
  • Fixed a problem where the "rich text" value of text interactive form fields would not be loaded

Changes in PDFTextStream v2.0.5

  • Fixed handling of text spacing that was causing some columnated text to overrun column boundaries improperly
  • Fixed a problem where text from adjacent lines would be inappropriately intermingled
  • Changed unlicensed functionality so that evaluation use would not require a special evaluation license file; specifically, PDFTextStream will randomize some digits in text extracts when it is operating unlicensed, and the 8-page extract limitation has been removed

Changes in PDFTextStream v2.0.2

  • Added com.snowtide.pdf.RegionOutputTarget to support region-specific content extraction
  • Added ability to derive encoding and spatial metrics of Type3 fonts; added pdfts.type3.derive system property to disable derivation if necessary (359)
  • Fixed problem with com.snowtide.pdf.VisualOutputTarget, where lines would sometimes be inappropriately combined (356)

Changes in PDFTextStream v2.0.1

  • Better indication of corrupted or otherwise unreadable PDF files (com.snowtide.pdf.FaultyPDFException)
  • Added pipe(OutputHandler) function to com.snowtide.pdf.layout.Line
  • Added pdfts.mmap.disable system property option to disable memory-mapping of PDF files - avoids JDK bug #4724038 (355)

Changes in PDFTextStream v2.0

  • PDFTextStream now available for .NET and Python
  • Added support for extraction of Chinese, Japanese, and Korean text (CJK)
  • Added support for accessing derived table structure (com.snowtide.pdf.layout.Table)
  • Significantly improved performance
  • Significantly improved accuracy of extraction of rotated text
  • Added support for Lucene v1.9 and v2.0
  • Added "visual" text layout output target (com.snowtide.pdf.VisualOutputTarget)
  • Added PDF merge capability (com.snowtide.pdf.util.PDFMergeUtil)
  • Added support for Type1C embedded font files (274)
  • Fixed issue where some bookmarks would have invalid page number attributes (i.e. -1) (324)
  • Fixed issues where blocks, lines, and textunits that represented rotated text reported inaccurate positions on the page
  • Fixed issue where xref table was not being rebuilt when object locator was simply missing (338)
  • Eliminated com.snowtide.pdf.PDFTextStreamOptions (deprecated in v1.3)

Changes in PDFTextStream v1.4

  • Added support for interactive PDF forms (AcroForms) (com.snowtide.pdf.forms.* and com.snowtide.pdf.PDFTextStream.getFormData()) (118)
  • Added support for derivation of 'graphical' font encoding (Type3) (297)
  • Added com.snowtide.pdf.OutputHandler base class for OutputTarget
  • Added PDFTextStream constructor that takes a java.nio.ByteBuffer, enabling completely in-memory operation
  • Added an example class that extracts form data as XML (pdfts.examples.XMLFormExport)
  • Added sample implementation of com.snowtide.pdf.OutputHandler that outputs PDF text as XML, indicating document structure and where bolded text ranges exist (pdfts.examples.XMLOutputTarget)
  • Added sample OutputHandler implementation that exports PDF text content as an XHTML document (pdfts.examples.GoogleHTMLOutputHandler)
  • Fixed bug where inline images were not being properly skipped (308)
  • Fixed bug where destination bounds of some bookmarks and annotations were not being properly set (307)
  • Fixed bug where text properties (font size, character encoding, etc) would persist beyond where they should (298)

Changes in PDFTextStream v1.3.6

  • Fixed potential OutOfMemoryError caused by complex graphical regions (295)
  • Fixed bug where out-of-date content might be extracted from updated PDF documents (296)

Changes in PDFTextStream v1.3.5

  • Added PDF annotation API (com.snowtide.pdf.annot.*) (76)
  • Added PDF bookmark API (com.snowtide.pdf.Bookmark and com.snowtide.pdf.PDFTextStream.getBookmarks()) (284)
  • Significantly improved performance parsing PDF data containing very complex illustrations (282)
  • Improved triage procedures for handling damaged or malformed PDF files (292)
  • Fixed bug where com.snowtide.pdf.Page.getPageNumber() was reporting 1-indexed page numbers; it now properly reports 0-indexed page numbers (283)
  • Fixed parsing bug related to zero-length PDF names (290)

Changes in PDFTextStream v1.3.4

  • Improved rectangle and line detection to avoid skipping graphics that impact text layout (272)
  • Improved the algorithm used to calculate the number of line breaks to be outputted between lines of text (271)
  • Improved detection and handling of malformed PDF documents to prevent potential infinite loops (278)
  • Fixed compatibility problem with PDFs generated by IBM Manyimage tool
  • Fixed compatibility problem with PDFs generated by SAP R/3 (276)
  • Fixed error thrown when some blank pages are encountered (270)

Changes in PDFTextStream v1.3.3

  • Expanded support for referenced form XObjects; results in more complete text extracts (263)
  • Improved font lookup routines; now caching frequently-referenced fonts for improved performance
  • Fixed logging classloading issue on JDK 1.3.1_01

Changes in PDFTextStream v1.3.2

  • Significant performance enhancement through improved usage of java.nio.* classes; available only on JDK 1.4+

Changes in PDFTextStream v1.3.1

  • Fixed integration with JDK v1.4 java.util.logging toolkit

Changes in PDFTextStream v1.3

  • Added ability to retrieve PDF document page attributes (height, width, rotation, etc) (94)
  • Added ability to retrieve PDF document pages one at a time (94)
  • Added ability to retrieve PDF document encryption parameters (99)
  • Added ability to retrieve PDF file specification version number (91)
  • Added pipe() method to PDFTextStream and retrieved PDF pages, allowing easy redirection of content to a buffer to file (92)
  • Significantly improved page segmentation and document read-ordering, resulting in more semantically-consistent text extracts
  • Significantly improved extraction of rotated text
  • Significantly improved extraction of line-bounded tables (107)
  • Deprecated PDFTextStreamOptions class: strictEncoding and page header options no longer used (87, 98)
  • PDFTextStream now always produces Unicode text; the ASCII-only option is no longer provided, as it proved to be unreliable (87)
  • Fixed some minor Unicode text extraction issues related to selecting the proper character encoding for Type 1 fonts (86)
  • Fixed PDFTextStream's implementation of the PDF graphics state stack to more closely conform to the PDF spec (90)
  • Fixed problem where certain monospaced character might be omitted from output (35)
  • Fixed problem where text might be scrambled on a line that contains certain monospaced text (182)

Changes in PDFTextStream v1.2

  • Added support for retrieving document-level Adobe XMP data (document metadata in an XML format) (66)
  • Added support for PDF v1.5 files encrypted using crypt filters that specify an invalid decryption key length (63)
  • Improved overview documentation of metadata access in Javadoc and Developer's Guide (70)
  • Fixed support for decrypting updated PDF v1.4 files encrypted with 128-bit passwords (62)
  • Fixed internal error that might have occurred in connection with processing updated PDF documents (72)

Changes in PDFTextStream v1.1.2

  • Enhanced the core parsing routines to accept PDF files that use improper (or nonexistant) string escape sequences
  • Fixed a bug that caused hard errors when processing some PDF v1.5 documents.
  • Fixed a bug where a particular text mapping (hex / CIDFont mappings) used in some PDF's would be misinterpreted, resulting in space characters being outputted instead of 'regular' characters

Changes in PDFTextStream v1.1.1

  • Fixed a problem where some PDF's that use a particular type of TrueType font were converted into useless text content

Changes in PDFTextStream v1.1

  • JDK v1.3 is now fully supported.
  • Significant improvements have been made in the layout and formatting of rotated text.
  • All logging is now channeled through Jakarta's commons-logging library to enable usage of logging toolkits other than log4j.
What's your PDF problem?