PDFTextStream Change Log
Changes in PDFTextStream v2.2.1
- PDFTextStream.NET now ships with ikvm v0.3.4, which fixes a number of problems that prevented PDFTextStream from functioning properly across multiple AppDomains (598)
- Added PDFTextStream.loadLicense(URL) function (475)
- Added a 'spacing scale' property to VisualOutputTarget which allows applications to control the amount of horizontal whitespace that should be emitted per physical amount of whitespace found in the source document (528)
- PDFTextStream will now attempt to load a license file from the host application's current directory before checking the current classpath / AppDomain (661)
- Fixed a problem where pathological embedded Unicode character encodings were causing PDFTextStream to strings of control characters rather than reasonable extracted content. (428)
- Fixed a bug in PDFTextStream's handling of cross reference entries that caused fatal errors in some documents (620)
- Fixed a problem where UTF-16 encoded bookmark titles were not being decoded properly (618)
Changes in PDFTextStream v2.2.0
- Added support for Apache Lucene v2.1 and v2.2 to PDFTextStream's
Lucene integration module (com.snowtide.pdf.lucene.PDFDocumentFactory)
- Added com.snowtide.pdf.PDFTextStreamConfig, which enables simple static and runtime configuration of PDFTextStream
- Added new PDFTextStream constructors that accept customized
PDFTextStreamConfig instances, and a
setConfiguration(PDFTextStreamConfig) function to set a PDFTextStream
instance's configuration at runtime
- PDFTextStream now joins adjacent rectangles that have similar
stroke and fill colors, which improves various page segmentation results
- Improved table detection processes to adaptively recognize very small "variant" table cells
- Improved pdfts.examples.XMLOutputTarget to build an XML DOM
Document instead of constructing XML using a StringBuffer; block
elements now include a type attribute of "table" if the block is a table
- Significantly improved the quality of PDF documents generated when
merging PDF files (com.snowtide.pdf.util.MergeUtil) and when saving
updated PDF forms
(com.snowtide.pdf.forms.AcroForm#writeUpdatedDocument(OutputStream))
- Rotated text blocks are now properly grouped within bounded regions
- Changed pdfts.cjk.disable and pdfts.mmap.disable system properties to pdfts.cjk.enable and pdfts.mmap.enable, respectively
- Fixed an overflow bug in PDFTextStream's PDF data parser
- Fixed a bug where the ascent and descent characteristics of some fonts were defaulting to improper values
- Fixed a bug where lines and rectangles drawn with a Separation color space were not being recognized properly
- Fixed a bug where an error would result when reading a PDF file with a non-conforming linebreak sequence after the `stream' tag
- Fixed a bug where tables containing underlined text would not be recognized properly
- Fixed a bug where edges of rectangles were improperly recognized as text underlines
- Fixed a bug where PDFTextStream wouldn't recognize PDF data stream filter name abbreviations
Changes in PDFTextStream v2.1.6
- Added com.snowtide.pdf.util.TableUtils, which provides a set of CSV conversion functions for exporting the contents of tables
- Added options to specify path to load PDFTextStream license file via pdftslicensepath environment variable or system property
- Added com.snowtide.pdf.PDFTextStream.loadLicense(String) -
programmatic way to specify path from which to load PDFTextStream
license file
- Changed PDFTextStream's default page segmentation algorithms to not
eliminate empty table cells, making it simpler to export tabular
content to Excel, etc.
- Fixed bug in VisualOutputTarget where vertically-adjacent lines of text were being inappropriately combined
- Fixed text encoding bug where text extracted from PDF documents
generated by Adobe InDesign v4.0 - v5.0 would be "scrambled", or appear
to be series of Chinese glyphs
- Fixed bug where AFM font mappings were sometimes applied in an incorrect order, leading to spot errors in text extracts
- Fixed bug where certain embedded Type1 font encodings were not
being loaded correctly, resulting in single-character extraction errors
Changes in PDFTextStream v2.1.5
- Significant improvements in the handling and standard output of rotated content
- Added com.snowtide.pdf.layout.TextUnit.getTheta()
Changes in PDFTextStream v2.1.3
- Added com.snowtide.pdf.Font.isItalic() -- indicates whether a font is italicized
- Added com.snowtide.pdf.layout.TextUnit.isUnderlined() -- indicates whether a character is underlined
- Added tagging of italic text regions in pdfts.examples.XMLOutputTarget
Changes in PDFTextStream v2.1.2
- Fixed page rotation detection bug when processing PDF documents generated by Crystal Reports
Changes in PDFTextStream v2.1.1
- Significant improvements in output of VisualOutputTarget, especially for pages with many different font sizes
- Fixed calculation of character widths for Type0 font that have a recognized AFM base font name
Changes in PDFTextStream v2.1
- Added support for updating text, checkbox, radio button, and choice interactive form fields
- Added support for Kodak print job data extraction (%KDK commands) via com.snowtide.pdf.util.KodakPrintData
- Exposed the AcroFormField.isReadOnly() function
- Added ByteBuffer-based buildPDFDocument() functions to com.snowtide.pdf.lucene.PDFDocumentFactory
- Added the pdfts.logfactory and pdfts.loggingtype system variables to simplify the customization of logging via com.snowtide.util.logging.LoggingRegistry
- java.util.logging is now the default logging toolkit; pdfts.loggingtype may be used to change that. Refer to the LoggingRegistry javadocs for more info.
- Improved documentation significantly
- Fixed a problem where merged PDF documents that contained empty dictionaries would be improperly generated
- Fixed a problem where the "rich text" value of text interactive form fields would not be loaded
Changes in PDFTextStream v2.0.5
- Fixed handling of text spacing that was causing some columnated text to overrun column boundaries improperly
- Fixed a problem where text from adjacent lines would be inappropriately intermingled
- Changed unlicensed functionality so that evaluation use would not require a special evaluation license file; specifically, PDFTextStream will randomize some digits in text extracts when it is operating unlicensed, and the 8-page extract limitation has been removed
Changes in PDFTextStream v2.0.2
- Added com.snowtide.pdf.RegionOutputTarget to support region-specific content extraction
- Added ability to derive encoding and spatial metrics of Type3 fonts; added pdfts.type3.derive system property to disable derivation if necessary (359)
- Fixed problem with com.snowtide.pdf.VisualOutputTarget, where lines would sometimes be inappropriately combined (356)
Changes in PDFTextStream v2.0.1
- Better indication of corrupted or otherwise unreadable PDF files (com.snowtide.pdf.FaultyPDFException)
- Added pipe(OutputHandler) function to com.snowtide.pdf.layout.Line
- Added
pdfts.mmap.disable system property option to disable memory-mapping of PDF files - avoids JDK bug #4724038 (355)
Changes in PDFTextStream v2.0
- PDFTextStream now available for .NET and Python
- Added support for extraction of Chinese, Japanese, and Korean text (CJK)
- Added support for accessing derived table structure (com.snowtide.pdf.layout.Table)
- Significantly improved performance
- Significantly improved accuracy of extraction of rotated text
- Added support for Lucene v1.9 and v2.0
- Added "visual" text layout output target (com.snowtide.pdf.VisualOutputTarget)
- Added PDF merge capability (com.snowtide.pdf.util.PDFMergeUtil)
- Added support for Type1C embedded font files (274)
- Fixed issue where some bookmarks would have invalid page number attributes (i.e. -1) (324)
- Fixed issues where blocks, lines, and textunits that represented rotated text reported inaccurate positions on the page
- Fixed issue where xref table was not being rebuilt when object locator was simply missing (338)
- Eliminated com.snowtide.pdf.PDFTextStreamOptions (deprecated in v1.3)
Changes in PDFTextStream v1.4
- Added support for interactive PDF forms (AcroForms) (com.snowtide.pdf.forms.* and com.snowtide.pdf.PDFTextStream.getFormData()) (118)
- Added support for derivation of 'graphical' font encoding (Type3) (297)
- Added com.snowtide.pdf.OutputHandler base class for OutputTarget
- Added PDFTextStream constructor that takes a java.nio.ByteBuffer, enabling completely in-memory operation
- Added an example class that extracts form data as XML (pdfts.examples.XMLFormExport)
- Added sample implementation of com.snowtide.pdf.OutputHandler that outputs PDF text as XML, indicating document structure and where bolded text ranges exist (pdfts.examples.XMLOutputTarget)
- Added sample OutputHandler implementation that exports PDF text content as an XHTML document (pdfts.examples.GoogleHTMLOutputHandler)
- Fixed bug where inline images were not being properly skipped (308)
- Fixed bug where destination bounds of some bookmarks and annotations were not being properly set (307)
- Fixed bug where text properties (font size, character encoding, etc) would persist beyond where they should (298)
Changes in PDFTextStream v1.3.6
- Fixed potential OutOfMemoryError caused by complex graphical regions (295)
- Fixed bug where out-of-date content might be extracted from updated PDF documents (296)
Changes in PDFTextStream v1.3.5
- Added PDF annotation API (com.snowtide.pdf.annot.*) (76)
- Added PDF bookmark API (com.snowtide.pdf.Bookmark and com.snowtide.pdf.PDFTextStream.getBookmarks()) (284)
- Significantly improved performance parsing PDF data containing very complex illustrations (282)
- Improved triage procedures for handling damaged or malformed PDF files (292)
- Fixed bug where com.snowtide.pdf.Page.getPageNumber() was reporting 1-indexed page numbers; it now properly reports 0-indexed page numbers (283)
- Fixed parsing bug related to zero-length PDF names (290)
Changes in PDFTextStream v1.3.4
- Improved rectangle and line detection to avoid skipping graphics that impact text layout (272)
- Improved the algorithm used to calculate the number of line breaks to be outputted between lines of text (271)
- Improved detection and handling of malformed PDF documents to prevent potential infinite loops (278)
- Fixed compatibility problem with PDFs generated by IBM Manyimage tool
- Fixed compatibility problem with PDFs generated by SAP R/3 (276)
- Fixed error thrown when some blank pages are encountered (270)
Changes in PDFTextStream v1.3.3
- Expanded support for referenced form XObjects; results in more complete text extracts (263)
- Improved font lookup routines; now caching frequently-referenced fonts for improved performance
- Fixed logging classloading issue on JDK 1.3.1_01
Changes in PDFTextStream v1.3.2
- Significant performance enhancement through improved usage of java.nio.* classes; available only on JDK 1.4+
Changes in PDFTextStream v1.3.1
- Fixed integration with JDK v1.4 java.util.logging toolkit
Changes in PDFTextStream v1.3
- Added ability to retrieve PDF document page attributes (height, width, rotation, etc) (94)
- Added ability to retrieve PDF document pages one at a time (94)
- Added ability to retrieve PDF document encryption parameters (99)
- Added ability to retrieve PDF file specification version number (91)
- Added pipe() method to PDFTextStream and retrieved PDF pages, allowing easy redirection of content to a buffer to file (92)
- Significantly improved page segmentation and document read-ordering, resulting in more semantically-consistent text extracts
- Significantly improved extraction of rotated text
- Significantly improved extraction of line-bounded tables (107)
- Deprecated PDFTextStreamOptions class: strictEncoding and page header options no longer used (87, 98)
- PDFTextStream now always produces Unicode text; the ASCII-only option is no longer provided, as it proved to be unreliable (87)
- Fixed some minor Unicode text extraction issues related to selecting the proper character encoding for Type 1 fonts (86)
- Fixed PDFTextStream's implementation of the PDF graphics state stack to more closely conform to the PDF spec (90)
- Fixed problem where certain monospaced character might be omitted from output (35)
- Fixed problem where text might be scrambled on a line that contains certain monospaced text (182)
Changes in PDFTextStream v1.2
- Added support for retrieving document-level Adobe XMP data (document metadata in an XML format) (66)
- Added support for PDF v1.5 files encrypted using crypt filters that specify an invalid decryption key length (63)
- Improved overview documentation of metadata access in Javadoc and Developer's Guide (70)
- Fixed support for decrypting updated PDF v1.4 files encrypted with 128-bit passwords (62)
- Fixed internal error that might have occurred in connection with processing updated PDF documents (72)
Changes in PDFTextStream v1.1.2
- Enhanced the core parsing routines to accept PDF files that use improper (or nonexistant) string escape sequences
- Fixed a bug that caused hard errors when processing some PDF v1.5 documents.
- Fixed a bug where a particular text mapping (hex / CIDFont mappings) used in some PDF's would be misinterpreted, resulting in space characters being outputted instead of 'regular' characters
Changes in PDFTextStream v1.1.1
- Fixed a problem where some PDF's that use a particular type of TrueType font were converted into useless text content
Changes in PDFTextStream v1.1
- JDK v1.3 is now fully supported.
- Significant improvements have been made in the layout and formatting of rotated text.
- All logging is now channeled through Jakarta's commons-logging library to enable usage of logging toolkits other than log4j.