Fixed issue where PDFTextStream would fail to initialize when the default system locale was set to Shift_JIS (i.e. SJIS, MS932, Windows-31J)
Fixed an issue where certain Chinese, Japanese, and Korean fonts were not being loaded properly when specific encoding config data was missing.
Fixed an octal string parsing bug that could lead to a PDF parsing failure.
Added crop box attribute to com.snowtide.pdf.Page interface
An expanded set of control characters are now treated as whitespace.
Added support for non-compliant PDF documents produced by TXT2PDF for OS/390.
Changes in PDFTextStream v2.3.1
Added methods to VisualOutputTarget to enable the optional exclusion of rotated content from its output (523)
Fixed a bug where rotated characters were reporting a rotation angle (theta) of 0 when presented to VisualOutputTarget. (519)
Fixed a bug where use of PDFTextStream.NET in a multithreaded environment could produce garbled or missing text extracts in very limited circumstances. (512)
Added support for PDFs that contain malformed arrays in their graphics output streams (509)
Fixed a bug where text rendered using a Type3 font that has a proper unicode mapping was being omitted from extracts (507)
Significantly improved the emission of whitespace between words on lines with large amounts of tracking (506)
Fixed character mapping for 'ã' and '- ' ("middle dot") (502)
Fixed a bug affecting VisualOutputTarget and RegionOutputTarget where smaller characters would not be included in resulting text extracts. (499)
Fixed an issue where string values held in compressed object streams were being re-encrypted (primarily affecting key/value PDF attributes) (495)
Fixed an issue where PDF documents generated by PDFSharp were improperly handled, leading to significant degradation of extraction accuracy. (490)
Fixed an issue where CFF font encodings were being applied inappropriately, potentially leading to garbled extracts. (479)
fixed a bug related to zero-length cross-reference entry codes that was resulting in a improper FaultyPDFException being thrown (450)
Changes in PDFTextStream v2.3.0
Added an .isStruckThrough() method to com.snowtide.pdf.TextUnit, indicating whether a character has a strikethrough drawn through it.
Improved PDFTextStream's support for embedded character mappings.
The calculation of whitespace between words has been fixed to properly account for whitespace that is explicitly encoded in the source PDF documents.
Improved PDFTextStream's handling of composite content encodings, which previously could fail resulting in some ranges of PDF content being 'ignored' during extraction.
Fixed a bug in VisualOutputTarget where text from a single line would be split over multiple lines
Improved vertical alignment of text extracted using VisualOutputTarget
Improved VisualOutputTarget-produced extracts to eliminate spurious additional whitespace between closely-adjacent words
Changes in PDFTextStream v2.2.5
Added support for extracting XFA forms data as XML
Significantly improved the performance of text extraction using VisualOutputTarget
Added support for PDF documents larger than 2GB
Fixed a bug where the encodings from embedded Type1 fonts were previously not being applied properly in some circumstances.
Fixed a problem where newer content in updated PDF documents were sometimes being ignored.
Fixed a problem where PDFDocEncoding-encoded bookmarks and metadata were not being decoded properly
added .getDestinationName() method to com.snowtide.pdf.Bookmark
Changes in PDFTextStream v2.2.1
PDFTextStream.NET now ships with ikvm v0.3.4, which fixes a number of problems that prevented PDFTextStream from functioning properly across multiple AppDomains (598)
Added PDFTextStream.loadLicense(URL) function (475)
Added a 'spacing scale' property to VisualOutputTarget which allows applications to control the amount of horizontal whitespace that should be emitted per physical amount of whitespace found in the source document (528)
PDFTextStream will now attempt to load a license file from the host application's current directory before checking the current classpath / AppDomain (661)
Fixed a problem where pathological embedded Unicode character encodings were causing PDFTextStream to strings of control characters rather than reasonable extracted content. (428)
Fixed a bug in PDFTextStream's handling of cross reference entries that caused fatal errors in some documents (620)
Fixed a problem where UTF-16 encoded bookmark titles were not being decoded properly (618)
Changes in PDFTextStream v2.2.0
Added support for Apache Lucene v2.1 and v2.2 to PDFTextStream's
Lucene integration module (com.snowtide.pdf.lucene.PDFDocumentFactory)
Added com.snowtide.pdf.PDFTextStreamConfig, which enables simple static and runtime configuration of PDFTextStream
Added new PDFTextStream constructors that accept customized
PDFTextStreamConfig instances, and a
setConfiguration(PDFTextStreamConfig) function to set a PDFTextStream
instance's configuration at runtime
PDFTextStream now joins adjacent rectangles that have similar
stroke and fill colors, which improves various page segmentation results
Improved table detection processes to adaptively recognize very small "variant" table cells
Improved pdfts.examples.XMLOutputTarget to build an XML DOM
Document instead of constructing XML using a StringBuffer; block
elements now include a type attribute of "table" if the block is a table
Significantly improved the quality of PDF documents generated when
merging PDF files (com.snowtide.pdf.util.MergeUtil) and when saving
updated PDF forms
(com.snowtide.pdf.forms.AcroForm#writeUpdatedDocument(OutputStream))
Rotated text blocks are now properly grouped within bounded regions
Changed pdfts.cjk.disable and pdfts.mmap.disable system properties to pdfts.cjk.enable and pdfts.mmap.enable, respectively
Fixed an overflow bug in PDFTextStream's PDF data parser
Fixed a bug where the ascent and descent characteristics of some fonts were defaulting to improper values
Fixed a bug where lines and rectangles drawn with a Separation color space were not being recognized properly
Fixed a bug where an error would result when reading a PDF file with a non-conforming linebreak sequence after the `stream' tag
Fixed a bug where tables containing underlined text would not be recognized properly
Fixed a bug where edges of rectangles were improperly recognized as text underlines
Fixed a bug where PDFTextStream wouldn't recognize PDF data stream filter name abbreviations
Changes in PDFTextStream v2.1.6
Added com.snowtide.pdf.util.TableUtils, which provides a set of CSV conversion functions for exporting the contents of tables
Added options to specify path to load PDFTextStream license file via pdftslicensepath environment variable or system property
Added com.snowtide.pdf.PDFTextStream.loadLicense(String) -
programmatic way to specify path from which to load PDFTextStream
license file
Changed PDFTextStream's default page segmentation algorithms to not
eliminate empty table cells, making it simpler to export tabular
content to Excel, etc.
Fixed bug in VisualOutputTarget where vertically-adjacent lines of text were being inappropriately combined
Fixed text encoding bug where text extracted from PDF documents
generated by Adobe InDesign v4.0 - v5.0 would be "scrambled", or appear
to be series of Chinese glyphs
Fixed bug where AFM font mappings were sometimes applied in an incorrect order, leading to spot errors in text extracts
Fixed bug where certain embedded Type1 font encodings were not
being loaded correctly, resulting in single-character extraction errors
Changes in PDFTextStream v2.1.5
Significant improvements in the handling and standard output of rotated content
Added com.snowtide.pdf.layout.TextUnit.getTheta()
Changes in PDFTextStream v2.1.3
Added com.snowtide.pdf.Font.isItalic() -- indicates whether a font is italicized
Added com.snowtide.pdf.layout.TextUnit.isUnderlined() -- indicates whether a character is underlined
Added tagging of italic text regions in pdfts.examples.XMLOutputTarget
Changes in PDFTextStream v2.1.2
Fixed page rotation detection bug when processing PDF documents generated by Crystal Reports
Changes in PDFTextStream v2.1.1
Significant improvements in output of VisualOutputTarget, especially for pages with many different font sizes
Fixed calculation of character widths for Type0 font that have a recognized AFM base font name
Changes in PDFTextStream v2.1
Added support for updating text, checkbox, radio button, and choice interactive form fields
Added support for Kodak print job data extraction (%KDK commands) via com.snowtide.pdf.util.KodakPrintData
Exposed the AcroFormField.isReadOnly() function
Added ByteBuffer-based buildPDFDocument() functions to com.snowtide.pdf.lucene.PDFDocumentFactory
Added the pdfts.logfactory and pdfts.loggingtype system variables to simplify the customization of logging via com.snowtide.util.logging.LoggingRegistry
java.util.logging is now the default logging toolkit; pdfts.loggingtype may be used to change that. Refer to the LoggingRegistry javadocs for more info.
Improved documentation significantly
Fixed a problem where merged PDF documents that contained empty dictionaries would be improperly generated
Fixed a problem where the "rich text" value of text interactive form fields would not be loaded
Changes in PDFTextStream v2.0.5
Fixed handling of text spacing that was causing some columnated text to overrun column boundaries improperly
Fixed a problem where text from adjacent lines would be inappropriately intermingled
Changed unlicensed functionality so that evaluation use would not require a special evaluation license file; specifically, PDFTextStream will randomize some digits in text extracts when it is operating unlicensed, and the 8-page extract limitation has been removed
Changes in PDFTextStream v2.0.2
Added com.snowtide.pdf.RegionOutputTarget to support region-specific content extraction
Added ability to derive encoding and spatial metrics of Type3 fonts; added pdfts.type3.derive system property to disable derivation if necessary (359)
Fixed problem with com.snowtide.pdf.VisualOutputTarget, where lines would sometimes be inappropriately combined (356)
Changes in PDFTextStream v2.0.1
Better indication of corrupted or otherwise unreadable PDF files (com.snowtide.pdf.FaultyPDFException)
Added pipe(OutputHandler) function to com.snowtide.pdf.layout.Line
Added pdfts.mmap.disable system property option to disable memory-mapping of PDF files - avoids JDK bug #4724038 (355)
Changes in PDFTextStream v2.0
PDFTextStream now available for .NET and Python
Added support for extraction of Chinese, Japanese, and Korean text (CJK)
Added support for accessing derived table structure (com.snowtide.pdf.layout.Table)
Significantly improved performance
Significantly improved accuracy of extraction of rotated text
Added support for Lucene v1.9 and v2.0
Added "visual" text layout output target (com.snowtide.pdf.VisualOutputTarget)
Added PDF merge capability (com.snowtide.pdf.util.PDFMergeUtil)
Added support for Type1C embedded font files (274)
Fixed issue where some bookmarks would have invalid page number attributes (i.e. -1) (324)
Fixed issues where blocks, lines, and textunits that represented rotated text reported inaccurate positions on the page
Fixed issue where xref table was not being rebuilt when object locator was simply missing (338)
Eliminated com.snowtide.pdf.PDFTextStreamOptions (deprecated in v1.3)
Changes in PDFTextStream v1.4
Added support for interactive PDF forms (AcroForms) (com.snowtide.pdf.forms.* and com.snowtide.pdf.PDFTextStream.getFormData()) (118)
Added support for derivation of 'graphical' font encoding (Type3) (297)
Added com.snowtide.pdf.OutputHandler base class for OutputTarget
Added PDFTextStream constructor that takes a java.nio.ByteBuffer, enabling completely in-memory operation
Added an example class that extracts form data as XML (pdfts.examples.XMLFormExport)
Added sample implementation of com.snowtide.pdf.OutputHandler that outputs PDF text as XML, indicating document structure and where bolded text ranges exist (pdfts.examples.XMLOutputTarget)
Added sample OutputHandler implementation that exports PDF text content as an XHTML document (pdfts.examples.GoogleHTMLOutputHandler)
Fixed bug where inline images were not being properly skipped (308)
Fixed bug where destination bounds of some bookmarks and annotations were not being properly set (307)
Fixed bug where text properties (font size, character encoding, etc) would persist beyond where they should (298)
Changes in PDFTextStream v1.3.6
Fixed potential OutOfMemoryError caused by complex graphical regions (295)
Fixed bug where out-of-date content might be extracted from updated PDF documents (296)
Changes in PDFTextStream v1.3.5
Added PDF annotation API (com.snowtide.pdf.annot.*) (76)
Added PDF bookmark API (com.snowtide.pdf.Bookmark and com.snowtide.pdf.PDFTextStream.getBookmarks()) (284)
Significantly improved performance parsing PDF data containing very complex illustrations (282)
Improved triage procedures for handling damaged or malformed PDF files (292)
Fixed bug where com.snowtide.pdf.Page.getPageNumber() was reporting 1-indexed page numbers; it now properly reports 0-indexed page numbers (283)
Fixed parsing bug related to zero-length PDF names (290)
Changes in PDFTextStream v1.3.4
Improved rectangle and line detection to avoid skipping graphics that impact text layout (272)
Improved the algorithm used to calculate the number of line breaks to be outputted between lines of text (271)
Improved detection and handling of malformed PDF documents to prevent potential infinite loops (278)
Fixed compatibility problem with PDFs generated by IBM Manyimage tool
Fixed compatibility problem with PDFs generated by SAP R/3 (276)
Fixed error thrown when some blank pages are encountered (270)
Changes in PDFTextStream v1.3.3
Expanded support for referenced form XObjects; results in more complete text extracts (263)
Improved font lookup routines; now caching frequently-referenced fonts for improved performance
Fixed logging classloading issue on JDK 1.3.1_01
Changes in PDFTextStream v1.3.2
Significant performance enhancement through improved usage of java.nio.* classes; available only on JDK 1.4+
Changes in PDFTextStream v1.3.1
Fixed integration with JDK v1.4 java.util.logging toolkit
Changes in PDFTextStream v1.3
Added ability to retrieve PDF document page attributes (height, width, rotation, etc) (94)
Added ability to retrieve PDF document pages one at a time (94)
Added ability to retrieve PDF document encryption parameters (99)
Added ability to retrieve PDF file specification version number (91)
Added pipe() method to PDFTextStream and retrieved PDF pages, allowing easy redirection of content to a buffer to file (92)
Significantly improved page segmentation and document read-ordering, resulting in more semantically-consistent text extracts
Significantly improved extraction of rotated text
Significantly improved extraction of line-bounded tables (107)
Deprecated PDFTextStreamOptions class: strictEncoding and page header options no longer used (87, 98)
PDFTextStream now always produces Unicode text; the ASCII-only option is no longer provided, as it proved to be unreliable (87)
Fixed some minor Unicode text extraction issues related to selecting the proper character encoding for Type 1 fonts (86)
Fixed PDFTextStream's implementation of the PDF graphics state stack to more closely conform to the PDF spec (90)
Fixed problem where certain monospaced character might be omitted from output (35)
Fixed problem where text might be scrambled on a line that contains certain monospaced text (182)
Changes in PDFTextStream v1.2
Added support for retrieving document-level Adobe XMP data (document metadata in an XML format) (66)
Added support for PDF v1.5 files encrypted using crypt filters that specify an invalid decryption key length (63)
Improved overview documentation of metadata access in Javadoc and Developer's Guide (70)
Fixed support for decrypting updated PDF v1.4 files encrypted with 128-bit passwords (62)
Fixed internal error that might have occurred in connection with processing updated PDF documents (72)
Changes in PDFTextStream v1.1.2
Enhanced the core parsing routines to accept PDF files that use improper (or nonexistant) string escape sequences
Fixed a bug that caused hard errors when processing some PDF v1.5 documents.
Fixed a bug where a particular text mapping (hex / CIDFont mappings) used in some PDF's would be misinterpreted, resulting in space characters being outputted instead of 'regular' characters
Changes in PDFTextStream v1.1.1
Fixed a problem where some PDF's that use a particular type of TrueType font were converted into useless text content
Changes in PDFTextStream v1.1
JDK v1.3 is now fully supported.
Significant improvements have been made in the layout and formatting of rotated text.
All logging is now channeled through Jakarta's commons-logging library to enable usage of logging toolkits other than log4j.