Frequently Asked Questions

What is Adobe PDF?

Portable Document Format (PDF) is a document file format developed by Adobe primarily designed to facilitate the distribution and archival of content. You can think of a PDF file as an uneditable Microsoft Word document.

You can read more about what PDF is and what it is used for on Adobe's website and on Wikipedia's PDF entry.

PDF has become an indispensable tool for distributing and archiving documents, but it can be difficult to integrate PDF documents into existing data processing systems. Our PDFTextStream product aims to make this integration possible and easily attainable.

What languages / character sets does PDFTextStream support?

Starting in v2.0, PDFTextStream provides full Unicode support for nearly every language on Earth. This includes Chinese, Japanese, and Korean, in both horizontal and vertical text orientations -- generally the most difficult languages to properly support.

The only notable languages that are not yet supported are those where text runs from right to left -- this includes Arabic and Hebrew. We are working to include support for such languages in a future PDFTextStream release.

Normalized Extracted Text Layout

The formatting of the text extracts provided by PDFTextStream is defined by the OutputHandler that is used to "collect" the PDF text events generated by PDFTextStream.

The default OutputHandler, OutputTarget, is optimized for performance and use in semantically-sensitive environments: search, indexing, summarization, etc. In these kinds of environments, maintaining the spacing of text elements (including table columns and such) is mostly unnecessary.

Solution

There are a number of alternate OutputHandler implementations included with PDFTextStream. For maintaining the layout of each page of PDF content so that the text extracts look as much like the original PDF as possible, use VisualOutputTarget.

Making this change is simple in most cases -- just replace all of your references to OutputTarget to VisualOutputTarget instead.

If you are using the java.io.Reader interface provided by PDFTextStream, you will need to switch to using a VisualOutputTarget and the accompanying .pipe(OutputHandler) functions (found on PDFTextStream, Page, and Block objects). We may make it possible to specify a different default OutputHandler in a future PDFTextStream release, which will enable you to continue using the java.io.Reader interface provided by PDFTextStream along and still be able to use alternate OutputHandler implementations.

What is the Snowtide Logo?

In the earliest days of Snowtide's existence (circa 2001), we were aiming to build search and data mining tools tailored to the specialized needs of professional researchers in medicene and the sciences. When formulating our original corporate image, we wanted research professionals to identify with our logo.

After much brainstorming, we stumbled upon the concept for our current logo, which is a depiction of a water molecule (with the two atoms of hydrogen at the bottom and on the right, and the oxygen atom at the top-left). This was meant as a subtle nod to the scientists and researchers we hoped to sell our search and data mining tools to, and had the added benefit of playing off of the company name (water molecule, snow, tide, etc).

Our search and data mining products did not find success in the marketplace; however, PDFTextStream, our followup product introduced in 2004, has found deep and broad adoption. Although PDFTextStream is not primarily intended for use by researchers or scientists, we decided to stick with the molecule logo. It has become well-associated with the Snowtide name, and we generally believe in sticking with what works.

Shrinking PDFTextStream

The standard PDFTextStream for Java download contains two JAR files in the lib directory. One, named PDFTextStream-2.x.jar, is the larger of the two; it includes everything.

If you need a smaller JAR file, and do not need to extract Chinese, Japanese, or Korean (CJK) text, you can use the second one in the lib directory, named PDFTextStream-2.x-NOCJK.jar. This JAR file does not include the character encodings needed for extracting CJK text, which shaves the file size down by over 2.5MB.

What makes PDFTextStream different from open source libraries?

This goes back to why PDFTextStream is priced the way it is. We believe that PDFTextStream offers significantly more overall value due to the levels of accuracy, performance, and support that it delivers, even in light of its pricing.

If your project does not require the highest degrees of these qualities, then you will likely find one of the open source libraries acceptable.

In addition, we also offer consulting and custom development services that yield PDFTextStream-based solutions to tackle the most difficult and mission-critical content and data extraction problems. If you have such a problem, and wish to tap the expertise and experience we bring to these efforts, then using PDFTextStream is mandatory -- we simply could not accomplish what we do if we were to use any other library.

What was the inspiration for PDFTextStream?

In Snowtide's earliest days (circa 2001-2002), we were focussed on building search and data mining tools for professional researchers. As such, we were very interested in finding a high-quality library that would enable our software to extract content from PDF documents so that it could provide search functionality for PDF content.

However, we consistently found all of the available PDF content extractions libraries to be unacceptable in various ways. Some had significant accuracy problems, many had API's that unfortunately presented a literal representation of a PDF document's data structures (which are extremely complicated and ill-suited for high-performance, accurate content extraction), many had serious PDF file format incompatibilities, and nearly all of them were quite slow.

So, we embarked on an effort to build our own PDF content extraction functionality. We quickly discovered why all of the other libraries that we had evaluated had various flaws that we found unacceptable -- the problem of PDF text extraction is a very difficult and complex one. We further found it likely that if we could build a better mousetrap in this context, we would find a very open and appreciative market for that mousetrap.

Two years later in the summer of 2004, we released PDFTextStream. It set (and continues to maintain) a new gold standard for PDF content extraction accuracy, performance, PDF file format compatibility, and "developer friendliness" (thanks to its significantly simpler API). With our recent release of v2.0, we believe we are making good on our aspirations to build that better mousetrap, making quality PDF content extraction readily available on multiple application platforms.

Why is PDFTextStream so expensive?

We are asked this question quite often. Truthfully, we don't think it's very expensive at all, given what it delivers for our customers:

  1. It provides a great deal of functionality and performance that is unmatched in the marketplace.
  2. A great deal of work was and continues to be required to ensure that PDFTextStream delivers that level of functionality and performance.
  3. We devote significant resources to ensuring that PDFTextStream is the most up-to-date PDF content extraction library. This requires continuous monitoring of both new PDF format specifications from Adobe as well as watching for new variations of the PDF format that are generated by third-party PDF file generators. It is because of this that PDFTextStream successfully extracts content out of PDF documents that many other PDF libraries choke on.
  4. Support. We provide pre- and post-sales support of a caliber that is difficult to come by these days. This includes answering complex technical questions in hours instead of days, providing deployment and integration support (usually for free), and analyzing problem PDF documents and providing patches in days instead of months.

In short, we are committed to delivering the highest possible quality product and support, and we do not wish to compromise that committment by cutting corners.

Why should I enroll in Premium Support?

Our Premium Support service maximizes the return your organization sees from deploying PDFTextStream, and simplifies future budgeting.

Your IT staff will get priority handling of their support inquiries (with industry-appropriate SLA's), automatically receive new updates and releases of PDFTextStream (including major releases!), and complimentary assistance with most deployment and integration issues.

Premium Support also simplifies your budgeting process. Enrolling in Premium Support entitles you to all new PDFTextStream releases, ensuring that you never have to pay for a major software upgrade again.

What if I don't want to enroll in Snowtide's Premium Support?

Do nothing -- Premium Support enrollment is entirely optional. You will not be billed for Premium Support services unless you specifically request that the licenses you purchase be enrolled in the program.

Who is behind Snowtide?

Snowtide is driven by a group of computer science and engineering professionals. Biographical information about each of the principal players is available on the Snowtide Team page.

What causes PDFTextStream to emit empty text extracts?

There are some (usually rare) situations where extracting text from PDF documents is not possible:

  1. PDF documents whose pages are simply a series of images. This is most common with PDF documents that have been scanned from physical documents, but which have not had a text "layer" added by an optical character recognition (OCR) process.
  2. In some very rare cases (less than 0.1% of all PDF documents in our testing), a PDF document may contain text that uses a font that "draws" glyphs using images, rather than referring to an actual character. In these cases, PDFTextStream will yield either empty text extracts, or extracts that are "junk" -- a series of nonsensical characters.

In principle, these issues could be solved by embedding an OCR process within PDFTextStream. This may be done for some future release.

PDFTextStream Bug

Finally, it is possible that you have stumbled across a bug in PDFTextStream. The easiest way to test for this possibility is to attempt to copy-and-paste text from the PDF file in question using Adobe Acrobat. If you can successfully do this (and the pasted text looks correct), but PDFTextStream is not delivering text, or the text it is delivering is incorrect in some way, then you have likely discovered a bug in PDFTextStream. In this case, please open a support ticket with us, and we'll begin working to resolve the problem straight away.

Disabling PDF File Memory Mapping

Due to an unfortunate bug in Java’s implementation of memory-mapped files (view JDK bug entry), it is possible that a PDF file opened and processed by PDFTextStream will remain locked even after the PDFTextStream instance’s close() function has been called, and PDFTextStream has released all of the filesystem handles it has allocated. This locking behaviour (which is known to occur only on Windows) will prevent the PDF file from being deleted or moved until Java’s garbage collector eliminates certain JDK-internal objects that are used to track and manage the previously memory-mapped PDF file.

The solution in this case is to force PDFTextStream to not memory map source PDF files. This is done by setting the pdfts.mmap.enable system property to N, or by setting the memory mapping property in com.snowtide.pdf.PDFTextStreamConfig to false. For details on how to set system properties for PDFTextStream, please refer to the appropriate appendix in the Developer's Guide (available in all PDFTextStream downloads, or here).

(Note that in PDFTextStream releases prior to v2.2.0, pdfts.mmap.disable was used as the key system property; starting with v2.2.0, pdfts.mmap.disable was removed, and replaced with the pdfts.mmap.enable property.)

What's your PDF problem?