Home / Frequently Asked Questions / View

What was the inspiration for PDFTextStream?

In Snowtide's earliest days (circa 2001-2002), we were focussed on building search and data mining tools for professional researchers. As such, we were very interested in finding a high-quality library that would enable our software to extract content from PDF documents so that it could provide search functionality for PDF content.

However, we consistently found all of the available PDF content extractions libraries to be unacceptable in various ways. Some had significant accuracy problems, many had API's that unfortunately presented a literal representation of a PDF document's data structures (which are extremely complicated and ill-suited for high-performance, accurate content extraction), many had serious PDF file format incompatibilities, and nearly all of them were quite slow.

So, we embarked on an effort to build our own PDF content extraction functionality. We quickly discovered why all of the other libraries that we had evaluated had various flaws that we found unacceptable -- the problem of PDF text extraction is a very difficult and complex one. We further found it likely that if we could build a better mousetrap in this context, we would find a very open and appreciative market for that mousetrap.

Two years later in the summer of 2004, we released PDFTextStream. It set (and continues to maintain) a new gold standard for PDF content extraction accuracy, performance, PDF file format compatibility, and "developer friendliness" (thanks to its significantly simpler API). With our recent release of v2.0, we believe we are making good on our aspirations to build that better mousetrap, making quality PDF content extraction readily available on multiple application platforms.

What's your PDF problem?