PDFTextStream has two main goals when it extracts the text content of a PDF document: do it accurately, and do it fast.
Which of those two attributes is more important to your application is something only you can decide. However, in many environments, text extraction performance is critical. That's why we're glad to be able to make such a bold statement without reservation:
PDFTextStream is the fastest component available for extracting text and metadata from PDF documents.
Thankfully, we have the numbers to back this claim up. Using 1000 PDF files randomly selected from those uploaded by users of PDFTextOnline representing all known variations of the PDF specification and dozens of languages and character sets, we ran a series of benchmark tests that compared the performance of PDFTextStream with four of the most widely-used PDF libraries that are capable of extracting text content from PDF documents. (Jump here for details about our benchmarking methodology.)
Figure 1. Relative performance of PDF text extraction libraries across 1000 randomly-selected PDF documents. Cumulative processing times are normalized to PDFTextStream v2.0’s processing time, which was given a score of 100. Larger scores (and longer bars) are better.
Figure 2. Summary benchmark results table, showing for each component benchmarked: number of errors and timeouts over the set of 1000 test PDF documents, total processing time, and relative performance (normalized to PDFTextStream v2.0's processing time, which was assigned a value of 100).
We will discuss only the bottom-line results here, presented above. Fully-detailed benchmark results are also available:
We believe the results speak for themselves -- PDFTextStream v2.0 is the fastest PDF text extraction component. As shown in Figure 1, PDFTextStream is around 13% faster than the next fastest known PDF text extraction component (xpdf's pdftotext utility, which is actually written in native C/C++), and around 2.25x (yes, 225%) faster than PDFBox, the next-fastest Java PDF text extraction library.
Further, as Figure 2 shows, PDFTextStream is more reliable, robust, and predictable as well. Across the 1000 PDF files in the benchmark collection, PDFTextStream experienced zero errors, and zero timeouts (a timeout is where a library does not finish processing a PDF file in under 60 seconds -- it is reasonable to assume that a library has fallen into an infinite loop if this happens). The other libraries in the test did experience some errors and timeouts.
xpdf's pdftotext utility also finished the benchmark error- and timeout-free. However, when processing PDF documents adhering to v1.6 of the PDF file format specification (corresponding to Adobe Acrobat 7), it did warn a number of times that it only supports v1.5 of the PDF document spec. (PDFTextStream supports all versions of the PDF document specification.)
Our aim here was to determine (as objectively as possible) which PDF text extraction library provides the best overall performance.
To accomplish this, we developed a benchmarking testbed, consisting of a set of test Java classes and accompanying scripts. A single main class was developed (com.snowtide.pdf.test.TestPerformance) that contained the timing infrastructure. This main test class, by default, tested the performance of the PDFTextStream library. A number of subclasses were then developed that extended this main class to test the performance of each of the competing PDF libraries. This approach had the advantage of ensuring that the critical timing infrastructure remained unchanged regardless of the library being tested. (See below for information on the approaches used in connection with the test classes developed for each library.)
This timing infrastructure had two very important attributes that ensured a fair test for all libraries involved:
The PDF documents used as testcases in the benchmark tests were randomly selected from PDF documents uploaded by users of our PDFTextOnline service. This selection represents a wide variety of document types (i.e. presentations, academic papers, corporate reports, white papers, technical documentation, etc) and producers (i.e. Adobe Acrobat, Adobe Pagemaker, PDFWriter, InDesign, QuarkXPress, Oracle Reports, etc.), as well as languages and character sets. Such diversity in the kinds of PDF documents presented to each library tested gives some confidence that the results of the benchmarks will correspond to real-world performance.
All benchmarking was run on a 2.0Ghz AMD Opteron 146 Sun Fire X2100 server running Red Hat Enterprise Linux 4.0 with 3GB of memory. The Java VM used reported this from running java -version:
Note that the benchmarks used the "server" configuration of the JVM -- this most closely matches the configuration likely to be used in a typical enterpise software deployment environment.
No other applications or non-system processes were running at the same time as the test, and all non-essential services and scheduled jobs were halted. A python script was used to automate the testing, and to collect the results that were outputted by the main test class and its subclasses. Those results are presented above.
Anyone is welcome to inspect and confirm our findings. The test classes we developed for these benchmarks are available for download:
Here are links to each of the benchmarked components:
We also are glad to provide anyone with the full set of 1000 PDF documents we used in the benchmark. However, these are available by request only -- the full archive of these PDF documents, even when compressed, is 261MB. It would therefore be unwise for us to post such a large file for public download. So, if you would like to download the archive of the 1000 PDF documents used in the benchmark, simply send us an email, and we'll provide you with a download link.
We welcome any comments or suggestions you might have for how we can make this benchmark more accurate, fair, comprehensive, etc.; please feel free to contact us if you have any ideas.
We are working on similar benchmarks for the .NET and Python platforms, in which PDFTextStream.NET and PDFTextStream.Python will participate, respectively. However, given that PDFTextStream for Java is faster than xpdf's pdftotext utility (as shown in these benchmark results), we are quite confident that PDFTextStream.NET and PDFTextStream.Python will not disappoint. Why do we think this?
There are scads of PDF libraries on the market, both commercially and in the open source world. The libraries we have included here have always been included in our benchmarks (except for the recent addition of pdftotext). However, we would be glad to other libraries to this benchmark, as long as they:
If there is a PDF library you would like to see added to this benchmark, do let us know.
In order to squeeze every possible drop of performance from each library, we developed adapters (some of which were heavily based upon code examples included with each library) to streamline and optimize the libraries' methods for reading PDF text. In some cases, this meant setting certain flags or calling particular methods in each library to provide hints that only the text and metadata content of each source PDF file were of interest. In others, we eliminated file-based output (as included in some sample code for some libraries) and replaced it with in-memory output.
Below are some additional noteworthy items specific to each library tested.
pdftotext is available as a command-line executable, so it could not be plugged in to the base benchmarking code that we built for the other libraries. However, because it is a command-line utility, it was trivial to write a script that would execute pdftotext for each of the PDF documents in the benchmark collection and take appropriate measure of how long the spawned pdftotext process ran.
PJ does not include any built-in methods for extracting text from PDF documents. Therefore, we developed a text extraction layer that interfaced with the PJ library to enable it to perform rudimentary extraction. The result was completely unformatted text (i.e. no linebreaks, paragraphs, or page breaks), and no metadata extraction (which should have given the PJ benchmarks an advantage over the rest of the libraries tested, which all extracted metadata for each source PDF document).
JPedal was benchmarked using one of its included examples, org.jpedal.examples.text.ExtractTextAsWordlist. This example code was optimized to operate entirely in-memory (it had been spooling text content out to disk), and to always extract the metadata from each PDF it was tested with (PDFTextStream always provides access to a PDF document's metadata).
While JPedal does track the positioning attributes of each block of text, there is no apparent way to have the library output formatted text to correspond with the layout of the source PDF document (such formatting would presumably have to be done by the JPedal user/developer).
PDFBox was benchmarked using an optimized version of one of its included classes, org.pdfbox.ExtractText. We modified the code to operate entirely in-memory (it had been spooling text content out to disk and/or standard out), and to always extract the metadata from each PDF it was tested with (PDFTextStream always provides access to a PDF document's metadata).