GWAVA's Retain Archives and Indexes 5,000,000 PDF Documents Each Day with PDFTextStream

GWAVA is successfully using PDFTextStream to enable its Retain email archiving and retention solution to archive and index millions of PDF email attachments each day. This pairing enables GWAVA to deliver to its customers more comprehensive and capable email archiving processes that comply with corporate auditing requirements and government regulatory measures.

GWAVA is the leading provider of software solutions for Novell GroupWise, and its Retain email archiving product is a key part of their offering in the GroupWise community. Retain integrates with GroupWise's email handling infrastructure to archive and index emails passing through a GroupWise instance according to rules set by system administrators. These archives and their attendant search indexes are critical to many organizations' operations, as they are a key part of fulfilling many auditing protocols and regulatory compliance measures.

The Challenge

"Snowtide is the perfect fit and solution for GWAVA users. Throughout our extensive testing, PDFTextStream proved itself to be far and away the best PDF content extraction solution available on the market." Charles Taite, CEO & Co-Founder, GWAVA

The days of plain text or simple HTML email passed long ago, and GWAVA needed Retain to archive and index all of the various types of email attachments. Of course, PDF documents are one of the most important and common types of email attachments, so it was clear that Retain needed to be able to archive and index PDF documents. This was doubly important since the organizations that have some of the most stringent regulatory and auditing requirements (such as law firms, government agencies, and institutions of higher education) depend so heavily on emailing PDF documents as part of their normal workflow.

With this in mind, GWAVA set out to find a software component that would allow Retain to fold the textual content of PDF email attachments into its existing indexing and archiving processes. This component would need to yield highly accurate content extraction results with the greatest possible performance, and be easy to integrate and maintain within the Retain codebase.

The Solution

This search was lead by Michael Bell, GWAVA’s Vice President of Research and Development. His team built a PDF text extraction framework in order to thoroughly test potential solutions, and proceeded to put a variety of PDF libraries through their paces. In the end, the GWAVA team chose PDFTextStream.

PDFTextStream was the only solution that clearly had the primary goal of quality text extraction, rather than handling that as an afterthought. Michael Bell, Vice President of R & D, GWAVA

This decision was fundamentally based upon PDFTextStream's singular focus on PDF content extraction, the benefits it provides because of that focus. “Each of the other products we tried had different problems: many were slow and unreliable, most had poor international character set support,” says Bell. “PDFTextStream was the only solution that clearly had the primary goal of quality text extraction, rather than handling that as an afterthought.”

Results

PDFTextStream enabled GWAVA to add PDF file attachment archiving and indexing to its Retain product, thereby helping its customers comply with critical auditing regulations. PDFTextStream is now distributed worldwide as part of the Retain solution. And according to Bell, “We don’t have exact statistics, but it would be reasonable to estimate that PDFTextStream is extracting content from approximately 5 million PDF documents each day across our entire installed base.”

Integrating PDFTextStream into Retain was simple, too. “Compared to other libraries we evaluated, working with PDFTextStream has been very easy,” says Bell. “It took us no more than ten minutes to implement it.”