If you're not familiar with how Lucene works, please refer to the Lucene project's documentation.
Since Lucene by itself will accept and process only plain text, some kind of adapter must be used that can extract plain text from PDF files in order for those files' content to be added to a Lucene index. PDFTextStream goes one step further than just extracting text from PDF files to be used with Lucene -- it provides a complete set of Lucene integration classes that enables a Lucene user to easily add PDF document content to Lucene indexes. Conceptually, how PDFTextStream and its integration classes relate to Lucene is shown here:
Two PDFTextStream classes provide the Lucene integration functionality:
com.snowtide.pdf.lucene.PDFDocumentFactory and
com.snowtide.pdf.lucene.DocumentFactoryConfig.
Using these classes is very straightforward:
-
A DocumentFactoryConfig instance is
created and configured. This configuration determines how Lucene will index a PDF file processed by PDFTextStream
(i.e. what fields will be indexed, tokenized, and/or stored), and what names will be assigned to the various fields
that will make up the index record (called a 'Document' in Lucene parlance).
-
That DocumentFactoryConfig instance is passed along with a PDF file (or PDF file data
in the form of a java.io.InputStream) into one of the static
buildPDFDocument() methods provided by the PDFDocumentFactory
class.
-
The PDFDocumentFactory.buildPDFDocument() method returns a
org.apache.lucene.document.Document instance. The Lucene Document class represents
a single document, which will correspond to a single record in the Lucene index to which it is added.
The Lucene Document instances that are created by the PDFDocumentFactory.buildPDFDocument()
methods derive their fields' contents from the text and metadata attributes extracted from the source
PDF file by PDFTextStream, and their field names and index attributes
(whether to store, index, and/or tokenize each field's contents)
from the configuration held by the DocumentFactoryConfig instance that was
created in the first step.
-
Once a Lucene Document instance is obtained from the PDFDocumentFactory class,
it can be passed directly into Lucene's indexing process (typically via a
org.apache.lucene.index.IndexWriter object), which will add the Document to
an open index.
It's a wonderful thing when the code needed to do something is shorter than the space needed to explain all that it
does for you. Here's some heavily-commented example code that does everything described above
using a sample PDF file and Lucene index:
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;
import java.io.*;
import com.snowtide.pdf.PDFTextStream;
public class EasyLuceneIntegration {
/**
* Simple method that adds the contents of the provided PDF document
* to the Lucene index via an already-open Lucene IndexWriter.
*/
public static void addPDFToIndex (IndexWriter openIndex, File pdfFile)
throws IOException {
// create and configure new DocumentFactoryConfig instance
DocumentFactoryConfig config = new DocumentFactoryConfig();
// set the name to be used for the main body of text extracted from the
// PDF file, and set it to not be stored, but to be tokenized and indexex
config.setMainTextFieldName("body_text");
config.setTextSettings(false, true, true);
// only copy the PDF metadata attributes into Lucene Document instances produced
// by PDFDocumentFactory that we explicitly map
// via DocumentFactoryConfig.setFieldName()
config.setCopyAllPDFAttrs(false);
// cause PDF metadata attribute values to be stored, tokenized, and indexed
config.setPDFAttrSettings(true, true, true);
// explicitly set the names that should be used for the fields that are
// created in the Lucene Document instance -- otherwise, default PDF
// names will be used that will likely not be picked up when the index is
// searched.
// For example, the default name for the modification date
// field in PDF files is 'ModDate', but our example Lucene index stores
// the modification dates of Documents with the name 'mod_date'. The
// third setFieldName() call below establishes the correct mapping.
config.setFieldName(PDFTextStream.ATTR_AUTHOR, "creator");
config.setFieldName(PDFTextStream.ATTR_CREATION_DATE, "mod_date");
config.setFieldName(PDFTextStream.ATTR_MOD_DATE, "mod_date");
// actually generate the Lucene Document instance from the PDF file using the
// configuration we've just built, and add the Document to the Lucene index
Document doc = PDFDocumentFactory.buildPDFDocument(pdfFile, config);
openIndex.addDocument(doc);
}
}
It should be clear now that PDFTextStream provides a remarkably easy-to-use Lucene integration
module, and one that will readily scale in the most demanding of Lucene-based indexing
environments (given PDFTextStream's core performance characteristics).
The conceptual overview and code sample provided in this techtip should get you most of the way
towards making Jakarta Lucene play nice with PDF documents thanks to PDFTextStream, much to the benefit of your
applications and projects.