Indexing PDF documents with Lucene
Apache Lucene is a full-text search engine written in Java. It is a perfect choice for applications that need 'built-in' search functionality: it's fast, works well with any kind of document structure, and is relatively painless to build around.
Lucene is focused on text indexing, and as such, it does not natively handle popular document formats such as Word, PDF, HTML, etc. Rather, it requires the use of external tools or libraries to convert any such documents into collections of text fields, which can then be easily indexed. In conjunction with Snowtide's open source lucene-pdf library, PDFxStream fills this role to help Lucene index content sourced from PDF documents.
Please refer to the documentation in the lucene-pdf project's README for information on how to obtain the lucene-pdf library itself, additional code samples, and a summary of the more in-depth tutorial/documentation presented here.
lucene-pdf enables Lucene
indexing of PDF documents with two classes:
com.snowtide.pdf.lucene.LucenePDFDocumentFactory
and
com.snowtide.pdf.lucene.LucenePDFConfiguration
. Using them is very
straightforward:
- A
LucenePDFConfiguration
instance is created. This configuration determines how content from a PDF file processed by PDFxStream will be used to construct index records (calledDocument
s in Lucene parlance). - That
LucenePDFConfiguration
instance is passed along with an open PDF file into one of the staticbuildPDFDocument()
methods provided byLucenePDFDocumentFactory
. - The
LucenePDFDocumentFactory.buildPDFDocument()
method returns aorg.apache.lucene.document.Document
instance. The LuceneDocument
instances that are created by theLucenePDFDocumentFactory.buildPDFDocument()
methods derive their fields' contents from the text and metadata attributes extracted from the source PDF file by PDFxStream, and their field names and index attributes (whether to store, index, and/or tokenize each field's contents) from the configuration held by theLucenePDFConfiguration
instance that was created in the first step. - Once a Lucene
Document
instance is obtained from theLucenePDFDocumentFactory
class, it can be passed directly into Lucene's indexing process (typically via anorg.apache.lucene.index.IndexWriter
), which will add theDocument
to an open index.
It's a wonderful thing when the code needed to do something is shorter than the space needed to explain all that it does for you. Here's some heavily-commented example code that does everything described above using a sample PDF file and Lucene index:
import com.snowtide.pdf.lucene.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.document.Document;
import com.snowtide.PDF;
import java.io.*;
public class EasyLuceneIntegration {
/**
* Simple method that adds the contents of the provided PDF document to the
* Lucene index via an already-open Lucene IndexWriter.
*/
public static void addPDFToIndex (IndexWriter openIndex, File pdfFile)
throws IOException {
// create and configure new LucenePDFConfiguration instance
LucenePDFConfiguration config = new LucenePDFConfiguration();
// set the name to be used for the main body of text extracted from the
// PDF file, and set it to be tokenized and indexed, but not be stored
config.setBodyTextFieldName("body_text");
config.setBodyTextSettings(false, true, true);
// only copy the PDF metadata attributes into Lucene Document instances
// produced by LucenePDFDocumentFactory that we explicitly map
// via LucenePDFConfiguration.setMetadataFieldMapping()
config.setCopyAllPDFMetadata(false);
// cause PDF metadata attribute values to be stored, tokenized, and indexed
config.setMetadataSettings(true, true, true);
// Explicitly set the names that should be used for the fields that are
// created in the Lucene Document instance -- otherwise, default PDF
// names will be used that will likely not be picked up when the index
// is searched.
// For example, the default name for the modification date
// field in PDF files is 'ModDate', but our example Lucene index stores
// the modification dates of Documents with the name 'mod_date'. The
// third setMetadataFieldMapping() call below establishes the correct mapping.
config.setMetadataFieldMapping(com.snowtide.pdf.Document.ATTR_AUTHOR, "creator");
config.setMetadataFieldMapping(com.snowtide.pdf.Document.ATTR_CREATION_DATE, "creation_date");
config.setMetadataFieldMapping(com.snowtide.pdf.Document.ATTR_MOD_DATE, "mod_date");
// actually generate the Lucene Document instance from the PDF file
// using the configuration we've just built, and add the Document to the
// Lucene index
com.snowtide.pdf.Document pdf = PDF.open(pdfFile);
Document doc = LucenePDFDocumentFactory.buildPDFDocument(pdf, config);
pdf.close();
openIndex.addDocument(doc);
}
}
Additional examples are available in the lucene-pdf project's README.
Customizing Lucene Document
fields
Unless a LucenePDFConfiguration
instance is
provided in the call to one of the buildPDFDocument() methods, the
fields in the created Lucene Document
s take on the defaults provided
by the PDF file. For example, the default name of the creation date
attribute included in the metadata of some PDF files is CreationDate
,
so that will be the name assigned to the field in the Lucene Document
that contains the value of that attribute.
Allowing these default names to be used for the fields in each Lucene
Document is convenient, but is probably not what you want; few Lucene
indexes will have used those defaults when being built. In order to
seamlessly integrate PDFxStream into your Lucene installation, you will
want to customize how the Document
instances are built. For this, you
should use LucenePDFConfiguration
.
Typically, a single LucenePDFConfiguration
instance will be created and configured for each Lucene index that PDF
content needs to be added to.
The main body of text contained in a PDF file is stored in a Lucene
Document object as just another named field. This name defaults to the
value defined by
com.snowtide.pdf.lucene.LucenePDFConfiguration.DEFAULT_MAIN_TEXT_FIELD_NAME
("text"
), but can be set either via the
LucenePDFConfiguration
constructor, or by a
setter method on a LucenePDFConfiguration
instance.
Also, the names used to identify the extracted metadata attributes can be customized. For example, a PDF file might contain these attributes:
Attribute Name | Attribute Value |
---|---|
Creator | Microsoft Word |
Author | Kate Burneson |
CreationDate | Mar 30, 2002 08:12:44 AM -0800 |
Using the default attribute names is likely not appropriate if this
example PDF file's content is to be added to a Lucene index that has,
for example, document author fields named authored_by
and creation
time/date stamps named create_dt
. The default field names can be
mapped to their desired replacements easily, using the
com.snowtide.pdf.lucene.LucenePDFConfiguration.setMetadataFieldMapping(String, String)
method:
LucenePDFConfiguration config = new LucenePDFConfiguration();
config.setMetadataFieldMapping(PDFxStream.ATTR_AUTHOR, "authored_by");
config.setMetadataFieldMapping(PDFxStream.ATTR_CREATION_DATE, "create_dt");
This will cause any invocation of a
LucenePDFDocumentFactory.buildPDFDocument()
method that uses the
config
object to build Lucene Document
instances that use the name
authored_by
for any Author
PDF metadata attribute, and create_dt
for any CreationDate
attribute. Note that the most common PDF document
attributes have standardized names, which are fixed as static final
constants in com.snowtide.pdf.Document
. All such constant
fields have an ATTR
prefix to identify them as standard document
attribute names.
Storing vs. Indexing vs. Tokenizing
Fields in every Lucene document have three attributes associated with
them, typically referred to as store
, index
, and token
. These
attributes control how Lucene processes each field when it is added to
an index as a part of a Document instance (a full discussion of these
attributes and how they impact Lucene indexing and searching is beyond
the scope of this guide; please refer to Lucene's documentation for
more information).
The values to be used for store
, index
, and token
when creating
named fields in Lucene Document
s can be set for PDF document
attributes via
com.snowtide.pdf.lucene.LucenePDFConfiguration.setMetadataSettings(boolean, boolean, boolean)
.
The values provided to this method are used for all fields created for
PDF document attributes. All of these settings default to true
.
The values for store
, index
, and token
for the main body of text
read out of PDF files can be set via
com.snowtide.pdf.lucene.LucenePDFConfiguration.setBodyTextSettings(boolean, boolean, boolean)
.
The defaults for these settings are false
, true
, and true
,
respectively.
Fin
It should be clear now that, paired with lucene-pdf, PDFxStream can provide remarkably easy Lucene PDF indexing. The conceptual overview and code sample provided in here should get you most of the way towards making Lucene play nice with PDF documents thanks to PDFxStream, much to the benefit of your applications and projects.
Finally, while lucene-pdf is suitable for many typical Lucene PDF indexing jobs, there may be aspects of your project's requirements that it cannot meet (e.g. taking advantage of some of the more esoteric document indexing parameters available in more recent versions of Lucene). In that case, its (liberally-licensed, MIT) source can serve as a useful starting point, exhibiting how PDF data can be extracted using PDFxStream and turned into Lucene Documents; feel free to import it into your projects and modify it as needed to suit your needs.