27.2. Building Indexes

27.2.1. Creating a New Index

Index creation and updating capabilities are implemented within Zend_Search_Lucene module and Java Lucene. You can use both of these capabilities.

The PHP code listing below provides an example of how to index a file using Zend_Search_Lucene indexing API:

<?php
// Create index
$index = Zend_Search_Lucene::create('/data/my-index');

$doc = new Zend_Search_Lucene_Document();

// Store document URL to identify it in search result.
$doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));

// Index document content
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docContent));

// Add document to the index.
$index->addDocument($doc);
?>

Newly added documents could be immediately retrieved from an index.

27.2.2. Updating Index

The same procedure is used to update an existing index. The only difference is that the open() method is called instead of the create() method:

<?php
// Open existing index
$index = Zend_Search_Lucene::open('/data/my-index');

$doc = new Zend_Search_Lucene_Document();
// Store document URL to identify it in search result.
$doc->addField(Zend_Search_Lucene_Field::Text('url', $docUrl));
// Index document content
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $docContent));

// Add document to the index.
$index->addDocument($doc);
?>

27.2.3. Updating Documents

Lucene index file format doesn't support document updating. Document should be removed and re-added to do this.

Zend_Search_Lucene::delete() method operates with an internal index document id. It can be retrieved from a query hit by 'id' property:

<?php
$removePath = ...;
$hits = $index->find('path:' . $removePath);
foreach ($hits as $hit) {
    $index->delete($hit->id);
}
?>

27.2.4. Retrieving Index size

There are two methods to retrieve index size in Zend_Search_Lucene.

Zend_Search_Lucene::maxDoc() returns one greater than the largest possible document number. It's actually overall number of the documents in the index including deleted documents. So it has a synonym: Zend_Search_Lucene::count().

Zend_Search_Lucene::numDocs() returns the total number of non-deleted documents.

<?php
$indexSize = $index->count();
$documents = $index->numDocs();
?>

Zend_Search_Lucene::isDeleted($id) method may be used to check if document is deleted.

<?php
for ($count = 0; $count < $index->maxDoc(); $count++) {
    if ($index->isDeleted($count)) {
        echo "Document #$id is deleted.\n";
    }
}
?>

Index optimization removes deleted documents and squeezes documents' IDs. So internal document id may be changed.

27.2.5. Index optimization

Lucene index is consist of segments. Each segment is a completely independent portion of data.

Lucene index segment files can't be updated by their nature. A segment update needs full segment reorganization. See Lucene index file formats for details (http://lucene.apache.org/java/docs/fileformats.html). Thus new documents are added to the index by creating new segment.

Increasing number of segments reduces quality of the index, but index optimization restores it. Optimization is reduced to merging several segments into one. This process also doesn't update segments. It generates new large segment, which contains new optimized segment instead of the set of old segments, and updates segment list ('segments' file).

Full index optimization can be invoked by Zend_Search_Lucene::optimize() call. It merges all index segments into new one:

<?php
// Open existing index
$index = Zend_Search_Lucene::open('/data/my-index');

// Optimize index.
$index->optimize();
?>

Automatic index optimization is performed to keep index in a consistent state.

Automatic optimization is an iterative process managed by several index options. It merges very small segments into larger one, then merges these larger segments into more larger and so on.

27.2.5.1. MaxBufferedDocs auto-optimization option

MaxBufferedDocs is a minimal number of documents required before the buffered in-memory documents are written into a new Segment.

MaxBufferedDocs can be retrieved or set by $index->getMaxBufferedDocs() or $index->setMaxBufferedDocs($maxBufferedDocs) calls.

Default value is 10.

27.2.5.2. MaxMergeDocs auto-optimization option

MaxMergeDocs is a largest number of documents ever merged by addDocument(). Small values (e.g., less than 10.000) are best for interactive indexing, as this limits the length of pauses while indexing to a few seconds. Larger values are best for batched indexing and speedier searches.

MaxMergeDocs can be retrieved or set by $index->getMaxMergeDocs() or $index->setMaxMergeDocs($maxMergeDocs) calls.

Default value is PHP_INT_MAX.

27.2.5.3. MergeFactor auto-optimization option

MergeFactor determines how often segment indices are merged by addDocument(). With smaller values, less RAM is used while indexing, and searches on unoptimized indices are faster, but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches on unoptimized indices are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indices that are interactively maintained.

MergeFactor is a good estimation for average number of segments merged by one auto-optimization pass. Too large values produce large number of segments while they are not merged into new one. It may be a cause of "failed to open stream: Too many open files" error message. This limitation is system dependant.

MergeFactor can be retrieved or set by $index->getMergeFactor() or $index->setMergeFactor($mergeFactor) calls.

Default value is 10.

Lucene Java and Luke (Lucene Index Toolbox - http://www.getopt.org/luke/) can also be used to optimize index.

27.2.6. Limitationas

Limitations are platform dependent.

Maximum index size is 2GB for 32-bit platforms.