27.3. Searching an Index

27.3.1. Building Queries

There are two ways to search the index. The first method uses Query Parser to construct query from a string. The second provides the ability to create your own queries through the Zend_Search_Lucene API.

Before choosing to use the provided Query Parser, please consider the following:

  1. If you are programmatically generating a query string and then parsing it with the query parser then you should seriously consider building your queries directly with the query API. In other words, the query parser is designed for human-entered text, not for program-generated text.

  2. Untokenized fields are best added directly to queries and not through the query parser. If a field's values are generated programmatically by the application, then the query clauses for this field should also be constructed programmatically. An analyzer, which the query parser uses, is designed to convert human-entered text to terms. Program-generated values, like dates, keywords, etc., should be consistently program-generated.

  3. In a query form, fields that are general text should use the query parser. All others, such as date ranges, keywords, etc., are better added directly through the query API. A field with a limited set of values that can be specified with a pull-down menu should not be added to a query string that is subsequently parsed but instead should be added as a TermQuery clause.

  4. Boolean queries allow to mix several other queries into new one. Thus it's the best way to add some additional criteria to user search, defined by a query string.

Both ways use the same API method to search through the index:

<?php
require_once 'Zend/Search/Lucene.php';

$index = Zend_Search_Lucene::open('/data/my_index');

$index->find($query);

?>

The Zend_Search_Lucene::find() method determines input type automatically and uses query parser to construct appropriate Zend_Search_Lucene_Search_Query object from a string.

It is important to note that query parser uses standard analyzer to tokenize separate parts of query string. Thus all transformations, which are done on indexed text are also done on query string entries.

It may be transforming to lower case to make search case-insensitive, removing stop-words, stamming and mauch more other things.

As opposed to it, API method doesn't transform or filter input terms. Thus it's more suitable for computer generated or untokenized fields.

27.3.1.1. Query parsing

Zend_Search_Lucene_Search_QueryParser::parse() method may be used to parse query string into query object.

This object may be used in query construction API methods to combine user entered queries with machine generated queries.

Actually, in some cases it's only way to search for values within untokenized fields:

<?php
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);

$pathTerm  = new Zend_Search_Lucene_Index_Term('/data/doc_dir/' . $filename, 'path');
$pathQuery = new Zend_Search_Query_Term($pathTerm);

$query = new Zend_Search_Query_Boolean();
$query->addSubquery($userQuery, true /* required */);
$query->addSubquery($pathQuery, true /* required */);

$hits = $index->find($query);

Zend_Search_Lucene_Search_QueryParser::parse() method also takes optional encoding parameter, which can specify query string encoding:

<?php
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr, 'iso-8859-5');

If encoding is omitted, then current locale is used.

It's also possible to specify default query string encoding with Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding() method:

<?php
Zend_Search_Lucene_Search_QueryParser::setDefaultEncoding('iso-8859-5');
...
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);

Zend_Search_Lucene_Search_QueryParser::getDefaultEncoding() returns current default query string encoding (empty string means "current locale").

27.3.2. Search Results

The search result is an array of Zend_Search_Lucene_Search_QueryHit objects. Each of these has two properties: $hit->document is a document number within the index and $hit->score is a score of the hit in a search result. Result is ordered by score (top scores come first).

The Zend_Search_Lucene_Search_QueryHit object also exposes each field of the Zend_Search_Lucene_Document found by the hit as a property of the hit. In this example, a hit is returned and the corresponding document has two fields: title and author.

<?php
require_once('Zend/Search/Lucene.php');

$index = Zend_Search_Lucene::open('/data/my_index');

$hits = $index->find($query);

foreach ($hits as $hit) {
    echo $hit->score;
    echo $hit->title;
    echo $hit->author;
}
?>

Stored fields are always returned in UTF-8 encoding.

Optionally, the original Zend_Search_Lucene_Document object can be returned from the Zend_Search_Lucene_Search_QueryHit. You can retrieve stored parts of the document by using the getDocument() method of the index object and then get them by getFieldValue() method:

<?php
require_once 'Zend/Search/Lucene.php';

$index = Zend_Search_Lucene::open('/data/my_index');

$hits = $index->find($query);
foreach ($hits as $hit) {
    // return Zend_Search_Lucene_Document object for this hit
    echo $document = $hit->getDocument();

    // return a Zend_Search_Lucene_Field object
    // from the Zend_Search_Lucene_Document
    echo $document->getField('title');

    // return the string value of the Zend_Search_Lucene_Field object
    echo $document->getFieldValue('title');

    // same as getFieldValue()
    echo $document->title;
}
?>

The fields available from the Zend_Search_Lucene_Document object are determined at the time of indexing. The document fields are either indexed, or index and stored, in the document by the indexing application (e.g. LuceneIndexCreation.jar).

Note that the document identity ('path' in our example) is also stored in the index and must be retrieved from it.

27.3.3. Results Scoring

Zend_Search_Lucene uses the same scoring algorithms as Java Lucene. Hits in search result are ordered by score by default. Hits with greater score come first, and documents having higher scores match the query more than documents having lower scores.

Roughly speaking, search hits that contain the searched term or phrase more frequently will have a higher score.

A scores can be retrieved by accessing the score property of a hit:

<?php
$hits = $index->find($query);

foreach ($hits as $hit) {
    echo $hit->id;
    echo $hit->score;
}

Zend_Search_Lucene_Search_Similarity class is used to calculate score. See Extensibility. Scoring Algorithms section for details.

27.3.4. Search Result Sorting

Search result is sorted by score by default. You change this by setting a sort field (or fields), sort type and sort order parameters.

$index->find() call may take several optional parameters:

<?php
$index->find($query [, $sortField [, $sortType [, $sortOrder]]] [, $sortField2 [, $sortType [, $sortOrder]]] ...);

$sortField is a name of stored field to sort result.

$sortType may be omitted or take values SORT_REGULAR (compare items normally, default value), SORT_NUMERIC (compare items numerically), SORT_STRING (compare items as strings).

$sortOrder may be omitted or take values SORT_ASC (sort in ascending order, default value), SORT_DESC (sort in descending order).

Examples:

<?php
$index->find($query, 'quantity', SORT_NUMERIC, SORT_DESC);

<?php
$index->find($query, 'fname', SORT_STRING, 'lname', SORT_STRING);

<?php
$index->find($query, 'name', SORT_STRING, 'quantity', SORT_NUMERIC, SORT_DESC);

Please be careful when using non-default search order. It needs to retrive documents completely from an index and may drammatically slow down search performance.

27.3.5. Search Results Highlighting

Zend_Search_Lucene_Search_Query::highlightMatches() method allows to highlight HTML document terms in context of search query:

<?php
$query = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
$hits = $index->find($query);
...
$highlightedHTML = $query->highlightMatches($sourceHTML);

highlightMatches() method utilizes Zend_Search_Lucene_Document_Html class (see HTML documents section for details) for HTML processing. So it has the same requirements for HTML source.