Rozdział 27. Zend_Search_Lucene

Spis treści

27.1. Overview
27.1.1. Introduction
27.1.2. Document and Field Objects
27.1.3. Understanding Field Types
27.1.4. HTML documents
27.2. Building Indexes
27.2.1. Creating a New Index
27.2.2. Updating Index
27.2.3. Updating Documents
27.2.4. Retrieving Index size
27.2.5. Index optimization
27.2.6. Limitationas
27.3. Searching an Index
27.3.1. Building Queries
27.3.2. Search Results
27.3.3. Results Scoring
27.3.4. Search Result Sorting
27.3.5. Search Results Highlighting
27.4. Query Language
27.4.1. Terms
27.4.2. Fields
27.4.3. Term Modifiers
27.4.4. Proximity Searches
27.4.5. Boosting a Term
27.4.6. Boolean Operators
27.4.7. Grouping
27.4.8. Field Grouping
27.4.9. Escaping Special Characters
27.5. Query Construction API
27.5.1. Query Parser Exceptions
27.5.2. Term Query
27.5.3. Multi-Term Query
27.5.4. Phrase Query
27.6. Character set.
27.6.1. UTF-8 and single-byte character sets support.
27.6.2. Default text analyzer.
27.6.3. UTF-8 compatible text analyzer.
27.7. Extensibility
27.7.1. Text Analysis
27.7.2. Tokens Filtering
27.7.3. Scoring Algorithms
27.7.4. Storage Containers
27.8. Interoperating with Java Lucene
27.8.1. File Formats
27.8.2. Index Directory
27.8.3. Java Source Code
27.9. Advanced
27.9.1. Using index as static property

27.1. Overview

27.1.1. Introduction

Zend_Search_Lucene is a general purpose text search engine written entirely in PHP 5. Since it stores its index on the filesystem and does not require a database server, it can add search capabilities to almost any PHP-driven website. Zend_Search_Lucene supports the following features:

  • Ranked searching - best results returned first

  • Many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more [7]

  • Search by specific field (e.g., title, author, contents)

Zend_Search_Lucene was derived from the Apache Lucene project. Currently supported Lucene version is 2.0. For more information on Lucene, visit http://lucene.apache.org/java/docs/ (http://lucene.apache.org/java/2_0_0/).

27.1.2. Document and Field Objects

Zend_Search_Lucene operates with documents as atomic subjects for indexing. A document is divided into named fields, and fields have content that can be searched.

A document is represented by the Zend_Search_Lucene_Document object, and this object contains Zend_Search_Lucene_Field objects that represent the fields.

It is important to note that any kind of information can be added to the index. Application-specific information or metadata can be stored in the document fields, and later retrieved with the document during search.

It is the responsibility of your application to control the indexer. This means that data can be indexed from any source that is accessible by your application. For example, this could be the filesystem, a database, an HTML form, etc.

Zend_Search_Lucene_Field class provides several static methods to create fields with different characteristics:

<?php
$doc = new Zend_Search_Lucene_Document();

// Field is not tokenized, but is indexed and stored within the index.
// Stored fields can be retrived from the index.
$doc->addField(Zend_Search_Lucene_Field::Keyword('doctype',
                                                 'autogenerated'));

// Field is not tokenized nor indexed, but is stored in the index.
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
                                                   time()));

// Binary String valued Field that is not tokenized nor indexed,
// but is stored in the index.
$doc->addField(Zend_Search_Lucene_Field::Binary('icon',
                                                $iconData));

// Field is tokenized and indexed, and is stored in the index.
$doc->addField(Zend_Search_Lucene_Field::Text('annotation',
                                              'Document annotation text'));

// Field is tokenized and indexed, but that is not stored in the index.
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents',
                                                  'My document content'));

?>

Each of these methods (excluding Zend_Search_Lucene_Field::Binary() method) has optional $encoding parameter. It specifies input data encoding.

Encoding may differ for different documents as well as for different fields within one document:

<?php
$doc = new Zend_Search_Lucene_Document();
$doc->addField(Zend_Search_Lucene_Field::Text('title', $title, 'iso-8859-1'));
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents, 'utf-8'));
?>

If encoding parameter is omitted, then current locale is used at processing time. For example:

<?php
setlocale(LC_ALL, 'de_DE.iso-8859-1');
...
$doc->addField(Zend_Search_Lucene_Field::UnStored('contents', $contents));
?>

Fields are always stored and returned from index in UTF-8 encoding. Conversion to UTF-8 proceeds automatically.

Text analyzers (see below) may also convert text to some other encodings. Actually, default analyzer converts text to 'ASCII//TRANSLIT' encoding. Be care with this, such translation may depend on current locale.

Fields' names are defined only by your own choice.

Java Lucene uses "contents" field as a default field to search. Zend_Search_Lucene searches through all fiels by default, but it's also possible to change this behavior. See "Default search field" chapter for details.

27.1.3. Understanding Field Types

  • Keyword fields are stored and indexed, meaning that they can be searched as well as displayed in search results. They are not split up into separate words by tokenization. Enumerated database fields usually translate well to Keyword fields in Zend_Search_Lucene.

  • UnIndexed fields are not searchable, but they are returned with search hits. Database timestamps, primary keys, file system paths, and other external identifiers are good candidates for UnIndexed fields.

  • Binary fields are not tokenized or indexed, but are stored for retrieval with search hits. They can be used to store any data encoded as a binary string, such as an image icon.

  • Text fields are stored, indexed, and tokenized. Text fields are appropriate for storing information like subjects and titles that need to be searchable as well as returned with search results.

  • UnStored fields are tokenized and indexed, but not stored in the index. Large amounts of text are best indexed using this type of field. Storing data creates a larger index on disk, so if you need to search but not redisplay the data, use an UnStored field. UnStored fields are practical when using a Zend_Search_Lucene index in combination with a relational database. You can index large data fields with UnStored fields for searching, and retrieve them from your relational database by using a separate fields as an identifier.

    Tabela 27.1. Zend_Search_Lucene_Field Types

    Field Type Stored Indexed Tokenized Binary
    Keyword Yes Yes No No
    UnIndexed Yes No No No
    Binary Yes No No Yes
    Text Yes Yes Yes No
    UnStored No Yes Yes No

27.1.4. HTML documents

Zend_Search_Lucene offers HTML parsing feature. Documents can be created directly from HTML file or string:

<?php
$doc = Zend_Search_Lucene_Document_Html::loadHTMLFile($filename);
$index->addDocument($doc);
...
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$index->addDocument($doc);
?>

Zend_Search_Lucene_Document_Html class uses DOMDocument::loadHTML() and DOMDocument::loadHTMLFile() methods to parse source HTML, so it doesn't need HTML to be well formed or to be XHTML. From the other side it's sensitive to encoding mentioned in "meta http-equiv" header tag.

Zend_Search_Lucene_Document_Html class recognizes document title, body and document header meta tags.

'title' field is actually /html/head/title value. It's stored within index, tokenized and available for search through.

'body' field is actually body content. It doesn't include scripts, comments and tags' attributes.

loadHTML() and loadHTMLFile() methods of Zend_Search_Lucene_Document_Html class also have second optional argument. If it's set to true, then body content is also stored within index and can be retrieved from index. Body is only tokenized and indexed, but not stored by default.

Document header meta tags produce additional document fields. Field name is taken from 'name' attribute, 'content' attribute gives field value, which is tokenized, indexed and stored, so documents may be searched by their meta tags (for example, by keywords).

Parsed documents may be extended by user with any other field:

<?php
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
                                                   time()));
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('created',
                                                   time()));
$doc->addField(Zend_Search_Lucene_Field::Text('annotation',
                                              'Document annotation text'));
$index->addDocument($doc);
?>

Document links are not included into generated document, but may be retrieved with Zend_Search_Lucene_Document_Html::getLinks() and Zend_Search_Lucene_Document_Html::getHeaderLinks() methods:

<?php
$doc = Zend_Search_Lucene_Document_Html::loadHTML($htmlString);
$linksArray = $doc->getLinks();
$headerLinksArray = $doc->getHeaderLinks();
?>



[7] Term, multi term, phrase queries, boolean expressions and subqueries are supported at this time.