Zend_Search_Lucene works with UTF-8 charset internally. Index files store unicode data in Java's "modified UTF-8 encoding". Zend_Search_Lucene core completely supports it with one exception. [9]
Actual input data encoding may be specified through Zend_Search_Lucene API. Data will be automatically converted into UTF-8 encoding.
However, default text analyzer (which is also used within query parser) uses ctype_alpha() for tokenizing text and queries.
ctype_alpha() is not UTF-8 compatible, so analyzer converts text to 'ASCII//TRANSLIT' encoding before indexing. The same processing is performed during query parsing, so it's done transparently. [10]
Zend_Search_Lucene also contains limited functionaliy utf-8 analyzer. It can be turned on with the following code:
<?php Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8());
It tokenizes data for indexing in UTF-8 mode and has no problems with any UTF-8 compatible character.
It has two limitations:
treats all non-ascii characters as letters (it's not always true);
is case-sensitive;
Because of these limitations it's not set as default, but may be helpful for someone.
Case insensitivity my be emulated with strtolower()
conversion:
<?php setlocale(LC_CTYPE, 'de_DE.iso-8859-1'); ... Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8()); ... $doc = new Zend_Search_Lucene_Document(); $doc->addField(Zend_Search_Lucene_Field::UnStored('contents', strtolower($contents))); // Title field for search through (indexed, unstored) $doc->addField(Zend_Search_Lucene_Field::UnStored('title', strtolower($title))); // Title field for retrieving (unindexed, stored) $doc->addField(Zend_Search_Lucene_Field::UnIndexed('_title', $title));
The same conversion has to be performed with query string:
<?php setlocale(LC_CTYPE, 'de_DE.iso-8859-1'); ... Zend_Search_Lucene_Analysis_Analyzer::setDefault( new Zend_Search_Lucene_Analysis_Analyzer_Common_Utf8()); ... $hits = $index->find(strtolower($query));
[9] Zend_Search_Lucene supports only Basic Multilingual Plane (BMP) characters (from 0x0000 to 0xFFFF) and doesn't support "supplementary characters" (characters whose code points are greater than 0xFFFF)
Java 2 represents these characters as a pair of char (16-bit) values, the first from the high-surrogates range (0xD800-0xDBFF), the second from the low-surrogates range (0xDC00-0xDFFF). Then they are encoded as usual UTF-8 characters in six bytes. Standard UTF-8 representation uses four bytes for supplementary characters.
[10] Conversion to 'ASCII//TRANSLIT' may depend on current locale and OS.