Bjorn Lamers
2010-Jun-09 14:48 UTC
[Xapian-discuss] TermGenerator incorrectly tokenizes German text which contains special characters
Dear Xapian users, I try to index some German text with Xapian using the xapian_php bindings. I run Apache 2.2 on Windows using PHP 5.2.13 with the pre build xapian bindings from Flax: Xapian Support enabled Xapian Compiled Version @PACKAGE_VERSION@ Xapian Linked Version 1.2.0 The problem is that after indexing text which contains special characters like ?, ?, ? and ?, using TermGenerator::index_text ( http://xapian.org/docs/sourcedoc/html/classXapian_1_1TermGenerator.html#b358784fa685139e8bdd71d37f39573e), terms get cut off (stopped) after the special character. For example the term gesundheitssch?dlich is indexed as gesundheitssch? and Zgesundheitssch? (stemmed). All character encodings are set to UTF-8, the MySql database is also in UTF-8 encoding. * #1 $lIndexer = new XapianTermGenerator(); #2 $lStemmer = new XapianStem(XapianHelper::GetStemmer($pLanguage)); // ?german? #3 $lIndexer->set_stemmer($lStemmer); #4 $lDoc = new XapianDocument(); #5 $lDoc->add_term($lObj->Id); #6 $lIndexer->set_document($lDoc); #7 $lIndexer->index_text("Nahrungserg?nzungsmittel Ausrei?er"); #8 $lIndexer->index_text($lSomeStringFromDb);* In the code example just above here the problem only occurs when I try to index text on line #8. The string which get indexed on line #7 is indexed correctly ({Zausreiss, Znahrungserganzungsmittel, ausrei?er, nahrungserg?nzungsmittel}). When I force *$lSomeStringFromDb* to be in UTF-8 encoding the tokens are also incorrect. *$lSomeStringFromDb* can either come from the database or from MemCache. I checked the character encoding of the different inputs with the PHP-method: mb_detect_encoding. Strings containing special characters have encoding UTF-8, string which not contain special characters are detected as ASCII. The string from #7 is detected as UTF-8. * mb_detect_encoding("Nahrungserg?nzungsmittel Ausrei?er") => UTF-8 mb_detect_encoding("Nahrungserganzungsmittel Ausreisser") => ASCII mb_check_encoding("Nahrungserg?nzungsmittel Ausrei?er", "UTF-8") => true mb_check_encoding("Nahrungserganzungsmittel Ausreisser", "UTF-8") => true* When I do the checks on the MemCache/database variable I get these results: *mb_detect_encoding($lSomeStringFromDb) => ASCII mb_check_encoding($lSomeStringFromDb, "UTF-8") => true* No matter what conversions I do the variable is detected as ASCII. *// http://www.php.net/manual/en/function.utf8-encode.php#89789 function fixEncoding($in_str) { $cur_encoding = mb_detect_encoding($in_str) ; if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8")) return $in_str; else return utf8_encode($in_str); } // fixEncoding if (mb_detect_encoding($lString) == "ASCII") $lString = mb_convert_encoding($lString, "UTF-8", "ASCII");* Only by adding a special character to the variable it gets detected as UTF-8 (in all cases the string was correctly encoded checked as UTF-8 with mb_check_encoding). But still the generated terms are incorrect. *mb_detect_encoding(?? ? . $lSomeStringFromDb) => UTF-8 mb_check_encoding(?? ? . $lSomeStringFromDb, "UTF-8") => true* To sum up my encoding problems: Text which contains special characters is not correctly indexed (German text). Terms are cut off just after a special character. I?m pretty sure my variables/objects are all in UTF-8 format, but they are not properly indexed. When I copy the contents of my variables/objects into strings in PHP the content is properly indexed. What can be the problem of the variables, why aren?t the indexed properly? Thanks in advance. Best regards, Bjorn
Olly Betts
2010-Jun-10 02:57 UTC
[Xapian-discuss] TermGenerator incorrectly tokenizes German text which contains special characters
On Wed, Jun 09, 2010 at 04:48:24PM +0200, Bjorn Lamers wrote:> I try to index some German text with Xapian using the xapian_php bindings. I > run Apache 2.2 on Windows using PHP 5.2.13 with the pre build xapian > bindings from Flax: > Xapian Support enabled Xapian > Compiled Version @PACKAGE_VERSION@Charlie, can you fix that?> Xapian Linked Version 1.2.0 > > The problem is that after indexing text which contains special characters > like ?, ?, ? and ?, using TermGenerator::index_text ( > http://xapian.org/docs/sourcedoc/html/classXapian_1_1TermGenerator.html#b358784fa685139e8bdd71d37f39573e), > terms get cut off (stopped) after the special character. For example the > term gesundheitssch?dlich is indexed as gesundheitssch? and Zgesundheitssch? > (stemmed). > > All character encodings are set to UTF-8, the MySql database is also in > UTF-8 encoding. > * > #1 $lIndexer = new XapianTermGenerator(); > #2 $lStemmer = new XapianStem(XapianHelper::GetStemmer($pLanguage)); // > ?german? > #3 $lIndexer->set_stemmer($lStemmer); > #4 $lDoc = new XapianDocument(); > #5 $lDoc->add_term($lObj->Id); > #6 $lIndexer->set_document($lDoc); > #7 $lIndexer->index_text("Nahrungserg?nzungsmittel Ausrei?er"); > #8 $lIndexer->index_text($lSomeStringFromDb);* > > In the code example just above here the problem only occurs when I try to > index text on line #8. The string which get indexed on line #7 is indexed > correctly ({Zausreiss, Znahrungserganzungsmittel, ausrei?er, > nahrungserg?nzungsmittel}). When I force *$lSomeStringFromDb* to be in UTF-8 > encoding the tokens are also incorrect. > *$lSomeStringFromDb* can either come from the database or from MemCache.If it works with a literal string but not with a variable containing that string, it sounds to me like there's something funny about the variable. What does this show: vardump($lSomeStringFromDb); I'm wondering if it's an object with some conversion magic which isn't working quite right.> I checked the character encoding of the different inputs with the > PHP-method: mb_detect_encoding. Strings containing special characters have > encoding UTF-8, string which not contain special characters are detected as > ASCII. The string from #7 is detected as UTF-8.That's what I'd expect (since ASCII is a subset of UTF-8).> When I do the checks on the MemCache/database variable I get these results: > *mb_detect_encoding($lSomeStringFromDb) => ASCII > mb_check_encoding($lSomeStringFromDb, "UTF-8") => true* > > No matter what conversions I do the variable is detected as ASCII.That's OK if it only has ASCII characters in. A UTF-8 string with only ASCII characters in is byte for byte identical to an ASCII string with the same characters. Cheers, Olly