R. Mattes
2005-Aug-09 15:51 UTC
[Xapian-discuss] Xapian::Queryparser / Encoding Problem (Utf8)
Well, the subject line says it all - what's the status of the UTF-8 support in the query parser? I recall some messages in the list recently but haven't heard of any updates. This starts to be a major showstopper for our project (all data is in UTF-8 and I'd hate to have to rewrite the indexer to recode the data). I guess I could have a look at the lemon source but it has been a while since I last wrote lemon grammars (and never for c++). TIA Ralf Mattes
Richard Boulton
2005-Aug-10 15:29 UTC
[Xapian-discuss] Xapian::Queryparser / Encoding Problem (Utf8)
On Tue, 2005-08-09 at 15:21 +0200, R. Mattes wrote:> Well, the subject line says it all - what's the status > of the UTF-8 support in the query parser? I recall some > messages in the list recently but haven't heard of any > updates. This starts to be a major showstopper for our > project (all data is in UTF-8 and I'd hate to have to > rewrite the indexer to recode the data). > I guess I could have a look at the lemon source but it > has been a while since I last wrote lemon grammars (and > never for c++).I believe that there haven't been any updates since the last flurry of messages on the list. (But feel free to check the commit logs for the relevant module.) Part of the problem has been that the stemming algorithms used not to support UTF-8 - however, the upstream algorithms (at http://snowball.tartarus.org/) now support this quite happily. However, other changes to the output of the stemmers have also occurred since the algorithms were imported into the Xapian source tree, so updating the algorithms has been waiting for a major release (since changing the stemming algorithms will force all databases to be rebuilt with the new algorithms). That said, don't let that stop you taking a look at the work, and changing them locally (and submitting a patch...) The query parser itself shouldn't need too much work - you'll probably need to look at the accent normalising code (see accentnormalisingitor.h and symboltab.h). Oh, and note that the very latest english stemming algorithm from snowball makes use of apostophe characters if it's presented with them, so it would be good to stop stripping them out of the input to the stemmer, if the language is english. -- Richard Boulton <richard@tartarus.org>