thr3ads.net - Xapian discuss - [Xapian-discuss] Xapian::Queryparser / Encoding Problem (Utf8) [Aug 2005]

If this information is useful, please help other people find it:
Share via:

R. Mattes

2005-Aug-09 15:51 UTC

[Xapian-discuss] Xapian::Queryparser / Encoding Problem (Utf8)

Well, the subject line says it all - what's the status 
of the UTF-8 support in the query parser? I recall some
messages in the list recently but haven't heard of any
updates. This starts to be a major showstopper for our
project (all data is in UTF-8 and I'd hate to have to
rewrite the indexer to recode the data).
I guess I could have a look at the lemon source but it
has been a while since I last wrote lemon grammars (and
never for c++).

 TIA Ralf Mattes

Richard Boulton

2005-Aug-10 15:29 UTC

head link

[Xapian-discuss] Xapian::Queryparser / Encoding Problem (Utf8)

On Tue, 2005-08-09 at 15:21 +0200, R. Mattes wrote:> Well, the subject line says it all - what's the status 
> of the UTF-8 support in the query parser? I recall some
> messages in the list recently but haven't heard of any
> updates. This starts to be a major showstopper for our
> project (all data is in UTF-8 and I'd hate to have to
> rewrite the indexer to recode the data).
> I guess I could have a look at the lemon source but it
> has been a while since I last wrote lemon grammars (and
> never for c++).
I believe that there haven't been any updates since the last flurry of
messages on the list.  (But feel free to check the commit logs for the
relevant module.)

Part of the problem has been that the stemming algorithms used not to
support UTF-8 - however, the upstream algorithms (at
http://snowball.tartarus.org/) now support this quite happily.  However,
other changes to the output of the stemmers have also occurred since the
algorithms were imported into the Xapian source tree, so updating the
algorithms has been waiting for a major release (since changing the
stemming algorithms will force all databases to be rebuilt with the new
algorithms).  That said, don't let that stop you taking a look at the
work, and changing them locally (and submitting a patch...)

The query parser itself shouldn't need too much work - you'll probably
need to look at the accent normalising code (see accentnormalisingitor.h
and symboltab.h).

Oh, and note that the very latest english stemming algorithm from
snowball makes use of apostophe characters if it's presented with them,
so it would be good to stop stripping them out of the input to the
stemmer, if the language is english.

-- 
Richard Boulton <richard@tartarus.org>

Xapian discuss - Aug 2005 - Xapian::Queryparser / Encoding Problem (Utf8)

[Xapian-discuss] Xapian::Queryparser / Encoding Problem (Utf8)

[Xapian-discuss] Xapian::Queryparser / Encoding Problem (Utf8)