On Sun, Apr 10, 2005 at 12:45:28PM -0400, info@bannershift.com
wrote:> I would like to know if xapian supports utf8.
>
> It is possible to add document data in utf8 format ?
>
> For example
>
> documen.set_data(utf8_description);
The document data is just an opaque blob as far as the library is
concerned. So you can put whatever you like in there.
However, omega (and omindex and scriptindex) impose a certain structure
on the document data - they use it to store a list of NAME=VALUE pairs,
one per line.
Two parts of the core library make character set assumptions currently
- the stemmers and query parser. Both currently assume latin1. The
the assumption isn't very deeply embedded, and it's something I plan
to fix:
http://www.xapian.org/cgi-bin/bugzilla/show_bug.cgi?id=30
It'll require tweaking Snowball to produce utf-8 stemmers - there was
some discussion on the Snowball list about this a few months ago:
http://thread.gmane.org/gmane.comp.search.snowball/668
Cheers,
Olly