Kevin SoftDev
2006-Mar-02 18:55 UTC
[Xapian-discuss] Different Collation (utf8_slovak_ci, utf8_danish_ci, latin1_german1_ci) etc.
One issue left for me to figure out is that in different languages there are different characters and Xapian takes only english characters. Thefore many word entered by users that contains their own language special characters will not return any result. MySQL offers different collations ... Now when I see how much data Xapian can search perhaps I could expand my index spider different European countries but how will I deal with different collation, interesting question. Kevin http://nitra.net -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.tartarus.org/pipermail/xapian-discuss/attachments/20060302/69b056f0/attachment.htm
Olly Betts
2006-Mar-02 20:12 UTC
[Xapian-discuss] Different Collation (utf8_slovak_ci, utf8_danish_ci, latin1_german1_ci) etc.
On Thu, Mar 02, 2006 at 10:55:22AM -0800, Kevin SoftDev wrote:> One issue left for me to figure out is that in different languages there are > different characters and Xapian takes only english characters.No, it doesn't only take english characters. Xapian::Stem and Xapian::QueryParser currently assume iso8859-1 (which covers most western european languages, plus some others), but should be fixed to be able to handle utf-8 fairly soon. Everything else treats the data as opaque, so is agnostic about encoding issues. The core library is zero-byte safe so wide characaters should be fine too. I've found (and fixed) code in the bindings (and SWIG) which isn't zero-byte safe, but not done a thorough audit so you may hit issues there still - if you do, please report them as they're easy to fix once identified.> Thefore many word entered by users that contains their own language special > characters will not return any result. MySQL offers different collations ...Assuming by a collation you mean a total order on pairs of strings, I don't plan to implement that in Xapian, because I think it's better addressed externally (for reasons of efficiency mainly). The sort order aspect of a collation would affect TermIterators, but it would be expensive to make a TermIterator return terms in anything other than the natural order. I think if you need terms ordered in a particular way it's better to gather those you want and then sort them. The "different character strings comparing equal" aspect really needs to be handled by converting them to a canonical form when generating terms. Otherwise you're going to need to do an "OR" query for any single term which is affected by this. You could potentially allow a single collation to be specified for ordering (and treating as equal or not) terms at the Btree table manager level, but you couldn't change it except by rebuilding the database from scratch, and it would complicate the Btree manager a lot. Cheers, Olly