On Fri, Mar 14, 2008 at 11:14:56PM +0000, Colin Bell
wrote:> I was wondering if anyone every came across a problem I seem to be
> having. I'm indexing in text files using some basic code written in C+
> +. The text files may or may not be in UTF-8, ISO 8859-1 or possibly
> (but very rarely) even some other format - I have no way of knowing.
There are ways to detect the character set of a file, though not always
100% reliably.
> Question is, does Xapian convert none UTF-8 characters when it stores
> the document. I think I read that UTF-8 is the default encoding for
> Xapian, which is exactly what I am after.
Most of Xapian treats things as opaque data. The classes which need
to know are Xapian::Stem, Xapian::QueryParser, and
Xapian::TermGenerator. The UTF-8 parsing used by the latter two will
treat invalid sequences as if they were ISO-8859-1, which for
real-world examples will almost always do the right thing when fed
ISO-8859-1. Xapian::Stem uses Snowball's UTF-8 parsing code currently -
I'm not sure how that handles invalid sequences.
> The reason I'm asking is that I am getting some seriously corrupted
> characters in the index. When they are displayed on Tomcat I get a
> "sun.io.MalformedInputException" when trying to display the
search
> results. I have set the pages charset to UTF-8 and apparently this
> error is thrown when when the streamreader detects characters that are
> not proper UTF-8 characters.
If you set document data, document values, or directly add terms (using
Document::add_posting() or Document::add_term()) then you'll get back
what you put in verbatim. So if you pass in something which is invalid
UTF-8, it will still be invalid.
If you pass data through Xapian::Utf8Iterator before doing anything with
it, then this will fix bad UTF-8. This is essentially what omindex
does to deal with this problem.
Cheers,
Olly