I see that TermGenerator::index_text() can take a Utf8Iterator argument, but Document::add_term() etc simply take a std::string. Are std::string arguments presumed to be UTF8 strings? If "sometimes," where or where not? Apologies if I've missed the docs on this...
James Aylett
2011-Nov-14 12:20 UTC
[Xapian-discuss] std::string arguments presumed to be UTF8?
On 14 Nov 2011, at 11:54, Liam wrote:> I see that TermGenerator::index_text() can take a Utf8Iterator argument, > but Document::add_term() etc simply take a std::string. > > Are std::string arguments presumed to be UTF8 strings? If "sometimes," > where or where not?I believe the situation is as follows: * std::string should never be presumed to be UTF8. Terms, for instance, are just treated internally as byte arrays (but are commonly used to store strings, hence using std::string for convenience in C++). * The TermGenerator, and a few other pieces of Xapian, *do* act on UTF8, since they operate at a level that is dealing with actual characters, so there has to be a defined encoding. Unfortunately, this isn't terribly clear from the documentation. J -- James Aylett talktorex.co.uk - xapian.org - devfort.com