David Spencer
2008-Sep-11 06:21 UTC
[Xapian-discuss] Semantics of terms - do they have to be C style, '\0' terminated strings?
Briefly: my goal is to index a series of 64 bit numeric values with every document. I see that WritableDocument::add_term takes a C++ string argument. I believe that C++ strings allow bytes with a value of 0x0 (or '\0'), however I'm pretty sure that all is lost if you then call string.c_str(). Is it defined anywhere whether the string you pass to add_term has to be "C style" or not? At a glance the code base doesn't call c_str() much and the cases I saw had to do with filenames, so this might be OK, I just wanted to check if this was, say, guaranteed by the contract or if this is just too dubious. thx Dave
Olly Betts
2008-Sep-11 06:48 UTC
[Xapian-discuss] Semantics of terms - do they have to be C style, '\0' terminated strings?
On Wed, Sep 10, 2008 at 11:21:53PM -0700, David Spencer wrote:> Is it defined anywhere whether the string you pass to add_term has to be "C > style" or not?All handling of "data" strings in the C++ API is zero-byte clean. The C# and Java bindings aren't currently though. And the quartz and flint backends internally do some messing around with zero bytes in terms, which essentially means that each zero byte in a term counts twice towards the term length limit at the moment. But the limit is a bit over 240, so that's rarely an issue.> At a glance the code base doesn't call c_str() much and the cases I saw had > to do with filenames, so this might be OK,Yes, filenames can't contain zero bytes, and OS/library calls take nul-terminated strings as const char * (or similar), so calling c_str() in such cases isn't a problem. Cheers, Olly