Peter Karman
2005-Dec-07 03:01 UTC
[Xapian-discuss] word context, numeric values, and characters
I have a few assumptions about Xapian's features that I'm seeking confirmation about. I've read the API docs, some of the example .cc files, the mailing list and wiki and want to know if I'm understanding correctly. 1. contextual data The convention for storing contextual information about words (i.e., what tag they appear in in HTML, or what field/column in a db) is to prefix the term with a string, and then map that string using add_prefix() or add_boolean_prefix() in constructing a query. For example, the html "<title>foo</title>" could be indexed as "Tfoo". A query for "title:foo" would be parsed with add_prefix("title","T") and that would generate a match for "Tfoo". Am I understanding that process correctly? 2. add_value() and set_data() require char* arguments; there is no support for an int or other numeric value. How then does sorting work for numeric values? 3. add_term() and add_posting() do not parse the passed char* string at all; it is indexes as-is. Any parsing (stemming, splitting into words on non-word characters) must happen before adding to the db. indextext.cc is one example in the omega package of text parsing prior to adding to the db. Correct? Thanks in advance for clarifying. pek -- Peter Karman . http://peknet.com/ . peter@peknet.com
Olly Betts
2005-Dec-07 03:11 UTC
[Xapian-discuss] word context, numeric values, and characters
On Tue, Dec 06, 2005 at 09:01:10PM -0600, Peter Karman wrote:> For example, the html "<title>foo</title>" could be indexed as "Tfoo". A > query for "title:foo" would be parsed with add_prefix("title","T") and that > would generate a match for "Tfoo". > > Am I understanding that process correctly?Yes, that's spot on.> 2. add_value() and set_data() require char* arguments; there is no support > for an int or other numeric value. How then does sorting work for numeric > values?(Actually, std::string not char* but you can pass a char* or const char* and C++ will automatically convert...) If you want to set a numeric value, you'll need to convert it to a string first (although a convenience overload which handled this for you might be handy, particularly for add_value). Currently numeric sorting isn't supported directly, though if you left-pad the values with zero or space to a fixed width you can get the same effect with a string sort. The plan is to allow a user-specified sort functor (similar in style to Xapian::MatchDecider).> 3. add_term() and add_posting() do not parse the passed char* string at > all; it is indexes as-is. Any parsing (stemming, splitting into words on > non-word characters) must happen before adding to the db. indextext.cc is > one example in the omega package of text parsing prior to adding to the db.Yes. Cheers, Olly