Jim Lynch
2004-Sep-15  20:49 UTC
[Xapian-discuss] What are the separators that scriptindex uses?
I've been asked to find out what are considered separators for scriptindex? Whitespace obviously. What is done with special characters? The reason for the question is that my data contains some strange stuff, like output from core dumps, source code for various programming languages like assembly, part numbers (not just numbers, of course) and other wierd collections of funny characters. Fortunately no unicode just yet. I'm trying to get a feel for how difficult it's going to be to search for this stuff and what the rules might be. Also can I assume omega uses the same set of separators? For instance if I look for something like PARAM_DEV-445*Foggy, will it be found? Will it be multiple terms? BTW, how are phrase searches these days? Thanks, Jim.
Olly Betts
2004-Sep-17  10:48 UTC
[Xapian-discuss] What are the separators that scriptindex uses?
On Wed, Sep 15, 2004 at 03:48:54PM -0400, Jim Lynch wrote:> I've been asked to find out what are considered separators for > scriptindex?Essentially, non-alphanumerics. But there's special handling for things like "N.A.T.O.", "C++", and "AT&T".> The reason for the question is that my data contains some > strange stuff, like output from core dumps, source code for various > programming languages like assembly, part numbers (not just numbers, of > course) and other wierd collections of funny characters. Fortunately no > unicode just yet. I'm trying to get a feel for how difficult it's going > to be to search for this stuff and what the rules might be.The following characters are treated as "phrase makers" by the QueryParser: _/\@'*.- so for example an email address is indexed as separate words, and a search for it triggers a phrase search.> Also can I assume omega uses the same set of separators?Pretty much. The indexer and QueryParser are designed to work together.> For instance if I look for something like PARAM_DEV-445*Foggy, will it > be found? Will it be multiple terms?It's be indexed as 4 terms, and searched for as a phrase of those 4 terms.> BTW, how are phrase searches these days?Why do you ask? Did you have a problem with them before? As far as I know they work correctly. They're inherently more expensive than non phrase searches, and there are a couple of bugzilla entries for related enhancements (one to improve term AND "some phrase"; the other to reduce the number of cases where a phrase query is required - e.g. "e-mail" uses a phrase at present). Cheers, Olly