Eric Wong
2018-Jul-19 20:32 UTC
choosing between probabilistic and boolean prefixes for terms
Hi all, public-inbox allows searching for git blob names (e.g. "badc0ffee") in patches. Initially, I chose to use add_prefix for probabilistic terms, since I assumed it could be a superset of what boolean searching offered. Unfortunately, it doesn't seem to be the case because stemming is interfering. So switching to boolean filtering seems to work; and it is fine for mechanical searches I plan on doing: https://public-inbox.org/meta/20180716040734.30104-1-e at 80x24.org/ Now I wonder, is there a way to get the best-of-both-worlds so a human can still use wildcards? public-inbox also allows searches on pathnames, and maybe that should use boolean filtering, too... My setup for the query parser isn't anything special: our $LANG = 'english'; sub stemmer { Search::Xapian::Stem->new($LANG) } sub qp { my ($self) = @_; my $qp = $self->{query_parser}; return $qp if $qp; # new parser $qp = Search::Xapian::QueryParser->new; $qp->set_default_op(OP_AND); $qp->set_database($self->{xdb}); $qp->set_stemmer($self->stemmer); $qp->set_stemming_strategy(STEM_SOME); $qp->set_max_wildcard_expansion(100); $qp->add_valuerangeprocessor( Search::Xapian::NumberValueRangeProcessor->new(YYYYMMDD, 'd:')); $qp->add_valuerangeprocessor( Search::Xapian::NumberValueRangeProcessor->new(DT, 'dt:')); In any case, all the code is available via: git clone https://public-inbox.org/public-inbox
Olly Betts
2018-Jul-25 04:45 UTC
choosing between probabilistic and boolean prefixes for terms
On Thu, Jul 19, 2018 at 08:32:23PM +0000, Eric Wong wrote:> public-inbox allows searching for git blob names (e.g. "badc0ffee") > in patches. Initially, I chose to use add_prefix for probabilistic > terms, since I assumed it could be a superset of what boolean > searching offered. Unfortunately, it doesn't seem to be the case > because stemming is interfering. > > So switching to boolean filtering seems to work; and it is > fine for mechanical searches I plan on doing: > > https://public-inbox.org/meta/20180716040734.30104-1-e at 80x24.org/ > > Now I wonder, is there a way to get the best-of-both-worlds so > a human can still use wildcards?I struggle to think of a situation in which one would you want to wildcard search for a git sha...> public-inbox also allows searches on pathnames, and maybe that > should use boolean filtering, too......but for a pathname that's more believable. Currently you can't specify a different stemmer (or stemming mode) per prefix. Perhaps that should be supported - there are common cases such as "author" fields where the stemming can be harmful, but currently you'd have to have a separate text entry field for the author search to support that directly. I think you could use add_prefix() with a FieldProcessor object since that get passed the term without stemming, but FieldProcessor isn't wrapped by Search::Xapian (the SWIG-based Perl bindings do wrap it, but the API isn't 100% the same as Search::Xapian's so you'd need to test and probably adjust some of your code to port to that - it is the future for using Xapian from Perl, but I've been hoping to sort out the incompatibilities before pushing it more). There isn't currently a flag to enable wildcards for boolean terms but that could be supported I think. It mostly isn't by default because it seems less useful, and because it's assumed you could have any character in a boolean term and "*" being special works against that. Some of the options to limit expansion don't really make sense for a boolean filter, but I guess that's a case of "well don't do that then". Cheers, Olly