Alexander Lind wrote:> I've been thinking about implementing some faceting support in my
Xapian
> classes.
> Has anyone implemented anything like this before?
I (and Olly) have actually been working on exactly this - there is
support for doing this in SVN HEAD, but we may still tweak the API for
it a bit more before the next release.
Basically, we've added a "matchspy" interface, which works very
like a
match decider, except that Xapian guarantees to call the matchspy for
every document which matches the query and is "considered for the
MSet"
(various optimisations mean that this guarantee is not provided for
MatchDeciders - in future, we plan to call them as late as possible in
the match process, so they won't see quite a few of the matching
documents). Note that not all the matching documents will always be
"considered for the MSet" - this is a little complicated, but
basically
you can guarantee that at least as many documents as are specified in
the "checkatleast" parameter to get_mset() will be considered (if
there
actually that many matching documents), but more documents may also be
considered. Thus, a matchspy can easily be set to be called on a much
larger number of matches than are returned in the mset, but a limit on
the number of matches passed to the matchspy also applies, to avoid slow
processing for very large result sets.
We also provide a couple of standard MatchSpy implementations (defined
in the new header file "matchspy.h"). One of these counts up the
occurrences of values in specified slots in the documents which are
presented to it: if the facets are stored in the value slots, this gives
the matching facets.
We've also implemented some code such that a facet can contain arbitrary
numbers (serialised to strings in an appropriate way), and some code
which will pick appropriate ranges to use to divide these numbers into a
managable number of groups. This can be used to allow a numerical value
(eg, price) to be used as a facet.
> It seems like Omega does not support faceted searches, is this correct?
That is (currently) correct. I don't have plans to add it, but it
wouldn't be terribly hard to add using the support which will be in 1.0.3.
> What I would like to do specifically, is issue a query like
> "bicycle facet:color"
>
> Alternative 1:
> 1. Perform main search (for bicycle).
> 2. Perform as many searches on the returned mset as there are facets (in
> this case all the colors).
> 3. Store the count for each color, and return it together with the
> search results from point 1.
>
> Alternative 2:
> 1. Perform main search (for bicycle).
> 2. Iterate through the entire mset, counting occurrences of each color,
> and storing the counts.
> 3. Return counts and results from point 1.
>
> Can anyone say if there are any pitfalls with either of these
> strategies?
Alternative 1 doesn't scale to facets with very many possible values: a
search needs to be performed for every possible value.
Alternative 2 will almost work, but requires the calculated mset to be
very large - we normally only calculate a small subset of the mset (eg,
top 10 documents), but we'd like the counts for the facet values to be
relevant to all the matching documents, not just the top few matches.
The matchspy approach has neither of these pitfalls.
--
Richard