thr3ads.net - Xapian discuss - [Xapian-discuss] Filter similar results [Sep 2006]

If this information is useful, please help other people find it:
Share via:

Robby Walker

2006-Sep-12 20:35 UTC

[Xapian-discuss] Filter similar results

Hi,

I'm trying to implement something like Google's "similar results
omitted" and I'm not sure how to go about it.

Specifically, what I have are a number of sets of documents, and I'd
like to get the best documents from *each* set.  So, making up some
relevance scores, say I have documents from set A with scores 10,9,8,7
and from set B with score 6,5.  If I am only to return 3 documents to
the user I'd like to return 10,9, and 6.  i.e. I'd like to filter out
the similar documents 8 and 7 from A.

What's the best way to go about this?  I've come up with a few ideas:

One option is to do multiple smaller searches - one per set - this
seems inefficient?  Is the cost of search weighted towards setting up
the search or is it more affected by the number of desired results?
What happens when I'm searching 20 or more different sets?

Another option would be to use MatchDecider and only "OK" the first
few documents from each set.  Is it true that documents will be tested
by MatchDecider in any sort of order - e.g. highest relevancy first?

A third option is to do a standard relevancy search over all sets.  If
one set dominates this result, I can search again filtering out that
set and merge the results.  Repeat if necessary.  Again the question
here is what part of search is most expensive.

Finally, the fourth option is to just return way more documents than I
need and then go filter it manually.

Any of these seem good?  Or am I missing an easier way to do this?

Many thanks in advance!

- Robby

Olly Betts

2006-Sep-12 23:06 UTC

head link

[Xapian-discuss] Filter similar results

On Tue, Sep 12, 2006 at 02:35:42PM -0500, Robby Walker
wrote:> I'm trying to implement something like Google's "similar
results
> omitted" and I'm not sure how to go about it.
> One option is to do multiple smaller searches - one per set - this
> seems inefficient?  Is the cost of search weighted towards setting up
> the search or is it more affected by the number of desired results?
For a large system, I/O will dominate.  Once the blocks involved are in
the cache, the search is much faster - try repeating a search which was
slow and you'll generally find it's much quicker.

So running two searches for similar things will typically be quicker
than twice the time to do one, and the worst case performance is
definitely much better.  Plus if this is a web search, you only incur
the browser/server roundtrip overhead once so the perceived time to
the end user is certainly less than doubled.
> What happens when I'm searching 20 or more different sets?
It's hard to give a general answer - it will depends on many things, not
least what an acceptable search time is for your users.
> Another option would be to use MatchDecider and only "OK" the
first
> few documents from each set.  Is it true that documents will be tested
> by MatchDecider in any sort of order - e.g. highest relevancy first?
The order is essentially unspecified.  It's probably currently ascending
docid order, but even that's not something you should rely on - I plan
to implement a new optimisation to delay applying the MatchDecider so
we end up testing fewer documents, and that will change the order.
> A third option is to do a standard relevancy search over all sets.  If
> one set dominates this result, I can search again filtering out that
> set and merge the results.  Repeat if necessary.  Again the question
> here is what part of search is most expensive.
> 
> Finally, the fourth option is to just return way more documents than I
> need and then go filter it manually.
The main problem for the last option is that many of the matcher's
optimisations are less effective when you ask for a larger MSet.
But perhaps cheaper than doing N searches.
> Any of these seem good?  Or am I missing an easier way to do this? 
My thought would be to enhance the collapse feature to allow collapsing
to leave N documents with an identical collapse key instead of just one.
It's probably a comparable amount of work to trying to do it outside of
Xapian and you get a much more satisfactory solution.

Cheers,
    Olly

James Aylett

2006-Sep-13 08:58 UTC

head link

[Xapian-discuss] Filter similar results

On Tue, Sep 12, 2006 at 02:35:42PM -0500, Robby Walker wrote:
> Specifically, what I have are a number of sets of documents, and I'd
> like to get the best documents from *each* set.  So, making up some
> relevance scores, say I have documents from set A with scores 10,9,8,7
> and from set B with score 6,5.  If I am only to return 3 documents to
> the user I'd like to return 10,9, and 6.  i.e. I'd like to filter
out
> the similar documents 8 and 7 from A.
How are you choosing the numbers to return from each set? I can think
of a variety of approaches which would result in that match set, but I
can't be sure which one you mean :-)

James

-- 
/--------------------------------------------------------------------------\
  James Aylett                                                  xapian.org
  james@tartarus.org                               uncertaintydivision.org

Xapian discuss - Sep 2006 - Filter similar results

[Xapian-discuss] Filter similar results

[Xapian-discuss] Filter similar results

[Xapian-discuss] Filter similar results