Hi, I'm trying to implement something like Google's "similar results omitted" and I'm not sure how to go about it. Specifically, what I have are a number of sets of documents, and I'd like to get the best documents from *each* set. So, making up some relevance scores, say I have documents from set A with scores 10,9,8,7 and from set B with score 6,5. If I am only to return 3 documents to the user I'd like to return 10,9, and 6. i.e. I'd like to filter out the similar documents 8 and 7 from A. What's the best way to go about this? I've come up with a few ideas: One option is to do multiple smaller searches - one per set - this seems inefficient? Is the cost of search weighted towards setting up the search or is it more affected by the number of desired results? What happens when I'm searching 20 or more different sets? Another option would be to use MatchDecider and only "OK" the first few documents from each set. Is it true that documents will be tested by MatchDecider in any sort of order - e.g. highest relevancy first? A third option is to do a standard relevancy search over all sets. If one set dominates this result, I can search again filtering out that set and merge the results. Repeat if necessary. Again the question here is what part of search is most expensive. Finally, the fourth option is to just return way more documents than I need and then go filter it manually. Any of these seem good? Or am I missing an easier way to do this? Many thanks in advance! - Robby
On Tue, Sep 12, 2006 at 02:35:42PM -0500, Robby Walker wrote:> I'm trying to implement something like Google's "similar results > omitted" and I'm not sure how to go about it.> One option is to do multiple smaller searches - one per set - this > seems inefficient? Is the cost of search weighted towards setting up > the search or is it more affected by the number of desired results?For a large system, I/O will dominate. Once the blocks involved are in the cache, the search is much faster - try repeating a search which was slow and you'll generally find it's much quicker. So running two searches for similar things will typically be quicker than twice the time to do one, and the worst case performance is definitely much better. Plus if this is a web search, you only incur the browser/server roundtrip overhead once so the perceived time to the end user is certainly less than doubled.> What happens when I'm searching 20 or more different sets?It's hard to give a general answer - it will depends on many things, not least what an acceptable search time is for your users.> Another option would be to use MatchDecider and only "OK" the first > few documents from each set. Is it true that documents will be tested > by MatchDecider in any sort of order - e.g. highest relevancy first?The order is essentially unspecified. It's probably currently ascending docid order, but even that's not something you should rely on - I plan to implement a new optimisation to delay applying the MatchDecider so we end up testing fewer documents, and that will change the order.> A third option is to do a standard relevancy search over all sets. If > one set dominates this result, I can search again filtering out that > set and merge the results. Repeat if necessary. Again the question > here is what part of search is most expensive. > > Finally, the fourth option is to just return way more documents than I > need and then go filter it manually.The main problem for the last option is that many of the matcher's optimisations are less effective when you ask for a larger MSet. But perhaps cheaper than doing N searches.> Any of these seem good? Or am I missing an easier way to do this?My thought would be to enhance the collapse feature to allow collapsing to leave N documents with an identical collapse key instead of just one. It's probably a comparable amount of work to trying to do it outside of Xapian and you get a much more satisfactory solution. Cheers, Olly
On Tue, Sep 12, 2006 at 02:35:42PM -0500, Robby Walker wrote:> Specifically, what I have are a number of sets of documents, and I'd > like to get the best documents from *each* set. So, making up some > relevance scores, say I have documents from set A with scores 10,9,8,7 > and from set B with score 6,5. If I am only to return 3 documents to > the user I'd like to return 10,9, and 6. i.e. I'd like to filter out > the similar documents 8 and 7 from A.How are you choosing the numbers to return from each set? I can think of a variety of approaches which would result in that match set, but I can't be sure which one you mean :-) James -- /--------------------------------------------------------------------------\ James Aylett xapian.org james@tartarus.org uncertaintydivision.org