Henry
2008-Dec-15 11:12 UTC
[Xapian-discuss] Xapian's scoring/sorting compared to Google's
Greets, For the sake of argument and general discussion, let's assume you have a value similar to Google's PageRank which you use for secondary sorting (ie, relevance first, then pagerank). Is this the best approach to use for sorting (to approach the general results of Google in a simplistic fashion)? My gut impression from Google results is that this is /roughly/ what they're doing, or am I wrong? Is Google sorting by PageRank first, *then* result relevance? Any comments? Thanks Henry
Olly Betts
2008-Dec-16 04:07 UTC
[Xapian-discuss] Xapian's scoring/sorting compared to Google's
On Mon, Dec 15, 2008 at 01:12:05PM +0200, Henry wrote:> For the sake of argument and general discussion, let's assume you have > a value similar to Google's PageRank which you use for secondary > sorting (ie, relevance first, then pagerank). > > Is this the best approach to use for sorting (to approach the general > results of Google in a simplistic fashion)?My suggestion for using a "page reputation" score such as PageRank would be to apply an extra weight contribution to each match using Xapian::PostingSource, though that's not been in a release yet so you'll have to use SVN trunk at present.> My gut impression from Google results is that this is /roughly/ what > they're doing, or am I wrong? Is Google sorting by PageRank first, > *then* result relevance?Actually, I personally doubt PageRank as such features much if at all in Google's document ranking these days - people have worked out how to game it too well, and it seems unlikely that more than ten years of development work by Google's thousands of employees hasn't found something better. Microsoft Research certainly claim to have done so: http://portal.acm.org/citation.cfm?id=1135881 Google undoubtably do still perform analysis of the network of links between pages (there is certainly useful information in there), but I suspect it bears at most a passing resemblance to PageRank. I heard a talk by one of the Google "search quality" team last year - of course he didn't go into much detail, but interestingly PageRank was only mentioned when talking about the history of Google... Anyway, the trick to using a query-independent weight for web-scale search is that you order the documents in your database by decreasing query-independent weight. If you want your results ordered *only* by the query-independent weight, then you can simply stop when you've found enough matches! If you also want to include a relevance weighting something like BM25 the ever-decreasing possible contribution from the query-independent weight will still help you be able to stop much sooner. You can implement this technique using Xapian::PostingSource fairly easily. Cheers, Olly