Hi, I was reading an article recently about how google ranks results (among many other things of course) based on the proximity of the search terms in the source documents. In addition, the position of the search terms in the search query string itself is also taken into consideration when determining how important each term is. Does Xapian do something similar - at least for the first part? For example, if I search for 'Olly Betts' - without double quotes in two documents the first of which the terms 'Olly' and 'Betts' are widely separated, and the second contains the terms 'Olly Betts' right next to each other, will the latter document score higher? Please tell me it is. I can understand the position information in the search string itself not being used, but surely term proximity is used? Thanks
goran kent writes: > Hi, > > I was reading an article recently about how google ranks results > (among many other things of course) based on the proximity of the > search terms in the source documents. In addition, the position of > the search terms in the search query string itself is also taken into > consideration when determining how important each term is. > > Does Xapian do something similar - at least for the first part? > > For example, if I search for 'Olly Betts' - without double quotes in > two documents the first of which the terms 'Olly' and 'Betts' are > widely separated, and the second contains the terms 'Olly Betts' right > next to each other, will the latter document score higher? Please > tell me it is. Hopefuly one of the Xapian developer will refute me, but I think that Xapian does no such thing, leaving such things to the application software. Recoll has an option to automatically add a phrase search to simple queries, in order to obtain the effect you describe, but it's off by default because phrase/proximity searches can be very slow, especially if the terms are common. By the way, Google handling of common word phrases looks nothing short of magic to my insufficiently advanced mind, and I'd be quite interested by an explanation of how they do it. I've been playing with indexing adjacent common terms as an n-gram, but the index size grows so fast that I'm losing a lot of the performance improvements. It would appear that some of the Google PhDs are actually hired for good reason :) Possibly, another approach for automatic proximity boost would be to prune the common terms from the generated phrase, but this looks a bit like admitting defeat and we're left with the to be or not to be issue. If someone has shareable ideas in this area, I'd be quite willing to experiment. jf
On Sun, Sep 4, 2011 at 8:10 PM, Jean-Francois Dockes <jf at dockes.org> wrote:> ?> For example, if I search for 'Olly Betts' - without double quotes in > ?> two documents the first of which the terms 'Olly' and 'Betts' are > ?> widely separated, and the second contains the terms 'Olly Betts' right > ?> next to each other, will the latter document score higher? ?Please > ?> tell me it is. > > Hopefuly one of the Xapian developer will refute me, but I think that > Xapian does no such thing, leaving such things to the application > software.This is rather sad indeed - one would think this is rather fundamental in determining how important a document is. It reminds me of search on gmane.com - almost utterly useless because of this issue (and also no ranking based on links - but this is implementation, not xapian per se). You'll get search results with a bazaar of highlighted terms, but no consideration for proximity terms. Gmane.com should be a showcase for Xapian. For example: http://search.gmane.org/?query=search+the+list&author=&group=gmane.discuss&sort=relevance&DEFAULTOP=and&xP=Zsearch%09Zlist&xFILTERS=Gdiscuss---A The second result has both terms in close proximity (the title *and* body), yet is not ranked 1st. I wish I had the money to sponsor development of this and other important issues - rather than support for more languages like Lua, et al, or tweaking Omega. Search performance and ranking should reign supreme for a project like Xapian. Reminds me of http://trac.xapian.org/ticket/326 - chert (without patches, but even with, it's still bad) is 7x SLOWER than the older flint format. That's embarrassing. Yes, one can argue that chert *may* perform better with larger indexes, but hell, that's still a bad start... Can you imagine trying to justify/explain that kind of degradation in a commercial product? You'd be laughed right out the conference room. Anyway, we can but hope. :)
On Tue, 6 Sep 2011 08:35:32 +0200, goran kent <gorankent at gmail.com> wrote:> On Sun, Sep 4, 2011 at 8:10 PM, Jean-Francois Dockes <jf at dockes.org>wrote:>> ?> For example, if I search for 'Olly Betts' - without double quotes in >> ?> two documents the first of which the terms 'Olly' and 'Betts' are >> ?> widely separated, and the second contains the terms 'Olly Betts' >> right >> ?> next to each other, will the latter document score higher? ?Please >> ?> tell me it is. >> >> Hopefuly one of the Xapian developer will refute me, but I think that >> Xapian does no such thing, leaving such things to the application >> software. > > This is rather sad indeed - one would think this is rather fundamental > in determining how important a document is.A Google search for "xapian proximity weight" finds this: http://lists.tartarus.org/pipermail/xapian-discuss/2006-December/003037.html -- All the best, Tim.
On Tue, Sep 6, 2011 at 1:24 PM, goran kent <gorankent at gmail.com> wrote:> Crikey, NEAR should be implicit with a general-purpose search engine, > IMHO. However, I confess, I wasn't even aware of NEAR, so I'll be > doing some digg'n.In fact, there should be a flag to default all searches to NEAR (if there *is*, forgive my ignorance).
On Tue, Sep 06, 2011 at 01:35:15PM +0200, goran kent wrote:> In fact, there should be a flag to default all searches to NEAR (if > there *is*, forgive my ignorance).queryparser->set_default_op(Xapian::Query::OP_NEAR); Cheers, Olly