thr3ads.net - Xapian discuss - [Xapian-discuss] Ranking and term proximity [Sep 2011]

If this information is useful, please help other people find it:
Share via:

goran kent

2011-Sep-04 13:43 UTC

[Xapian-discuss] Ranking and term proximity

Hi,

I was reading an article recently about how google ranks results
(among many other things of course) based on the proximity of the
search terms in the source documents.  In addition, the position of
the search terms in the search query string itself is also taken into
consideration when determining how important each term is.

Does Xapian do something similar - at least for the first part?

For example, if I search for 'Olly Betts' - without double quotes in
two documents the first of which the terms 'Olly' and 'Betts'
are
widely separated, and the second contains the terms 'Olly Betts' right
next to each other, will the latter document score higher?  Please
tell me it is.

I can understand the position information in the search string itself
not being used, but surely term proximity is used?

Thanks

Jean-Francois Dockes

2011-Sep-04 18:11 UTC

head link

[Xapian-discuss] Ranking and term proximity

goran kent writes:
 > Hi,
 > 
 > I was reading an article recently about how google ranks results
 > (among many other things of course) based on the proximity of the
 > search terms in the source documents.  In addition, the position of
 > the search terms in the search query string itself is also taken into
 > consideration when determining how important each term is.
 > 
 > Does Xapian do something similar - at least for the first part?
 > 
 > For example, if I search for 'Olly Betts' - without double quotes
in
 > two documents the first of which the terms 'Olly' and
'Betts' are
 > widely separated, and the second contains the terms 'Olly Betts'
right
 > next to each other, will the latter document score higher?  Please
 > tell me it is.

Hopefuly one of the Xapian developer will refute me, but I think that
Xapian does no such thing, leaving such things to the application
software. 

Recoll has an option to automatically add a phrase search to simple
queries, in order to obtain the effect you describe, but it's off by
default because phrase/proximity searches can be very slow, especially if
the terms are common.

By the way, Google handling of common word phrases looks nothing short of
magic to my insufficiently advanced mind, and I'd be quite interested by an
explanation of how they do it.

I've been playing with indexing adjacent common terms as an n-gram, but the
index size grows so fast that I'm losing a lot of the performance
improvements. It would appear that some of the Google PhDs are actually
hired for good reason :)

Possibly, another approach for automatic proximity boost would be to prune
the common terms from the generated phrase, but this looks a bit like
admitting defeat and we're left with the to be or not to be issue.

If someone has shareable ideas in this area, I'd be quite willing to
experiment.

jf

goran kent

2011-Sep-06 06:35 UTC

head link

[Xapian-discuss] Ranking and term proximity

On Sun, Sep 4, 2011 at 8:10 PM, Jean-Francois Dockes <jf at dockes.org>
wrote:> ?> For example, if I search for 'Olly Betts' - without double
quotes in
> ?> two documents the first of which the terms 'Olly' and
'Betts' are
> ?> widely separated, and the second contains the terms 'Olly
Betts' right
> ?> next to each other, will the latter document score higher? ?Please
> ?> tell me it is.
>
> Hopefuly one of the Xapian developer will refute me, but I think that
> Xapian does no such thing, leaving such things to the application
> software.
This is rather sad indeed - one would think this is rather fundamental
in determining how important a document is.

It reminds me of search on gmane.com - almost utterly useless because
of this issue (and also no ranking based on links - but this is
implementation, not xapian per se).  You'll get search results with a
bazaar of highlighted terms, but no consideration for proximity terms.
 Gmane.com should be a showcase for Xapian.

For example:
http://search.gmane.org/?query=search+the+list&author=&group=gmane.discuss&sort=relevance&DEFAULTOP=and&xP=Zsearch%09Zlist&xFILTERS=Gdiscuss---A

The second result has both terms in close proximity (the title *and*
body), yet is not ranked 1st.

I wish I had the money to sponsor development of this and other
important issues - rather than support for more languages like Lua, et
al, or tweaking Omega.  Search performance and ranking should reign
supreme for a project like Xapian.  Reminds me of
http://trac.xapian.org/ticket/326 - chert (without patches, but even
with, it's still bad) is 7x SLOWER than the older flint format.
That's embarrassing.  Yes, one can argue that chert *may* perform
better with larger indexes, but hell, that's still a bad start...  Can
you imagine trying to justify/explain that kind of degradation in a
commercial product?  You'd be laughed right out the conference room.

Anyway, we can but hope.

:)

Tim Brody

2011-Sep-06 08:01 UTC

head link

[Xapian-discuss] Ranking and term proximity

On Tue, 6 Sep 2011 08:35:32 +0200, goran kent <gorankent at gmail.com>
wrote:> On Sun, Sep 4, 2011 at 8:10 PM, Jean-Francois Dockes <jf at
dockes.org>
wrote:>> ?> For example, if I search for 'Olly Betts' - without
double quotes in
>> ?> two documents the first of which the terms 'Olly' and
'Betts' are
>> ?> widely separated, and the second contains the terms 'Olly
Betts'
>> right
>> ?> next to each other, will the latter document score higher?
?Please
>> ?> tell me it is.
>>
>> Hopefuly one of the Xapian developer will refute me, but I think that
>> Xapian does no such thing, leaving such things to the application
>> software.
> 
> This is rather sad indeed - one would think this is rather fundamental
> in determining how important a document is.
A Google search for "xapian proximity weight" finds this:
http://lists.tartarus.org/pipermail/xapian-discuss/2006-December/003037.html

-- 
All the best,
Tim.

goran kent

2011-Sep-06 11:35 UTC

head link

[Xapian-discuss] Ranking and term proximity

On Tue, Sep 6, 2011 at 1:24 PM, goran kent <gorankent at gmail.com>
wrote:> Crikey, NEAR should be implicit with a general-purpose search engine,
> IMHO.  However, I confess, I wasn't even aware of NEAR, so I'll be
> doing some digg'n.
In fact, there should be a flag to default all searches to NEAR (if
there *is*, forgive my ignorance).

Olly Betts

2011-Sep-15 06:41 UTC

head link

[Xapian-discuss] Ranking and term proximity

On Tue, Sep 06, 2011 at 01:35:15PM +0200, goran kent
wrote:> In fact, there should be a flag to default all searches to NEAR (if
> there *is*, forgive my ignorance).
    queryparser->set_default_op(Xapian::Query::OP_NEAR);

Cheers,
    Olly

Possibly Parallel Threads

Search for more reasonably related threads

Xapian discuss - Sep 2011 - Ranking and term proximity

[Xapian-discuss] Ranking and term proximity

[Xapian-discuss] Ranking and term proximity

[Xapian-discuss] Ranking and term proximity

[Xapian-discuss] Ranking and term proximity

[Xapian-discuss] Ranking and term proximity

[Xapian-discuss] Ranking and term proximity

Possibly Parallel Threads