thr3ads.net - Xapian discuss - [Xapian-discuss] Theoretical question [Jan 2006]

If this information is useful, please help other people find it:
Share via:

Chris

2006-Jan-17 01:12 UTC

[Xapian-discuss] Theoretical question

I've been reading the docs on the internal construction of Xapian.
There's
discussion of autopruning and operator decay in the Matching section.

Elsewhere, though, it says that postings lists are stored in doc_id order, 
instead of wdf order, which suggests that there could be high-ranking 
documents at the end of a postings list.

How can autoprune and operator decay really have much effect, then? You 
would almost always have to go to the end of every list.

Example: let's say we have 1000 documents, and we need to return the top 10 
for a single-word query. On average, the top 10 will be scattered uniformly 
across a postings list which is sorted in doc_id order, which means that at 
least one of them will commonly be found 90% or 95% of the way into the 
list.

Chris

2006-Jan-17 04:50 UTC

head link

[Xapian-discuss] Theoretical question

I've been reading the docs on the internal construction of Xapian.
There's
discussion of autopruning and operator decay in the Matching section.

Elsewhere, though, it says that postings lists are stored in doc_id order,
instead of wdf order, which suggests that there could be high-ranking
documents at the end of a postings list.

How can autoprune and operator decay really have much effect, then? You
would almost always have to go to the end of every list.

Example: let's say we have 1000 documents, and we need to return the top 10
for a single-word query. On average, the top 10 will be scattered uniformly
across a postings list which is sorted in doc_id order, which means that at
least one of them will commonly be found 90% or 95% of the way into the
list.

shef

2006-Jan-18 09:12 UTC

head link

[Xapian-discuss] Theoretical question

I tried to post this a couple times through gmane, but
it failed. Sorry if it ends up duplicating...
*****************************************

I've been reading the docs on the internal
construction of Xapian. There's discussion of
autopruning and operator decay in the Matching
section.

Elsewhere, though, it says that postings lists are
stored in doc_id order, instead of wdf order, which
suggests that there could be high-ranking documents at
the end of a postings list.

How can autoprune and operator decay really have much
effect, then? You would almost always have to go to
the end of every list.

Example: let's say we have 1000 documents, and we need
to return the top 10 for a single-word query. On
average, the top 10 will be scattered uniformly across
a postings list which is sorted in doc_id order, which
means that at least one of them will commonly be found
90% or 95% of the way into the list.



__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Olly Betts

2006-Feb-03 03:52 UTC

head link

[Xapian-discuss] Theoretical question

On Mon, Jan 16, 2006 at 06:53:05PM -0600, Chris wrote:> I've been reading the docs on the internal construction of Xapian.
There's
> discussion of autopruning and operator decay in the Matching section.
> 
> Elsewhere, though, it says that postings lists are stored in doc_id order, 
> instead of wdf order, which suggests that there could be high-ranking 
> documents at the end of a postings list.
> 
> How can autoprune and operator decay really have much effect, then? You 
> would almost always have to go to the end of every list.
The matcher will need to go to the end of the posting list for one term,
but at that point operator decay and autopruning can start to happen.

Also, for some operators (e.g. AND) the matcher can use skip_to on
the posting lists.  Often that avoids even needing to look at many
chunks of a long posting list.

We've not really looked in detail at how often the various optimisations
actually fire and how much work they save when they do.  But empirically
the combined effect is a marked speed-up.  Coincidentally I'm hoping to
collaborate on a paper studying this.
> Example: let's say we have 1000 documents, and we need to return the
top 10
> for a single-word query.
A single term query has no query operators, so operator decay can't
happen.
> On average, the top 10 will be scattered uniformly across a postings
> list which is sorted in doc_id order, which means that at least one of
> them will commonly be found 90% or 95% of the way into the list. 
Yes, in this case Xapian's matcher will simply scan the whole of that
term's posting list.  Unless you store the posting lists sorted into
decreasing wdf order, that's unavoidable.

Sorting is a possible alternative approach, though document length
factors into the calculation so you can't simply take the first 10
entries from the posting list for a single term query.  Also update is
harder if posting lists are sorted by wdf (delete in particular).  And
we can't compress using the very effective delta-docid scheme we
currently use.

A hybrid approach might work, storing "high" wdf postings in separate
lists to "low" wdf ones, perhaps with multiple bands.

But I'm resistant to making major changes like these without evidence
that they'll improve performance.  If you're interested in investigating
I'd suggest prototyping a scheme you think might work better and then
we can compare it with the current approach.

Cheers,
    Olly

Xapian discuss - Jan 2006 - Theoretical question

[Xapian-discuss] Theoretical question

[Xapian-discuss] Theoretical question

[Xapian-discuss] Theoretical question

[Xapian-discuss] Theoretical question