I've been reading the docs on the internal construction of Xapian. There's discussion of autopruning and operator decay in the Matching section. Elsewhere, though, it says that postings lists are stored in doc_id order, instead of wdf order, which suggests that there could be high-ranking documents at the end of a postings list. How can autoprune and operator decay really have much effect, then? You would almost always have to go to the end of every list. Example: let's say we have 1000 documents, and we need to return the top 10 for a single-word query. On average, the top 10 will be scattered uniformly across a postings list which is sorted in doc_id order, which means that at least one of them will commonly be found 90% or 95% of the way into the list.
I've been reading the docs on the internal construction of Xapian. There's discussion of autopruning and operator decay in the Matching section. Elsewhere, though, it says that postings lists are stored in doc_id order, instead of wdf order, which suggests that there could be high-ranking documents at the end of a postings list. How can autoprune and operator decay really have much effect, then? You would almost always have to go to the end of every list. Example: let's say we have 1000 documents, and we need to return the top 10 for a single-word query. On average, the top 10 will be scattered uniformly across a postings list which is sorted in doc_id order, which means that at least one of them will commonly be found 90% or 95% of the way into the list.
I tried to post this a couple times through gmane, but it failed. Sorry if it ends up duplicating... ***************************************** I've been reading the docs on the internal construction of Xapian. There's discussion of autopruning and operator decay in the Matching section. Elsewhere, though, it says that postings lists are stored in doc_id order, instead of wdf order, which suggests that there could be high-ranking documents at the end of a postings list. How can autoprune and operator decay really have much effect, then? You would almost always have to go to the end of every list. Example: let's say we have 1000 documents, and we need to return the top 10 for a single-word query. On average, the top 10 will be scattered uniformly across a postings list which is sorted in doc_id order, which means that at least one of them will commonly be found 90% or 95% of the way into the list. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
On Mon, Jan 16, 2006 at 06:53:05PM -0600, Chris wrote:> I've been reading the docs on the internal construction of Xapian. There's > discussion of autopruning and operator decay in the Matching section. > > Elsewhere, though, it says that postings lists are stored in doc_id order, > instead of wdf order, which suggests that there could be high-ranking > documents at the end of a postings list. > > How can autoprune and operator decay really have much effect, then? You > would almost always have to go to the end of every list.The matcher will need to go to the end of the posting list for one term, but at that point operator decay and autopruning can start to happen. Also, for some operators (e.g. AND) the matcher can use skip_to on the posting lists. Often that avoids even needing to look at many chunks of a long posting list. We've not really looked in detail at how often the various optimisations actually fire and how much work they save when they do. But empirically the combined effect is a marked speed-up. Coincidentally I'm hoping to collaborate on a paper studying this.> Example: let's say we have 1000 documents, and we need to return the top 10 > for a single-word query.A single term query has no query operators, so operator decay can't happen.> On average, the top 10 will be scattered uniformly across a postings > list which is sorted in doc_id order, which means that at least one of > them will commonly be found 90% or 95% of the way into the list.Yes, in this case Xapian's matcher will simply scan the whole of that term's posting list. Unless you store the posting lists sorted into decreasing wdf order, that's unavoidable. Sorting is a possible alternative approach, though document length factors into the calculation so you can't simply take the first 10 entries from the posting list for a single term query. Also update is harder if posting lists are sorted by wdf (delete in particular). And we can't compress using the very effective delta-docid scheme we currently use. A hybrid approach might work, storing "high" wdf postings in separate lists to "low" wdf ones, perhaps with multiple bands. But I'm resistant to making major changes like these without evidence that they'll improve performance. If you're interested in investigating I'd suggest prototyping a scheme you think might work better and then we can compare it with the current approach. Cheers, Olly