thr3ads.net - Xapian devel - [Xapian-devel] Project: Posting list encoding improvements [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Weixian Zhou

2012-Mar-31 05:25 UTC

[Xapian-devel] Project: Posting list encoding improvements

Hi Xapianers:
My name is Weixian Zhou, Computer Science student of University at Buffalo,
State University of New York. I am interested in the project of posting
list encoding improvements and weighting schemes. I have some questions
toward them.
1) After read the comments in brass_postlist.cc, I am still not very clear
about the detailed structure of postings list. If you can provide some
simple examples/graphs will be very straightforward.
2) My instant idea to make list smaller: use gamma codes to encode the gap
between docids instead of docids.

Last question towards the project of weighting schemes: Do we need only to
implement existing weighting scheme instead of coming up with new ideas?
And our mission is to find a weighting scheme that could replace the
default BM25 in Xapian?


-- 
Weixian Zhou
Department of Computer Science and Engineering
University at Buffalo, SUNY
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20120331/58c7dedd/attachment.htm>

Olly Betts

2012-Apr-03 04:47 UTC

head link

[Xapian-devel] Project: Posting list encoding improvements

On Sat, Mar 31, 2012 at 01:25:31AM -0400, Weixian Zhou
wrote:> My name is Weixian Zhou, Computer Science student of University at Buffalo,
> State University of New York. I am interested in the project of posting
> list encoding improvements and weighting schemes. I have some questions
> toward them.
If you're equally interested in both, I'd probably steer you towards the
posting list encoding project - we've had quite a lot of interest in the
weighting schemes project already.
> 1) After read the comments in brass_postlist.cc, I am still not very clear
> about the detailed structure of postings list. If you can provide some
> simple examples/graphs will be very straightforward.
Each term's posting list is stored in one or more chunks.  Each chunk is
up to about 2KB of encoded data, and is stored in a B-tree, with a key
built from the term name and the first docid in the chunk.  This allows
us to skip ahead in a posting list efficiently - we just look up the
key for the docid we want which puts us on the right chunk (or the one
before it if that docid is in a gap between chunks).

Within each chunk, there's a simple header and then the docids and wdfs
are encoding in a fairly simple way using variable length integers, with
7 bits of each byte used for data and the most significant bit used to
indicate if there are more bytes for this integer, so 0 to 127 are
encoded in one byte, 128 to 16383 in two bytes, etc.  We store the
(docid delta - 1) and the wdf in this way.

This isn't the most compact encoding by any means, but it is reasonably
compact and easy to update - you can splice in new entries without
having to re-encode a whole chunk, or do a lot of bit shifts.

However, for users not worried about efficiency of updating existing
documents, a more compact encoding would be good.
> 2) My instant idea to make list smaller: use gamma codes to encode the gap
> between docids instead of docids.
Why gamma codes?

I've not done comparisons myself, but Managing Gigabytes suggests
interpolative coding is the best choice for encoding posting lists (of
those it considers, which includes gamma).  Interpolative coding comes
up as the most compact and similar time to encode and decode.  If you
have access to the second edition, the table is on page 129.  Or you
may be able to see it in the preview link here:

https://www.google.com/search?tbm=bks&q=%22either+the+appropriate+model+must+be+read+off+disk%22

I've not done experiments, but we use interpolative coding for
positional data so we already have a well tested implementation.
> Last question towards the project of weighting schemes: Do we need only to
> implement existing weighting scheme instead of coming up with new ideas?
Yes, we're more interested there in implementing proven approaches than
trying to do original research.
> And our mission is to find a weighting scheme that could replace the
> default BM25 in Xapian?
I'm not sure that's the mission exactly.  The idea is to implement more
schemes which are likely to be useful to users.  They might be a better
in some situations, or give faster results, or be useful a baseline for
researchers, or be useful for some other reason.

If there's a good argument for changing the default from BM25, that's
definitely an option.  We wouldn't remove BM25 though.

Cheers,
    Olly

Apparently Analagous Threads

Search for more possibly parallel threads

Xapian devel - Mar 2012 - Project: Posting list encoding improvements

[Xapian-devel] Project: Posting list encoding improvements

[Xapian-devel] Project: Posting list encoding improvements

Apparently Analagous Threads