On Fri, Mar 08, 2013 at 10:21:46PM +0530, chinmay naik
wrote:> i am interested to work on the project "posting list
> encoding improvements".
> I am a newbie to Xapian but would like to get involved and get a head
start.
> I am currently trying to understand the current encoding for the posting
> lists and trying to understand the drawbacks of the current format.I was
> not able to make out much of the brass_postlist.cc
Each postlist is stored as one or more chunks (each less than about
2000 bytes). We split them into chunks so we can easily and efficiently
skip ahead in the list.
This comment in the middle of the file describes the format of
these chunks:
/** The format of a postlist is:
*
* Split into chunks. Key for first chunk is the termname (encoded as
* length : name). Key for subsequent chunks is the same, followed by the
* document ID of the first document in the chunk (encoded as length of
* representation in first byte, and then docid).
*
* A chunk (except for the first chunk) contains:
*
* 1) bool - true if this is the last chunk.
* 2) difference between final docid in chunk and first docid.
* 3) wdf for the first item.
* 4) increment in docid to next item, followed by wdf for the item.
* 5) (4) repeatedly.
*
* The first chunk begins with the number of entries, the collection
* frequency, then the docid of the first document, then has the header of a
* standard chunk.
*/
The integer values are encoded using a variable length integer format,
where 7 bits in every byte are from the integer and the top bit says
whether this is the last byte of that integer or not.
The main drawback of this format is that it isn't as compact as some
of the alternatives (there are a number of encodings described in the
academic literature, and probably elsewhere). It isn't too bad though
- frequent terms will tend to have small deltas, and wdf is rarely more
than 127, so it will generally average not much more than 2 bytes per
posting.
The big plus point it has is that it is very easy to append to an
existing chunk (which is a common operation when adding a new document,
or batches of new documents). It is also relatively easy to insert
or remove entries in the middle.
If we used something like interpolative coding, we would have to rewrite
the whole chunk to append, insert, or remove entries. But that overhead
might not actually be all that large, or if it is large enough to
matter, we could potentially rewrite chunks to a different encoding once
a database is built.
Even if we don't need to do such rewriting, I suspect it makes sense to
support multiple chunk encodings, as some will probably work well in
general, while others may work especially well in certain situations.
For example, a term indexed for use as a boolean filter will have wdf=0
everywhere (so we don't need to store the wdf for every document), and
if it indexes a lot of documents, the document id list is likely to
encode best using interpolative coding. So a chunk format which was
used for wdf=0 and dense postings would work nicely there.
Cheers,
Olly