On Sat, Feb 22, 2014 at 10:41:19PM +0800, Hurricane Tong
wrote:> I am now confused with the Interpolative Coding Algorithm, I failed in
> grasping the main idea of this algorithm by reading xapian source
> code. And it seems that there is not much resource about this
> algorithm in the Internet.
The best reference I know of is the book "Managing Gigabytes". In the
second edition, see pages 126-127 (starting with the first new paragraph
of page 126).
You can probably see it here (assuming the link isn't tied to my
browser session):
http://books.google.com/books?id=2F74jyPl48EC&pg=PA126&lpg=PA126&dq=interpolative+code&source=bl&ots=5QiOEu7Xaa&sig=32JCNZUwQPrZZxW3LmTqHOYnP9E&hl=en&sa=X&ei=mOwIU8LPF8XZkAXin4DwBw&redir_esc=y#v=onepage&q=interpolative%20code&f=false
Otherwise, try searching google books for "interpolative code" and
find
it from there.
> In bitstream.h, I'm confused with the function
> void BitWriter::encode( size_t value, size_t outof ),
> I can't grasp the exact meaning of the two parameters.
It encodes "value" to the bitstream, where we know that:
0 <= value && value < outof
> And in
> void BrassPositionListTable::pack(string & s,const
vector<Xapian::termpos> & vec) const,
> I don't know why this function calls
> wr.encode(vec[0], vec.back());
> wr.encode(vec.size() - 2, vec.back() - vec[0]);
For interpolative coding, we need to know the end points and number of
entries to be able to decode, so we encode these first.
Just before the two lines you quote, we wrote vec.back() (the last
element of the range) to the bitstream. The entries of vec are of
type Xapian::termpos, which is an unsigned type, so vec[0] >= 0,
and since vec is sorted and contains no duplicates, and we know it
contains more than one element, we know that vec[0] < vec.back() (since
they are different elements).
So 0 <= vec[0] and vec[0] < vec.back() and we can encode vec[0] "out
of"
vec.back().
Since vec contains more than one element vec.size() >= 2 or:
0 <= vec.size() - 2
And vec.size() <= vec.back() - vec[0] + 1 (equality is when they are
contiguous, adding gaps makes the difference bigger), which is
equivalent to:
vec.size() - 2 < vec.back() - vec[0]
So we can encode vec.size() - 2 "out of" vec.back() - vec[0].
We could instead encode vec.size() - 2 out of vec.back() and then
vec[0] out of vec.back() - (vec.size() - 2). It looks like that's
less good on average though - for the testsuite it comes out 18%
worse.
Cheers,
Olly