On Wed, Mar 12, 2014 at 12:30:09PM +0800, Hurricane Tong
wrote:> I optimized the code of VSEncoder,
> and encode/decode the whole list, linux.postlist.
>
> Encoding time :
> VSEncoder 103429ms
> InterpolativeEncoder 684ms
>
> Decoding time:
> VSDecoder 756ms
> InterpolativeDecoder 925ms
What where the sizes in each case?
Is your code online somewhere? I'd be interested in taking a look.
> ( compile with -o2, and the Interpolative Encoder/Decoder is just from
> the Xapian's code, bitstream.cc )
>
> VS algorithm performs better in decoding, at the price of encoding time.
>
> I analyse the VSEncoder, most time is spent on getting the optimal
> split, I try to do some optimization, but it seems very difficult to
> make significant progress.
>
> I don't think the position list of a term will be frequently modified,
> so I think it is acceptable to sacrifice some encoding time for faster
> decoding.
You've flipped to talking about position lists here - the choice of
encoding is something to think about there too, though it's really
unlikely they'd be nearly as long as linux.postlist, and the pattern
will be a little different (e.g. the same word occurring at adjacent
positions is rare for most words).
It is true that position lists don't need unpacking and repacking,
but if encoding actually takes 150 times longer in practice, that's
probably not something we want to do by default. It could be an
option for users who want to squeeze out every last bit of search
speed though.
For postlists, we would need to decode and reencode chunks when
documents are deleted or modified - there I think the extra encode
time would also be too much for general use.
But linux.postlist is rather large - we'll probably never encode
such a big chunk of data in one go. Our postlist chunks are currently
about 1000 entries at most I'd guess.
What happens if you try the first 1000 lines of linux.postlist?
Cheers,
Olly