thr3ads.net - Xapian devel - [Xapian-devel] Optimized VSEncoding [Mar 2014]

If this information is useful, please help other people find it:
Share via:

Hurricane Tong

2014-Mar-13 02:41 UTC

[Xapian-devel] Optimized VSEncoding

Hi,

The size of the string generated by VSEncoder is 12592387,
while that by InterpolativeEncoder is 8554817.

When only encoding the first 1000lines, 
both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost 0ms,
1000lines is just too little to catch the difference in my test.

I upload the source code to
https://github.com/HurricaneTong/Xapian/tree/master/VSEncoder
The bitstream.cpp and bitstream.h is about VSEncoder.
The Interpolative.cpp and Interpolative is about InterpolativeEncoder from the
Xapian's code.
I haven't write comment on the code, but in main.cpp, 
double CalTimeVS( const vector<unsigned int>& src ) gives an example
about how to use VSEncoder and VSDecoder.

Best 
Regards

------------------
Shangtong Zhang,Second Year Undergraduate,
School of Computer Science,
Fudan University, China.

------------------ Original ------------------
From:  "Olly Betts";<olly at survex.com>;
Date:  Wed, Mar 12, 2014 04:37 PM
To:  "Hurricane Tong"<zhangshangtong.cpp at qq.com>; 
Cc:  "xapian-devel"<xapian-devel at lists.xapian.org>; 
Subject:  Re: [Xapian-devel] Optimized VSEncoding

On Wed, Mar 12, 2014 at 12:30:09PM +0800, Hurricane Tong
wrote:> I optimized the code of VSEncoder,
> and encode/decode the whole list, linux.postlist.
> 
> Encoding time :  
> VSEncoder 103429ms
> InterpolativeEncoder 684ms
> 
> Decoding time:
> VSDecoder 756ms
> InterpolativeDecoder 925ms
What where the sizes in each case?

Is your code online somewhere?  I'd be interested in taking a look.
> ( compile with -o2, and the Interpolative Encoder/Decoder is just from
> the Xapian's code, bitstream.cc )
> 
> VS algorithm performs better in decoding, at the price of encoding time.
> 
> I analyse the VSEncoder, most time is spent on getting the optimal
> split, I try to do some optimization, but it seems very difficult to
> make significant progress.
> 
> I don't think the position list of a term will be frequently modified,
> so I think it is acceptable to sacrifice some encoding time for faster
> decoding.
You've flipped to talking about position lists here - the choice of
encoding is something to think about there too, though it's really
unlikely they'd be nearly as long as linux.postlist, and the pattern
will be a little different (e.g. the same word occurring at adjacent
positions is rare for most words).

It is true that position lists don't need unpacking and repacking,
but if encoding actually takes 150 times longer in practice, that's
probably not something we want to do by default.  It could be an
option for users who want to squeeze out every last bit of search
speed though.

For postlists, we would need to decode and reencode chunks when
documents are deleted or modified - there I think the extra encode
time would also be too much for general use.

But linux.postlist is rather large - we'll probably never encode
such a big chunk of data in one go.  Our postlist chunks are currently
about 1000 entries at most I'd guess.

What happens if you try the first 1000 lines of linux.postlist?

Cheers,
    Olly
.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140313/75395ca7/attachment-0002.html>

Richard Boulton

2014-Mar-13 08:01 UTC

head link

[Xapian-devel] Optimized VSEncoding

On 13 March 2014 02:41, Hurricane Tong <zhangshangtong.cpp at qq.com>
wrote:>
> When only encoding the first 1000lines,
> both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost
> 0ms,
> 1000lines is just too little to catch the difference in my test.
>
When measuring small times like that there are two approaches you can try:

 - run the test code lots of times.  Ideally, with different data each time
so that you don't get too many cache effects (ie, with the same data, you
might find that subsequent runs are much faster than the first).  So, in
this case, you could try encoding the whole thing, but in chunks of 1000.

 - run the test under a tool like valgrind's "cachegrind", which
measures
the theoretical performance on a particular CPU architecture, and gives
precise counts of operations performed.

-- 
Richard
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140313/744acf8d/attachment-0002.html>

Olly Betts

2014-Mar-13 10:49 UTC

head link

[Xapian-devel] Optimized VSEncoding

On Thu, Mar 13, 2014 at 10:41:02AM +0800, Hurricane Tong
wrote:> When only encoding the first 1000lines,
> both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost
0ms,
> 1000lines is just too little to catch the difference in my test.
As Richard says, there are better ways to do timing tests than calling
a fairly low resolution timer like clock() before and after a single
run.

We also recommend using Linux for development work rather than Windows.
That would help here as there are better open source profiling tools
available on Linux.
> I upload the source code to
https://github.com/HurricaneTong/Xapian/tree/master/VSEncoder
Thanks.

You're inserting repeatedly at the start of vector S - each time you do
that the entire current contents of the vector have to be copied up one
place to make room, which means that inserting N items has cost O(N*N).

But vector::push_back() is O(1), so pushing back N items has cost O(N).
And we can just flip the loop which uses the items to run backwards
through S instead of forwards.

I've attached a quick patch to show what I mean. I haven't checked the
output is the same as before, so this might not be quite right - the
patch is really just to make it clearer what I'm talking about.

Making this change reduces the encoding time by a factor of about 3.6
for me for the linux.postlist case (though if the patch isn't quite
right this reduction may not be either). And that's still ~14 times as
long as interpolative coding takes, but I suspect there's further scope
for speeding it up.

Cheers,
Olly
-------------- next part --------------
A non-text attachment was scrubbed...
Name: vsencoder-speedup.patch
Type: text/x-diff
Size: 811 bytes
Desc: not available
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140313/ee297f71/attachment-0002.patch>

Hurricane Tong

2014-Mar-14 01:03 UTC

head link

[Xapian-devel] Optimized VSEncoding

Hi,

Thanks for your patch, it works well, the output is the same with what it is
before.
To do further optimization, I try to use more subtle bit operation in encoding,
it does make a little progress.
The get_optimal_split cost 72% of the time for encoding, but I can hardly apply
more optimization to it.
I will try to optimize the dynamic programming algorithm.

>>We also recommend using Linux for development work rather than Windows.

I have a ubuntu installed in my machine, but in ubuntu, my machine cannot awake
from hiberating, sleeping....
I tried many methods but finally failed, so if not necessary, I don't switch
to ubuntu.

Best
Regards

------------------
Shangtong Zhang,Second Year Undergraduate,
School of Computer Science,
Fudan University, China.

------------------ Original ------------------
From:  "Olly Betts";<olly at survex.com>;
Date:  Thu, Mar 13, 2014 06:49 PM
To:  "Hurricane Tong"<zhangshangtong.cpp at qq.com>; 
Cc:  "xapian-devel"<xapian-devel at lists.xapian.org>; 
Subject:  Re: [Xapian-devel] Optimized VSEncoding

On Thu, Mar 13, 2014 at 10:41:02AM +0800, Hurricane Tong
wrote:> When only encoding the first 1000lines, 
> both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost
0ms,
> 1000lines is just too little to catch the difference in my test.
As Richard says, there are better ways to do timing tests than calling
a fairly low resolution timer like clock() before and after a single
run.

We also recommend using Linux for development work rather than Windows.
That would help here as there are better open source profiling tools
available on Linux.
> I upload the source code to
https://github.com/HurricaneTong/Xapian/tree/master/VSEncoder
Thanks.

You're inserting repeatedly at the start of vector S - each time you do
that the entire current contents of the vector have to be copied up one
place to make room, which means that inserting N items has cost O(N*N).

But vector::push_back() is O(1), so pushing back N items has cost O(N).
And we can just flip the loop which uses the items to run backwards
through S instead of forwards.

I've attached a quick patch to show what I mean.  I haven't checked the
output is the same as before, so this might not be quite right - the
patch is really just to make it clearer what I'm talking about.

Making this change reduces the encoding time by a factor of about 3.6
for me for the linux.postlist case (though if the patch isn't quite
right this reduction may not be either).  And that's still ~14 times as
long as interpolative coding takes, but I suspect there's further scope
for speeding it up.

Cheers,
    Olly
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.xapian.org/pipermail/xapian-devel/attachments/20140314/2376a902/attachment-0002.html>

Possibly Parallel Threads

Search for more maybe matching threads

Xapian devel - Mar 2014 - Optimized VSEncoding

[Xapian-devel] Optimized VSEncoding

[Xapian-devel] Optimized VSEncoding

[Xapian-devel] Optimized VSEncoding

[Xapian-devel] Optimized VSEncoding

Possibly Parallel Threads