Hi, The size of the string generated by VSEncoder is 12592387, while that by InterpolativeEncoder is 8554817. When only encoding the first 1000lines, both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost 0ms, 1000lines is just too little to catch the difference in my test. I upload the source code to https://github.com/HurricaneTong/Xapian/tree/master/VSEncoder The bitstream.cpp and bitstream.h is about VSEncoder. The Interpolative.cpp and Interpolative is about InterpolativeEncoder from the Xapian's code. I haven't write comment on the code, but in main.cpp, double CalTimeVS( const vector<unsigned int>& src ) gives an example about how to use VSEncoder and VSDecoder. Best Regards ------------------ Shangtong Zhang,Second Year Undergraduate, School of Computer Science, Fudan University, China. ------------------ Original ------------------ From: "Olly Betts";<olly at survex.com>; Date: Wed, Mar 12, 2014 04:37 PM To: "Hurricane Tong"<zhangshangtong.cpp at qq.com>; Cc: "xapian-devel"<xapian-devel at lists.xapian.org>; Subject: Re: [Xapian-devel] Optimized VSEncoding On Wed, Mar 12, 2014 at 12:30:09PM +0800, Hurricane Tong wrote:> I optimized the code of VSEncoder, > and encode/decode the whole list, linux.postlist. > > Encoding time : > VSEncoder 103429ms > InterpolativeEncoder 684ms > > Decoding time: > VSDecoder 756ms > InterpolativeDecoder 925msWhat where the sizes in each case? Is your code online somewhere? I'd be interested in taking a look.> ( compile with -o2, and the Interpolative Encoder/Decoder is just from > the Xapian's code, bitstream.cc ) > > VS algorithm performs better in decoding, at the price of encoding time. > > I analyse the VSEncoder, most time is spent on getting the optimal > split, I try to do some optimization, but it seems very difficult to > make significant progress. > > I don't think the position list of a term will be frequently modified, > so I think it is acceptable to sacrifice some encoding time for faster > decoding.You've flipped to talking about position lists here - the choice of encoding is something to think about there too, though it's really unlikely they'd be nearly as long as linux.postlist, and the pattern will be a little different (e.g. the same word occurring at adjacent positions is rare for most words). It is true that position lists don't need unpacking and repacking, but if encoding actually takes 150 times longer in practice, that's probably not something we want to do by default. It could be an option for users who want to squeeze out every last bit of search speed though. For postlists, we would need to decode and reencode chunks when documents are deleted or modified - there I think the extra encode time would also be too much for general use. But linux.postlist is rather large - we'll probably never encode such a big chunk of data in one go. Our postlist chunks are currently about 1000 entries at most I'd guess. What happens if you try the first 1000 lines of linux.postlist? Cheers, Olly . -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140313/75395ca7/attachment-0002.html>
On 13 March 2014 02:41, Hurricane Tong <zhangshangtong.cpp at qq.com> wrote:> > When only encoding the first 1000lines, > both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost > 0ms, > 1000lines is just too little to catch the difference in my test. >When measuring small times like that there are two approaches you can try: - run the test code lots of times. Ideally, with different data each time so that you don't get too many cache effects (ie, with the same data, you might find that subsequent runs are much faster than the first). So, in this case, you could try encoding the whole thing, but in chunks of 1000. - run the test under a tool like valgrind's "cachegrind", which measures the theoretical performance on a particular CPU architecture, and gives precise counts of operations performed. -- Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140313/744acf8d/attachment-0002.html>
On Thu, Mar 13, 2014 at 10:41:02AM +0800, Hurricane Tong wrote:> When only encoding the first 1000lines, > both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost 0ms, > 1000lines is just too little to catch the difference in my test.As Richard says, there are better ways to do timing tests than calling a fairly low resolution timer like clock() before and after a single run. We also recommend using Linux for development work rather than Windows. That would help here as there are better open source profiling tools available on Linux.> I upload the source code to https://github.com/HurricaneTong/Xapian/tree/master/VSEncoderThanks. You're inserting repeatedly at the start of vector S - each time you do that the entire current contents of the vector have to be copied up one place to make room, which means that inserting N items has cost O(N*N). But vector::push_back() is O(1), so pushing back N items has cost O(N). And we can just flip the loop which uses the items to run backwards through S instead of forwards. I've attached a quick patch to show what I mean. I haven't checked the output is the same as before, so this might not be quite right - the patch is really just to make it clearer what I'm talking about. Making this change reduces the encoding time by a factor of about 3.6 for me for the linux.postlist case (though if the patch isn't quite right this reduction may not be either). And that's still ~14 times as long as interpolative coding takes, but I suspect there's further scope for speeding it up. Cheers, Olly -------------- next part -------------- A non-text attachment was scrubbed... Name: vsencoder-speedup.patch Type: text/x-diff Size: 811 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140313/ee297f71/attachment-0002.patch>
Hi, Thanks for your patch, it works well, the output is the same with what it is before. To do further optimization, I try to use more subtle bit operation in encoding, it does make a little progress. The get_optimal_split cost 72% of the time for encoding, but I can hardly apply more optimization to it. I will try to optimize the dynamic programming algorithm.>>We also recommend using Linux for development work rather than Windows.I have a ubuntu installed in my machine, but in ubuntu, my machine cannot awake from hiberating, sleeping.... I tried many methods but finally failed, so if not necessary, I don't switch to ubuntu. Best Regards ------------------ Shangtong Zhang,Second Year Undergraduate, School of Computer Science, Fudan University, China. ------------------ Original ------------------ From: "Olly Betts";<olly at survex.com>; Date: Thu, Mar 13, 2014 06:49 PM To: "Hurricane Tong"<zhangshangtong.cpp at qq.com>; Cc: "xapian-devel"<xapian-devel at lists.xapian.org>; Subject: Re: [Xapian-devel] Optimized VSEncoding On Thu, Mar 13, 2014 at 10:41:02AM +0800, Hurricane Tong wrote:> When only encoding the first 1000lines, > both cost 0ms to decode and VS cost 1ms to encode while Interpolative cost 0ms, > 1000lines is just too little to catch the difference in my test.As Richard says, there are better ways to do timing tests than calling a fairly low resolution timer like clock() before and after a single run. We also recommend using Linux for development work rather than Windows. That would help here as there are better open source profiling tools available on Linux.> I upload the source code to https://github.com/HurricaneTong/Xapian/tree/master/VSEncoderThanks. You're inserting repeatedly at the start of vector S - each time you do that the entire current contents of the vector have to be copied up one place to make room, which means that inserting N items has cost O(N*N). But vector::push_back() is O(1), so pushing back N items has cost O(N). And we can just flip the loop which uses the items to run backwards through S instead of forwards. I've attached a quick patch to show what I mean. I haven't checked the output is the same as before, so this might not be quite right - the patch is really just to make it clearer what I'm talking about. Making this change reduces the encoding time by a factor of about 3.6 for me for the linux.postlist case (though if the patch isn't quite right this reduction may not be either). And that's still ~14 times as long as interpolative coding takes, but I suspect there's further scope for speeding it up. Cheers, Olly -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140314/2376a902/attachment-0002.html>