Weixian Zhou
2012-Apr-20 02:51 UTC
[Xapian-devel] Posting list encoding improvements - pfd encoding & var len encoding comparison program
Hi all, I wrote a program that implement the variable length encoding and fixed length encoding, and compares their index size and speed of search doc length. You can see the comparison result from the attachment snapshot. 1. The posting list is in all memory; 2. The search strategy of fixed length encoding is skipping with exponential step (1, 2, 4, 8, ...). Once exceeds the desired doc id, back to previous step and skip with step 1. 3. The implemented fixed length encoding uses 4 bytes as fixed length. This is not optimal and can be further optimized in PFD. 4. The program generates uniform random doc id gap and doc len to make posting list. *You can access the code via my github: https://github.com/zwxxx/pfd_simple_test* -- Weixian Zhou Department of Computer Science and Engineering University at Buffalo, SUNY -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120419/72da0ab2/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: compare.png Type: image/png Size: 6772 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120419/72da0ab2/attachment-0002.png>
Weixian Zhou
2012-Apr-20 03:12 UTC
[Xapian-devel] Posting list encoding improvements - pfd encoding & var len encoding comparison program
Fix a typo in the attachment: The search time of 100000 searches of variable length encoding and fixed length encoding are reversed. For convenience, I attached all source files. On Thu, Apr 19, 2012 at 10:51 PM, Weixian Zhou <ideazwx at gmail.com> wrote:> Hi all, > I wrote a program that implement the variable length encoding and fixed > length encoding, and compares their index size and speed of search doc > length. > You can see the comparison result from the attachment snapshot. > > 1. The posting list is in all memory; > 2. The search strategy of fixed length encoding is skipping with > exponential step (1, 2, 4, 8, ...). Once exceeds the desired doc id, back > to previous step and skip with step 1. > 3. The implemented fixed length encoding uses 4 bytes as fixed length. > This is not optimal and can be further optimized in PFD. > 4. The program generates uniform random doc id gap and doc len to make > posting list. > > *You can access the code via my github: > https://github.com/zwxxx/pfd_simple_test* > -- > Weixian Zhou > Department of Computer Science and Engineering > University at Buffalo, SUNY > >-- Weixian Zhou Department of Computer Science and Engineering University at Buffalo, SUNY -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120419/c9950a69/attachment.html> -------------- next part -------------- A non-text attachment was scrubbed... Name: src.zip Type: application/zip Size: 4654 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20120419/c9950a69/attachment-0002.zip>
Olly Betts
2012-Apr-20 03:58 UTC
[Xapian-devel] Posting list encoding improvements - pfd encoding & var len encoding comparison program
On Thu, Apr 19, 2012 at 11:12:32PM -0400, Weixian Zhou wrote:> 3. The implemented fixed length encoding uses 4 bytes as fixed length. > This is not optimal and can be further optimized in PFD.[...]> Fix a typo in the attachment: The search time of 100000 searches of > variable length encoding and fixed length encoding are reversed.That's promising I think - the fixed length is quite a bit faster, but almost exactly twice the size with a 4 byte fixed size. But document lengths will probably fit in 2 bytes in many situations (and should almost never need more than 3) so even a simple per-chunk choice of the number of byes to use will often put this about the same in size terms it seems. Cheers, Olly
Reasonably Related Threads
- Project: Posting list encoding improvements
- Project: QueryParser Reimplementation, to Olly Betts and Dan Colish
- [Compile Issue] netcat.c on HP NonStop
- [PATCH] check for default subvolid and act accordingly on install
- [PATCH] Enable ConnectTimeout with ConnectionAttempts