张旭东
2013-Apr-05 13:42 UTC
[Xapian-devel] Question about Project: Posting list encoding improvements
Dear all: My name is Xudong Zhang, from Peking University. I'm very interested in the Project: Posting list encoding improvements, and I have had some experience on encoding, and so I want to be involved in the project. About the project?I have a question to consult: What's the main goal of the improvement of the speed for the encoding and decoding of posting lists? Is it OK to increase the encoding/decoding speed greatly with the slight loss of compression? Below is about my personal information. I'm a master student from Peking University, Beijing, China. My research area is search engine and web mining, and I've been focusing on compression algorithms of posting lists in search engines since last February. I have proposed a novel decoding algorithm called SIMD-PFORDelta, which exploit SIMD instructions in PFORDelta. It is fast and can achieve good compression ratio. Recently, I also did some experiments to test the state-of-the-art and proposed encoding algorithms of posting lists. We used our own search engine called PARADISE and the datasets are GOV2(25 million docs) and ClueWeb09B(50 million docs). Below is some experimental results of docIDs? gap sequence on ClueWeb09B, which is similar to Lemire's results[1] (The implementation of SIMD_pfor_delta is somewhat different to that in Lemire's paper): Algorithm Name Compression Speed(M/s) Decompression Speed(M/s) Compression Ratio simple9 88.2749297 500.8305639 16.72% simple16 51.29381855 488.0998879 15.55% var_byte 492.9379145 519.1921354 26.02% group_var_byte 243.7189985 501.3713526 31.76% group_var_byte_incomp_unary 124.5200775 501.8188062 28.77% group_var_byte_comp_unary 123.1249031 461.8294703 28.70% SIMD_group_var_byte 240.296232 793.0340184 31.76% SIMD_group_var_byte_incomp_unary 126.3003402 1325.49703 28.77% SIMD_group_var_byte_comp_unary 126.3347572 1128.968014 28.70% packed_binary 252.6257354 1151.000784 27.55% SIMD_packed_binary 254.2433557 1507.316133 27.55% pfor_delta 26.99031483 1026.717585 19.53% SIMD_pfor_delta 27.23024567 1580.111396 19.53% rice 68.74742448 79.16529964 15.77% k_gamma 172.9461803 242.8787555 15.24% gamma_opt 61.7750133 59.30571281 14.91% SIMD__kgamma 174.1829249 262.0594328 15.24% Thank you very much! P.S. In D. Lemire's paper: Decoding billions of integers per second through vectorization(CORR'12), SIMD-PORDelta can outperforms VSEncoding w.r.t decoding speed, while remaining nearly the same compression ratio, namely, bits/int. [1] D. Lemire et al. Decoding billions of integers per second through vectorization, CORR?12, http://arxiv.org/abs/1209.2137 Best wishes, Xudong Zhang -- --------------------------- Xudong Zhang +86-13488695609 Team of Search Engine and Web Mining School of Electronic Engineering and Computer Science Peking University, Beijing, 100871, P.R.China -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20130405/a80742fe/attachment.html>