Hurricane Tong
2014-May-21 14:14 UTC
[Xapian-devel] the influence of the change of the doclen chunk format
Hi, I am going to make a patch applying new doclen chunk format. But I can't figure out the influence of the change. Olly once said the procedure of matching will use the doclen postlist. How can I find all the components that use the doclen postlist? As the fixed-width format is designed for contiguous docids, I just want to apply this format to doclen postlist, rather than ordinary term postlist. So I can't just change the PostlistChunkReader. And how can I get some real doclen data to test the performance of the new format? Or I just generate some data randomly. Best Regards ------------------ Shangtong Zhang,Second Year Undergraduate, School of Computer Science, Fudan University, China. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.xapian.org/pipermail/xapian-devel/attachments/20140521/0ae18f2c/attachment-0002.html>
Olly Betts
2014-May-23 01:32 UTC
[Xapian-devel] the influence of the change of the doclen chunk format
On Wed, May 21, 2014 at 10:14:15PM +0800, Hurricane Tong wrote:> I am going to make a patch applying new doclen chunk format. > But I can't figure out the influence of the change. > Olly once said the procedure of matching will use the doclen postlist. > How can I find all the components that use the doclen postlist?I would suggest just profiling some searches. The default weighting uses the doclength, and that's the case which Richard looked at in #326. There are some particular cases which use the doclen list more heavily, but these don't really need to be able to skip through it efficiently.> As the fixed-width format is designed for contiguous docids, > I just want to apply this format to doclen postlist, rather than > ordinary term postlist. > So I can't just change the PostlistChunkReader.Currently the doclength encoding just shares stuff with the postlist encoding for convenience, but you'll need to provide a different handler for it now.> And how can I get some real doclen data to test the performance of the > new format? > Or I just generate some data randomly.Randomly generated data tends to have different characteristics to real data, so it's better to avoid it for this sort of thing, as you can end up optimising for the wrong things. I've put two sets of doclen data from real databases here, and a third one will appear shortly (it's taking a while to copy over): http://oligarchy.co.uk/xapian/data/doclens/ Format is (docid, doclen) in ascending docid order. The archives collection is from indexed files, and has 101440 documents present out of 102433 used docids. The email collection has more missing docids, partly due to spam deletion, but also refiling an email gets handled as a deletion and addition in this system - that has 20934 used out of 33384. I would say this one is probably unusually sparse. The gmane collection (which is on its way) has no missing docids, and has 114086700 docids. I dumped these out using: xapian-delve /path/to/db -t '' -1 -v|awk '($1 != "Posting"){print $1" "$3}' I can get more data if you need more. Cheers, Olly