Paul Boddie
2009-Nov-06 16:38 UTC
[Xapian-discuss] The position list table and index size (compared to Lucene)
Hello, I have been using Xapian (and Python) in a biomedical literature search application for several months, having previously used PyLucene (which I've been moving away from because of the awkward build constraints). One issue which arose from a migration from Lucene to Xapian was the approximately four-fold increase in index size when moving to Xapian; this appears to be a product of the term position information, and I have also seen similar reports of index "growth" mentioned in the mailing list archives. For example: http://lists.tartarus.org/pipermail/xapian-discuss/2009-April/006626.html (There are probably better examples than this, but I can't locate more relevant documents at the moment.) Are there any simple explanations for the large differences in index and position list table size? I notice that the documentation, specifically this Wiki page... http://trac.xapian.org/wiki/FlintPositionListTable ...mentions "tname" in the "key format", which I presume means "term name", but in a B-tree I imagine that even storing the term name for each key would still only result in the overhead in doing so being proportional to the document frequency of each term, not to the actual "occurrence" frequency, although this probably wouldn't be a trivial amount for a collection of tens of millions of short documents, which is the kind of collection I've built. I must confess to not having studied the Xapian source code thoroughly at this point. While trying to formulate some kind of explanation, I have also consulted the Lucene index format documentation: http://lucene.apache.org/java/2_9_0/fileformats.html However, it appears to me that Lucene doesn't use a standard B-tree structure, anyway. I've also written a very simple indexer in Python using a highly restrictive index format similar to (but even simpler than) that employed by Lucene, and I can reproduce a significant reduction in index size, although I'm obviously not going to make any performance claims for this solution. Is there anything I can configure or change to affect the index size in Xapian, or is there anything I should be more aware of? Regards, Paul
Olly Betts
2009-Nov-08 10:40 UTC
[Xapian-discuss] The position list table and index size (compared to Lucene)
On Fri, Nov 06, 2009 at 05:38:44PM +0100, Paul Boddie wrote:> One issue which arose from a migration from Lucene to Xapian was the > approximately four-fold increase in index size when moving to Xapian;Four-fold sounds extreme - are you sure you are comparing like with like? For example, indexing your data in the same way with the two systems, storing the same literal data in each, and comparing optimised/compacted sizes or unoptimised/uncompacted sizes, not one of each.> this appears to be a product of the term position information, and I > have also seen similar reports of index "growth" mentioned in the > mailing list archives. For example: > > http://lists.tartarus.org/pipermail/xapian-discuss/2009-April/006626.html > (There are probably better examples than this, but I can't locate more > relevant documents at the moment.) > > Are there any simple explanations for the large differences in index and > position list table size?Well, the article you just linked to discusses one big factor - the termlist table. It seems silly to repeat myself - just read what I wrote before - but to update what's said there a little, this is now optional in trunk, though "imperfect" deletion isn't yet supported. As it also mentions, the chert backend in 1.1.x is more compact than the flint backend which is the default in 1.0.x.> I notice that the documentation, specifically > this Wiki page... > > http://trac.xapian.org/wiki/FlintPositionListTable > > ...mentions "tname" in the "key format", which I presume means "term > name"Yes, "tname" is "term name".> in a B-tree I imagine that even storing the term name for each key would > still only result in the overhead in doing so being proportional to the > document frequency of each termIt should be roughly proportional to the number of documents each term appears in. The most common words in most languages tend to be the shortest, but it would probably save a significant amount of space to have a lexicon which mapped the term name to a unique integer which could be used in the keys to the position table, and in a few other places. I've not yet tried this though. If you measure how much of the position table is actually the positional data the percentage is surprisingly low. I got 14% for gmane, though that will certainly vary, but something like 86% is a mixture of keys, Btree structure, and unused space. We can't eliminate all of that (we do need to somehow store what each stream of positional data is for), but it shouldn't be hard to make a dramatic difference here. This is something I'm likely to be working on later this month.> Is there anything I can configure or change to affect the index size in > Xapian, or is there anything I should be more aware of?Using 1.1.x and chert (set XAPIAN_PREFER_CHERT=1 in the environment until we make it the default) if you aren't scared of potentially having to rebuild your application and index when you upgrade to a new release. If you don't need to be able to delete or replace documents, then with trunk (and 1.1.3 once we release it) you can "rm termlist*" in the database directory after you create it and you'll have a termlist-less database (this is rather new, and will gain an API flag once it's had any kinks knocked out). Cheers, Olly