thr3ads.net - Xapian discuss - [Xapian-discuss] The position list table and index size (compared to Lucene) [Nov 2009]

If this information is useful, please help other people find it:
Share via:

Paul Boddie

2009-Nov-06 16:38 UTC

[Xapian-discuss] The position list table and index size (compared to Lucene)

Hello,

I have been using Xapian (and Python) in a biomedical literature search 
application for several months, having previously used PyLucene (which 
I've been moving away from because of the awkward build constraints). 
One issue which arose from a migration from Lucene to Xapian was the 
approximately four-fold increase in index size when moving to Xapian; 
this appears to be a product of the term position information, and I 
have also seen similar reports of index "growth" mentioned in the 
mailing list archives. For example:

http://lists.tartarus.org/pipermail/xapian-discuss/2009-April/006626.html
(There are probably better examples than this, but I can't locate more 
relevant documents at the moment.)

Are there any simple explanations for the large differences in index and 
position list table size? I notice that the documentation, specifically 
this Wiki page...

http://trac.xapian.org/wiki/FlintPositionListTable

...mentions "tname" in the "key format", which I presume
means "term
name", but in a B-tree I imagine that even storing the term name for 
each key would still only result in the overhead in doing so being 
proportional to the document frequency of each term, not to the actual 
"occurrence" frequency, although this probably wouldn't be a
trivial
amount for a collection of tens of millions of short documents, which is 
the kind of collection I've built.

I must confess to not having studied the Xapian source code thoroughly 
at this point. While trying to formulate some kind of explanation, I 
have also consulted the Lucene index format documentation:

http://lucene.apache.org/java/2_9_0/fileformats.html

However, it appears to me that Lucene doesn't use a standard B-tree 
structure, anyway. I've also written a very simple indexer in Python 
using a highly restrictive index format similar to (but even simpler 
than) that employed by Lucene, and I can reproduce a significant 
reduction in index size, although I'm obviously not going to make any 
performance claims for this solution. Is there anything I can configure 
or change to affect the index size in Xapian, or is there anything I 
should be more aware of?

Regards,

Paul

Olly Betts

2009-Nov-08 10:40 UTC

head link

[Xapian-discuss] The position list table and index size (compared to Lucene)

On Fri, Nov 06, 2009 at 05:38:44PM +0100, Paul Boddie
wrote:> One issue which arose from a migration from Lucene to Xapian was the
> approximately four-fold increase in index size when moving to Xapian;
Four-fold sounds extreme - are you sure you are comparing like with like?

For example, indexing your data in the same way with the two systems, storing
the same literal data in each, and comparing optimised/compacted sizes or
unoptimised/uncompacted sizes, not one of each.
> this appears to be a product of the term position information, and I
> have also seen similar reports of index "growth" mentioned in the
> mailing list archives. For example:
>
> http://lists.tartarus.org/pipermail/xapian-discuss/2009-April/006626.html
> (There are probably better examples than this, but I can't locate more
> relevant documents at the moment.)
>
> Are there any simple explanations for the large differences in index and
> position list table size?
Well, the article you just linked to discusses one big factor - the termlist
table.  It seems silly to repeat myself - just read what I wrote before - but
to update what's said there a little, this is now optional in trunk, though
"imperfect" deletion isn't yet supported.

As it also mentions, the chert backend in 1.1.x is more compact than the flint
backend which is the default in 1.0.x.
> I notice that the documentation, specifically
> this Wiki page...
>
> http://trac.xapian.org/wiki/FlintPositionListTable
>
> ...mentions "tname" in the "key format", which I
presume means "term
> name"
Yes, "tname" is "term name".
> in a B-tree I imagine that even storing the term name for each key would
> still only result in the overhead in doing so being proportional to the
> document frequency of each term
It should be roughly proportional to the number of documents each term appears
in.  The most common words in most languages tend to be the shortest, but it
would probably save a significant amount of space to have a lexicon which
mapped the term name to a unique integer which could be used in the keys to the
position table, and in a few other places.  I've not yet tried this though.

If you measure how much of the position table is actually the positional data
the percentage is surprisingly low.  I got 14% for gmane, though that will
certainly vary, but something like 86% is a mixture of keys, Btree structure,
and unused space.  We can't eliminate all of that (we do need to somehow
store
what each stream of positional data is for), but it shouldn't be hard to
make a dramatic difference here.  This is something I'm likely to be working
on later this month.
> Is there anything I can configure or change to affect the index size in
> Xapian, or is there anything I should be more aware of?
Using 1.1.x and chert (set XAPIAN_PREFER_CHERT=1 in the environment until we
make it the default) if you aren't scared of potentially having to rebuild
your application and index when you upgrade to a new release.

If you don't need to be able to delete or replace documents, then with trunk
(and 1.1.3 once we release it) you can "rm termlist*" in the database
directory
after you create it and you'll have a termlist-less database (this is rather
new, and will gain an API flag once it's had any kinks knocked out).

Cheers,
    Olly

Xapian discuss - Nov 2009 - The position list table and index size (compared to Lucene)

[Xapian-discuss] The position list table and index size (compared to Lucene)

[Xapian-discuss] The position list table and index size (compared to Lucene)