On Mon, Apr 23, 2012 at 10:16:51PM +0800, Jaguar Xiong
wrote:> I did a comparison based on similar steps as in the blog
>
(zooie.wordpress.com/2009/07/06/a-comparison-of-open-source-search-engines-and-indexing-twitter),
> against lucene-3.4 and xapian-1.3.0. The overall index sizes are:
> lucene 89M, xapian 189M (chert backend and compacted).
> Since I'm more interested in index size, I dig a little further to dump
> the full term list. There are about 360000 terms from lucene index, and
> about 285000 terms from xapian index.
What are the additional terms lucene has indexed?
> But surprisingly, the termlist.DB of xapian index is already 122M.
It's surprising to hear termlist.DB is ~2/3 of the total size, as it is
usually much less - I guess if you are indexing tweets then that's a
lot of very small documents, and the front coding used in the termlist
entries works better for larger documents.
The termlist table stores the list of terms each document contains (and
if you are storing any document values, also the value slots used in
each document).
This information allows Xapian to delete or update a document correctly,
and also allows query expansion. My understanding is that Lucene
doesn't store this information, and handles deletion by adding the
document id to a "deleted" list, which has to be excluded from query
results; this also means the frequency statistics will tend to be
increasingly inaccurate as more documents are deleted or modified.
That's the trade-off in exchange for not having to store the termlist
data.
Xapian doesn't currently support a "deleted" list, but if you
don't
want to be able to delete or modify documents, you can just delete
this table from your database ("rm termlist.*") and pretty much
everything else will continue to work. The other things which rely
on the termlist table are listed in the ticket for this issue:
http://trac.xapian.org/ticket/181
If you delete the termlist, then it looks like Xapian would be ~67M vs
Lucene's 89M.
> Is tmere some idea/plan on reducing the index size? I'll glad if I
could
> help.
Brass should be a little smaller than chert, but it's not going to be
dramatic.
There are a few ideas we have to reduce the size - if you're wanting to
help work on this, here are a couple:
* Posting list encodings could be more compact (probably in exchange for
being more expensive to update, so supporting several encodings and
picking the appropriate one via heuristics and/or user hints would
probably be best):
http://trac.xapian.org/wiki/GSoCProjectIdeas#Project:Postinglistencodingimprovements
* The Btree keys are currently stored in full each time, but within
almost all blocks, the keys will share a common prefix, so it would
reduce the spaced used and allow us to fit more in a block if we just
stored that prefix once. This would help tables with a lot of small
entries especially (like the position table).
Cheers,
Olly