I came across this benchmark between Xapian & Solr: http://www.anur.ag/blog/2009/03/xapian-and-solr/ According to the benchmark, a doc set that took Solr 34 min to index took Xapian 7 hours. Solr's index is also much smaller - 2.5GB to Xapian's 8.9GB. I'm new to Xapian. Just wondering if results like these are typical? Is indexing speed & size a known issue in Xapian? Or is there some other explanation for the big difference between the Solr & Xapian results?
Michel Pelletier
2009-Apr-17 21:39 UTC
[Xapian-discuss] Indexing speed benchmark - Xapian, Solr
Without being able to look at the code this person wrote to reproduce the benchmark then it's difficult for us to say. Recently I was bulk indexing into Xapian and ran out of memory. This was not xapian's fault, i had an obvious and stupid bug in my code preventing python's garbage collector from collecting already indexed objects. This author may well have run into a similar problem without knowing it. Or done something clearly inefficient, like flushing after every single added document. Without code, we'll never know. On my year old macbook pro laptop, I can bulk index about 90 employment description documents ("job ads") per second, taking about 280 seconds to index 25 thousand documents. These document are coming out of a relational database and into xapian. Those 25K documents, which include many terms and values and full document data, take up about 52Mb of disk. During the import, the resident process memory of the import script never goes over 60MB. I agree with the poster that searching xapian is very fast. :) -Mike On Sun, Apr 12, 2009 at 7:26 AM, Andy <angelflow at yahoo.com> wrote:> > I came across this benchmark between Xapian & Solr: > > http://www.anur.ag/blog/2009/03/xapian-and-solr/ > > According to the benchmark, a doc set that took Solr 34 min to index took Xapian 7 hours. Solr's index is also much smaller - 2.5GB to Xapian's 8.9GB. > > I'm new to Xapian. Just wondering if results like these are typical? Is indexing speed & size a known issue in Xapian? Or is there some other explanation for the big difference between the Solr & Xapian results? > > > > > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >
On Sun, Apr 12, 2009 at 07:26:03AM -0700, Andy wrote:> I came across this benchmark between Xapian & Solr: > > http://www.anur.ag/blog/2009/03/xapian-and-solr/Note that this is mis-titled - it is really a benchmark between Xappy and Solr, though I don't know how much difference that makes. Richard can probably comment more usefully. It's good that it actually says what versions were used (many benchmarks seem to fail to), but it's a shame that the benchmark code itself isn't available - an experiment you can't independently reproduce isn't really scientifically valid.> According to the benchmark, a doc set that took Solr 34 min to index > took Xapian 7 hours. Solr's index is also much smaller - 2.5GB to > Xapian's 8.9GB.That's not a fair size comparison. 2.5GB was the "optimized index size" for Solr. The comparable figure for Xapian is the compacted size which was 6.5GB.> I'm new to Xapian. Just wondering if results like these are typical? > Is indexing speed & size a known issue in Xapian? Or is there some > other explanation for the big difference between the Solr & Xapian > results?Regarding the indexing time, by default Xapian auto-commits every 10000 documents, which is pretty conservative on modern hardware. The article doesn't mention tuning this (by setting XAPIAN_FLUSH_THRESHOLD) so I assume he didn't. If you have plenty of RAM, increasing that will speed up indexing a lot. I'd imagine on the hardware described you could index all million documents in one go, especially since they are truncated to 2000 characters which is really short. And if you index in one go, the database shouldn't need compacting either. Ideally the flush size should probably adjust itself, but nobody has done any work on that so far. But it's true that more effort has been put in to search speed than indexing speed so far, so there's likely to be a lot of potential for making indexing faster. Database size is something we have been working on a bit, and the new chert backend which will debut shortly in the 1.1.x development series will give smaller databases (especially the postlist table). It'll still be larger than Solr in this case though. If I understand correctly how Lucene handles document deletion, one big difference between Lucene and Xapian is that Xapian stores the list of terms indexed by each document which allows it to perform a "perfect" delete, while Lucene doesn't store this information and can only flag the document as deleted which means that the stats won't get updated to reflect this change right away: http://wiki.apache.org/lucene-java/LuceneFAQ#head-9475b5b51f7ca022e03dbd94cb82b4a6c02e3675 Once a document is deleted it will not appear in TermDocs nor TermPositions enumerations, nor any search results. Attempts to load the document will result in an exception. The presence of this document may still be reflected in the docFreq statistics, and thus alter search scores, though this will be corrected eventually as segments containing deletions are merged. While having stale stats is not ideal, the size of the termlist table is quite a price to pay for "perfect" deletion if you don't need it for other reasons, and in some situations you never need to delete documents anyway. We're intending to allow the termlist table to be optional, probably during the 1.1 development series: http://trac.xapian.org/ticket/181 Once that's done, allowing "imperfect" deletion would be fairly easy. To give an idea how much difference that would make, Gmane's index (running on the new chert backend) is 130GB of which the termlist table is 62GB. Gmane doesn't currently index positional data - if it did I guess the database would be roughly twice as large, but that's still about a 25% space saving if the termlist table were removed. Cheers, Olly