I'm new to Xapian & need some help, many thanks if anyone replies. I did a release build from xapian-core-1.0.7 with VS2008 by using Charlie Hull's makefiles. I'm trying to test-index my dataset -- some 200'000 docs, each document being (on average) 50 bytes long and having 6 words. I tried (a) not to use stemmer, (b) commit_transaction() on every 50/100/etc. docs, (c) not to use transactions at all -- but in all scenarios indexing goes at ~10 doc/sec or 500 bytes per second. This should probably be ~400 times faster, I'm clearly doing something wrong. Can anyone give me a hint or direct me to a source on the net to do some reading? Regards Celto
cel tix44 wrote:> I'm new to Xapian & need some help, many thanks if anyone replies. > > I did a release build from xapian-core-1.0.7 with VS2008 by using > Charlie Hull's makefiles. > > I'm trying to test-index my dataset -- some 200'000 docs, each > document being (on average) 50 bytes long and having 6 words. > > I tried (a) not to use stemmer, (b) commit_transaction() on every > 50/100/etc. docs, (c) not to use transactions at all -- but in all > scenarios indexing goes at ~10 doc/sec or 500 bytes per second. > > This should probably be ~400 times faster, I'm clearly doing something > wrong. Can anyone give me a hint or direct me to a source on the net > to do some reading?If you could let us know the platform you're using, and how you're accessing Xapian (which bindings for example, or directly using C/C++?), and even post the code you're using for your indexer, that would help hugely. Cheers Charlie> > Regards > Celto > > _______________________________________________ > Xapian-discuss mailing list > Xapian-discuss at lists.xapian.org > http://lists.xapian.org/mailman/listinfo/xapian-discuss >
On Thu, Aug 21, 2008 at 07:17:00PM +1000, cel tix44 wrote:> I'm trying to test-index my dataset -- some 200'000 docs, each > document being (on average) 50 bytes long and having 6 words. > > I tried (a) not to use stemmer, (b) commit_transaction() on every > 50/100/etc. docs, (c) not to use transactions at all -- but in all > scenarios indexing goes at ~10 doc/sec or 500 bytes per second.Forcing flushes more frequently will be slower, not faster. Assuming you have a decent amount of memory, you want to flush changes in *larger* batches than the default. To do this, set XAPIAN_FLUSH_THRESHOLD in the environment. The default is 10000. With such short documents, I suspect you could index all 200000 in a single batch. I've not seen any performance studies of Xapian on Windows, and I'm not aware of any large deployments, so it is possible that the Windows VM subsystem just sucks badly for Xapian's usage patterns. If you're still struggling to get good performance after setting XAPIAN_FLUSH_THRESHOLD, I'd suggest trying the same code on a similar spec box running Linux or similar to see how it compares. If there's a problem here it might be possible to improve things. Cheers, Olly