Hello In the series of stupid questions comes sequel #5 or so: I was wondering about indexing and what the best practices are. Right now I fetch items from the database, add them to the index, with terms etc, and commit every 1000 documents. Does this sound reasonable, or should I wait untill the end and commit then, or something else all together? What I think I am seeing is that it takes longer and longer for each commit, but that might be something else all together? - andreas
On Mon, Apr 16, 2007 at 04:19:43PM +0200, Andreas Marienborg wrote:> Right now I fetch items from the database, add them to the index, > with terms etc, and commit every 1000 documents. > > Does this sound reasonable, or should I wait untill the end and > commit then, or something else all together?Unless memory is tight, you are likely to get better throughput by flushing larger batches less often. Batches of a million or even more can make sense for some applications.> What I think I am seeing is that it takes longer and longer for each > commit, but that might be something else all together?It will generally take longer to flush a batch to a larger database. If you're trying to build a really large database from scratch or add a large batch to an existing database, and you don't care about having partial results available for searching, it is likely to be fastest to indexing the new documents to one or more new databases, and then use xapian-compact to merge the existing database and all the new ones into a replacement for the existing database. Cheers, Olly
Hi, On Mon, Apr 16, 2007 at 04:19:43PM +0200, Andreas Marienborg wrote:> Right now I fetch items from the database, add them to the index, > with terms etc, and commit every 1000 documents.apart from creating the index in Xapian you should also keep in mind how fast you can get your data from the database. The largest SQL-database I use consists of 2.8 million bibliographic entries (only the OPAC of the University Library of Cologne without institutes etc.). To actually build the index I use two tables in the database - one with the terms to index and another with correspondig serialized data-structures constisting of author/title/year etc. I use MySQL 4.1.x. Unfortunately MySQL gets quite slow when performing the select over these to tables - especially after the 1 millionth dataset fetched. I benchmarked the performance and measured the time to index 1000 datasets.>From the 1st to the 200.000th set the time used grew from 0.8 seconds toaround 3.6 seconds for 1000 sets. The time then grew - until after 2 million sets is was around 59 seconds. I first suspected Xapian and fine-tuned XAPIAN_FLUSH_THRESHOLD but quite soon it became quite clear that MySQL was responsible. As an alternative way to index (in my setup) I indexed the same data from two flat files with the same data as in the MySQL-database. To use flat files was quite a performance boost. Every 1000 sets took around 0.8 to 2.0 seconds to index for the whole 2.8 million bibliographic records. Just my 0.02 EUR. Regards, Oliver -- !- Oliver Flimm - Cologne/Germany | flimm@sigtrap.de | http://www.sigtrap.de/ -! ! Die Zehn Gebote haben 279 Woerter, die amerikanische Unabhaengigkeits- ! ! erklaerung hat 300 Woerter. Die EU-Verordnung zur Einfuhr von Karamelbonbons ! !----------------------------- hat 25911 Woerter ----------------------------!