thr3ads.net - Xapian discuss - [Xapian-discuss] Indexing and commiting [Apr 2007]

If this information is useful, please help other people find it:
Share via:

Andreas Marienborg

2007-Apr-16 15:19 UTC

[Xapian-discuss] Indexing and commiting

Hello

In the series of stupid questions comes sequel #5 or so:

I was wondering about indexing and what the best practices are.

Right now I fetch items from the database, add them to the index,  
with terms etc,  and commit every 1000 documents.

Does this sound reasonable, or should I wait untill the end and  
commit then, or something else all together?

What I think I am seeing is that it takes longer and longer for each  
commit, but that might be something else all together?



- andreas

Olly Betts

2007-Apr-16 16:29 UTC

head link

[Xapian-discuss] Indexing and commiting

On Mon, Apr 16, 2007 at 04:19:43PM +0200, Andreas Marienborg
wrote:> Right now I fetch items from the database, add them to the index,  
> with terms etc,  and commit every 1000 documents.
> 
> Does this sound reasonable, or should I wait untill the end and  
> commit then, or something else all together?
Unless memory is tight, you are likely to get better throughput by
flushing larger batches less often.  Batches of a million or even
more can make sense for some applications.
> What I think I am seeing is that it takes longer and longer for each  
> commit, but that might be something else all together?
It will generally take longer to flush a batch to a larger database.

If you're trying to build a really large database from scratch or add a
large batch to an existing database, and you don't care about having
partial results available for searching, it is likely to be fastest
to indexing the new documents to one or more new databases, and then
use xapian-compact to merge the existing database and all the new
ones into a replacement for the existing database.

Cheers,
    Olly

Oliver Flimm

2007-Apr-16 20:41 UTC

head link

[Xapian-discuss] Indexing and commiting

Hi,

On Mon, Apr 16, 2007 at 04:19:43PM +0200, Andreas Marienborg
wrote:> Right now I fetch items from the database, add them to the index,
> with terms etc, and commit every 1000 documents.
apart from creating the index in Xapian you should also keep in mind how
fast you can get your data from the database.

The largest SQL-database I use consists of 2.8 million bibliographic entries
(only the OPAC of the University Library of Cologne without institutes
etc.). To actually build the index I use two tables in the database - one
with the terms to index and another with correspondig serialized
data-structures constisting of author/title/year etc.

I use MySQL 4.1.x. Unfortunately MySQL gets quite slow when performing the
select over these to tables - especially after the 1 millionth dataset
fetched.

I benchmarked the performance and measured the time to index 1000
datasets.>From the 1st to the 200.000th set the time used grew from 0.8 seconds toaround 3.6 seconds for 1000 sets. The time then grew - until after 2 million
sets is was around 59 seconds. I first suspected Xapian and fine-tuned
XAPIAN_FLUSH_THRESHOLD but quite soon it became quite clear that MySQL was
responsible.

As an alternative way to index (in my setup) I indexed the same data from
two flat files with the same data as in the MySQL-database. To use flat
files was quite a performance boost. Every 1000 sets took around 0.8 to 2.0
seconds to index for the whole 2.8 million bibliographic records.

Just my 0.02 EUR.

Regards,

Oliver

--
!- Oliver Flimm - Cologne/Germany | flimm@sigtrap.de | http://www.sigtrap.de/ -!
! Die Zehn Gebote haben 279 Woerter, die amerikanische Unabhaengigkeits- !
! erklaerung hat 300 Woerter. Die EU-Verordnung zur Einfuhr von Karamelbonbons !
!----------------------------- hat 25911 Woerter ----------------------------!

Xapian discuss - Apr 2007 - Indexing and commiting

[Xapian-discuss] Indexing and commiting

[Xapian-discuss] Indexing and commiting

[Xapian-discuss] Indexing and commiting