On Wed, May 04, 2011 at 08:33:46PM +0530, Parth Gupta
wrote:> Types of Files: text files with .txt extension
> Size of the collection: 11400 documents [1.6 GB]
>
> This takes a lot of time to index and indexing for last 20 hrs or so. I am
> using omindex.
>
> I notice that around 2900 docs are indexed very smoothly and suddenly after
> that indexing becomes very sluggish.
>
> I have tried couple of tricks like replacing the index_text() call to
> index_text_without_positions(). I also tried after setting the
> XAPIAN_FLUSH_THRESHLOD to 1500 documents from 10000 default. Above
mentioned
> time is after this tricks.
You probably want to *raise* the threshold, not lower it. Bigger
batches are more efficient, provided you have sufficient memory.
For typical size documents, 10000 is fairly conservative on modern
hardware - you should be able to index 11400 documents in a single
batch I'd think.
You've told Xapian to commit every 1500 document changes, so at 3000
docs it will be merging postlist changes - that's why there's apparently
a pause at that point. Once the changes are committed, it should go
faster up to 4500 documents, then up to 6000, etc
If you do need to index in several batches, you can build several
databases, each smaller than your flush threshold. Then you can either
just search these together, or merge them into a single database with
xapian-compact.
Cheers,
Olly