Kevin Duraj
2007-Feb-07 21:21 UTC
[Xapian-discuss] My new record: Indexing 20 millions docs = 79m9.378s
Gentoo Linux 2.6 8 AMD Opteron 64-bit Processors 32GB Memory -------------------------------------------------------------------------------- Environment: ------------------ XAPIAN_FLUSH_THRESHOLD=21000000 XAPIAN_FLUSH_THRESHOLD_LENGTH=16000000 XAPIAN_PREFER_FLINT=True Indexing 20 million documents: --stemmer=none ------------------------------------------- real 79m9.378s user 77m28.696s sys 1m36.654s # delve /home/kevin/index --------------------------------------- number of documents = 19999995 average document length = 8.18631 PS: In my scenario after 25 million records the indexing significantly slows down (2x-4x) I do not know why? Could it be because of the B-Tree become very complex? - Kevin Duraj
Olly Betts
2007-Feb-09 06:07 UTC
[Xapian-discuss] My new record: Indexing 20 millions docs = 79m9.378s
On Wed, Feb 07, 2007 at 01:21:06PM -0800, Kevin Duraj wrote:> Gentoo Linux 2.6 > 8 AMD Opteron 64-bit Processors > 32GB Memory > -------------------------------------------------------------------------------- > > Environment: > ------------------ > XAPIAN_FLUSH_THRESHOLD=21000000 > XAPIAN_FLUSH_THRESHOLD_LENGTH=16000000Setting XAPIAN_FLUSH_THRESHOLD_LENGTH no longer does anything (it was removed in September 2004).> PS: In my scenario after 25 million records the indexing significantly > slows down (2x-4x) I do not know why? Could it be because of the > B-Tree become very complex?That seems unlikely, the B-Tree complexity grows logarithmically. It's probably a cache effect - as the working set of a process grows, performance can suddenly get worse when it just fails to fit in the available CPU cache. In your case, I suspect it's some key subset of the working set which is the issue. When indexing, do you only call WritableDatabase::add_document()? If so, we should be able to index significantly faster than this by buffering appended changes in a more compact way. Cheers, Olly
Olly Betts
2007-Feb-12 10:19 UTC
[Xapian-discuss] Re: My new record: Indexing 20 millions docs = 79m9.378s
Kevin Duraj <kevin.softdev@gmail.com> writes:> - Yes I did read the fact that XAPIAN_FLUSH_THRESHOLD_LENGTH has no more > effect and was removed, I was just not sure. It was good decision because I > was getting confused how balance between number of records and maximum > memory used.I think that really the threshold should be the amount of memory used to buffer posting list changes, but that's not easy to really know as things currently stand. Also the threshold would ideally automatically tune itself for good performance by default. We'll get there eventually...> - I am building 2 prototypes to measure performance between Lucene .NET Win > and Xapian Linux. Therefore for my prototype I am simply using the > scriptindex (/usr/local/bin/scriptindex --stemmer=none /home/kevin/index1 > indexscript1 $filename) to index 20 million of records. If Xapian will > perform better then Lucene then I will write new search using C/C++ and will > use WritableDatabase::add_document() ... Thank you for the suggestion.Unless you use the "unique" action scriptindex will just call add_document anyway so that's fine as it is. It might be interesting to see what speedup this patch gives you: http://oligarchy.co.uk/xapian/patches/xapian-faster-flint-add-document.patch It implements more compact storage of pending posting list changes from add_document with flint. Currently replace_document and delete_document are disabled - it's just a quick prototype I'm going to test on a full rebuild of gmane's index, but the gmane machine is still regenerating the index spool so I won't be able to test it there for a few more days. If this looks promising, we can sort out a better version which doesn't disable the other methods! Cheers, Olly