thr3ads.net - Xapian discuss - [Xapian-discuss] My new record: Indexing 20 millions docs = 79m9.378s [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Kevin Duraj

2007-Feb-07 21:21 UTC

[Xapian-discuss] My new record: Indexing 20 millions docs = 79m9.378s

Gentoo Linux 2.6
8 AMD Opteron 64-bit Processors
32GB Memory
--------------------------------------------------------------------------------

Environment:
------------------
XAPIAN_FLUSH_THRESHOLD=21000000
XAPIAN_FLUSH_THRESHOLD_LENGTH=16000000
XAPIAN_PREFER_FLINT=True
Indexing 20 million documents:
--stemmer=none
-------------------------------------------
real    79m9.378s
user    77m28.696s
sys     1m36.654s

# delve /home/kevin/index
---------------------------------------
number of documents = 19999995
average document length = 8.18631


PS: In my scenario after 25 million records the indexing significantly slows
down (2x-4x)
I do not know why? Could it be because of the B-Tree become very complex?

- Kevin Duraj

Olly Betts

2007-Feb-09 06:07 UTC

head link

[Xapian-discuss] My new record: Indexing 20 millions docs = 79m9.378s

On Wed, Feb 07, 2007 at 01:21:06PM -0800, Kevin Duraj
wrote:> Gentoo Linux 2.6
> 8 AMD Opteron 64-bit Processors
> 32GB Memory
>
--------------------------------------------------------------------------------
> 
> Environment:
> ------------------
> XAPIAN_FLUSH_THRESHOLD=21000000
> XAPIAN_FLUSH_THRESHOLD_LENGTH=16000000
Setting XAPIAN_FLUSH_THRESHOLD_LENGTH no longer does anything (it was
removed in September 2004).
> PS: In my scenario after 25 million records the indexing significantly
> slows down (2x-4x) I do not know why? Could it be because of the
> B-Tree become very complex?
That seems unlikely, the B-Tree complexity grows logarithmically.

It's probably a cache effect - as the working set of a process grows,
performance can suddenly get worse when it just fails to fit in the
available CPU cache.  In your case, I suspect it's some key subset of
the working set which is the issue.

When indexing, do you only call WritableDatabase::add_document()?  If
so, we should be able to index significantly faster than this by
buffering appended changes in a more compact way.

Cheers,
    Olly

Olly Betts

2007-Feb-12 10:19 UTC

head link

[Xapian-discuss] Re: My new record: Indexing 20 millions docs = 79m9.378s

Kevin Duraj <kevin.softdev@gmail.com> writes:> - Yes I did read the fact that XAPIAN_FLUSH_THRESHOLD_LENGTH has no more
> effect and was removed, I was just not sure. It was good decision because I
> was getting confused how balance between number of records and maximum
> memory used.
I think that really the threshold should be the amount of memory used to
buffer posting list changes, but that's not easy to really know as things
currently stand.  Also the threshold would ideally automatically tune itself
for good performance by default.  We'll get there eventually...
> - I am building 2 prototypes to measure performance between Lucene .NET Win
> and Xapian  Linux. Therefore for my prototype I am simply using the
> scriptindex (/usr/local/bin/scriptindex --stemmer=none /home/kevin/index1
> indexscript1 $filename) to index 20 million of records. If Xapian will
> perform better then Lucene then I will write new search using C/C++ and
will
> use WritableDatabase::add_document() ... Thank you for the suggestion.
Unless you use the "unique" action scriptindex will just call
add_document
anyway so that's fine as it is.

It might be interesting to see what speedup this patch gives you:

http://oligarchy.co.uk/xapian/patches/xapian-faster-flint-add-document.patch

It implements more compact storage of pending posting list changes from
add_document with flint.  Currently replace_document and delete_document are
disabled - it's just a quick prototype I'm going to test on a full
rebuild of
gmane's index, but the gmane machine is still regenerating the index spool
so I won't be able to test it there for a few more days.

If this looks promising, we can sort out a better version which doesn't
disable
the other methods!

Cheers,
    Olly

Seemingly Similar Threads

Search for more seemingly similar threads

Xapian discuss - Feb 2007 - My new record: Indexing 20 millions docs = 79m9.378s

[Xapian-discuss] My new record: Indexing 20 millions docs = 79m9.378s

[Xapian-discuss] My new record: Indexing 20 millions docs = 79m9.378s

[Xapian-discuss] Re: My new record: Indexing 20 millions docs = 79m9.378s

Seemingly Similar Threads