thr3ads.net - Xapian discuss - Xapian 1.3.5 snapshot performance and index size [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Jean-Francois Dockes

2016-Apr-10 14:47 UTC

Xapian 1.3.5 snapshot performance and index size

Hi,

I ran some tests with Recoll to compare Xapian 1.2.22 and 1.3.5 performance.

I mostly used two relatively small document sets (realistic/typical recoll
data subsets).

The first set is a 2.2 GB mbox folder, with approximately 56K messages in
275 files, producing approximately 64K documents (because of attachments).

The second set is a 11 GB folder with 5300 PDF files in it (random PDFS
harvested on Google).

The machine has an Intel Core i7-4770T CPU @ 2.50GHz (4 cores +
hyperthreading), 8 GB of memory and SSD storage.

I repeated most tests multiple times, and I give the best times here (the
variation was not very significant anyway).

PDF directory:
-------------

Xapian 1.2.22
Index size 399 MB
time: real 3m15s user 22m19s sys 1m9s

Xapian 1.3.5
Index size 614 MB
time: real 3m18s user 22m21s sys 1m28s

Mail directory:
--------------

Xapian 1.2.22
Index size 615 MB
time: real 2m20s user 7m57s sys 1m34s

Approximately 2mn of CPU time are spent in the actual Xapian thread (which
gets xapian::document as input and processes them into the index).

Xapian 1.3.5
Index size 794MB
time: real 3m47s user 7m14s sys 1m59s

Approximately 2m40s of CPU time are spent in the Xapian thread.


Indexing performance, interpretation:
------------------------------------

On the PDF directory, the performance of the Xapian thread is masked by the
processing of the PDF input. The CPU utilization is good (CPU time/clock
time is around 7, compared to 8 possible threads).

On the mail directory, the input processing is less significant, the single
index update thread is the bottleneck, and the Xapian version makes a
difference, Xapian 1.3 being significantly slower.

The CPU utilization is less than with PDF input, because the process is
often waiting for the Xapian thread, which is almost never waiting for
input. The situation is worse with 1.3 than with 1.2, because 1.3 is slower.

I am not sure why there is so much more difference betweeen the time of the
Xapian thread and the wall time for 1.3, but one possible explanation would
be more I/O waits.

The increase in index size between 1.2.22 and 1.3.5 is quite significant,
around 50%, concentrated on the positions file.


Phrase queries:
---------------

I ran a query on both versions of the mail index after copying the data to
a machine with spinning disks. The queries are run just after a reboot,
they find 3 documents (not shown):

xapian 1.2
time recoll -t -q '"to be or not to be"'
 real   0m5.766s user   0m0.108s sys    0m0.600s

xapian 1.3
time recoll -t -q '"to be or not to be"'
 real   0m2.178s user   0m0.072s sys    0m0.048s

This is a very significant improvement of phrase query time, which would,
I imagine, become even more spectacular on a really big index. 

Home directory
--------------

For another more realistic data point, I used my whole home (on
SSD): 10GB, 79K files yielding 112K documents.

I crashed the machine while trying to purge the cache for the query tests,
so the phrase queries are really cold :)

xapian 1.2
  Index size 1758228 kb
  Indexing time: real 10m29s user 38m40s sys 13m22s
  Cold phrase query:
   time recoll -t -q '"with a little help from my friends"'
   real 0m0.441s user 0m0.093s sys 0m0.028s

xapian 1.3
  Index size: 2701448 kb
  Indexing time: real 15m2s user 36m53s sys 13m24s
  Cold phrase query:
    time recoll -t -q '"with a little help from my friends"'
    real 0m0.175s user 0m0.103s sys 0m0.019s

On SSD, phrases searches are also much faster with 1.3, but this would not
be significant in the personal use case (might be a different issue on a
public site running myriads of queries of course)


My conclusion at this point:
---------------------------

I think that most Recoll users will not notice the slightly slower
indexing.

Some might notice the 50% index size increase. Excessive index size is
already one relatively rare, but recurring complaint. Except if I did
something wrong: I'm actually quite surprised by it.

Of course, having faster phrase searches is a good thing. Maybe I have not
run the right tests to display the maximum effect of the new code ?

As it is, and still hoping that more 1.3 optimization will improve the
situation, I have to wonder if the price payed for faster phrase searches
is not a bit too high, given that these are rather unfrequent queries, and
that the improvement, while very significant, does not completely solve the
issue.

jf

Olly Betts

2016-Apr-11 01:47 UTC

head link

Xapian 1.3.5 snapshot performance and index size

On Sun, Apr 10, 2016 at 04:47:01PM +0200, Jean-Francois Dockes
wrote:> Some might notice the 50% index size increase. Excessive index size is
> already one relatively rare, but recurring complaint. Except if I did
> something wrong: I'm actually quite surprised by it.
Did you try compacting the resulting databases?

Creating a database by calling add_document() repeatedly would have
resulted in a close to compact position table with chert, but that's not
true with glass (because the position table is no longer sorted
primarily by the document id).  But if you compact the result, it should
be a fair bit smaller with glass than chert.

Creating a database from scratch is the worst case for this (but of course
a common one).  In general day to day use, this effect should be less
marked.
> Of course, having faster phrase searches is a good thing. Maybe I have not
> run the right tests to display the maximum effect of the new code ?
The cases that motivated these changes were really those taking tens of
seconds (or even minutes for the extreme ones), and were generally
sub-second afterwards - 5.8 to 2.1 seconds is at the unimpressive end
of the improvements seen.  One particular issue with "to be or not to
be" will be that we don't currently try to reuse the postlist or
positional data for "to" and "be", so it has to decode them
twice.
> As it is, and still hoping that more 1.3 optimization will improve the
> situation, I have to wonder if the price payed for faster phrase searches
> is not a bit too high, given that these are rather unfrequent queries, and
It's difficult to make the call on changes like this, but I do feel
that searches taking minutes is completely unacceptable.  How much users
use phrase searches varies a lot, but even if it's a small fraction of
queries, active users will hit such cases and form the impression that
the system is unreliable (and for multi-users systems, it affects the
speed of other queries, as you can end up with the server bogged down
with the long-running searches).  It's made worse by users often
responding to an apparently stalled search by hitting reload in their
browser.
> that the improvement, while very significant, does not completely solve the
> issue.
2.1 seconds is slower than I'd like, but it's at least in the realms of
"that took a while" rather than "the computer has hung".

We're closing in on 1.4.0, so there's not scope for much of this to
change markedly before then.  But I do have plans for internal
improvements which should help the indexing speed and memory usage, and
should be suitable for 1.4.x.

I'm not sure there's an easy solution to the position table not coming
out compact in this case.  Supporting a choice of which key order to use
is possible, but adds some complexity.

Cheers,
    Olly

Jean-Francois Dockes

2016-Apr-11 07:54 UTC

head link

Xapian 1.3.5 snapshot performance and index size

Olly Betts writes:
 > On Sun, Apr 10, 2016 at 04:47:01PM +0200, Jean-Francois Dockes wrote:
 > > Some might notice the 50% index size increase. Excessive index size
is
 > > already one relatively rare, but recurring complaint. Except if I did
 > > something wrong: I'm actually quite surprised by it.
 > 
 > Did you try compacting the resulting databases?
 > 
 > Creating a database by calling add_document() repeatedly would have
 > resulted in a close to compact position table with chert, but that's
not
 > true with glass (because the position table is no longer sorted
 > primarily by the document id).  But if you compact the result, it should
 > be a fair bit smaller with glass than chert.
 > 
 > Creating a database from scratch is the worst case for this (but of course
 > a common one).  In general day to day use, this effect should be less
 > marked.

I had not compacted. After compacting, the 1.3 index is indeed smaller
than the 1.2 one.

 > > Of course, having faster phrase searches is a good thing. Maybe I
have not
 > > run the right tests to display the maximum effect of the new code ?
 > 
 > The cases that motivated these changes were really those taking tens of
 > seconds (or even minutes for the extreme ones), and were generally
 > sub-second afterwards - 5.8 to 2.1 seconds is at the unimpressive end
 > of the improvements seen.  One particular issue with "to be or not to
 > be" will be that we don't currently try to reuse the postlist or
 > positional data for "to" and "be", so it has to decode
them twice.
 > 
 > > As it is, and still hoping that more 1.3 optimization will improve
the
 > > situation, I have to wonder if the price payed for faster phrase
searches
 > > is not a bit too high, given that these are rather unfrequent
queries, and
 > 
 > It's difficult to make the call on changes like this, but I do feel
 > that searches taking minutes is completely unacceptable.  How much users
 > use phrase searches varies a lot, but even if it's a small fraction of
 > queries, active users will hit such cases and form the impression that
 > the system is unreliable (and for multi-users systems, it affects the
 > speed of other queries, as you can end up with the server bogged down
 > with the long-running searches).  It's made worse by users often
 > responding to an apparently stalled search by hitting reload in their
 > browser.
 > 
 > > that the improvement, while very significant, does not completely
solve the
 > > issue.
 > 
 > 2.1 seconds is slower than I'd like, but it's at least in the
realms of
 > "that took a while" rather than "the computer has
hung".

My spinning disk machine was actually "too cold", I should have
thought a
bit more and run a query on another index first to get the program text
pages in memory.

This way, "to be or not to be" gets from 11 S to 0.6 S, and "to
be of
the" gets from 12 S to 0.9 S. Which is of course brilliant !

I think that I can dump my plan of indexing compound terms for runs of
common words :)

 > We're closing in on 1.4.0, so there's not scope for much of this
to
 > change markedly before then.  But I do have plans for internal
 > improvements which should help the indexing speed and memory usage, and
 > should be suitable for 1.4.x.
 > 
 > I'm not sure there's an easy solution to the position table not
coming
 > out compact in this case.  Supporting a choice of which key order to use
 > is possible, but adds some complexity.

The question which remains for me is if I should run xapian-compact after an
initial indexing operation. I guess that this depends on the amount of
expected updates and that there is no easy answer ?

jf

Apparently Analagous Threads

Search for more possibly parallel threads

Xapian discuss - Apr 2016 - Xapian 1.3.5 snapshot performance and index size

Xapian 1.3.5 snapshot performance and index size

Xapian 1.3.5 snapshot performance and index size

Xapian 1.3.5 snapshot performance and index size

Apparently Analagous Threads