Jean-Francois Dockes
2016-Apr-10 14:47 UTC
Xapian 1.3.5 snapshot performance and index size
Hi, I ran some tests with Recoll to compare Xapian 1.2.22 and 1.3.5 performance. I mostly used two relatively small document sets (realistic/typical recoll data subsets). The first set is a 2.2 GB mbox folder, with approximately 56K messages in 275 files, producing approximately 64K documents (because of attachments). The second set is a 11 GB folder with 5300 PDF files in it (random PDFS harvested on Google). The machine has an Intel Core i7-4770T CPU @ 2.50GHz (4 cores + hyperthreading), 8 GB of memory and SSD storage. I repeated most tests multiple times, and I give the best times here (the variation was not very significant anyway). PDF directory: ------------- Xapian 1.2.22 Index size 399 MB time: real 3m15s user 22m19s sys 1m9s Xapian 1.3.5 Index size 614 MB time: real 3m18s user 22m21s sys 1m28s Mail directory: -------------- Xapian 1.2.22 Index size 615 MB time: real 2m20s user 7m57s sys 1m34s Approximately 2mn of CPU time are spent in the actual Xapian thread (which gets xapian::document as input and processes them into the index). Xapian 1.3.5 Index size 794MB time: real 3m47s user 7m14s sys 1m59s Approximately 2m40s of CPU time are spent in the Xapian thread. Indexing performance, interpretation: ------------------------------------ On the PDF directory, the performance of the Xapian thread is masked by the processing of the PDF input. The CPU utilization is good (CPU time/clock time is around 7, compared to 8 possible threads). On the mail directory, the input processing is less significant, the single index update thread is the bottleneck, and the Xapian version makes a difference, Xapian 1.3 being significantly slower. The CPU utilization is less than with PDF input, because the process is often waiting for the Xapian thread, which is almost never waiting for input. The situation is worse with 1.3 than with 1.2, because 1.3 is slower. I am not sure why there is so much more difference betweeen the time of the Xapian thread and the wall time for 1.3, but one possible explanation would be more I/O waits. The increase in index size between 1.2.22 and 1.3.5 is quite significant, around 50%, concentrated on the positions file. Phrase queries: --------------- I ran a query on both versions of the mail index after copying the data to a machine with spinning disks. The queries are run just after a reboot, they find 3 documents (not shown): xapian 1.2 time recoll -t -q '"to be or not to be"' real 0m5.766s user 0m0.108s sys 0m0.600s xapian 1.3 time recoll -t -q '"to be or not to be"' real 0m2.178s user 0m0.072s sys 0m0.048s This is a very significant improvement of phrase query time, which would, I imagine, become even more spectacular on a really big index. Home directory -------------- For another more realistic data point, I used my whole home (on SSD): 10GB, 79K files yielding 112K documents. I crashed the machine while trying to purge the cache for the query tests, so the phrase queries are really cold :) xapian 1.2 Index size 1758228 kb Indexing time: real 10m29s user 38m40s sys 13m22s Cold phrase query: time recoll -t -q '"with a little help from my friends"' real 0m0.441s user 0m0.093s sys 0m0.028s xapian 1.3 Index size: 2701448 kb Indexing time: real 15m2s user 36m53s sys 13m24s Cold phrase query: time recoll -t -q '"with a little help from my friends"' real 0m0.175s user 0m0.103s sys 0m0.019s On SSD, phrases searches are also much faster with 1.3, but this would not be significant in the personal use case (might be a different issue on a public site running myriads of queries of course) My conclusion at this point: --------------------------- I think that most Recoll users will not notice the slightly slower indexing. Some might notice the 50% index size increase. Excessive index size is already one relatively rare, but recurring complaint. Except if I did something wrong: I'm actually quite surprised by it. Of course, having faster phrase searches is a good thing. Maybe I have not run the right tests to display the maximum effect of the new code ? As it is, and still hoping that more 1.3 optimization will improve the situation, I have to wonder if the price payed for faster phrase searches is not a bit too high, given that these are rather unfrequent queries, and that the improvement, while very significant, does not completely solve the issue. jf
On Sun, Apr 10, 2016 at 04:47:01PM +0200, Jean-Francois Dockes wrote:> Some might notice the 50% index size increase. Excessive index size is > already one relatively rare, but recurring complaint. Except if I did > something wrong: I'm actually quite surprised by it.Did you try compacting the resulting databases? Creating a database by calling add_document() repeatedly would have resulted in a close to compact position table with chert, but that's not true with glass (because the position table is no longer sorted primarily by the document id). But if you compact the result, it should be a fair bit smaller with glass than chert. Creating a database from scratch is the worst case for this (but of course a common one). In general day to day use, this effect should be less marked.> Of course, having faster phrase searches is a good thing. Maybe I have not > run the right tests to display the maximum effect of the new code ?The cases that motivated these changes were really those taking tens of seconds (or even minutes for the extreme ones), and were generally sub-second afterwards - 5.8 to 2.1 seconds is at the unimpressive end of the improvements seen. One particular issue with "to be or not to be" will be that we don't currently try to reuse the postlist or positional data for "to" and "be", so it has to decode them twice.> As it is, and still hoping that more 1.3 optimization will improve the > situation, I have to wonder if the price payed for faster phrase searches > is not a bit too high, given that these are rather unfrequent queries, andIt's difficult to make the call on changes like this, but I do feel that searches taking minutes is completely unacceptable. How much users use phrase searches varies a lot, but even if it's a small fraction of queries, active users will hit such cases and form the impression that the system is unreliable (and for multi-users systems, it affects the speed of other queries, as you can end up with the server bogged down with the long-running searches). It's made worse by users often responding to an apparently stalled search by hitting reload in their browser.> that the improvement, while very significant, does not completely solve the > issue.2.1 seconds is slower than I'd like, but it's at least in the realms of "that took a while" rather than "the computer has hung". We're closing in on 1.4.0, so there's not scope for much of this to change markedly before then. But I do have plans for internal improvements which should help the indexing speed and memory usage, and should be suitable for 1.4.x. I'm not sure there's an easy solution to the position table not coming out compact in this case. Supporting a choice of which key order to use is possible, but adds some complexity. Cheers, Olly
Jean-Francois Dockes
2016-Apr-11 07:54 UTC
Xapian 1.3.5 snapshot performance and index size
Olly Betts writes: > On Sun, Apr 10, 2016 at 04:47:01PM +0200, Jean-Francois Dockes wrote: > > Some might notice the 50% index size increase. Excessive index size is > > already one relatively rare, but recurring complaint. Except if I did > > something wrong: I'm actually quite surprised by it. > > Did you try compacting the resulting databases? > > Creating a database by calling add_document() repeatedly would have > resulted in a close to compact position table with chert, but that's not > true with glass (because the position table is no longer sorted > primarily by the document id). But if you compact the result, it should > be a fair bit smaller with glass than chert. > > Creating a database from scratch is the worst case for this (but of course > a common one). In general day to day use, this effect should be less > marked. I had not compacted. After compacting, the 1.3 index is indeed smaller than the 1.2 one. > > Of course, having faster phrase searches is a good thing. Maybe I have not > > run the right tests to display the maximum effect of the new code ? > > The cases that motivated these changes were really those taking tens of > seconds (or even minutes for the extreme ones), and were generally > sub-second afterwards - 5.8 to 2.1 seconds is at the unimpressive end > of the improvements seen. One particular issue with "to be or not to > be" will be that we don't currently try to reuse the postlist or > positional data for "to" and "be", so it has to decode them twice. > > > As it is, and still hoping that more 1.3 optimization will improve the > > situation, I have to wonder if the price payed for faster phrase searches > > is not a bit too high, given that these are rather unfrequent queries, and > > It's difficult to make the call on changes like this, but I do feel > that searches taking minutes is completely unacceptable. How much users > use phrase searches varies a lot, but even if it's a small fraction of > queries, active users will hit such cases and form the impression that > the system is unreliable (and for multi-users systems, it affects the > speed of other queries, as you can end up with the server bogged down > with the long-running searches). It's made worse by users often > responding to an apparently stalled search by hitting reload in their > browser. > > > that the improvement, while very significant, does not completely solve the > > issue. > > 2.1 seconds is slower than I'd like, but it's at least in the realms of > "that took a while" rather than "the computer has hung". My spinning disk machine was actually "too cold", I should have thought a bit more and run a query on another index first to get the program text pages in memory. This way, "to be or not to be" gets from 11 S to 0.6 S, and "to be of the" gets from 12 S to 0.9 S. Which is of course brilliant ! I think that I can dump my plan of indexing compound terms for runs of common words :) > We're closing in on 1.4.0, so there's not scope for much of this to > change markedly before then. But I do have plans for internal > improvements which should help the indexing speed and memory usage, and > should be suitable for 1.4.x. > > I'm not sure there's an easy solution to the position table not coming > out compact in this case. Supporting a choice of which key order to use > is possible, but adds some complexity. The question which remains for me is if I should run xapian-compact after an initial indexing operation. I guess that this depends on the amount of expected updates and that there is no easy answer ? jf