Jean-Francois Dockes
2016-Apr-11 07:54 UTC
Xapian 1.3.5 snapshot performance and index size
Olly Betts writes: > On Sun, Apr 10, 2016 at 04:47:01PM +0200, Jean-Francois Dockes wrote: > > Some might notice the 50% index size increase. Excessive index size is > > already one relatively rare, but recurring complaint. Except if I did > > something wrong: I'm actually quite surprised by it. > > Did you try compacting the resulting databases? > > Creating a database by calling add_document() repeatedly would have > resulted in a close to compact position table with chert, but that's not > true with glass (because the position table is no longer sorted > primarily by the document id). But if you compact the result, it should > be a fair bit smaller with glass than chert. > > Creating a database from scratch is the worst case for this (but of course > a common one). In general day to day use, this effect should be less > marked. I had not compacted. After compacting, the 1.3 index is indeed smaller than the 1.2 one. > > Of course, having faster phrase searches is a good thing. Maybe I have not > > run the right tests to display the maximum effect of the new code ? > > The cases that motivated these changes were really those taking tens of > seconds (or even minutes for the extreme ones), and were generally > sub-second afterwards - 5.8 to 2.1 seconds is at the unimpressive end > of the improvements seen. One particular issue with "to be or not to > be" will be that we don't currently try to reuse the postlist or > positional data for "to" and "be", so it has to decode them twice. > > > As it is, and still hoping that more 1.3 optimization will improve the > > situation, I have to wonder if the price payed for faster phrase searches > > is not a bit too high, given that these are rather unfrequent queries, and > > It's difficult to make the call on changes like this, but I do feel > that searches taking minutes is completely unacceptable. How much users > use phrase searches varies a lot, but even if it's a small fraction of > queries, active users will hit such cases and form the impression that > the system is unreliable (and for multi-users systems, it affects the > speed of other queries, as you can end up with the server bogged down > with the long-running searches). It's made worse by users often > responding to an apparently stalled search by hitting reload in their > browser. > > > that the improvement, while very significant, does not completely solve the > > issue. > > 2.1 seconds is slower than I'd like, but it's at least in the realms of > "that took a while" rather than "the computer has hung". My spinning disk machine was actually "too cold", I should have thought a bit more and run a query on another index first to get the program text pages in memory. This way, "to be or not to be" gets from 11 S to 0.6 S, and "to be of the" gets from 12 S to 0.9 S. Which is of course brilliant ! I think that I can dump my plan of indexing compound terms for runs of common words :) > We're closing in on 1.4.0, so there's not scope for much of this to > change markedly before then. But I do have plans for internal > improvements which should help the indexing speed and memory usage, and > should be suitable for 1.4.x. > > I'm not sure there's an easy solution to the position table not coming > out compact in this case. Supporting a choice of which key order to use > is possible, but adds some complexity. The question which remains for me is if I should run xapian-compact after an initial indexing operation. I guess that this depends on the amount of expected updates and that there is no easy answer ? jf
On Mon, Apr 11, 2016 at 09:54:36AM +0200, Jean-Francois Dockes wrote:> This way, "to be or not to be" gets from 11 S to 0.6 S, and "to be of > the" gets from 12 S to 0.9 S. Which is of course brilliant ! > > I think that I can dump my plan of indexing compound terms for runs of > common words :)We had been experimenting with bigrams to accelerate phrases, and not having to go that route was one motivation for the key order change. The bigram terms would add significantly to DB size, and to cache pressure.> > I'm not sure there's an easy solution to the position table not coming > > out compact in this case. Supporting a choice of which key order to use > > is possible, but adds some complexity. > > The question which remains for me is if I should run xapian-compact after an > initial indexing operation. I guess that this depends on the amount of > expected updates and that there is no easy answer ?I think it's not obvious whether it's a good plan to or not. Ideally we'd find a way to make it come out more compact to start with. One thing which could help is making glass more willing to switch to "sequential mode". If you fancy some more benchmarking, you could try changing SEQ_START_POINT in backends/glass/glass_table.cc. It defaults to -10, but I don't think anyone has tried tuning it recently (this value comes from Martin's original code in commit 26bd647ff6084c60d8869f27d6abbd99e06c3f30 back in 2000 - he may have done tests to select it, but even if he did, so much has changed since). Something like -3 or -4 might work well - probably enough that it shouldn't enable when it's not useful, and by default we ensure at least 4 items fit in a block. Cheers, Olly
Jean-Francois Dockes
2016-Apr-12 09:28 UTC
Xapian 1.3.5 snapshot performance and index size
Olly Betts writes: > On Mon, Apr 11, 2016 at 09:54:36AM +0200, Jean-Francois Dockes wrote: > > The question which remains for me is if I should run xapian-compact > > after an initial indexing operation. I guess that this depends on the > > amount of expected updates and that there is no easy answer ? > > I think it's not obvious whether it's a good plan to or not. > > Ideally we'd find a way to make it come out more compact to start with. > > One thing which could help is making glass more willing to switch to > "sequential mode". If you fancy some more benchmarking, you could > try changing SEQ_START_POINT in backends/glass/glass_table.cc. > > It defaults to -10, but I don't think anyone has tried tuning it > recently (this value comes from Martin's original code in commit > 26bd647ff6084c60d8869f27d6abbd99e06c3f30 back in 2000 - he may have done > tests to select it, but even if he did, so much has changed since). > Something like -3 or -4 might work well - probably enough that it > shouldn't enable when it's not useful, and by default we ensure at least > 4 items fit in a block. Ok, I tried this, with not much luck. I used a script to edit the SEQ_START_POINT value, then rebuild and install Xapian, then run the indexing. Sizes don't change much... Maybe I did something wrong, https://gist.github.com/medoc92/1ad2a232e4b36e2993ce9adc5789a60a The output follows (I edited out the unchanging recoll config dumps). Jf *******LIB***************** Tue Apr 12 10:43:14 CEST 2016 #define SEQ_START_POINT (-10) -rwxr-xr-x 1 root root 30728315 Apr 12 10:43 /usr/lib/libxapian-1.3.so.6 ************************* 452.68user 124.94system 4:42.27elapsed 204%CPU (0avgtext+0avgdata 1055204maxresident)k 0inputs+21046192outputs (0major+41137071minor)pagefaults 0swaps ************************* 793244 /home/dockes/.recoll/xapiandb total 793240 -rw-r--r-- 1 dockes dockes 24150016 Apr 12 10:47 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 12 10:47 flintlock -rw-r--r-- 1 dockes dockes 130 Apr 12 10:47 iamglass -rw-r--r-- 1 dockes dockes 577527808 Apr 12 10:47 position.glass -rw-r--r-- 1 dockes dockes 120905728 Apr 12 10:47 postlist.glass -rw-r--r-- 1 dockes dockes 89677824 Apr 12 10:47 termlist.glass ************************* *******LIB***************** Tue Apr 12 10:48:04 CEST 2016 #define SEQ_START_POINT (-7) -rwxr-xr-x 1 root root 30728315 Apr 12 10:48 /usr/lib/libxapian-1.3.so.6 ************************* 449.64user 124.36system 4:48.82elapsed 198%CPU (0avgtext+0avgdata 1074832maxresident)k 8inputs+22874712outputs (0major+41448062minor)pagefaults 0swaps ************************* 791324 /home/dockes/.recoll/xapiandb total 791320 -rw-r--r-- 1 dockes dockes 24141824 Apr 12 10:52 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 12 10:52 flintlock -rw-r--r-- 1 dockes dockes 130 Apr 12 10:52 iamglass -rw-r--r-- 1 dockes dockes 577921024 Apr 12 10:52 position.glass -rw-r--r-- 1 dockes dockes 119078912 Apr 12 10:52 postlist.glass -rw-r--r-- 1 dockes dockes 89153536 Apr 12 10:52 termlist.glass ************************* *******LIB***************** Tue Apr 12 10:53:00 CEST 2016 #define SEQ_START_POINT (-4) -rwxr-xr-x 1 root root 30728315 Apr 12 10:52 /usr/lib/libxapian-1.3.so.6 ************************* 451.16user 128.46system 5:35.34elapsed 172%CPU (0avgtext+0avgdata 1060184maxresident)k 16inputs+24076448outputs (0major+41924101minor)pagefaults 0swaps ************************* 789020 /home/dockes/.recoll/xapiandb total 789016 -rw-r--r-- 1 dockes dockes 24150016 Apr 12 10:58 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 12 10:58 flintlock -rw-r--r-- 1 dockes dockes 130 Apr 12 10:58 iamglass -rw-r--r-- 1 dockes dockes 578453504 Apr 12 10:58 position.glass -rw-r--r-- 1 dockes dockes 115941376 Apr 12 10:58 postlist.glass -rw-r--r-- 1 dockes dockes 89391104 Apr 12 10:58 termlist.glass ************************* *******LIB***************** Tue Apr 12 10:58:43 CEST 2016 #define SEQ_START_POINT (-3) -rwxr-xr-x 1 root root 30728315 Apr 12 10:58 /usr/lib/libxapian-1.3.so.6 ************************* 458.04user 125.02system 5:18.14elapsed 183%CPU (0avgtext+0avgdata 1048328maxresident)k 0inputs+22002000outputs (0major+40947584minor)pagefaults 0swaps ************************* 786756 /home/dockes/.recoll/xapiandb total 786752 -rw-r--r-- 1 dockes dockes 24150016 Apr 12 11:03 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 12 11:04 flintlock -rw-r--r-- 1 dockes dockes 130 Apr 12 11:04 iamglass -rw-r--r-- 1 dockes dockes 577871872 Apr 12 11:03 position.glass -rw-r--r-- 1 dockes dockes 114171904 Apr 12 11:04 postlist.glass -rw-r--r-- 1 dockes dockes 89423872 Apr 12 11:03 termlist.glass ************************* *******LIB***************** Tue Apr 12 11:04:08 CEST 2016 #define SEQ_START_POINT (-2) -rwxr-xr-x 1 root root 30728315 Apr 12 11:04 /usr/lib/libxapian-1.3.so.6 ************************* 452.14user 122.41system 4:55.79elapsed 194%CPU (0avgtext+0avgdata 1060256maxresident)k 40inputs+22850200outputs (0major+38276837minor)pagefaults 0swaps ************************* 784960 /home/dockes/.recoll/xapiandb total 784956 -rw-r--r-- 1 dockes dockes 24141824 Apr 12 11:09 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 12 11:09 flintlock -rw-r--r-- 1 dockes dockes 130 Apr 12 11:09 iamglass -rw-r--r-- 1 dockes dockes 578920448 Apr 12 11:09 position.glass -rw-r--r-- 1 dockes dockes 111460352 Apr 12 11:09 postlist.glass -rw-r--r-- 1 dockes dockes 89251840 Apr 12 11:09 termlist.glass *************************