Jean-Francois Dockes
2016-Apr-12 09:28 UTC
Xapian 1.3.5 snapshot performance and index size
Olly Betts writes: > On Mon, Apr 11, 2016 at 09:54:36AM +0200, Jean-Francois Dockes wrote: > > The question which remains for me is if I should run xapian-compact > > after an initial indexing operation. I guess that this depends on the > > amount of expected updates and that there is no easy answer ? > > I think it's not obvious whether it's a good plan to or not. > > Ideally we'd find a way to make it come out more compact to start with. > > One thing which could help is making glass more willing to switch to > "sequential mode". If you fancy some more benchmarking, you could > try changing SEQ_START_POINT in backends/glass/glass_table.cc. > > It defaults to -10, but I don't think anyone has tried tuning it > recently (this value comes from Martin's original code in commit > 26bd647ff6084c60d8869f27d6abbd99e06c3f30 back in 2000 - he may have done > tests to select it, but even if he did, so much has changed since). > Something like -3 or -4 might work well - probably enough that it > shouldn't enable when it's not useful, and by default we ensure at least > 4 items fit in a block. Ok, I tried this, with not much luck. I used a script to edit the SEQ_START_POINT value, then rebuild and install Xapian, then run the indexing. Sizes don't change much... Maybe I did something wrong, https://gist.github.com/medoc92/1ad2a232e4b36e2993ce9adc5789a60a The output follows (I edited out the unchanging recoll config dumps). Jf *******LIB***************** Tue Apr 12 10:43:14 CEST 2016 #define SEQ_START_POINT (-10) -rwxr-xr-x 1 root root 30728315 Apr 12 10:43 /usr/lib/libxapian-1.3.so.6 ************************* 452.68user 124.94system 4:42.27elapsed 204%CPU (0avgtext+0avgdata 1055204maxresident)k 0inputs+21046192outputs (0major+41137071minor)pagefaults 0swaps ************************* 793244 /home/dockes/.recoll/xapiandb total 793240 -rw-r--r-- 1 dockes dockes 24150016 Apr 12 10:47 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 12 10:47 flintlock -rw-r--r-- 1 dockes dockes 130 Apr 12 10:47 iamglass -rw-r--r-- 1 dockes dockes 577527808 Apr 12 10:47 position.glass -rw-r--r-- 1 dockes dockes 120905728 Apr 12 10:47 postlist.glass -rw-r--r-- 1 dockes dockes 89677824 Apr 12 10:47 termlist.glass ************************* *******LIB***************** Tue Apr 12 10:48:04 CEST 2016 #define SEQ_START_POINT (-7) -rwxr-xr-x 1 root root 30728315 Apr 12 10:48 /usr/lib/libxapian-1.3.so.6 ************************* 449.64user 124.36system 4:48.82elapsed 198%CPU (0avgtext+0avgdata 1074832maxresident)k 8inputs+22874712outputs (0major+41448062minor)pagefaults 0swaps ************************* 791324 /home/dockes/.recoll/xapiandb total 791320 -rw-r--r-- 1 dockes dockes 24141824 Apr 12 10:52 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 12 10:52 flintlock -rw-r--r-- 1 dockes dockes 130 Apr 12 10:52 iamglass -rw-r--r-- 1 dockes dockes 577921024 Apr 12 10:52 position.glass -rw-r--r-- 1 dockes dockes 119078912 Apr 12 10:52 postlist.glass -rw-r--r-- 1 dockes dockes 89153536 Apr 12 10:52 termlist.glass ************************* *******LIB***************** Tue Apr 12 10:53:00 CEST 2016 #define SEQ_START_POINT (-4) -rwxr-xr-x 1 root root 30728315 Apr 12 10:52 /usr/lib/libxapian-1.3.so.6 ************************* 451.16user 128.46system 5:35.34elapsed 172%CPU (0avgtext+0avgdata 1060184maxresident)k 16inputs+24076448outputs (0major+41924101minor)pagefaults 0swaps ************************* 789020 /home/dockes/.recoll/xapiandb total 789016 -rw-r--r-- 1 dockes dockes 24150016 Apr 12 10:58 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 12 10:58 flintlock -rw-r--r-- 1 dockes dockes 130 Apr 12 10:58 iamglass -rw-r--r-- 1 dockes dockes 578453504 Apr 12 10:58 position.glass -rw-r--r-- 1 dockes dockes 115941376 Apr 12 10:58 postlist.glass -rw-r--r-- 1 dockes dockes 89391104 Apr 12 10:58 termlist.glass ************************* *******LIB***************** Tue Apr 12 10:58:43 CEST 2016 #define SEQ_START_POINT (-3) -rwxr-xr-x 1 root root 30728315 Apr 12 10:58 /usr/lib/libxapian-1.3.so.6 ************************* 458.04user 125.02system 5:18.14elapsed 183%CPU (0avgtext+0avgdata 1048328maxresident)k 0inputs+22002000outputs (0major+40947584minor)pagefaults 0swaps ************************* 786756 /home/dockes/.recoll/xapiandb total 786752 -rw-r--r-- 1 dockes dockes 24150016 Apr 12 11:03 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 12 11:04 flintlock -rw-r--r-- 1 dockes dockes 130 Apr 12 11:04 iamglass -rw-r--r-- 1 dockes dockes 577871872 Apr 12 11:03 position.glass -rw-r--r-- 1 dockes dockes 114171904 Apr 12 11:04 postlist.glass -rw-r--r-- 1 dockes dockes 89423872 Apr 12 11:03 termlist.glass ************************* *******LIB***************** Tue Apr 12 11:04:08 CEST 2016 #define SEQ_START_POINT (-2) -rwxr-xr-x 1 root root 30728315 Apr 12 11:04 /usr/lib/libxapian-1.3.so.6 ************************* 452.14user 122.41system 4:55.79elapsed 194%CPU (0avgtext+0avgdata 1060256maxresident)k 40inputs+22850200outputs (0major+38276837minor)pagefaults 0swaps ************************* 784960 /home/dockes/.recoll/xapiandb total 784956 -rw-r--r-- 1 dockes dockes 24141824 Apr 12 11:09 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 12 11:09 flintlock -rw-r--r-- 1 dockes dockes 130 Apr 12 11:09 iamglass -rw-r--r-- 1 dockes dockes 578920448 Apr 12 11:09 position.glass -rw-r--r-- 1 dockes dockes 111460352 Apr 12 11:09 postlist.glass -rw-r--r-- 1 dockes dockes 89251840 Apr 12 11:09 termlist.glass *************************
On Tue, Apr 12, 2016 at 11:28:52AM +0200, Jean-Francois Dockes wrote:> Olly Betts writes: > > Ideally we'd find a way to make it come out more compact to start with. > > > > One thing which could help is making glass more willing to switch to > > "sequential mode". If you fancy some more benchmarking, you could > > try changing SEQ_START_POINT in backends/glass/glass_table.cc. > > > > It defaults to -10, but I don't think anyone has tried tuning it > > recently (this value comes from Martin's original code in commit > > 26bd647ff6084c60d8869f27d6abbd99e06c3f30 back in 2000 - he may have done > > tests to select it, but even if he did, so much has changed since). > > Something like -3 or -4 might work well - probably enough that it > > shouldn't enable when it's not useful, and by default we ensure at least > > 4 items fit in a block. > > Ok, I tried this, with not much luck.Many thanks for taking a look at this. If you have the databases from your test around still, what's the size of the tables in one of them after compaction? It shouldn't make a difference which version of the output database you compact to find this.> I used a script to edit the SEQ_START_POINT value, then rebuild and > install Xapian, then run the indexing. > > Sizes don't change much... Maybe I did something wrong,I've been pondering your results, and have a few insights. Looking at the variations in table size, the postlist table actually benefits more from changing SEQ_START_POINT, with a reduction in size of 8% in the best case, which is pretty significant. I think the reason it makes more difference there is that the items in the postlist table tend to be larger, whereas a lot of the positional data entries are actually very small, so in fact we'll often have inserted enough items sequentially to have switched to sequential mode before we need to split a block. And making the wrong call about an uneven split can make things worse as it creates a block < 50% full and a block much fuller than 50%. If the next batch of updates doesn't touch the under-full block but splits the fuller one, we can end up with more unused space than if we'd just split evenly. There looks to be scope for improvement here, but it's not as simple as just reducing SEQ_START_POINT, as I'd naively hoped. If we had an "oracle" which could predict with perfect foresight where we should split a block for the best end result, we can expect at least an 8% improvement for the postlist table, and probably significantly better. I'd expect good gains for the position table too. So the question is, can we build at least a useful approximation to an oracle? And the answer is likely yes, since we have all the data batched up at the point this is relevant, so we can look ahead to see what's coming (or pack it in a speculative way, or something along those lines). I think with care the overhead of doing so can be kept low too. A change like this isn't going to happen before 1.4.0, but it doesn't require format changes, could be done in 1.4.x. Cheers, Olly
Jean-Francois Dockes
2016-Apr-30 13:04 UTC
Xapian 1.3.5 snapshot performance and index size
Olly Betts writes: > On Tue, Apr 12, 2016 at 11:28:52AM +0200, Jean-Francois Dockes wrote: > > Olly Betts writes: > > > Ideally we'd find a way to make it come out more compact to start with. > > > > > > One thing which could help is making glass more willing to switch to > > > "sequential mode". If you fancy some more benchmarking, you could > > > try changing SEQ_START_POINT in backends/glass/glass_table.cc. > > > > > > It defaults to -10, but I don't think anyone has tried tuning it > > > recently (this value comes from Martin's original code in commit > > > 26bd647ff6084c60d8869f27d6abbd99e06c3f30 back in 2000 - he may have done > > > tests to select it, but even if he did, so much has changed since). > > > Something like -3 or -4 might work well - probably enough that it > > > shouldn't enable when it's not useful, and by default we ensure at least > > > 4 items fit in a block. > > > > Ok, I tried this, with not much luck. > > Many thanks for taking a look at this. > > If you have the databases from your test around still, what's the > size of the tables in one of them after compaction? It shouldn't > make a difference which version of the output database you compact to > find this. Hi, Here follow the table sizes before and after compaction, for xapian 1.3.5 and 1.2.21. I re-ran the script which indexes after changing SEQ_START_POINT, probably on a slightly different but equivalent data set (bunch of pdfs), and the bad news is that I could not reproduce the earlier results, which showed a small but consistent variation of index sizes with SEQ_START_POINT. During the re-runs, the size variations are rather smaller (but of the same order), and don't seem to follow an obvious pattern. I don't know how to explain the change in behaviour, except for having had a bit of luck the first time, which seems strange. However, given that the variations were not that significant to begin with (around 1% of the full index size), I've stopped trying. Regards, jf hm1$ xapian-compact-1.3 .recoll/xapiandb/ .recoll/xapiandb-compacted postlist: Reduced by 63% 70584K (112016K -> 41432K) docdata: Reduced by 1% 24K (1888K -> 1864K) termlist: Reduced by 24% 9016K (36760K -> 27744K) position: Reduced by 58% 278088K (475960K -> 197872K) spelling: doesn't exist synonym: Reduced by 42% 3840K (8936K -> 5096K) hm1$ ls -l .recoll/xapiandb* .recoll/xapiandb: total 635576 -rw-r--r-- 1 dockes dockes 1933312 Apr 30 14:01 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 30 14:01 flintlock -rw-r--r-- 1 dockes dockes 145 Apr 30 14:01 iamglass -rw-r--r-- 1 dockes dockes 487383040 Apr 30 14:01 position.glass -rw-r--r-- 1 dockes dockes 114704384 Apr 30 14:01 postlist.glass -rw-r--r-- 1 dockes dockes 9150464 Apr 30 14:01 synonym.glass -rw-r--r-- 1 dockes dockes 37642240 Apr 30 14:01 termlist.glass .recoll/xapiandb-compacted: total 274016 -rw-r--r-- 1 dockes dockes 1908736 Apr 30 14:10 docdata.glass -rw-r--r-- 1 dockes dockes 0 Apr 30 14:10 flintlock -rw-r--r-- 1 dockes dockes 134 Apr 30 14:11 iamglass -rw-r--r-- 1 dockes dockes 202620928 Apr 30 14:11 position.glass -rw-r--r-- 1 dockes dockes 42426368 Apr 30 14:10 postlist.glass -rw-r--r-- 1 dockes dockes 5218304 Apr 30 14:11 synonym.glass -rw-r--r-- 1 dockes dockes 28409856 Apr 30 14:10 termlist.glass Same for xapian 1.2.21: hm1$ xapian-compact .recoll/xapiandb/ .recoll/xapiandb-compacted postlist: Reduced by 63% 78528K (123912K -> 45384K) record: Reduced by 2% 48K (1904K -> 1856K) termlist: Reduced by 25% 9432K (37096K -> 27664K) position: Reduced by 0% 656K (220904K -> 220248K) spelling: doesn't exist synonym: Reduced by 46% 4848K (10464K -> 5616K) hm1$ ls -l .recoll/xapiandb* .recoll/xapiandb: total 394336 -rw-r--r-- 1 dockes dockes 0 Apr 30 14:18 flintlock -rw-r--r-- 1 dockes dockes 28 Apr 30 14:12 iamchert -rw-r--r-- 1 dockes dockes 3473 Apr 30 14:18 position.baseA -rw-r--r-- 1 dockes dockes 3473 Apr 30 14:18 position.baseB -rw-r--r-- 1 dockes dockes 226205696 Apr 30 14:18 position.DB -rw-r--r-- 1 dockes dockes 1954 Apr 30 14:18 postlist.baseA -rw-r--r-- 1 dockes dockes 1954 Apr 30 14:18 postlist.baseB -rw-r--r-- 1 dockes dockes 126885888 Apr 30 14:18 postlist.DB -rw-r--r-- 1 dockes dockes 46 Apr 30 14:18 record.baseA -rw-r--r-- 1 dockes dockes 46 Apr 30 14:18 record.baseB -rw-r--r-- 1 dockes dockes 1949696 Apr 30 14:18 record.DB -rw-r--r-- 1 dockes dockes 182 Apr 30 14:18 synonym.baseA -rw-r--r-- 1 dockes dockes 182 Apr 30 14:18 synonym.baseB -rw-r--r-- 1 dockes dockes 10715136 Apr 30 14:18 synonym.DB -rw-r--r-- 1 dockes dockes 597 Apr 30 14:18 termlist.baseA -rw-r--r-- 1 dockes dockes 597 Apr 30 14:18 termlist.baseB -rw-r--r-- 1 dockes dockes 37986304 Apr 30 14:18 termlist.DB .recoll/xapiandb-compacted: total 300816 -rw-r--r-- 1 dockes dockes 28 Apr 30 14:19 iamchert -rw-r--r-- 1 dockes dockes 13 Apr 30 14:19 position.baseA -rw-r--r-- 1 dockes dockes 3462 Apr 30 14:19 position.baseB -rw-r--r-- 1 dockes dockes 225533952 Apr 30 14:19 position.DB -rw-r--r-- 1 dockes dockes 13 Apr 30 14:19 postlist.baseA -rw-r--r-- 1 dockes dockes 728 Apr 30 14:19 postlist.baseB -rw-r--r-- 1 dockes dockes 46473216 Apr 30 14:19 postlist.DB -rw-r--r-- 1 dockes dockes 13 Apr 30 14:19 record.baseA -rw-r--r-- 1 dockes dockes 44 Apr 30 14:19 record.baseB -rw-r--r-- 1 dockes dockes 1900544 Apr 30 14:19 record.DB -rw-r--r-- 1 dockes dockes 13 Apr 30 14:19 synonym.baseA -rw-r--r-- 1 dockes dockes 105 Apr 30 14:19 synonym.baseB -rw-r--r-- 1 dockes dockes 5750784 Apr 30 14:19 synonym.DB -rw-r--r-- 1 dockes dockes 13 Apr 30 14:19 termlist.baseA -rw-r--r-- 1 dockes dockes 450 Apr 30 14:19 termlist.baseB -rw-r--r-- 1 dockes dockes 28327936 Apr 30 14:19 termlist.DB