Olly Betts writes:
> On Mon, Jan 21, 2019 at 03:25:01PM +0100, Jean-Francois Dockes wrote:
> > I have had a problem report from a Recoll user about the amount of
writes
> > during index creation.
> >
> > https://opensourceprojects.eu/p/recoll1/tickets/67/
> >
> > The issue is that the index is on SSD and that the amount of writes
is
> > significant compared to the SSD life expectancy (index size > 250
GB).
> >
> > From the numbers he supplied, it seems to me that the total amount of
block
> > writes is roughly quadratic with the index size.
> >
> > First question: is this expected, or is Recoll doing something wrong
?
>
> It isn't expected.
>
> I think this is probably due to a bug which coincidentally was
> discovered earlier this week by Germán M. Bravo. I've now fixed it
> and backported ready for 1.4.10. If you're able to test to confirm
> if this solves your problem that would be very useful - see
> f19bcb96857419469f74f748e7fe8eaccaedc0fd on the RELEASE/1.4 branch:
>
>
https://git.xapian.org/?p=xapian;a=commitdiff;h=f19bcb96857419469f74f748e7fe8eaccaedc0fd
>
> Anything which uses a term for a unique document identifier is likely to
> be affected.
>
> Cheers,
> Olly
I have run a number of tests, with data mostly from a project gutenberg dvd
and other books, with relatively modest index sizes, from 1 to 24 GB.
Quite curiously, in this zone, with all Xapian versions I tried, the ratio
from index size to the amount of writes is roughly proportional to the index
size to the power 1.5
TotalWrites / (IndexSize**1.5) ~= K
So, not quadratic, which is good news. For big indexes, 1.5 is not so good
but probably somewhat expected.
The other good news is that the patch above decreases the amount of writing
by a significant factor, around 4.5 for the biggest index I tried.
The amount of writes is estimated with iostat before/after. The disk has
nothing else to do.
idxflushmb is the number of megabytes of input text between Xapian commits.
xapiandb,kb writes,kb K*1000 sz/w
xapian 1.4.5 idxflushmb 200
1544724 6941286 3.62 4.49
3080540 16312960 3.02 5.30
4606060 21054756 2.13 4.57
6123140 33914344 2.24 5.54
7631788 50452348 2.39 6.61
xapian git master latest idxflushmb 200
1402524 1597352 0.96 1.14
2223076 3291588 0.99 1.48
2678404 4121024 0.94 1.54
3842372 7219404 0.96 1.88
4964132 10850844 0.98 2.19
6062204 14751196 0.99 2.43
19677680 125418760 1.44 6.37
xapian git master before patch idxflushmb 200
24707840 750228444 6.11 30.36
So that was 750 GB of writes for the big index before the patch...
As you can see my beautiful law does not hold so well for the biggest index :)
(K = 1.44)
It's not quite the same data though, so I would need more tests, but I
think I'll stop here...
The improvement brought by the patch is nice. It remains that for people
using big indexes on SSD, the amount of writes is still something to
consider, and splitting the index probably makes sense ? What do you think ?
I'll run another test this night with a smaller flush interval to see if it
changes things.
Cheers,
jf