Olly Betts
2018-Jul-10 05:36 UTC
Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
On Mon, Jul 09, 2018 at 10:29:18AM +0100, Olly Betts wrote:> The attached patch reset this cursor each time commit() is called, and > that fixes my C++ reproducer, though I think this ought to work as-is > and the real bug is at a lower level.I've dug deeper and that was indeed the case. Here's a patch which addresses the root cause: https://oligarchy.co.uk/xapian/patches/glass-cursor-rebuild-fix.patch For the curious, the bug was in some code to rebuild the cursor when the underlying table changes in ways which require that. That's a fairly rare occurrence (with my C++ reproducer it happens 99 times out of 5000 commits). In chert the equivalent code just marks the cursor's blocks as not yet read, but in glass cursor blocks are reference counted and shared so we can't simply do that as it could affect other cursors sharing the same blocks. So instead the glass code was leaving them with the contents they previously had, except for copying the current root block from the table's "built-in cursor". After the rebuild we seek the cursor to the same key it was on before, and that mostly works because we follow down each level in the Btree from the new root, except it can happen that the old cursor contained a block number which has since been released and reallocated, and in that case the block doesn't get reread and we try to use its old contents, which violates the rule that a parent can't be younger than its child and causes the exception. The simplest fix is to just reset the rebuilt cursor to match the current "built-in cursor" at all levels (not just the root), which is cheap because of the reference counting. And that fixes my C++ reproducer, which I converted from your Python reproducer. Please test and let me know if this fixes the original problem or not. Cheers, Olly
Sylvain Taverne
2018-Jul-10 12:35 UTC
Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
Thank's !! I'll try during the week, and will let you know if all is fine ;) Le mar. 10 juil. 2018 à 07:36, Olly Betts <olly at survex.com> a écrit :> On Mon, Jul 09, 2018 at 10:29:18AM +0100, Olly Betts wrote: > > The attached patch reset this cursor each time commit() is called, and > > that fixes my C++ reproducer, though I think this ought to work as-is > > and the real bug is at a lower level. > > I've dug deeper and that was indeed the case. Here's a patch which > addresses the root cause: > > https://oligarchy.co.uk/xapian/patches/glass-cursor-rebuild-fix.patch > > For the curious, the bug was in some code to rebuild the cursor when the > underlying table changes in ways which require that. That's a fairly > rare occurrence (with my C++ reproducer it happens 99 times out of 5000 > commits). > > In chert the equivalent code just marks the cursor's blocks as not yet > read, but in glass cursor blocks are reference counted and shared so we > can't simply do that as it could affect other cursors sharing the same > blocks. > > So instead the glass code was leaving them with the contents they > previously had, except for copying the current root block from the > table's "built-in cursor". After the rebuild we seek the cursor to the > same key it was on before, and that mostly works because we follow down > each level in the Btree from the new root, except it can happen that the > old cursor contained a block number which has since been released and > reallocated, and in that case the block doesn't get reread and we try to > use its old contents, which violates the rule that a parent can't be > younger than its child and causes the exception. > > The simplest fix is to just reset the rebuilt cursor to match the > current "built-in cursor" at all levels (not just the root), which is > cheap because of the reference counting. And that fixes my C++ > reproducer, which I converted from your Python reproducer. > > Please test and let me know if this fixes the original problem or not. > > Cheers, > Olly >
Sylvain Taverne
2018-Jul-17 11:08 UTC
Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
Hello, The patch seems to fix the problem. I've created a Dockerfile to test your patch: https://github.com/staverne/xapian_test/blob/master/docker/Dockerfile I didn't get the corruption errors anymore... So good job ! Le mar. 10 juil. 2018 à 14:35, Sylvain Taverne <taverne.sylvain at gmail.com> a écrit :> Thank's !! > I'll try during the week, and will let you know if all is fine ;) > > Le mar. 10 juil. 2018 à 07:36, Olly Betts <olly at survex.com> a écrit : > >> On Mon, Jul 09, 2018 at 10:29:18AM +0100, Olly Betts wrote: >> > The attached patch reset this cursor each time commit() is called, and >> > that fixes my C++ reproducer, though I think this ought to work as-is >> > and the real bug is at a lower level. >> >> I've dug deeper and that was indeed the case. Here's a patch which >> addresses the root cause: >> >> https://oligarchy.co.uk/xapian/patches/glass-cursor-rebuild-fix.patch >> >> For the curious, the bug was in some code to rebuild the cursor when the >> underlying table changes in ways which require that. That's a fairly >> rare occurrence (with my C++ reproducer it happens 99 times out of 5000 >> commits). >> >> In chert the equivalent code just marks the cursor's blocks as not yet >> read, but in glass cursor blocks are reference counted and shared so we >> can't simply do that as it could affect other cursors sharing the same >> blocks. >> >> So instead the glass code was leaving them with the contents they >> previously had, except for copying the current root block from the >> table's "built-in cursor". After the rebuild we seek the cursor to the >> same key it was on before, and that mostly works because we follow down >> each level in the Btree from the new root, except it can happen that the >> old cursor contained a block number which has since been released and >> reallocated, and in that case the block doesn't get reread and we try to >> use its old contents, which violates the rule that a parent can't be >> younger than its child and causes the exception. >> >> The simplest fix is to just reset the rebuilt cursor to match the >> current "built-in cursor" at all levels (not just the root), which is >> cheap because of the reference counting. And that fixes my C++ >> reproducer, which I converted from your Python reproducer. >> >> Please test and let me know if this fixes the original problem or not. >> >> Cheers, >> Olly >> >
Reasonably Related Threads
- Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
- Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
- Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
- Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
- Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass