Olly Betts
2018-Mar-07 20:16 UTC
Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
On Mon, Mar 05, 2018 at 09:48:52PM +0000, Olly Betts wrote:> On Mon, Mar 05, 2018 at 08:52:47PM +0100, Sylvain Taverne wrote: > > I've remarked the error occur when i'm trying to get stored values from a > > database with a lot of stored values. I can reproduce the error with simple > > python2 script i've posted on github > > > > https://github.com/staverne/xapian_test > > https://github.com/staverne/xapian_test/blob/master/test_xapian.py > > > > The script always works with Chert backend but sometimes fail with Glass > > backend. > > Thanks, that's brilliant - I can reproduce with glass which means I should > be able to pin down the cause (though it's likely to take a while as it > took about 45 minutes to fail).Just to update the status of this - I now have a C++ reproducer, and have found that the transaction isn't needed to reproduce this. It also still reproduces when run under eatmydata [1], and together these bring the time to reproduce down to 4-5 minutes. Still longer than ideal, but quick enough to start making some progress on narrowing down what's happening. Cheers, Olly [1] https://www.flamingspork.com/projects/libeatmydata/
Olly Betts
2018-Jul-09 09:29 UTC
Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
On Wed, Mar 07, 2018 at 08:16:23PM +0000, Olly Betts wrote:> Just to update the status of this - I now have a C++ reproducer, and > have found that the transaction isn't needed to reproduce this. It > also still reproduces when run under eatmydata [1], and together these > bring the time to reproduce down to 4-5 minutes. Still longer than > ideal, but quick enough to start making some progress on narrowing > down what's happening.I was intending to finish analysing this and address it for 1.4.6, but once I found the MSet::snippet() bug I wanted to get a fix for that out promptly. But I've dug into this some more now. It's a cursor on the postlist table which is used to look up document values which causes the problem. The attached patch reset this cursor each time commit() is called, and that fixes my C++ reproducer, though I think this ought to work as-is and the real bug is at a lower level. But if you're suffering with this bug, the patch will probably help and there's not much scope for it making things worse (this cursor member is lazily created, so if it's NULL when needed a new cursor is created). Cheers, Olly -------------- next part -------------- A non-text attachment was scrubbed... Name: glass-value-cursor-reset.patch Type: text/x-diff Size: 363 bytes Desc: not available URL: <http://lists.xapian.org/pipermail/xapian-discuss/attachments/20180709/25f2c2d1/attachment.patch>
Olly Betts
2018-Jul-10 05:36 UTC
Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
On Mon, Jul 09, 2018 at 10:29:18AM +0100, Olly Betts wrote:> The attached patch reset this cursor each time commit() is called, and > that fixes my C++ reproducer, though I think this ought to work as-is > and the real bug is at a lower level.I've dug deeper and that was indeed the case. Here's a patch which addresses the root cause: https://oligarchy.co.uk/xapian/patches/glass-cursor-rebuild-fix.patch For the curious, the bug was in some code to rebuild the cursor when the underlying table changes in ways which require that. That's a fairly rare occurrence (with my C++ reproducer it happens 99 times out of 5000 commits). In chert the equivalent code just marks the cursor's blocks as not yet read, but in glass cursor blocks are reference counted and shared so we can't simply do that as it could affect other cursors sharing the same blocks. So instead the glass code was leaving them with the contents they previously had, except for copying the current root block from the table's "built-in cursor". After the rebuild we seek the cursor to the same key it was on before, and that mostly works because we follow down each level in the Btree from the new root, except it can happen that the old cursor contained a block number which has since been released and reallocated, and in that case the block doesn't get reread and we try to use its old contents, which violates the rule that a parent can't be younger than its child and causes the exception. The simplest fix is to just reset the rebuilt cursor to match the current "built-in cursor" at all levels (not just the root), which is cheap because of the reference counting. And that fixes my C++ reproducer, which I converted from your Python reproducer. Please test and let me know if this fixes the original problem or not. Cheers, Olly
Possibly Parallel Threads
- Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
- Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
- Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
- Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass
- Xapian 1.4.5 "Db block overwritten - are there multiple writers?" with Glass