Jean-Francois Dockes
2017-May-22 05:45 UTC
Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
Olly Betts writes: > On Wed, May 17, 2017 at 09:08:32PM +0200, Jean-Francois Dockes wrote: > > I have a user reporting the following error during recoll indexing: > > > > flush() failed: Db block overwritten - are there multiple writers? > > > > "flush() failed" is from recoll, the rest is, I think the text of the Xapian > > exception. > > > > This is with Xapian 1.4.3 on Linux (I asked for more details, should be > > coming). > > > > I don't think that I've ever seen this error, and I also don't think that > > there has been significant changes to recoll in this area, but as usual, I > > may be wrong. > > What this means is that the database appears to have a child block which > is newer than its parent block (in the real world children are younger > than their parents, but in current Xapian DBs the reverse should be the > case - blocks are copied on write and the parent block points to its > children, so needs updating whenever any of its children are). > > When reading a database, this is possible if a writer has updated that > part of the tree between reading the parent and reading the child (and > gives DatabaseModifiedError). > > When writing, this shouldn't happen. > > As the error suggests, if you manage to get multiple concurrent writers > this could happen. There's locking which should prevent this, but that > can be defeated if the lock file is deleted (which people sometimes add > code to do, misunderstanding how the lock file is used - fcntl() locking > is used, and the lock file should always be present.). I don't think that there is code in Recoll doing this. Recoll also has its own protection against multiple writer processes, and in the normal configuration, a single thread uses the WritableDatabase. It's also possible to set things up for multiple writing threads though (with lock protection in this case). I've asked the user to confirm the thread configuration. > Assuming nobody deleted the log file, this could be a Xapian bug. This > isn't something we're drowning in reports of, so presumably it doesn't > trigger easily, so finding a way to reproduce would be good. > > It could also be memory or disk corruption. We don't currently store > a checksum for each block, so there's no explicit detection of this. > > Or something in the same process wrote to an fd that has since been > closed and reused for one of the database tables (Xapian avoids reusing > fds 0, 1 and 2 to avoid this for the standard streams, but it's hard to > fully protect against this given how fds work). This is certainly a possibility of course. In this case, we might be able to get an idea by looking at the actual data (with luck). What would be the best approach to get a peek ? > Or something else perhaps. > > > I've asked the kind user to run xapian-check on the index and post the > > output. > > That's a good thing to check. If xapian-check finds no problems, then > it's presumably just an in-core issue, which points to a Xapian bug or > memory issues. The output of xapian-check follows. Best regards, Jf xapian-check ~/.recoll/xapiandb record: baseB blocksize=8K items=943378 lastblock=85955 revision=6207 levels=2 root=18014 B-tree checked okay record table structure checked OK termlist: baseB blocksize=8K items=1886756 lastblock=417475 revision=6207 levels=3 root=83720 B-tree checked okay termlist table structure checked OK postlist: baseB blocksize=8K items=8872525 lastblock=524452 revision=6207 levels=3 root=238 B-tree checked okay termfreq 197211 != # of entries 197210 collfreq 10861536 != sum wdf 10861533 termfreq 14189 != # of entries 14188 collfreq 98354 != sum wdf 98344 termfreq 9866 != # of entries 9865 collfreq 56453 != sum wdf 56443 termfreq 195141 != # of entries 195137 collfreq 8126093 != sum wdf 8126079 postlist table errors found: 8 position: baseB blocksize=8K items=180902610 lastblock=1701333 revision=6207 levels=3 root=48617 B-tree checked okay position table structure checked OK spelling: Lazily created, and not yet used. synonym: baseB blocksize=8K items=1369690 lastblock=32050 revision=6207 levels=2 root=2 B-tree checked okay synonym table: Don't know how to check structure Total errors found: 8
Olly Betts
2017-May-24 02:40 UTC
Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
On Mon, May 22, 2017 at 07:45:59AM +0200, Jean-Francois Dockes wrote:> Olly Betts writes: > > Assuming nobody deleted the log file, this could be a Xapian bug. ThisI meant "lock file" not "log file" here.> > isn't something we're drowning in reports of, so presumably it doesn't > > trigger easily, so finding a way to reproduce would be good. > > > > It could also be memory or disk corruption. We don't currently store > > a checksum for each block, so there's no explicit detection of this. > > > > Or something in the same process wrote to an fd that has since been > > closed and reused for one of the database tables (Xapian avoids reusing > > fds 0, 1 and 2 to avoid this for the standard streams, but it's hard to > > fully protect against this given how fds work). > > This is certainly a possibility of course. In this case, we might be able > to get an idea by looking at the actual data (with luck). What would be the > best approach to get a peek ?In this case, the output of xapian-check strongly hints it's unlikely to be this, or at least not just this.> > Or something else perhaps. > > > > > I've asked the kind user to run xapian-check on the index and post the > > > output. > > > > That's a good thing to check. If xapian-check finds no problems, then > > it's presumably just an in-core issue, which points to a Xapian bug or > > memory issues. > > The output of xapian-check follows.> xapian-check ~/.recoll/xapiandb[...]> postlist: > baseB blocksize=8K items=8872525 lastblock=524452 revision=6207 levels=3 root=238 > B-tree checked okay > termfreq 197211 != # of entries 197210 > collfreq 10861536 != sum wdf 10861533 > termfreq 14189 != # of entries 14188 > collfreq 98354 != sum wdf 98344 > termfreq 9866 != # of entries 9865 > collfreq 56453 != sum wdf 56443 > termfreq 195141 != # of entries 195137 > collfreq 8126093 != sum wdf 8126079 > postlist table errors found: 8[...]> Total errors found: 8Two interesting things here: Firstly, the parent vs child block revision inconsistency seems to have gone (xapian-check includes a check for this situation). Secondly, the only inconsistencies seem to be in the term and collection frequencies of 4 terms. I suspect both are a consequence of the exception you originally reported during commit() (flush() is just a compatibility alias for commit()). Some updates were made but not committed and then we hit an exception which meant corresponding updates didn't get applied. Then when the database is closed, those pending updates get committed, which leaves the database inconsistent, but also would likely have fixed the mismatching revisions (if they were on disk) by writing out a new version of the child and parent. We do have code to handle clearing pending changes in such cases, but it looks to me like it's not applied broadly enough. I'll take a look at addressing that. However, that only affects what happens after the original exception was thrown, so couldn't have caused it. Sadly exactly what caused the original exception is obscured by the effects of this bug and I can't really narrow down the original exception much - about all I can say is that if that was due to random corruption or overwriting, it was fairly localised. Is this a reproducible (or at least recurring) issue? Cheers, Olly
Jean-Francois Dockes
2017-May-24 08:56 UTC
Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
Olly Betts writes: > On Mon, May 22, 2017 at 07:45:59AM +0200, Jean-Francois Dockes wrote: > > Olly Betts writes: > > > Assuming nobody deleted the log file, this could be a Xapian bug. This > > I meant "lock file" not "log file" here. Sort of had guessed :) > [...] > Is this a reproducible (or at least recurring) issue? I don't know yet, I am going to ask if it would be possible to rebuild the index and keep monitoring for a recurrence. Cheers, jf
Reasonably Related Threads
- Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
- Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
- trouble with user's right indexing with omega
- Implementing tf-idf weighting scheme in Xapian
- postlist: Tag containing meta information is corrupt.