Jean-Francois Dockes
2017-May-17 19:08 UTC
Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
Hi, I have a user reporting the following error during recoll indexing: flush() failed: Db block overwritten - are there multiple writers? "flush() failed" is from recoll, the rest is, I think the text of the Xapian exception. This is with Xapian 1.4.3 on Linux (I asked for more details, should be coming). I don't think that I've ever seen this error, and I also don't think that there has been significant changes to recoll in this area, but as usual, I may be wrong. I've asked the kind user to run xapian-check on the index and post the output. Anything more I can do ? Cheers, jf
Olly Betts
2017-May-21 21:54 UTC
Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
On Wed, May 17, 2017 at 09:08:32PM +0200, Jean-Francois Dockes wrote:> I have a user reporting the following error during recoll indexing: > > flush() failed: Db block overwritten - are there multiple writers? > > "flush() failed" is from recoll, the rest is, I think the text of the Xapian > exception. > > This is with Xapian 1.4.3 on Linux (I asked for more details, should be > coming). > > I don't think that I've ever seen this error, and I also don't think that > there has been significant changes to recoll in this area, but as usual, I > may be wrong.What this means is that the database appears to have a child block which is newer than its parent block (in the real world children are younger than their parents, but in current Xapian DBs the reverse should be the case - blocks are copied on write and the parent block points to its children, so needs updating whenever any of its children are). When reading a database, this is possible if a writer has updated that part of the tree between reading the parent and reading the child (and gives DatabaseModifiedError). When writing, this shouldn't happen. As the error suggests, if you manage to get multiple concurrent writers this could happen. There's locking which should prevent this, but that can be defeated if the lock file is deleted (which people sometimes add code to do, misunderstanding how the lock file is used - fcntl() locking is used, and the lock file should always be present.). Assuming nobody deleted the log file, this could be a Xapian bug. This isn't something we're drowning in reports of, so presumably it doesn't trigger easily, so finding a way to reproduce would be good. It could also be memory or disk corruption. We don't currently store a checksum for each block, so there's no explicit detection of this. Or something in the same process wrote to an fd that has since been closed and reused for one of the database tables (Xapian avoids reusing fds 0, 1 and 2 to avoid this for the standard streams, but it's hard to fully protect against this given how fds work). Or something else perhaps.> I've asked the kind user to run xapian-check on the index and post the > output.That's a good thing to check. If xapian-check finds no problems, then it's presumably just an in-core issue, which points to a Xapian bug or memory issues. Cheers, Olly
Jean-Francois Dockes
2017-May-22 05:45 UTC
Xapian 1.4.3 "Db block overwritten - are there multiple writers?"
Olly Betts writes: > On Wed, May 17, 2017 at 09:08:32PM +0200, Jean-Francois Dockes wrote: > > I have a user reporting the following error during recoll indexing: > > > > flush() failed: Db block overwritten - are there multiple writers? > > > > "flush() failed" is from recoll, the rest is, I think the text of the Xapian > > exception. > > > > This is with Xapian 1.4.3 on Linux (I asked for more details, should be > > coming). > > > > I don't think that I've ever seen this error, and I also don't think that > > there has been significant changes to recoll in this area, but as usual, I > > may be wrong. > > What this means is that the database appears to have a child block which > is newer than its parent block (in the real world children are younger > than their parents, but in current Xapian DBs the reverse should be the > case - blocks are copied on write and the parent block points to its > children, so needs updating whenever any of its children are). > > When reading a database, this is possible if a writer has updated that > part of the tree between reading the parent and reading the child (and > gives DatabaseModifiedError). > > When writing, this shouldn't happen. > > As the error suggests, if you manage to get multiple concurrent writers > this could happen. There's locking which should prevent this, but that > can be defeated if the lock file is deleted (which people sometimes add > code to do, misunderstanding how the lock file is used - fcntl() locking > is used, and the lock file should always be present.). I don't think that there is code in Recoll doing this. Recoll also has its own protection against multiple writer processes, and in the normal configuration, a single thread uses the WritableDatabase. It's also possible to set things up for multiple writing threads though (with lock protection in this case). I've asked the user to confirm the thread configuration. > Assuming nobody deleted the log file, this could be a Xapian bug. This > isn't something we're drowning in reports of, so presumably it doesn't > trigger easily, so finding a way to reproduce would be good. > > It could also be memory or disk corruption. We don't currently store > a checksum for each block, so there's no explicit detection of this. > > Or something in the same process wrote to an fd that has since been > closed and reused for one of the database tables (Xapian avoids reusing > fds 0, 1 and 2 to avoid this for the standard streams, but it's hard to > fully protect against this given how fds work). This is certainly a possibility of course. In this case, we might be able to get an idea by looking at the actual data (with luck). What would be the best approach to get a peek ? > Or something else perhaps. > > > I've asked the kind user to run xapian-check on the index and post the > > output. > > That's a good thing to check. If xapian-check finds no problems, then > it's presumably just an in-core issue, which points to a Xapian bug or > memory issues. The output of xapian-check follows. Best regards, Jf xapian-check ~/.recoll/xapiandb record: baseB blocksize=8K items=943378 lastblock=85955 revision=6207 levels=2 root=18014 B-tree checked okay record table structure checked OK termlist: baseB blocksize=8K items=1886756 lastblock=417475 revision=6207 levels=3 root=83720 B-tree checked okay termlist table structure checked OK postlist: baseB blocksize=8K items=8872525 lastblock=524452 revision=6207 levels=3 root=238 B-tree checked okay termfreq 197211 != # of entries 197210 collfreq 10861536 != sum wdf 10861533 termfreq 14189 != # of entries 14188 collfreq 98354 != sum wdf 98344 termfreq 9866 != # of entries 9865 collfreq 56453 != sum wdf 56443 termfreq 195141 != # of entries 195137 collfreq 8126093 != sum wdf 8126079 postlist table errors found: 8 position: baseB blocksize=8K items=180902610 lastblock=1701333 revision=6207 levels=3 root=48617 B-tree checked okay position table structure checked OK spelling: Lazily created, and not yet used. synonym: baseB blocksize=8K items=1369690 lastblock=32050 revision=6207 levels=2 root=2 B-tree checked okay synonym table: Don't know how to check structure Total errors found: 8