thr3ads.net - Xapian discuss - Xapian 1.4.3 "Db block overwritten

If this information is useful, please help other people find it:
Share via:

Jean-Francois Dockes

2017-May-17 19:08 UTC

Xapian 1.4.3 "Db block overwritten - are there multiple writers?"

Hi,

I have a user reporting the following error during recoll indexing:

    flush() failed: Db block overwritten - are there multiple writers?

"flush() failed" is from recoll, the rest is, I think the text of the
Xapian
exception.

This is with Xapian 1.4.3 on Linux (I asked for more details, should be
coming).

I don't think that I've ever seen this error, and I also don't think
that
there has been significant changes to recoll in this area, but as usual, I
may be wrong.

I've asked the kind user to run xapian-check on the index and post the
output.

Anything more I can do ?

Cheers,

jf

Olly Betts

2017-May-21 21:54 UTC

head link

Xapian 1.4.3 "Db block overwritten - are there multiple writers?"

On Wed, May 17, 2017 at 09:08:32PM +0200, Jean-Francois Dockes
wrote:> I have a user reporting the following error during recoll indexing:
> 
>     flush() failed: Db block overwritten - are there multiple writers?
> 
> "flush() failed" is from recoll, the rest is, I think the text of
the Xapian
> exception.
> 
> This is with Xapian 1.4.3 on Linux (I asked for more details, should be
> coming).
> 
> I don't think that I've ever seen this error, and I also don't
think that
> there has been significant changes to recoll in this area, but as usual, I
> may be wrong.
What this means is that the database appears to have a child block which
is newer than its parent block (in the real world children are younger
than their parents, but in current Xapian DBs the reverse should be the
case - blocks are copied on write and the parent block points to its
children, so needs updating whenever any of its children are).

When reading a database, this is possible if a writer has updated that
part of the tree between reading the parent and reading the child (and
gives DatabaseModifiedError).

When writing, this shouldn't happen.

As the error suggests, if you manage to get multiple concurrent writers
this could happen.  There's locking which should prevent this, but that
can be defeated if the lock file is deleted (which people sometimes add
code to do, misunderstanding how the lock file is used - fcntl() locking
is used, and the lock file should always be present.).  

Assuming nobody deleted the log file, this could be a Xapian bug.  This
isn't something we're drowning in reports of, so presumably it
doesn't
trigger easily, so finding a way to reproduce would be good.

It could also be memory or disk corruption.  We don't currently store
a checksum for each block, so there's no explicit detection of this.

Or something in the same process wrote to an fd that has since been
closed and reused for one of the database tables (Xapian avoids reusing
fds 0, 1 and 2 to avoid this for the standard streams, but it's hard to
fully protect against this given how fds work).

Or something else perhaps.
> I've asked the kind user to run xapian-check on the index and post the
> output.
That's a good thing to check.  If xapian-check finds no problems, then
it's presumably just an in-core issue, which points to a Xapian bug or
memory issues.

Cheers,
    Olly

Jean-Francois Dockes

2017-May-22 05:45 UTC

head link

Xapian 1.4.3 "Db block overwritten - are there multiple writers?"

Olly Betts writes:
 > On Wed, May 17, 2017 at 09:08:32PM +0200, Jean-Francois Dockes wrote:
 > > I have a user reporting the following error during recoll indexing:
 > > 
 > >     flush() failed: Db block overwritten - are there multiple
writers?
 > > 
 > > "flush() failed" is from recoll, the rest is, I think the
text of the Xapian
 > > exception.
 > > 
 > > This is with Xapian 1.4.3 on Linux (I asked for more details, should
be
 > > coming).
 > > 
 > > I don't think that I've ever seen this error, and I also
don't think that
 > > there has been significant changes to recoll in this area, but as
usual, I
 > > may be wrong.
 > 
 > What this means is that the database appears to have a child block which
 > is newer than its parent block (in the real world children are younger
 > than their parents, but in current Xapian DBs the reverse should be the
 > case - blocks are copied on write and the parent block points to its
 > children, so needs updating whenever any of its children are).
 > 
 > When reading a database, this is possible if a writer has updated that
 > part of the tree between reading the parent and reading the child (and
 > gives DatabaseModifiedError).
 > 
 > When writing, this shouldn't happen.
 > 
 > As the error suggests, if you manage to get multiple concurrent writers
 > this could happen.  There's locking which should prevent this, but
that
 > can be defeated if the lock file is deleted (which people sometimes add
 > code to do, misunderstanding how the lock file is used - fcntl() locking
 > is used, and the lock file should always be present.).  

I don't think that there is code in Recoll doing this. Recoll also has its
own protection against multiple writer processes, and in the normal
configuration, a single thread uses the WritableDatabase. It's also
possible to set things up for multiple writing threads though (with lock
protection in this case). I've asked the user to confirm the thread
configuration.

 > Assuming nobody deleted the log file, this could be a Xapian bug.  This
 > isn't something we're drowning in reports of, so presumably it
doesn't
 > trigger easily, so finding a way to reproduce would be good.
 > 
 > It could also be memory or disk corruption.  We don't currently store
 > a checksum for each block, so there's no explicit detection of this.
 > 
 > Or something in the same process wrote to an fd that has since been
 > closed and reused for one of the database tables (Xapian avoids reusing
 > fds 0, 1 and 2 to avoid this for the standard streams, but it's hard
to
 > fully protect against this given how fds work).

This is certainly a possibility of course. In this case, we might be able
to get an idea by looking at the actual data (with luck). What would be the
best approach to get a peek ?

 > Or something else perhaps.
 > 
 > > I've asked the kind user to run xapian-check on the index and
post the
 > > output.
 > 
 > That's a good thing to check.  If xapian-check finds no problems, then
 > it's presumably just an in-core issue, which points to a Xapian bug or
 > memory issues.

The output of xapian-check follows.

Best regards,

Jf

xapian-check ~/.recoll/xapiandb
record:
baseB blocksize=8K items=943378 lastblock=85955 revision=6207 levels=2
root=18014
B-tree checked okay
record table structure checked OK

termlist:
baseB blocksize=8K items=1886756 lastblock=417475 revision=6207 levels=3
root=83720
B-tree checked okay
termlist table structure checked OK

postlist:
baseB blocksize=8K items=8872525 lastblock=524452 revision=6207 levels=3
root=238
B-tree checked okay
termfreq 197211 != # of entries 197210
collfreq 10861536 != sum wdf 10861533
termfreq 14189 != # of entries 14188
collfreq 98354 != sum wdf 98344
termfreq 9866 != # of entries 9865
collfreq 56453 != sum wdf 56443
termfreq 195141 != # of entries 195137
collfreq 8126093 != sum wdf 8126079
postlist table errors found: 8

position:
baseB blocksize=8K items=180902610 lastblock=1701333 revision=6207 levels=3
root=48617
B-tree checked okay
position table structure checked OK

spelling:
Lazily created, and not yet used.

synonym:
baseB blocksize=8K items=1369690 lastblock=32050 revision=6207 levels=2 root=2
B-tree checked okay
synonym table: Don't know how to check structure

Total errors found: 8

Possibly Parallel Threads

Search for more possibly parallel threads

Xapian discuss - May 2017 - Xapian 1.4.3 "Db block overwritten - are there multiple writers?"

Xapian 1.4.3 "Db block overwritten - are there multiple writers?"

Xapian 1.4.3 "Db block overwritten - are there multiple writers?"

Xapian 1.4.3 "Db block overwritten - are there multiple writers?"

Possibly Parallel Threads