thr3ads.net - Xapian discuss - Strange index consistency issue [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Jean-Francois Dockes

2016-Jan-14 10:04 UTC

Strange index consistency issue

Olly Betts writes:
 > On Sun, Jan 10, 2016 at 02:53:14AM +0000, Bob Cargill wrote:
 > > I am the recoll user mentioned in the first post above. I still have
a copy
 > > of the (potentially) corrupted index and I did the requested testing.
 > > 
 > > I ran delve -t '' ./xapiandb on the index and it returned a
very long list
 > > of document IDs, separated by spaces. I than ran delve -t ''
./xapiandb |
 > > grep " 6 " and it returned nothing. 
 > > 
 > > So, document 6 was not in the list. 
 > > 
 > > There were other documents missing from the index as well, so I ran
delve -t
 > > '' ./xapiandb | head -c 100 
 > > The first ID was 257, then it began sequentially from 356. Looks like
the
 > > first approximately 350 document IDs are "missing." 
 > 
 > OK, that matches what I suspected was happening.
 > 
 > I've extended xapian-check so it should catch this case - you can get
 > the patch here ("Unified Diff" link at the bottom):
 > 
 >
http://trac.xapian.org/changeset/ee3bc009d98a7cea8a2944135f38626e73bbcae3/git

Thanks Olly !

Bob: I'll ping you from the Recoll issue about running the new xapian-check
(I have built it).

 > > I will look into the bug you listed to see if it might be related. If
there
 > > is anything else that I can do, please let me know. 
 > 
 > If that bug is not the cause, it would be good to get to the bottom of
this -
 > the database shouldn't become corrupt like this.

I remembered something: I could only reproduce issue #645 with separate
read/write database objects, but this one is with recoll 1.21, which uses a
single object, so maybe a different problem. 

While a Xapian bug might be involved, there are many reasons why a Recoll
indexer can meet an abrupt end in the general case (not saying this is
the case here).
A pulled power cord would be the most radical example. Recoll usually does
not run in a datacenter...

In most cases, the data is replaceable without too much effort, so that
reliable detection of an issue is almost as good as assurance that it won't
occur. The latter seems very difficult to attain when running in an
uncontrolled environment.

There is one weird thing though, which is why, in this situation,
replace_document() appears to repeatedly accepts data which goes into a
black hole.

Cheers,

jf

Olly Betts

2016-Jan-14 22:09 UTC

head link

Strange index consistency issue

On Thu, Jan 14, 2016 at 11:04:29AM +0100, Jean-Francois Dockes
wrote:> Olly Betts writes:
>  > On Sun, Jan 10, 2016 at 02:53:14AM +0000, Bob Cargill wrote:
>  > > I will look into the bug you listed to see if it might be
related. If there
>  > > is anything else that I can do, please let me know. 
>  > 
>  > If that bug is not the cause, it would be good to get to the bottom
of this -
>  > the database shouldn't become corrupt like this.
> 
> I remembered something: I could only reproduce issue #645 with separate
> read/write database objects, but this one is with recoll 1.21, which uses a
> single object, so maybe a different problem. 
The underlying bug for #645 was that cursors weren't getting rebuilt in
some situations where they needed to be, and could end up with bad data
in, and that bad data could be stale data.  So it's plausible a write
might go to the wrong block, which could explain "lost" data like we
have here.

It could easily be a different problem, but testing with the latest
1.2.x would be useful to make sure we aren't trying to track down a bug
we've already fixed.
> While a Xapian bug might be involved, there are many reasons why a Recoll
> indexer can meet an abrupt end in the general case (not saying this is
> the case here).
> A pulled power cord would be the most radical example. Recoll usually does
> not run in a datacenter...
> 
> In most cases, the data is replaceable without too much effort, so that
> reliable detection of an issue is almost as good as assurance that it
won't
> occur. The latter seems very difficult to attain when running in an
> uncontrolled environment.
It may not matter for recoll, but more generally we don't want Xapian
databases getting corrupt.  And we do aim to survive power failures,
kernel panics, etc - achieving that in all cases is rather hard, but I
don't think that's a reason to drop it as an aim.

Examples of corruption that can be reproduced (even if it's not entirely
on demand) are very useful - if you can see the corruption happen it's
a lot easier to work out what is going wrong than if you just see the
aftermath.
> There is one weird thing though, which is why, in this situation,
> replace_document() appears to repeatedly accepts data which goes into a
> black hole.
Are you replacing the document with the same data?

If so, I think what happens is that it looks in the termlist table to
see if the document exists.  It does, so it compares the terms and sees
they are the same, and decides there's nothing to do.

It never looks at the document length list, so doesn't see that is
damaged.

Or if it's different data, but with the same "document length"
(i.e.
sum(wdf)) then it'll update the termlist, but spot the length hasn't
changes so again not bother to look at the document length list.

If you replaced the document with a modified version with a different
length, I'd expect this would actually "self-heal".

Cheers,
    Olly

Bob Cargill

2016-Jan-14 23:57 UTC

head link

Strange index consistency issue

Olly Betts <olly <at> survex.com> writes:
> 
> On Thu, Jan 14, 2016 at 11:04:29AM +0100, Jean-Francois Dockes wrote:
> > Olly Betts writes:
> >  > On Sun, Jan 10, 2016 at 02:53:14AM +0000, Bob Cargill wrote:
> >  > > I will look into the bug you listed to see if it might be
related.
If there> >  > > is anything else that I can do, please let me know. 
> >  > 
> >  > If that bug is not the cause, it would be good to get to the
bottom
of this -> >  > the database shouldn't become corrupt like this.
> > 
> > I remembered something: I could only reproduce issue #645 with
separate
> > read/write database objects, but this one is with recoll 1.21, which
uses a
> > single object, so maybe a different problem. 
> 
> The underlying bug for #645 was that cursors weren't getting rebuilt in
> some situations where they needed to be, and could end up with bad data
> in, and that bad data could be stale data.  So it's plausible a write
> might go to the wrong block, which could explain "lost" data like
we
> have here.
> 
> It could easily be a different problem, but testing with the latest
> 1.2.x would be useful to make sure we aren't trying to track down a bug
> we've already fixed.
> 
I don't see the most recent xapian in the Ubuntu 14.04 repositories. I have
1.2.16 from the Ubuntu Trusty repositories. The PPA doesn't list a Trusty
version. Is 1.2.22 easy to build? I have not built anything on linux (sorry
for the naive question. I've built under windows in the distance past, so
I'm familiar with compilers, source files, and make).  
> 
> It may not matter for recoll, but more generally we don't want Xapian
> databases getting corrupt.  And we do aim to survive power failures,
> kernel panics, etc - achieving that in all cases is rather hard, but I
> don't think that's a reason to drop it as an aim.
> 
> Examples of corruption that can be reproduced (even if it's not
entirely
> on demand) are very useful - if you can see the corruption happen it's
> a lot easier to work out what is going wrong than if you just see the
> aftermath.
I have the log file from recoll from (almost) the beginning. I'm not sure if
I can cull out the source of the errors, but I will take a look at what was
happening with recoll when the errors began. 
> 
> > There is one weird thing though, which is why, in this situation,
> > replace_document() appears to repeatedly accepts data which goes into
a
> > black hole.
> 
> Are you replacing the document with the same data?
Yes. All of the files that I mentioned earlier (~350) have not been changed
in years.
> 
> If so, I think what happens is that it looks in the termlist table to
> see if the document exists.  It does, so it compares the terms and sees
> they are the same, and decides there's nothing to do.
> 
> It never looks at the document length list, so doesn't see that is
> damaged.
> 
This is likely what happened. Recoll should have provided the same terms for
the file and the same length.

Jean-Francois Dockes

2016-Jan-15 07:48 UTC

head link

Strange index consistency issue

Olly Betts writes:
 > On Thu, Jan 14, 2016 at 11:04:29AM +0100, Jean-Francois Dockes wrote:
 > > Olly Betts writes:
 > >  > On Sun, Jan 10, 2016 at 02:53:14AM +0000, Bob Cargill wrote:
 > >  > > I will look into the bug you listed to see if it might be
 > >  > > related. If there is anything else that I can do, please
let me
 > >  > > know.
 > >  > 
 > >  > If that bug is not the cause, it would be good to get to the
bottom
 > >  > of this - the database shouldn't become corrupt like this.
 > > 
 > > I remembered something: I could only reproduce issue #645 with
 > > separate read/write database objects, but this one is with recoll
 > > 1.21, which uses a single object, so maybe a different problem.
 > 
 > The underlying bug for #645 was that cursors weren't getting rebuilt
in
 > some situations where they needed to be, and could end up with bad data
 > in, and that bad data could be stale data.  So it's plausible a write
 > might go to the wrong block, which could explain "lost" data
like we
 > have here.
 > 
 > It could easily be a different problem, but testing with the latest
 > 1.2.x would be useful to make sure we aren't trying to track down a
bug
 > we've already fixed.
 > 
 > > While a Xapian bug might be involved, there are many reasons why a
 > > Recoll indexer can meet an abrupt end in the general case (not saying
 > > this is the case here).
 > > A pulled power cord would be the most radical example. Recoll usually
 > > does not run in a datacenter...
 > > 
 > > In most cases, the data is replaceable without too much effort, so
 > > that reliable detection of an issue is almost as good as assurance
 > > that it won't occur. The latter seems very difficult to attain
when
 > > running in an uncontrolled environment.
 > 
 > It may not matter for recoll, but more generally we don't want Xapian
 > databases getting corrupt.  And we do aim to survive power failures,
 > kernel panics, etc - achieving that in all cases is rather hard, but I
 > don't think that's a reason to drop it as an aim.

It was not my intention to suggest this.

As an aside, it *does* matter for Recoll that its index would survive
events. A few Recoll users have gigantic indexes (hopefully in sane
environments), needing multiple days to rebuild.

Being oldish and having spent 30 years around data management issues, I
just happen to believe that datacenter RDBMS-type reliability is *not
possible* for the typical Recoll installation, on a random machine, with an
arbitrary filesystem and IO subsystem (hasn't there been a few issues
around Linux fs data post-crash consistency?).

This is why I believe that, faced with uncertain reliability, and equipped
with backed-up data, corruption detection is a very important feature, even
if it can't be completely reliable either.

 > Examples of corruption that can be reproduced (even if it's not
entirely
 > on demand) are very useful - if you can see the corruption happen it's
 > a lot easier to work out what is going wrong than if you just see the
 > aftermath.

And I do intend to provide such examples whenever possible. I was just
trying to make it clear that I was not necessarily looking for a fault in
Xapian code.

 > > There is one weird thing though, which is why, in this situation,
 > > replace_document() appears to repeatedly accepts data which goes into
a
 > > black hole.
 > 
 > Are you replacing the document with the same data?

Bob answered this, yes, mystery solved.

Cheers,

jf

 > If so, I think what happens is that it looks in the termlist table to
 > see if the document exists.  It does, so it compares the terms and sees
 > they are the same, and decides there's nothing to do.
 > 
 > It never looks at the document length list, so doesn't see that is
 > damaged.
 > 
 > Or if it's different data, but with the same "document
length" (i.e.
 > sum(wdf)) then it'll update the termlist, but spot the length
hasn't
 > changes so again not bother to look at the document length list.
 > 
 > If you replaced the document with a modified version with a different
 > length, I'd expect this would actually "self-heal".
 > 
 > Cheers,
 >     Olly
 >

Seemingly Similar Threads

Search for more seemingly similar threads

Xapian discuss - Jan 2016 - Strange index consistency issue

Strange index consistency issue

Strange index consistency issue

Strange index consistency issue

Strange index consistency issue

Seemingly Similar Threads