Olly Betts writes: > On Sun, Jan 10, 2016 at 02:53:14AM +0000, Bob Cargill wrote: > > I am the recoll user mentioned in the first post above. I still have a copy > > of the (potentially) corrupted index and I did the requested testing. > > > > I ran delve -t '' ./xapiandb on the index and it returned a very long list > > of document IDs, separated by spaces. I than ran delve -t '' ./xapiandb | > > grep " 6 " and it returned nothing. > > > > So, document 6 was not in the list. > > > > There were other documents missing from the index as well, so I ran delve -t > > '' ./xapiandb | head -c 100 > > The first ID was 257, then it began sequentially from 356. Looks like the > > first approximately 350 document IDs are "missing." > > OK, that matches what I suspected was happening. > > I've extended xapian-check so it should catch this case - you can get > the patch here ("Unified Diff" link at the bottom): > > http://trac.xapian.org/changeset/ee3bc009d98a7cea8a2944135f38626e73bbcae3/git Thanks Olly ! Bob: I'll ping you from the Recoll issue about running the new xapian-check (I have built it). > > I will look into the bug you listed to see if it might be related. If there > > is anything else that I can do, please let me know. > > If that bug is not the cause, it would be good to get to the bottom of this - > the database shouldn't become corrupt like this. I remembered something: I could only reproduce issue #645 with separate read/write database objects, but this one is with recoll 1.21, which uses a single object, so maybe a different problem. While a Xapian bug might be involved, there are many reasons why a Recoll indexer can meet an abrupt end in the general case (not saying this is the case here). A pulled power cord would be the most radical example. Recoll usually does not run in a datacenter... In most cases, the data is replaceable without too much effort, so that reliable detection of an issue is almost as good as assurance that it won't occur. The latter seems very difficult to attain when running in an uncontrolled environment. There is one weird thing though, which is why, in this situation, replace_document() appears to repeatedly accepts data which goes into a black hole. Cheers, jf
On Thu, Jan 14, 2016 at 11:04:29AM +0100, Jean-Francois Dockes wrote:> Olly Betts writes: > > On Sun, Jan 10, 2016 at 02:53:14AM +0000, Bob Cargill wrote: > > > I will look into the bug you listed to see if it might be related. If there > > > is anything else that I can do, please let me know. > > > > If that bug is not the cause, it would be good to get to the bottom of this - > > the database shouldn't become corrupt like this. > > I remembered something: I could only reproduce issue #645 with separate > read/write database objects, but this one is with recoll 1.21, which uses a > single object, so maybe a different problem.The underlying bug for #645 was that cursors weren't getting rebuilt in some situations where they needed to be, and could end up with bad data in, and that bad data could be stale data. So it's plausible a write might go to the wrong block, which could explain "lost" data like we have here. It could easily be a different problem, but testing with the latest 1.2.x would be useful to make sure we aren't trying to track down a bug we've already fixed.> While a Xapian bug might be involved, there are many reasons why a Recoll > indexer can meet an abrupt end in the general case (not saying this is > the case here). > A pulled power cord would be the most radical example. Recoll usually does > not run in a datacenter... > > In most cases, the data is replaceable without too much effort, so that > reliable detection of an issue is almost as good as assurance that it won't > occur. The latter seems very difficult to attain when running in an > uncontrolled environment.It may not matter for recoll, but more generally we don't want Xapian databases getting corrupt. And we do aim to survive power failures, kernel panics, etc - achieving that in all cases is rather hard, but I don't think that's a reason to drop it as an aim. Examples of corruption that can be reproduced (even if it's not entirely on demand) are very useful - if you can see the corruption happen it's a lot easier to work out what is going wrong than if you just see the aftermath.> There is one weird thing though, which is why, in this situation, > replace_document() appears to repeatedly accepts data which goes into a > black hole.Are you replacing the document with the same data? If so, I think what happens is that it looks in the termlist table to see if the document exists. It does, so it compares the terms and sees they are the same, and decides there's nothing to do. It never looks at the document length list, so doesn't see that is damaged. Or if it's different data, but with the same "document length" (i.e. sum(wdf)) then it'll update the termlist, but spot the length hasn't changes so again not bother to look at the document length list. If you replaced the document with a modified version with a different length, I'd expect this would actually "self-heal". Cheers, Olly
Olly Betts <olly <at> survex.com> writes:> > On Thu, Jan 14, 2016 at 11:04:29AM +0100, Jean-Francois Dockes wrote: > > Olly Betts writes: > > > On Sun, Jan 10, 2016 at 02:53:14AM +0000, Bob Cargill wrote: > > > > I will look into the bug you listed to see if it might be related.If there> > > > is anything else that I can do, please let me know. > > > > > > If that bug is not the cause, it would be good to get to the bottomof this -> > > the database shouldn't become corrupt like this. > > > > I remembered something: I could only reproduce issue #645 with separate > > read/write database objects, but this one is with recoll 1.21, which uses a > > single object, so maybe a different problem. > > The underlying bug for #645 was that cursors weren't getting rebuilt in > some situations where they needed to be, and could end up with bad data > in, and that bad data could be stale data. So it's plausible a write > might go to the wrong block, which could explain "lost" data like we > have here. > > It could easily be a different problem, but testing with the latest > 1.2.x would be useful to make sure we aren't trying to track down a bug > we've already fixed. >I don't see the most recent xapian in the Ubuntu 14.04 repositories. I have 1.2.16 from the Ubuntu Trusty repositories. The PPA doesn't list a Trusty version. Is 1.2.22 easy to build? I have not built anything on linux (sorry for the naive question. I've built under windows in the distance past, so I'm familiar with compilers, source files, and make).> > It may not matter for recoll, but more generally we don't want Xapian > databases getting corrupt. And we do aim to survive power failures, > kernel panics, etc - achieving that in all cases is rather hard, but I > don't think that's a reason to drop it as an aim. > > Examples of corruption that can be reproduced (even if it's not entirely > on demand) are very useful - if you can see the corruption happen it's > a lot easier to work out what is going wrong than if you just see the > aftermath.I have the log file from recoll from (almost) the beginning. I'm not sure if I can cull out the source of the errors, but I will take a look at what was happening with recoll when the errors began.> > > There is one weird thing though, which is why, in this situation, > > replace_document() appears to repeatedly accepts data which goes into a > > black hole. > > Are you replacing the document with the same data?Yes. All of the files that I mentioned earlier (~350) have not been changed in years.> > If so, I think what happens is that it looks in the termlist table to > see if the document exists. It does, so it compares the terms and sees > they are the same, and decides there's nothing to do. > > It never looks at the document length list, so doesn't see that is > damaged. >This is likely what happened. Recoll should have provided the same terms for the file and the same length.
Olly Betts writes: > On Thu, Jan 14, 2016 at 11:04:29AM +0100, Jean-Francois Dockes wrote: > > Olly Betts writes: > > > On Sun, Jan 10, 2016 at 02:53:14AM +0000, Bob Cargill wrote: > > > > I will look into the bug you listed to see if it might be > > > > related. If there is anything else that I can do, please let me > > > > know. > > > > > > If that bug is not the cause, it would be good to get to the bottom > > > of this - the database shouldn't become corrupt like this. > > > > I remembered something: I could only reproduce issue #645 with > > separate read/write database objects, but this one is with recoll > > 1.21, which uses a single object, so maybe a different problem. > > The underlying bug for #645 was that cursors weren't getting rebuilt in > some situations where they needed to be, and could end up with bad data > in, and that bad data could be stale data. So it's plausible a write > might go to the wrong block, which could explain "lost" data like we > have here. > > It could easily be a different problem, but testing with the latest > 1.2.x would be useful to make sure we aren't trying to track down a bug > we've already fixed. > > > While a Xapian bug might be involved, there are many reasons why a > > Recoll indexer can meet an abrupt end in the general case (not saying > > this is the case here). > > A pulled power cord would be the most radical example. Recoll usually > > does not run in a datacenter... > > > > In most cases, the data is replaceable without too much effort, so > > that reliable detection of an issue is almost as good as assurance > > that it won't occur. The latter seems very difficult to attain when > > running in an uncontrolled environment. > > It may not matter for recoll, but more generally we don't want Xapian > databases getting corrupt. And we do aim to survive power failures, > kernel panics, etc - achieving that in all cases is rather hard, but I > don't think that's a reason to drop it as an aim. It was not my intention to suggest this. As an aside, it *does* matter for Recoll that its index would survive events. A few Recoll users have gigantic indexes (hopefully in sane environments), needing multiple days to rebuild. Being oldish and having spent 30 years around data management issues, I just happen to believe that datacenter RDBMS-type reliability is *not possible* for the typical Recoll installation, on a random machine, with an arbitrary filesystem and IO subsystem (hasn't there been a few issues around Linux fs data post-crash consistency?). This is why I believe that, faced with uncertain reliability, and equipped with backed-up data, corruption detection is a very important feature, even if it can't be completely reliable either. > Examples of corruption that can be reproduced (even if it's not entirely > on demand) are very useful - if you can see the corruption happen it's > a lot easier to work out what is going wrong than if you just see the > aftermath. And I do intend to provide such examples whenever possible. I was just trying to make it clear that I was not necessarily looking for a fault in Xapian code. > > There is one weird thing though, which is why, in this situation, > > replace_document() appears to repeatedly accepts data which goes into a > > black hole. > > Are you replacing the document with the same data? Bob answered this, yes, mystery solved. Cheers, jf > If so, I think what happens is that it looks in the termlist table to > see if the document exists. It does, so it compares the terms and sees > they are the same, and decides there's nothing to do. > > It never looks at the document length list, so doesn't see that is > damaged. > > Or if it's different data, but with the same "document length" (i.e. > sum(wdf)) then it'll update the termlist, but spot the length hasn't > changes so again not bother to look at the document length list. > > If you replaced the document with a modified version with a different > length, I'd expect this would actually "self-heal". > > Cheers, > Olly >