Arjen van der Meijden
2004-May-14 18:20 UTC
[Xapian-discuss] postlist-errors due to already fixed bugs or not?
Hi list, I discovered that our current postlist-database contains, according to quartzcheck, a lot of small errors. The termslist doesn't contain errors and I haven't checked the positionlist. Here is the last "screen" of output: First did in this chunk is <= last in prev chunk First did in this chunk is <= last in prev chunk termfreq 41152 != # of entries 41326 collfreq 83440 != sum wdf 84190 termfreq 1236 != # of entries 1237 collfreq 5138 != sum wdf 5140 termfreq 13844 != # of entries 13828 collfreq 40767 != sum wdf 40716 collfreq 20173 != sum wdf 20169 termfreq 3086 != # of entries 3071 collfreq 6927 != sum wdf 6895 termfreq 409 != # of entries 408 collfreq 910 != sum wdf 909 termfreq 9375 != # of entries 9367 collfreq 37530 != sum wdf 37486 termfreq 3232 != # of entries 3231 collfreq 7901 != sum wdf 7897 termfreq 3136 != # of entries 3135 collfreq 9988 != sum wdf 9984 collfreq 11389 != sum wdf 11385 collfreq 40486 != sum wdf 40479 termfreq 12330 != # of entries 12308 collfreq 35165 != sum wdf 35089 termfreq 2631 != # of entries 2621 collfreq 34499 != sum wdf 34343 termfreq 1262 != # of entries 1261 collfreq 1995 != sum wdf 1994 termfreq 8991 != # of entries 8977 collfreq 19060 != sum wdf 19013 collfreq 7616 != sum wdf 7614 termfreq 30039 != # of entries 30021 collfreq 5674942 != sum wdf 5674858 termfreq 1270 != # of entries 1267 collfreq 4508 != sum wdf 4499 First did in this chunk is <= last in prev chunk termfreq 73413 != # of entries 73423 collfreq 281262 != sum wdf 281180 termfreq 9534 != # of entries 9529 collfreq 24972 != sum wdf 24963 termfreq 11142 != # of entries 11127 collfreq 30445 != sum wdf 30386 termfreq 792 != # of entries 791 collfreq 1327 != sum wdf 1326 termfreq 2159 != # of entries 2157 collfreq 4925 != sum wdf 4919 termfreq 430 != # of entries 429 collfreq 1194 != sum wdf 1192 termfreq 14741 != # of entries 14739 collfreq 70992 != sum wdf 70989 termfreq 858 != # of entries 856 collfreq 3730 != sum wdf 3722 termfreq 35831 != # of entries 35817 collfreq 90159 != sum wdf 90118 termfreq 7621 != # of entries 7606 collfreq 24608 != sum wdf 24567 First did in this chunk is <= last in prev chunk termfreq 99583 != # of entries 99581 collfreq 328043 != sum wdf 327991 termfreq 9018 != # of entries 9017 collfreq 27275 != sum wdf 27271 termfreq 41762 != # of entries 41756 collfreq 264852 != sum wdf 264664 termfreq 975 != # of entries 974 collfreq 4300 != sum wdf 4290 collfreq 4501 != sum wdf 4498 termfreq 4721 != # of entries 4716 collfreq 14232 != sum wdf 14218 Extra bytes after key for first chunk of posting list for term `fotos' Failed to unpack last chunk flag Segmentation fault The "extra bytes after "... error is unique in the output, but the other lines are output over a few hundred/thousand times. This database has been created with a cvs-version of Apr 7 2004, so it may have gone corrupted due to flaws in this version, which have already been fixed in version 0.8.0 The database has been created using a cvs omega/scriptindex version of the same date as the xapian-library. But my question is, can anyone assure me this has been fixed or is otherwise a result of our setup and not a present bug in xapian? Best regards, Arjen van der Meijden
Olly Betts
2004-May-15 01:02 UTC
[Xapian-discuss] postlist-errors due to already fixed bugs or not?
On Fri, May 14, 2004 at 08:20:28PM +0200, Arjen van der Meijden wrote:> I discovered that our current postlist-database contains, according to > quartzcheck, a lot of small errors. The termslist doesn't contain errors > and I haven't checked the positionlist.Quartzcheck checks the Btree structures are consistent, but at present only checks the data held *in* the Btrees for the postlist table.> Here is the last "screen" of output:The checker may not resync gracefully after detecting an error (the segmentation fault which ends the checking is an extreme example of this). What are the first few errors? It's possible the file is fine and that the checking code is wrong. I wrote it quite recently so it's not had much use, and may handle a corner case incorrectly. If I know the first few errors, I can look at that part of the code to see. Or does this database exhibit problems in normal use too?> This database has been created with a cvs-version of Apr 7 2004, so it > may have gone corrupted due to flaws in this version, which have already > been fixed in version 0.8.0The last relevant bug fix prior to 0.8.0 was: Fri Mar 26 22:33:30 GMT 2004 Olly Betts <olly@survex.com> * backends/quartz/quartz_database.cc: Fix problems with termfreq and collfreq in postlist getting out of step when a recently modified or deleted document is deleted or remodified. That sounds similar to what you're seeing, but the version you're using should have this fix in.> But my question is, can anyone assure me this has been fixed or is > otherwise a result of our setup and not a present bug in xapian?Not without further investigation I'm afraid. Cheers, Olly
Arjen van der Meijden
2004-May-29 10:42 UTC
[Xapian-discuss] postlist-errors due to already fixed bugs or not?
On 14-5-2004 20:20, Arjen van der Meijden wrote:> Hi list, > > I discovered that our current postlist-database contains, according to > quartzcheck, a lot of small errors. The termslist doesn't contain errors > and I haven't checked the positionlist. >[snip]> > The "extra bytes after "... error is unique in the output, but the other > lines are output over a few hundred/thousand times. > > This database has been created with a cvs-version of Apr 7 2004, so it > may have gone corrupted due to flaws in this version, which have already > been fixed in version 0.8.0 > The database has been created using a cvs omega/scriptindex version of > the same date as the xapian-library. > > But my question is, can anyone assure me this has been fixed or is > otherwise a result of our setup and not a present bug in xapian?I've downloaded 0.8.0 and reindexed our database with that. Due to two system-crashes, the index process had to be restarted twice, and the postlist contains errors again... Although, now it are only termfreq's and collfreq's that are incorrect: termfreq 169578 != # of entries 169576 collfreq 1478897 != sum wdf 1478893 termfreq 108299 != # of entries 108297 collfreq 504301 != sum wdf 504297 termfreq 26206 != # of entries 26205 collfreq 152732 != sum wdf 152731 termfreq 2994 != # of entries 2993 (and a few hundred more) But the search engine now fails to find at least word that is known to be in a few thousand documents, it is only found in 3, while a word that is very commonly used with it, is found in over 2000... I'm not sure whether the above is related to the two system crashes or other problems? How atomic are those write-batches of scriptindex? Best regards, Arjen van der Meijden