Just got this error today in my dmesg: btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 linux % find . -inum 1483065 ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack It''s the main pack file from my git linux kernel tree: linux % ls -l ./.git/objects/pack/ total 562848 -rw-r--r-- 1 markus markus 1891324 2008-11-29 19:49 pack-011b43fa6956667db5e67fba859e40cb4b154226.idx -rw-r--r-- 1 markus markus 44002938 2008-11-29 19:54 pack-011b43fa6956667db5e67fba859e40cb4b154226.pack.temp -rw-r--r-- 1 markus markus 730332 2008-11-29 19:49 pack-67be92b3fab3dab175683582dab0b719517e55a5.idx -r--r--r-- 1 markus markus 36061684 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.idx -r--r--r-- 1 markus markus 335202742 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack -rw------- 1 markus markus 158457856 2009-09-07 22:15 tmp_pack_OUdxER I''m running the latest git kernel and I''ve been using btrfs as my root fs for the last few weeks without problems so far. -- Markus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 07 2009, Markus Trippelsdorf wrote:> Just got this error today in my dmesg: > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > linux % find . -inum 1483065 > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > It''s the main pack file from my git linux kernel tree: > > linux % ls -l ./.git/objects/pack/ > total 562848 > -rw-r--r-- 1 markus markus 1891324 2008-11-29 19:49 pack-011b43fa6956667db5e67fba859e40cb4b154226.idx > -rw-r--r-- 1 markus markus 44002938 2008-11-29 19:54 pack-011b43fa6956667db5e67fba859e40cb4b154226.pack.temp > -rw-r--r-- 1 markus markus 730332 2008-11-29 19:49 pack-67be92b3fab3dab175683582dab0b719517e55a5.idx > -r--r--r-- 1 markus markus 36061684 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.idx > -r--r--r-- 1 markus markus 335202742 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > -rw------- 1 markus markus 158457856 2009-09-07 22:15 tmp_pack_OUdxER > > I''m running the latest git kernel and I''ve been using btrfs as my root > fs for the last few weeks without problems so far.Hmm, I ran into something very similar. Care to check what the corrupted block of data looks like (and how big it is)? -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:> On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > Just got this error today in my dmesg: > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > linux % find . -inum 1483065 > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > It''s the main pack file from my git linux kernel tree: > > > > linux % ls -l ./.git/objects/pack/ > > total 562848 > > -rw-r--r-- 1 markus markus 1891324 2008-11-29 19:49 pack-011b43fa6956667db5e67fba859e40cb4b154226.idx > > -rw-r--r-- 1 markus markus 44002938 2008-11-29 19:54 pack-011b43fa6956667db5e67fba859e40cb4b154226.pack.temp > > -rw-r--r-- 1 markus markus 730332 2008-11-29 19:49 pack-67be92b3fab3dab175683582dab0b719517e55a5.idx > > -r--r--r-- 1 markus markus 36061684 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.idx > > -r--r--r-- 1 markus markus 335202742 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > -rw------- 1 markus markus 158457856 2009-09-07 22:15 tmp_pack_OUdxER > > > > I''m running the latest git kernel and I''ve been using btrfs as my root > > fs for the last few weeks without problems so far. > > Hmm, I ran into something very similar. Care to check what the corrupted > block of data looks like (and how big it is)?I''ve already deleted the file in question unfortunately. On IRC Chris decided that either bad RAM or a harddrive error was the most likely reason for this chechsum mismatch. -- Markus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 08 2009, Markus Trippelsdorf wrote:> On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > Just got this error today in my dmesg: > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > > > linux % find . -inum 1483065 > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > > > It''s the main pack file from my git linux kernel tree: > > > > > > linux % ls -l ./.git/objects/pack/ > > > total 562848 > > > -rw-r--r-- 1 markus markus 1891324 2008-11-29 19:49 pack-011b43fa6956667db5e67fba859e40cb4b154226.idx > > > -rw-r--r-- 1 markus markus 44002938 2008-11-29 19:54 pack-011b43fa6956667db5e67fba859e40cb4b154226.pack.temp > > > -rw-r--r-- 1 markus markus 730332 2008-11-29 19:49 pack-67be92b3fab3dab175683582dab0b719517e55a5.idx > > > -r--r--r-- 1 markus markus 36061684 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.idx > > > -r--r--r-- 1 markus markus 335202742 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > -rw------- 1 markus markus 158457856 2009-09-07 22:15 tmp_pack_OUdxER > > > > > > I''m running the latest git kernel and I''ve been using btrfs as my root > > > fs for the last few weeks without problems so far. > > > > Hmm, I ran into something very similar. Care to check what the corrupted > > block of data looks like (and how big it is)? > > I''ve already deleted the file in question unfortunately. > On IRC Chris decided that either bad RAM or a harddrive error was the > most likely reason for this chechsum mismatch.Darn, that''s too bad. The corruption issue I had was also in a git pack file. It was fine one day, bad the next. Turned out to be 16kb of 0xff in the file, and I blamed it on the (cheap) SSD drive that hosted the local git repo. It''s still the most likely explanation given the nature of the problem, however it would have been really interesting to see what corruption you had. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote:> On Tue, Sep 08 2009, Markus Trippelsdorf wrote: > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > > Just got this error today in my dmesg: > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > I''ve already deleted the file in question unfortunately. > > On IRC Chris decided that either bad RAM or a harddrive error was the > > most likely reason for this chechsum mismatch. > > Darn, that''s too bad. The corruption issue I had was also in a git pack > file. It was fine one day, bad the next. Turned out to be 16kb of 0xff > in the file, and I blamed it on the (cheap) SSD drive that hosted the > local git repo. It''s still the most likely explanation given the nature > of the problem, however it would have been really interesting to see > what corruption you had.BTW, I had some similar issue. One file on btrfs had csum failed. I''ve copied it using dd_rescue and, suprise, reading new file yields this error also. How to retrieve block failing csum check from btrfs volume? -- Tomasz Torcz 72->| 80->| xmpp: zdzichubg@chrome.pl 72->| 80->| -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 08, 2009 at 10:22:11PM +0200, Markus Trippelsdorf spake thusly:> I''ve already deleted the file in question unfortunately. > On IRC Chris decided that either bad RAM or a harddrive error was the > most likely reason for this chechsum mismatch.Which raises an interesting point: I know reiserfs had its problems but it also turned up a lot of machines with bad RAM which contributed to giving the fs a bad name. With more and more complicated and memory consuming filesystem datastructures being stored in RAM, larger volumes of RAM in systems, and RAM not really getting any more reliable will we ever see a day where something like btrfs is not recommended for use in any machine that doesn''t have ECC? Does the filesystem do anything to protect itself from bad hardware? -- Tracy Reed http://tracyreed.org
On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote:> On Tue, Sep 08 2009, Markus Trippelsdorf wrote: > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > > Just got this error today in my dmesg: > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > > > > > linux % find . -inum 1483065 > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > > > > > It''s the main pack file from my git linux kernel tree: > > > > > > > > > > Hmm, I ran into something very similar. Care to check what the corrupted > > > block of data looks like (and how big it is)? > > > > I''ve already deleted the file in question unfortunately. > > On IRC Chris decided that either bad RAM or a harddrive error was the > > most likely reason for this chechsum mismatch. > > Darn, that''s too bad. The corruption issue I had was also in a git pack > file. It was fine one day, bad the next. Turned out to be 16kb of 0xff > in the file, and I blamed it on the (cheap) SSD drive that hosted the > local git repo. It''s still the most likely explanation given the nature > of the problem, however it would have been really interesting to see > what corruption you had.If by cheap SSD drive you mean an Indilinx Barefoot based one, we might be using the same hardware (30GB Vertex in my case). What a strange coincidence that it affected git pack files in both cases. It''s almost too improbable... -- Markus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 09 2009, Markus Trippelsdorf wrote:> On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote: > > On Tue, Sep 08 2009, Markus Trippelsdorf wrote: > > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > > > Just got this error today in my dmesg: > > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > > > > > > > linux % find . -inum 1483065 > > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > > > > > > > It''s the main pack file from my git linux kernel tree: > > > > > > > > > > > > > Hmm, I ran into something very similar. Care to check what the corrupted > > > > block of data looks like (and how big it is)? > > > > > > I''ve already deleted the file in question unfortunately. > > > On IRC Chris decided that either bad RAM or a harddrive error was the > > > most likely reason for this chechsum mismatch. > > > > Darn, that''s too bad. The corruption issue I had was also in a git pack > > file. It was fine one day, bad the next. Turned out to be 16kb of 0xff > > in the file, and I blamed it on the (cheap) SSD drive that hosted the > > local git repo. It''s still the most likely explanation given the nature > > of the problem, however it would have been really interesting to see > > what corruption you had. > > If by cheap SSD drive you mean an Indilinx Barefoot based one, we might > be using the same hardware (30GB Vertex in my case).Spooky, yes indeed that''s the very same drive I''m using. Also see my postings on this very issue here, top two entries: http://axboe.livejournal.com/ So that pretty much looks like it reaffirms some of my suspicions. Is the drive in a laptop that you suspend and resume?> What a strange coincidence that it affected git pack files in both cases. > It''s almost too improbable...Probably more than a coincidence I think, the question is what though... -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 09, 2009 at 09:01:41AM +0200, Jens Axboe wrote:> On Wed, Sep 09 2009, Markus Trippelsdorf wrote: > > On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote: > > > On Tue, Sep 08 2009, Markus Trippelsdorf wrote: > > > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > > > > Just got this error today in my dmesg: > > > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > > > > > > > > > linux % find . -inum 1483065 > > > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > > > > > > > > > It''s the main pack file from my git linux kernel tree: > > > > > > > > > > > > > > > > Hmm, I ran into something very similar. Care to check what the corrupted > > > > > block of data looks like (and how big it is)? > > > > > > > > I''ve already deleted the file in question unfortunately. > > > > On IRC Chris decided that either bad RAM or a harddrive error was the > > > > most likely reason for this chechsum mismatch. > > > > > > Darn, that''s too bad. The corruption issue I had was also in a git pack > > > file. It was fine one day, bad the next. Turned out to be 16kb of 0xff > > > in the file, and I blamed it on the (cheap) SSD drive that hosted the > > > local git repo. It''s still the most likely explanation given the nature > > > of the problem, however it would have been really interesting to see > > > what corruption you had. > > > > If by cheap SSD drive you mean an Indilinx Barefoot based one, we might > > be using the same hardware (30GB Vertex in my case). > > Spooky, yes indeed that''s the very same drive I''m using. Also see my > postings on this very issue here, top two entries: > > http://axboe.livejournal.com/ > > So that pretty much looks like it reaffirms some of my suspicions. Is > the drive in a laptop that you suspend and resume?No. I use it in my workstation, that I never switch off normally.> > What a strange coincidence that it affected git pack files in both cases. > > It''s almost too improbable... > > Probably more than a coincidence I think, the question is what though...If it really was an SSD error, then it should happen randomly, messing up random files. But (contrary to your experience) I never had any issues with this SSD until this single failed checksum. -- Markus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 8, 2009 at 5:53 PM, Tracy Reed<treed@ultraviolet.org> wrote:> On Tue, Sep 08, 2009 at 10:22:11PM +0200, Markus Trippelsdorf spake thusly: >> I''ve already deleted the file in question unfortunately. >> On IRC Chris decided that either bad RAM or a harddrive error was the >> most likely reason for this chechsum mismatch. > > Which raises an interesting point: I know reiserfs had its problems > but it also turned up a lot of machines with bad RAM which contributed > to giving the fs a bad name. With more and more complicated and memory > consuming filesystem datastructures being stored in RAM, larger volumes > of RAM in systems, and RAM not really getting any more reliable will > we ever see a day where something like btrfs is not recommended for > use in any machine that doesn''t have ECC? Does the filesystem do > anything to protect itself from bad hardware?Such as the checksums that started this thread? That *is* a protection against bad hardware feature. A large part of reiserfs'' problem was a religious degree of "panic on inconsistency!" so failures of identical severity that might slip by unnoticed on other file systems were more likely to be noticed. Sadly shooting the messenger is still a popular sport and the qualities of BTRFS which make it more bad hardware resistant may well give it a bad reputation. I don''t know that there is much that can be done about that. On Wed, Sep 9, 2009 at 3:01 AM, Jens Axboe<jens.axboe@oracle.com> wrote:> On Wed, Sep 09 2009, Markus Trippelsdorf wrote: >> What a strange coincidence that it affected git pack files in both cases. >> It''s almost too improbable... > > Probably more than a coincidence I think, the question is what though...Could this have been the same data in both cases? Either way— if the hardware was randomly corrupting high entropy blocks with very-low probability it''s quite possible that you two would have seen it while anyone else who did chalked it up to some other problem. I''ve encountered telecom equipment where a particular packet data interacted poorly with the clock recovery hardware. "Any file transfers fine, except for this one. This one stalls and never finishes, but if I unzip it. it''s fine!". Ugh. or it could be some busted ECC that always ''corrects'' a particular class of perfectly valid blocks to something wrong... or it could be a million other things. At the end of the day you just need to accept that the hardware is junk. Black list it, give the vendor the best black eye that you can, and move on. I can only expect that this is going to get worse over time. I really wish that it had become the norm for drive makers to expose an optional raw interface to the flash. Alas, we''re stuck with the equivalent of running Linux on a hypervisor provided by Microsoft... except the SSD makers are less experienced. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 09 2009, Markus Trippelsdorf wrote:> On Wed, Sep 09, 2009 at 09:01:41AM +0200, Jens Axboe wrote: > > On Wed, Sep 09 2009, Markus Trippelsdorf wrote: > > > On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote: > > > > On Tue, Sep 08 2009, Markus Trippelsdorf wrote: > > > > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > > > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > > > > > Just got this error today in my dmesg: > > > > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > > > > > > > > > > > linux % find . -inum 1483065 > > > > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > > > > > > > > > > > It''s the main pack file from my git linux kernel tree: > > > > > > > > > > > > > > > > > > > Hmm, I ran into something very similar. Care to check what the corrupted > > > > > > block of data looks like (and how big it is)? > > > > > > > > > > I''ve already deleted the file in question unfortunately. > > > > > On IRC Chris decided that either bad RAM or a harddrive error was the > > > > > most likely reason for this chechsum mismatch. > > > > > > > > Darn, that''s too bad. The corruption issue I had was also in a git pack > > > > file. It was fine one day, bad the next. Turned out to be 16kb of 0xff > > > > in the file, and I blamed it on the (cheap) SSD drive that hosted the > > > > local git repo. It''s still the most likely explanation given the nature > > > > of the problem, however it would have been really interesting to see > > > > what corruption you had. > > > > > > If by cheap SSD drive you mean an Indilinx Barefoot based one, we might > > > be using the same hardware (30GB Vertex in my case). > > > > Spooky, yes indeed that''s the very same drive I''m using. Also see my > > postings on this very issue here, top two entries: > > > > http://axboe.livejournal.com/ > > > > So that pretty much looks like it reaffirms some of my suspicions. Is > > the drive in a laptop that you suspend and resume? > > No. I use it in my workstation, that I never switch off normally.OK, so we can rule out any interactions between suspending and resuming the drive. That''s at least something.> > > What a strange coincidence that it affected git pack files in both cases. > > > It''s almost too improbable... > > > > Probably more than a coincidence I think, the question is what though... > > If it really was an SSD error, then it should happen randomly, messing up > random files. But (contrary to your experience) I never had any issues with > this SSD until this single failed checksum.Not necessarily, they may be some pattern to how the pack files are accessed (that propagates through to the drive). The fact is, 0xff is an extremely weird piece of corruption that just reeks of bad flash blocks. It''s almost impossible that it is a software error. If it was all zeroes, or a bit flip, the likely causes would be very different. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 9, 2009 at 8:01 AM, Jens Axboe<jens.axboe@oracle.com> wrote:> On Wed, Sep 09 2009, Markus Trippelsdorf wrote: >> On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote: >> > On Tue, Sep 08 2009, Markus Trippelsdorf wrote: >> > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: >> > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: >> > > > > Just got this error today in my dmesg: >> > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 >> > > > > >> > > > > linux % find . -inum 1483065 >> > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack >> > > > > >> > > > > It''s the main pack file from my git linux kernel tree: >> > > > > >> > > > >> > > > Hmm, I ran into something very similar. Care to check what the corrupted >> > > > block of data looks like (and how big it is)? >> > > >> > > I''ve already deleted the file in question unfortunately. >> > > On IRC Chris decided that either bad RAM or a harddrive error was the >> > > most likely reason for this chechsum mismatch. >> > >> > Darn, that''s too bad. The corruption issue I had was also in a git pack >> > file. It was fine one day, bad the next. Turned out to be 16kb of 0xff >> > in the file, and I blamed it on the (cheap) SSD drive that hosted the >> > local git repo. It''s still the most likely explanation given the nature >> > of the problem, however it would have been really interesting to see >> > what corruption you had. >> >> If by cheap SSD drive you mean an Indilinx Barefoot based one, we might >> be using the same hardware (30GB Vertex in my case). > > Spooky, yes indeed that''s the very same drive I''m using. Also see my > postings on this very issue here, top two entries: > > http://axboe.livejournal.com/ > > So that pretty much looks like it reaffirms some of my suspicions. Is > the drive in a laptop that you suspend and resume?If you''re on firmware < 1.30, the changlog includes some fixes which may be relevant, eg if "block 0" is relative, or you''re suspending/resuming: - Race condition occurred during soft reset handler - If read fail occurs during reading stamp information, firmware corrupted block 0. - Power off recovery had bug in certain circumstances http://www.ocztechnologyforum.com/forum/showthread.php?t=57516 -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 09 2009, Daniel J Blueman wrote:> On Wed, Sep 9, 2009 at 8:01 AM, Jens Axboe<jens.axboe@oracle.com> wrote: > > On Wed, Sep 09 2009, Markus Trippelsdorf wrote: > >> On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote: > >> > On Tue, Sep 08 2009, Markus Trippelsdorf wrote: > >> > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > >> > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > >> > > > > Just got this error today in my dmesg: > >> > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > >> > > > > > >> > > > > linux % find . -inum 1483065 > >> > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > >> > > > > > >> > > > > It''s the main pack file from my git linux kernel tree: > >> > > > > > >> > > > > >> > > > Hmm, I ran into something very similar. Care to check what the corrupted > >> > > > block of data looks like (and how big it is)? > >> > > > >> > > I''ve already deleted the file in question unfortunately. > >> > > On IRC Chris decided that either bad RAM or a harddrive error was the > >> > > most likely reason for this chechsum mismatch. > >> > > >> > Darn, that''s too bad. The corruption issue I had was also in a git pack > >> > file. It was fine one day, bad the next. Turned out to be 16kb of 0xff > >> > in the file, and I blamed it on the (cheap) SSD drive that hosted the > >> > local git repo. It''s still the most likely explanation given the nature > >> > of the problem, however it would have been really interesting to see > >> > what corruption you had. > >> > >> If by cheap SSD drive you mean an Indilinx Barefoot based one, we might > >> be using the same hardware (30GB Vertex in my case). > > > > Spooky, yes indeed that''s the very same drive I''m using. Also see my > > postings on this very issue here, top two entries: > > > > http://axboe.livejournal.com/ > > > > So that pretty much looks like it reaffirms some of my suspicions. Is > > the drive in a laptop that you suspend and resume? > > If you''re on firmware < 1.30, the changlog includes some fixes which > may be relevant, eg if "block 0" is relative, or you''re > suspending/resuming: > > - Race condition occurred during soft reset handler > - If read fail occurs during reading stamp information, firmware > corrupted block 0. > - Power off recovery had bug in certain circumstances > > http://www.ocztechnologyforum.com/forum/showthread.php?t=57516The issue is pretty much moot at this point, since OCZ support were not really interested in providing any sort of real technical support to find out what really caused this issue. My main worry was reliability of these cheaper SSD drives, and that worry is still not resolved. If you read the blog entries, I do comment on the apparently scary basic bugs taht are still being fixed on the Indilinx controllers. I do expect some basic level of data integrity from a consumer product and at least some interest in resolving weird corruption issues if things go wrong. Since OCZ cannot provide anything like that, I have a hard time recommending these drives for anything but very casual use. Fast, cheap, reliable. Pick any two. My drive was running 1.10 at the time of the problem. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 9, 2009 at 9:26 AM, Jens Axboe<jens.axboe@oracle.com> wrote:> On Wed, Sep 09 2009, Daniel J Blueman wrote: >> On Wed, Sep 9, 2009 at 8:01 AM, Jens Axboe<jens.axboe@oracle.com> wrote: >> > On Wed, Sep 09 2009, Markus Trippelsdorf wrote: >> >> On Tue, Sep 08, 2009 at 10:32:14PM +0200, Jens Axboe wrote: >> >> > On Tue, Sep 08 2009, Markus Trippelsdorf wrote: >> >> > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: >> >> > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: >> >> > > > > Just got this error today in my dmesg: >> >> > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 >> >> > > > > >> >> > > > > linux % find . -inum 1483065 >> >> > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack >> >> > > > > >> >> > > > > It''s the main pack file from my git linux kernel tree: >> >> > > > > >> >> > > > >> >> > > > Hmm, I ran into something very similar. Care to check what the corrupted >> >> > > > block of data looks like (and how big it is)? >> >> > > >> >> > > I''ve already deleted the file in question unfortunately. >> >> > > On IRC Chris decided that either bad RAM or a harddrive error was the >> >> > > most likely reason for this chechsum mismatch. >> >> > >> >> > Darn, that''s too bad. The corruption issue I had was also in a git pack >> >> > file. It was fine one day, bad the next. Turned out to be 16kb of 0xff >> >> > in the file, and I blamed it on the (cheap) SSD drive that hosted the >> >> > local git repo. It''s still the most likely explanation given the nature >> >> > of the problem, however it would have been really interesting to see >> >> > what corruption you had. >> >> >> >> If by cheap SSD drive you mean an Indilinx Barefoot based one, we might >> >> be using the same hardware (30GB Vertex in my case). >> > >> > Spooky, yes indeed that''s the very same drive I''m using. Also see my >> > postings on this very issue here, top two entries: >> > >> > http://axboe.livejournal.com/ >> > >> > So that pretty much looks like it reaffirms some of my suspicions. Is >> > the drive in a laptop that you suspend and resume? >> >> If you''re on firmware < 1.30, the changlog includes some fixes which >> may be relevant, eg if "block 0" is relative, or you''re >> suspending/resuming: >> >> - Race condition occurred during soft reset handler >> - If read fail occurs during reading stamp information, firmware >> corrupted block 0. >> - Power off recovery had bug in certain circumstances >> >> http://www.ocztechnologyforum.com/forum/showthread.php?t=57516 > > The issue is pretty much moot at this point, since OCZ support were not > really interested in providing any sort of real technical support to > find out what really caused this issue. My main worry was reliability of > these cheaper SSD drives, and that worry is still not resolved. If you > read the blog entries, I do comment on the apparently scary basic bugs > taht are still being fixed on the Indilinx controllers. I do expect some > basic level of data integrity from a consumer product and at least some > interest in resolving weird corruption issues if things go wrong. Since > OCZ cannot provide anything like that, I have a hard time recommending > these drives for anything but very casual use. Fast, cheap, reliable. > Pick any two. > > My drive was running 1.10 at the time of the problem.It looks like we need a small tool which performs patterned block I/O to the device, updating a checksum as it goes, and performing integrity sweeps at intervals, lower level than fsx. It must be trusted or not. I had a problem like this with nVidia CK804/MCP55 chipsets corrupting data under a triple-edge case workload. -- Daniel J Blueman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 09, 2009 at 09:37:42AM +0100, Daniel J Blueman wrote:> >> > >> http://www.ocztechnologyforum.com/forum/showthread.php?t=57516 > > > > The issue is pretty much moot at this point, since OCZ support were not > > really interested in providing any sort of real technical support to > > find out what really caused this issue. My main worry was reliability of > > these cheaper SSD drives, and that worry is still not resolved. If you > > read the blog entries, I do comment on the apparently scary basic bugs > > taht are still being fixed on the Indilinx controllers. I do expect some > > basic level of data integrity from a consumer product and at least some > > interest in resolving weird corruption issues if things go wrong. Since > > OCZ cannot provide anything like that, I have a hard time recommending > > these drives for anything but very casual use. Fast, cheap, reliable. > > Pick any two. > > > > My drive was running 1.10 at the time of the problem. > > It looks like we need a small tool which performs patterned block I/O > to the device, updating a checksum as it goes, and performing > integrity sweeps at intervals, lower level than fsx. It must be > trusted or not. > > I had a problem like this with nVidia CK804/MCP55 chipsets corrupting > data under a triple-edge case workload.Well, just use git ;) Apply a bunch of patches (say the mm tree) with guilt and repack in a loop. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
>> What a strange coincidence that it affected git pack files in both cases. >> It''s almost too improbable...>Probably more than a coincidence I think, the question is what though...Some SSD drives (or rather the cheap wear levelling controllers in things like USB sticks) have firmware which tries to recognise certain data structures of common filesystems (like FAT and NTFS), and uses information in those data structures to optimise the allocation and erasure of blocks (for example the free space linked list in FAT). If the data you were saving to the disk was similar to one of those data structures, you might''ve triggered one of those algorithms, which would cause data corruption. This is common in high performance usb sticks because they want to pre-erase blocks on file deletion for operating systems not supporting SCSI TRIM - I imagine the same technology might carry across to cheap SSD''s. Not much BTRFS can do about it though. If the piece of data that triggers the bug could be identified, workarounds could possibly be introduced for the particular buggy controllers. Oliver Mattos (resent as I emailled wrong recipients before) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 9, 2009 at 11:01 PM, Oliver Mattos <oliver.mattos08@imperial.ac.uk> wrote:> >>> What a strange coincidence that it affected git pack files in both cases. >>> It''s almost too improbable... >I had similar problems with a broken git repository about two weeks ago. This was on a regular laptop harddrive that''s never reported any errors. Unfortunately I rm''ed the repository and cloned it again so I can''t check exactly what caused the corruption. Interestingly I''ve just discovered a broken tar.bz2 file that shows similar symptoms as what''s been described here earlier. The first (and by far largest) chunk of the file consists entirely of 0x01 bytes followed by a smaller chunk that appears to be a PNG file and then arch/sparc/include/asm/fhc.h from the linux kernel. After this I have a small chunk of 0x00 bytes followed by arch/sparc/include/asm/floppy.h. This pattern is repeated several times with different include files from the kernel sources and the file ends with a small chunk of 0x01 bytes again. The harddisk in question is: === START OF INFORMATION SECTION ==Model Family: Fujitsu MHV series Device Model: FUJITSU MHV2080BH Serial Number: NW05T6425FRY Firmware Version: 00840028 User Capacity: 80,025,280,000 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 4a Local Time is: Thu Sep 10 12:40:10 2009 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled As already mentioned it''s never reported any errors and I also haven''t seen any problems like this before when using ext3 or ext4. The broken file is available at http://omploader.org/vMmJtbg if that''s any help. Regards, Bryan Østergaard -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote:> On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > Just got this error today in my dmesg: > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > linux % find . -inum 1483065 > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > It''s the main pack file from my git linux kernel tree: > > > > linux % ls -l ./.git/objects/pack/ > > total 562848 > > -rw-r--r-- 1 markus markus 1891324 2008-11-29 19:49 pack-011b43fa6956667db5e67fba859e40cb4b154226.idx > > -rw-r--r-- 1 markus markus 44002938 2008-11-29 19:54 pack-011b43fa6956667db5e67fba859e40cb4b154226.pack.temp > > -rw-r--r-- 1 markus markus 730332 2008-11-29 19:49 pack-67be92b3fab3dab175683582dab0b719517e55a5.idx > > -r--r--r-- 1 markus markus 36061684 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.idx > > -r--r--r-- 1 markus markus 335202742 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > -rw------- 1 markus markus 158457856 2009-09-07 22:15 tmp_pack_OUdxER > > > > I''m running the latest git kernel and I''ve been using btrfs as my root > > fs for the last few weeks without problems so far. > > Hmm, I ran into something very similar. Care to check what the corrupted > block of data looks like (and how big it is)?I''ve hit the same problem again today: btrfs csum failed ino 1826333 off 150208512 csum 4148434891 private 1660028275 The file in question is: ./.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack I can''t read the file directly, because of the csum mismatch: 08F3FF80 58 C8 18 3D 36 58 B0 B0 CC 35 3A 3D 72 95 8E 71 9E AA 34 14 0B C4 B4 41 5F E0 6F 66 03 B9 0B 79 X..=6X...5:=r..q..4....A_.of...y 08F3FFA0 9C 94 6B 15 F9 CA 93 AC C4 34 6E 2C FA 4C 99 31 55 35 36 3B 46 04 71 7E 2E 66 21 1C 89 FC 1B 92 ..k......4n,.L.1U56;F.q~.f!..... 08F3FFC0 90 FE B2 4D 0D 28 A9 3F CC D8 B1 9A 38 28 51 86 10 69 88 CA 46 A6 07 FE EC 0F 2B 7E 81 65 30 86 ...M.(.?....8(Q..i..F.....+~.e0. 08F3FFE0 8E 2A 37 E9 88 CC 6F 1A 8D CF 82 7C 9D 43 A5 B1 FF 2C 62 72 2E 06 E6 44 44 02 45 03 BC 12 EA 3B .*7...o....|.C...,br...DD.E....; 08F40000 where 0x8F40000=150208512. # hdparm --fibmap /usr/src/linux/.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack 0,13: device not found in /dev does not work unfortunately. How do I get the LBAs of the file instead? I did a hex-search on the raw devive with hexedit for "90FEB24D0D28A93FCCD8B19A38285186106988CA46A607FEEC0F2B7E81653086", but there is no obvious corruption in the vicinity of the few places that are found. -- Markus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 17 2009, Markus Trippelsdorf wrote:> On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > Just got this error today in my dmesg: > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > > > linux % find . -inum 1483065 > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > > > It''s the main pack file from my git linux kernel tree: > > > > > > linux % ls -l ./.git/objects/pack/ > > > total 562848 > > > -rw-r--r-- 1 markus markus 1891324 2008-11-29 19:49 pack-011b43fa6956667db5e67fba859e40cb4b154226.idx > > > -rw-r--r-- 1 markus markus 44002938 2008-11-29 19:54 pack-011b43fa6956667db5e67fba859e40cb4b154226.pack.temp > > > -rw-r--r-- 1 markus markus 730332 2008-11-29 19:49 pack-67be92b3fab3dab175683582dab0b719517e55a5.idx > > > -r--r--r-- 1 markus markus 36061684 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.idx > > > -r--r--r-- 1 markus markus 335202742 2009-09-06 21:48 pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > -rw------- 1 markus markus 158457856 2009-09-07 22:15 tmp_pack_OUdxER > > > > > > I''m running the latest git kernel and I''ve been using btrfs as my root > > > fs for the last few weeks without problems so far. > > > > Hmm, I ran into something very similar. Care to check what the corrupted > > block of data looks like (and how big it is)? > > I''ve hit the same problem again today: > > btrfs csum failed ino 1826333 off 150208512 csum 4148434891 private 1660028275 > > The file in question is: > ./.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack > > I can''t read the file directly, because of the csum mismatch:Chris, is there a way to force reading the file? Seems like that would be a very handy feature. Markus, not sure if that works, but you could always try and remount with data checksumming disabled. mount /dev/fooX -o remount,rw,nodatasum should do the trick. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 17, 2009 at 08:44:56AM +0200, Jens Axboe wrote:> On Thu, Sep 17 2009, Markus Trippelsdorf wrote: > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > > Just got this error today in my dmesg: > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > > > > > linux % find . -inum 1483065 > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > > > > > It''s the main pack file from my git linux kernel tree: > > > > > > > > > > Hmm, I ran into something very similar. Care to check what the corrupted > > > block of data looks like (and how big it is)? > > > > I''ve hit the same problem again today: > > > > btrfs csum failed ino 1826333 off 150208512 csum 4148434891 private 1660028275 > > > > The file in question is: > > ./.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack > > > > I can''t read the file directly, because of the csum mismatch: > > Chris, is there a way to force reading the file? Seems like that would > be a very handy feature. > > Markus, not sure if that works, but you could always try and remount > with data checksumming disabled. > > mount /dev/fooX -o remount,rw,nodatasum > > should do the trick.That doesn''t work unfortunately, btrfs still calculates and compares the checksums (it won''t write new ones I guess). -- Markus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 17 2009, Markus Trippelsdorf wrote:> On Thu, Sep 17, 2009 at 08:44:56AM +0200, Jens Axboe wrote: > > On Thu, Sep 17 2009, Markus Trippelsdorf wrote: > > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > > > Just got this error today in my dmesg: > > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > > > > > > > linux % find . -inum 1483065 > > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > > > > > > > It''s the main pack file from my git linux kernel tree: > > > > > > > > > > > > > Hmm, I ran into something very similar. Care to check what the corrupted > > > > block of data looks like (and how big it is)? > > > > > > I''ve hit the same problem again today: > > > > > > btrfs csum failed ino 1826333 off 150208512 csum 4148434891 private 1660028275 > > > > > > The file in question is: > > > ./.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack > > > > > > I can''t read the file directly, because of the csum mismatch: > > > > Chris, is there a way to force reading the file? Seems like that would > > be a very handy feature. > > > > Markus, not sure if that works, but you could always try and remount > > with data checksumming disabled. > > > > mount /dev/fooX -o remount,rw,nodatasum > > > > should do the trick. > > That doesn''t work unfortunately, btrfs still calculates and compares the > checksums (it won''t write new ones I guess).Ah ok, as mentioned I wasn''t sure whether that would work or not. I''ll defer to Chris :-) -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 17, 2009 at 11:05:49AM +0200, Jens Axboe wrote:> On Thu, Sep 17 2009, Markus Trippelsdorf wrote: > > On Thu, Sep 17, 2009 at 08:44:56AM +0200, Jens Axboe wrote: > > > On Thu, Sep 17 2009, Markus Trippelsdorf wrote: > > > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > > > > Just got this error today in my dmesg: > > > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > > > > > > > > > linux % find . -inum 1483065 > > > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > > > > > > > > > It''s the main pack file from my git linux kernel tree: > > > > > > > > > > > > > > > > Hmm, I ran into something very similar. Care to check what the corrupted > > > > > block of data looks like (and how big it is)? > > > > > > > > I''ve hit the same problem again today: > > > > > > > > btrfs csum failed ino 1826333 off 150208512 csum 4148434891 private 1660028275 > > > > > > > > The file in question is: > > > > ./.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack > > > > > > > > I can''t read the file directly, because of the csum mismatch: > > > > > > Chris, is there a way to force reading the file? Seems like that would > > > be a very handy feature. > > > > > > Markus, not sure if that works, but you could always try and remount > > > with data checksumming disabled. > > > > > > mount /dev/fooX -o remount,rw,nodatasum > > > > > > should do the trick. > > > > That doesn''t work unfortunately, btrfs still calculates and compares the > > checksums (it won''t write new ones I guess). > > Ah ok, as mentioned I wasn''t sure whether that would work or not. I''ll > defer to Chris :-)Understood. I did some further investigations and was able to reconstruct exactly the same pack file in question by starting from an older backup copy of my git repro and then running the same git commands as previous. Then I did a binary comparison between this reconstructed file and a corrupted backup copy from the time before the csum errors occured (I automatically backup every 4h). This is the result (first line good pack file, second line corrupted file): vbindiff debug/.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack debug2/.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack 0130 9FA0: E2 3B 43 AA 63 BF 28 B3 87 B7 FD AB DA 74 2D 1C 0130 9FA0: E2 3B 43 AA 63 BF 28 B3 87 33 FD AB DA 74 2D 1C 06CD DF90: B0 22 6B 46 9F ED 6E 47 73 5E 7E EB DA 5F D6 11 06CD DF90: B0 22 6B 46 9F ED 6E 47 73 1E 7E EB DA 5F D6 11 06CD DFC0: 0D 86 2B B2 57 A4 5A CD 78 4B 08 94 C0 65 17 3A 06CD DFC0: 0D 86 2B B2 57 A4 5A CD 78 0B 08 94 C0 65 17 3A 0802 C3C0: 5C A5 E1 4A 1C BC 14 04 16 4A 29 D3 CC EF A6 80 0802 C3C0: 5C 25 E1 4A 1C BC 14 04 16 48 29 D3 CC EF A6 80 081A B3C0: 7D 7A 2C CD 20 89 E5 F2 A8 D3 32 38 04 BA 8A B5 081A B3C0: 7D 3A 2C CD 20 89 E5 F2 A8 D3 32 38 04 BA 8A B5 098E C430: FE 24 4A 19 09 F4 D5 1F 22 E8 36 FA F8 55 B2 6E 098E C430: FE 24 4A 19 09 F4 D5 1F 22 E0 36 FA F8 55 B2 6E 098E C440: 1B 3F C1 B4 BB 80 F8 5A FB EE 0D A3 3F C5 A4 DB 098E C440: 1B 3D C1 B4 BB 80 F8 5A FB EE 0D A3 3F C5 A4 DB 098E C4D0: F8 6C E2 65 18 7A 5D 33 2E 35 77 64 B2 81 BE DF 098E C4D0: F8 6C E2 65 18 7A 5D 33 2E 25 77 64 B2 81 BE DF 098E C4E0: 05 18 DE E3 00 78 D2 2C 4F 91 8F AF 0B F6 0C 31 098E C4E0: 05 1C DE E3 00 78 D2 2C 4F 91 8F AF 0B F6 0C 31 098E C500: 0A 12 D3 E7 FA B8 40 DE 0D 71 94 88 5D 4C 97 21 098E C500: 0A 12 D3 E7 FA B8 40 DE 0D 51 94 88 5D 4C 97 21 098E C540: 93 F2 58 C7 49 9A AA EB 30 3D 28 AA E3 09 4B 7B 098E C540: 93 F2 58 C7 49 9A AA EB 30 3C 28 AA E3 09 4B 7B 0FDE C420: F3 6A C2 38 76 43 9E 86 0D 9C 89 86 F1 E6 B0 F2 0FDE C420: F3 6A C2 38 76 43 9E 86 0D DC 89 86 F1 E6 B0 F2 0FDE C430: 38 E4 69 2E 22 1D E4 FF 90 A7 C6 E8 9F 08 4C 98 0FDE C430: 38 E4 69 2E 22 1D E4 FF 90 A5 C6 E8 9F 08 4C 98 1214 A4C0: 24 D6 56 AC 8B D8 D0 9B D2 62 7B 83 C7 0B 3D BE 1214 A4C0: 24 D4 56 AC 8B D8 D0 9B D2 62 7B 83 C7 0B 3D BE 1214 A500: EC 51 D3 FF C5 7D 30 DD 6D 45 50 FE E9 64 A4 FC 1214 A500: EC 11 D3 FF C5 7D 30 DD 6D 45 50 FE E9 64 A4 FC 1214 A520: D9 4D 63 EB 77 4D F0 BE 5E B3 6B DE E6 D2 28 67 1214 A520: D9 4D 63 EB 77 4D F0 BE 5E 33 6B DE E6 D2 28 67 -- Markus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 17, 2009 at 02:15:01PM +0200, Markus Trippelsdorf wrote:> On Thu, Sep 17, 2009 at 11:05:49AM +0200, Jens Axboe wrote: > > On Thu, Sep 17 2009, Markus Trippelsdorf wrote: > > > On Thu, Sep 17, 2009 at 08:44:56AM +0200, Jens Axboe wrote: > > > > On Thu, Sep 17 2009, Markus Trippelsdorf wrote: > > > > > On Tue, Sep 08, 2009 at 10:00:42PM +0200, Jens Axboe wrote: > > > > > > On Mon, Sep 07 2009, Markus Trippelsdorf wrote: > > > > > > > Just got this error today in my dmesg: > > > > > > > btrfs csum failed ino 1483065 off 158482432 csum 4283543305 private 43905798 > > > > > > > > > > > > > > linux % find . -inum 1483065 > > > > > > > ./.git/objects/pack/pack-f9251bcc6a8afe3c92193e14d1d742f2f0182ce5.pack > > > > > > > > > > > > > > It''s the main pack file from my git linux kernel tree: > > > > > > > > > > > > > > > > > > > Hmm, I ran into something very similar. Care to check what the corrupted > > > > > > block of data looks like (and how big it is)? > > > > > > > > > > I''ve hit the same problem again today: > > > > > > > > > > btrfs csum failed ino 1826333 off 150208512 csum 4148434891 private 1660028275 > > > > > > > > > > The file in question is: > > > > > ./.git/objects/pack/pack-a2330b703d5a7fd62626b39a5fdfb6eecf739d0d.pack > > > > > > > > > > I can''t read the file directly, because of the csum mismatch: > > > > > > > > Chris, is there a way to force reading the file? Seems like that would > > > > be a very handy feature. > > > > > > > > Markus, not sure if that works, but you could always try and remount > > > > with data checksumming disabled. > > > > > > > > mount /dev/fooX -o remount,rw,nodatasum > > > > > > > > should do the trick. > > > > > > That doesn''t work unfortunately, btrfs still calculates and compares the > > > checksums (it won''t write new ones I guess). > > > > Ah ok, as mentioned I wasn''t sure whether that would work or not. I''ll > > defer to Chris :-) > > Understood. > > I did some further investigations and was able to reconstruct exactly > the same pack file in question by starting from an older backup copy of > my git repro and then running the same git commands as previous. > Then I did a binary comparison between this reconstructed file and a > corrupted backup copy from the time before the csum errors occured (I > automatically backup every 4h). >Thanks to Chris'' patch (from IRC) I was able to compare the file with the csum error to the reconstructed one. You''ll find the reults as attachments. -- Markus
> 0130 9FA0: E2 3B 43 AA 63 BF 28 B3 87 B7 FD AB DA 74 2D 1C > 0130 9FA0: E2 3B 43 AA 63 BF 28 B3 87 33 FD AB DA 74 2D 1CB7 = 10110111 33 = 00110011> 06CD DF90: B0 22 6B 46 9F ED 6E 47 73 5E 7E EB DA 5F D6 11 > 06CD DF90: B0 22 6B 46 9F ED 6E 47 73 1E 7E EB DA 5F D6 115E = 01011110 1E = 00011110> 06CD DFC0: 0D 86 2B B2 57 A4 5A CD 78 4B 08 94 C0 65 17 3A > 06CD DFC0: 0D 86 2B B2 57 A4 5A CD 78 0B 08 94 C0 65 17 3A4B = 01001011 0B = 00001011 And so on. It looks like a few bits are getting flipped at the same byte offset. One can imagine software bugs that would do this, certainly, but upset hardware seems awfully likely too. - z -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 17, 2009 at 10:00:28AM -0700, Zach Brown wrote:> > > 0130 9FA0: E2 3B 43 AA 63 BF 28 B3 87 B7 FD AB DA 74 2D 1C > > 0130 9FA0: E2 3B 43 AA 63 BF 28 B3 87 33 FD AB DA 74 2D 1C > > B7 = 10110111 > 33 = 00110011 > > > 06CD DF90: B0 22 6B 46 9F ED 6E 47 73 5E 7E EB DA 5F D6 11 > > 06CD DF90: B0 22 6B 46 9F ED 6E 47 73 1E 7E EB DA 5F D6 11 > > 5E = 01011110 > 1E = 00011110 > > > 06CD DFC0: 0D 86 2B B2 57 A4 5A CD 78 4B 08 94 C0 65 17 3A > > 06CD DFC0: 0D 86 2B B2 57 A4 5A CD 78 0B 08 94 C0 65 17 3A > > 4B = 01001011 > 0B = 00001011 > > And so on. > > It looks like a few bits are getting flipped at the same byte offset. > One can imagine software bugs that would do this, certainly, but upset > hardware seems awfully likely too.I''m afraid you''re right. I did some further tests and now I''m pretty sure that a bad RAM module was the root cause of it all... Oh well. -- Markus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 17, 2009 at 07:10:06PM +0200, Markus Trippelsdorf wrote:> > > 06CD DFC0: 0D 86 2B B2 57 A4 5A CD 78 4B 08 94 C0 65 17 3A > > > 06CD DFC0: 0D 86 2B B2 57 A4 5A CD 78 0B 08 94 C0 65 17 3A > > > > 4B = 01001011 > > 0B = 00001011 > > > > And so on. > > > > It looks like a few bits are getting flipped at the same byte offset. > > One can imagine software bugs that would do this, certainly, but upset > > hardware seems awfully likely too. > > I''m afraid you''re right. I did some further tests and now I''m pretty > sure that a bad RAM module was the root cause of it all... > Oh well.On the other hand, that what''s so great in checksumming filesystems. You found bad module thanks to btrfs, otherwise you wouldn''t suspect anything wrong. If you have had raid-1 for data, this corruption would have been fixed by btrfs. -- Tomasz Torcz 72->| 80->| xmpp: zdzichubg@chrome.pl 72->| 80->| -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html