Hello list, I was greeted by the following errors in my syslog: Sep 2 23:06:08 laptop kernel: [ 7340.809551] btrfs: checksum error at logical 271008116736 on dev /dev/dm-0, sector 540863448, root 442, inode 1508, offset 10128658432, length 4096, links 1 (path: Werkstation/Windows 8 x64-cl1.vmdk) Sep 2 23:06:08 laptop kernel: [ 7340.809562] btrfs: bdev /dev/dm-0 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 Sep 2 23:06:08 laptop kernel: [ 7340.809565] btrfs: unable to fixup (regular) error at logical 271008116736 on dev /dev/dm-0 Sep 2 23:06:08 laptop kernel: [ 7340.814266] btrfs: checksum error at logical 271008120832 on dev /dev/dm-0, sector 540863456, root 442, inode 1508, offset 10128662528, length 4096, links 1 (path: Werkstation/Windows 8 x64-cl1.vmdk) Sep 2 23:06:08 laptop kernel: [ 7340.814278] btrfs: bdev /dev/dm-0 errs: wr 0, rd 0, flush 0, corrupt 2, gen 0 Sep 2 23:06:08 laptop kernel: [ 7340.814283] btrfs: unable to fixup (regular) error at logical 271008120832 on dev /dev/dm-0 Sep 2 23:06:08 laptop kernel: [ 7340.815205] btrfs: checksum error at logical 271008124928 on dev /dev/dm-0, sector 540863464, root 442, inode 1508, offset 10128666624, length 4096, links 1 (path: Werkstation/Windows 8 x64-cl1.vmdk) Sep 2 23:06:08 laptop kernel: [ 7340.815212] btrfs: bdev /dev/dm-0 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0 Sep 2 23:06:08 laptop kernel: [ 7340.815214] btrfs: unable to fixup (regular) error at logical 271008124928 on dev /dev/dm-0 Sep 2 23:06:08 laptop kernel: [ 7340.816107] btrfs: checksum error at logical 271008129024 on dev /dev/dm-0, sector 540863472, root 442, inode 1508, offset 10128670720, length 4096, links 1 (path: Werkstation/Windows 8 x64-cl1.vmdk) Sep 2 23:06:08 laptop kernel: [ 7340.816111] btrfs: bdev /dev/dm-0 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0 Sep 2 23:06:08 laptop kernel: [ 7340.816113] btrfs: unable to fixup (regular) error at logical 271008129024 on dev /dev/dm-0 Sep 2 23:06:08 laptop kernel: [ 7340.816882] btrfs: checksum error at logical 271008133120 on dev /dev/dm-0, sector 540863480, root 442, inode 1508, offset 10128674816, length 4096, links 1 (path: Werkstation/Windows 8 x64-cl1.vmdk) Sep 2 23:06:08 laptop kernel: [ 7340.816887] btrfs: bdev /dev/dm-0 errs: wr 0, rd 0, flush 0, corrupt 5, gen 0 Sep 2 23:06:08 laptop kernel: [ 7340.816889] btrfs: unable to fixup (regular) error at logical 271008133120 on dev /dev/dm-0 Sep 2 23:06:08 laptop kernel: [ 7340.817672] btrfs: bdev /dev/dm-0 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0 Sep 2 23:06:08 laptop kernel: [ 7340.817676] btrfs: unable to fixup (regular) error at logical 271008137216 on dev /dev/dm-0 So, I ran a full scrub, and, luckily, it only found 6 csum errors (these 6). The damage therefore seems to be contained in "just" 1 file. Now, I removed the offending file. But is there something else I should have done to recover the data in this file? Can it be recovered? I''m running 3.11-rc7. It is a single disk btrfs filesystem. I have several subvolumes defined, one of which for VMWare Workstation (on which the corruption took place). I checked the SMART values, they all seem OK. The harddisks in this machine are less then a month old. I replaced them after seeing similar messages on the "old" disks. Is the only logical explanation for this some kind of hardware failure (SATA controller, power supply...), or could there be something more to this? Sincerely, Roel Brook -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 02, 2013 at 11:41:12PM +0200, Rain Maker wrote:> Hello list, > > So, I ran a full scrub, and, luckily, it only found 6 csum errors > (these 6). The damage therefore seems to be contained in "just" 1 > file. > > Now, I removed the offending file. But is there something else I > should have done to recover the data in this file? Can it be > recovered?No, and no. The data''s failing a checksum, so it''s basically broken. If you had a btrfs RAID-1 configuration, the FS would be able to recover from one broken copy using the other (good) copy.> I''m running 3.11-rc7. It is a single disk btrfs filesystem. I have > several subvolumes defined, one of which for VMWare Workstation (on > which the corruption took place).Aaah, the VM workload could explain this. There''s some (known, won''t-fix) issues with (I think) direct-IO in VM guests that can cause bad checksums to be written under some circumstances. I''m not 100% certain, but I _think_ that making your VM images nocow (create an empty file with touch; use chattr +C; extend the file to the right size) may help prevent these problems.> I checked the SMART values, they all seem OK. The harddisks in this > machine are less then a month old. I replaced them after seeing > similar messages on the "old" disks. > > Is the only logical explanation for this some kind of hardware failure > (SATA controller, power supply...), or could there be something more > to this?As above, there''s some direct-IO problems with data changing in-flight that can lead to bad checksums. Fixing the issue would cause some fairly serious slow-downs in performance for that case, which is rather against what direct-IO is trying to do, so I think it''s unlikely the behaviour will be changed. Of course, I could be completely wrong about all this, and you''ve got bad RAM or PSU something... Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- "What are we going to do tonight?" "The same thing we do --- every night, Pinky. Try to take over the world!"
First of all, thanks for the quick response. Reply inline. 2013/9/3 Hugo Mills <hugo@carfax.org.uk>:> On Mon, Sep 02, 2013 at 11:41:12PM +0200, Rain Maker wrote: >> Now, I removed the offending file. But is there something else I >> should have done to recover the data in this file? Can it be >> recovered? > > No, and no. The data''s failing a checksum, so it''s basically > broken. If you had a btrfs RAID-1 configuration, the FS would be able > to recover from one broken copy using the other (good) copy. >Ofcourse, this makes sense. I know filesystem recovery in BTRFS is incomplete. I''m opting for a override for these usecases. I mean; the filesystem still knows the checksum. There are 2 possibilities: - The checksum is wrong - The data is wrong In case the checksum is wrong, why is there no possibility to recalculate the checksum and continue with the file (taking small corruptions for granted)? In this case (and, I believe, in more cases), it''s a VM. I could have run Windows chkdsk from the VM to see what I could have salvaged. In case the data is wrong, there may be a reverse CRC32 algorithm implemented. Most likely it''s only several bytes which got "flipped". On modern hardware, it shouldn''t take that much time to brute-force the checksum, especially considering we have a good guess (the raw, corrupted data). Now, the VM I removed did not have any special data in it (+ I make backups), but it could''ve been much worse.>> I''m running 3.11-rc7. It is a single disk btrfs filesystem. I have >> several subvolumes defined, one of which for VMWare Workstation (on >> which the corruption took place). > > Aaah, the VM workload could explain this. There''s some (known, > won''t-fix) issues with (I think) direct-IO in VM guests that can cause > bad checksums to be written under some circumstances. > > I''m not 100% certain, but I _think_ that making your VM images > nocow (create an empty file with touch; use chattr +C; extend the file > to the right size) may help prevent these problems. >Hmm, could try that. Thanks for the tip. I could also disable writeback cache on the VM. But, VMWare uses it''s own "vmblock" kernel module for I/O, so I''m not sure if this would do any good. Then ofcourse, there''s the performance hit.>> Is the only logical explanation for this some kind of hardware failure >> (SATA controller, power supply...), or could there be something more >> to this? > > As above, there''s some direct-IO problems with data changing > in-flight that can lead to bad checksums. Fixing the issue would cause > some fairly serious slow-downs in performance for that case, which is > rather against what direct-IO is trying to do, so I think it''s > unlikely the behaviour will be changed. > > Of course, I could be completely wrong about all this, and you''ve > got bad RAM or PSU something... > > Hugo. > > -- > === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==> PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk > --- "What are we going to do tonight?" "The same thing we do --- > every night, Pinky. Try to take over the world!"-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Rain Maker posted on Tue, 03 Sep 2013 00:28:30 +0200 as excerpted:> 2013/9/3 Hugo Mills <hugo@carfax.org.uk>: >> On Mon, Sep 02, 2013 at 11:41:12PM +0200, Rain Maker wrote: >>> Now, I removed the offending file. But is there something else I >>> should have done to recover the data in this file? Can it be >>> recovered? >> >> No, and no. The data''s failing a checksum, so it''s basically >> broken. If you had a btrfs RAID-1 configuration, the FS would be able >> to recover from one broken copy using the other (good) copy. >> > Ofcourse, this makes sense. > > I know filesystem recovery in BTRFS is incomplete. I''m opting for a > override for these usecases. I mean; the filesystem still knows the > checksum. There are 2 possibilities: > - The checksum is wrong - The data is wrong > > In case the checksum is wrong, why is there no possibility to > recalculate the checksum and continue with the file (taking small > corruptions for granted)? In this case (and, I believe, in more cases), > it''s a VM. I could have run Windows chkdsk from the VM to see what I > could have salvaged.AFAIK chkdsk wouldn''t have returned an error, because from its point of view, the data is probably correct. The issue, as stated, is (AFAIK proprietary, blackbox-unpatchable from a freedomware perspective) vmware changing data under direct-IO "in-flight", which breaks the intent and rules of direct-IO, at least as defined for Linux. The previous discussion I''ve seen of the problem indicates that MS allows such changes, apparently choosing to take the speed hit for doing so, so it''s an impedance miss-match between VM/physical-machine layers, one of which is proprietary and thus unfixable from a FLOSS perspective, with the other unwilling to take the general case slowdown for the proprietary special case that''s breaking the intent of direct-IO and thus the rules for it in the first place. It''s worth noting that in the normal non-direct-IO case, there''s no problem; the data is allowed to change and the checksum is simply recalculated. But the entire purpose of direct-IO is to short-cut a lot of the care taken in the normal path in the interest of performance, when the user knows it can guarantee certain conditions are met. The problem here is that direct-IO is being used, but the user is breaking the guarantee it chose to make by choosing to use direct-IO in the first place, then changing data in-flight that is guaranteed to be stable once committed to the direct-IO path. (Just because it happened to work with ext3/4, etc, because they didn''t do checksums and thus didn''t actually rely on the level of guarantee being made, doesn''t obligate other filesystems to do the same, particularly when one of their major features is checksummed data integrity, as is the case with btrfs.) So because the data under direct-IO was changed in-flight, after the btrfs checksum had already been calculated, the MS side should indeed show it to be correct -- only the btrfs side will show as wrong, since the data changed after it calculated its checksum, thus breaking the rules for direct-IO under Linux. The "proper" fix would thus be in vmware or possibly in the MS software running on top of it. It should either not change the data in-flight if it''s going to use direct-IO and by doing so make the guarantee that the data won''t change in-flight, or should not use direct-IO if it''s going to be changing the data in-flight and thus can''t make that guarantee. But of course that''s not within the Linux/FLOSS world''s control.> In case the data is wrong, there may be a reverse CRC32 algorithm > implemented. Most likely it''s only several bytes which got "flipped". > On modern hardware, it shouldn''t take that much time to brute-force the > checksum, especially considering we have a good guess (the raw, > corrupted data).But... that flips the entire reason for choosing direct-IO in the first place -- performance -- on its head, incurring a **HUGE** slowdown just to fix up a broken program that can''t keep the guarantees it chose to make, to try to gain just a bit of performance. By analogy, normal-IO might be considered surface shipping China to US, with direct-IO shipping by air. But once the packages/data arrive by air, they''re found to be broken because the packer didn''t pack the data with the padding specified by the air-carrier so things broke in shipping, but instead of proposing the problem be fixed by actually padding as specified by the carrier or choosing the slower but more careful surface carrier, you''re now proposing we send them to Mars (!!) and back to be fixed!> Now, the VM I removed did not have any special data in it (+ I make > backups), but it could''ve been much worse. > >>> I have several subvolumes defined, one of which for VMWare >>> Workstation (on which the corruption took place). >> >> Aaah, the VM workload could explain this. There''s some (known, >> won''t-fix) issues with (I think) direct-IO in VM guests that can cause >> bad checksums to be written under some circumstances. >> >> I''m not 100% certain, but I _think_ that making your VM images >> nocow (create an empty file with touch; use chattr +C; extend the file >> to the right size) may help prevent these problems. >> > Hmm, could try that. Thanks for the tip.I''m similarly not 100% certain, but from (I believe accurate) memory, it was indeed nocow (nodatacow in terms of mount options). The actual desired feature would be nodatasum, but AFAIK that''s only available as a mount option, not as a per-file attribute. And since those mount options currently apply to the entire filesystem, not just a subvolume, and checksumming is one of the big reasons you''d use btrfs in the first place, turning it off for the entire filesystem probably isn''t what you want. But since nodatacow/nocow implies nodatasum, turning off COW on the file also turns off checksumming, so it should do what you need, even if it does a bit more as well. But nocow for a file containing a VM is almost certainly a good idea anyway, since the file-internal write pattern of VMs is such that the file would very likely otherwise end up hugely fragmented over time. So it''s probably what you want in the first place. =:^) Of course you could look up the previous discussion in the list archives if you want the original discussion. Meanwhile, as an alternative to the touch/chattr/extend routine (ordinarily necessary since nocow won''t fix data that''s already written), you can set nodatacow on the subdir the file will be created in, and (based on what I''ve read, I''m an admin not a developer myself and thus haven''t actually read the code) all new files in that subdir should automatically inherit the nocow attribute. That''s what I''d probably do.> I could also disable writeback cache on the VM. But, VMWare uses it''s > own "vmblock" kernel module for I/O, so I''m not sure if this would do > any good. Then ofcourse, there''s the performance hit.Well, considering that by analogy you''ve proposed after-the-fact shipping to Mars and back to fix the breakage, choosing surface shipping vs. air shipment should be entirely insignificant, performance-wise. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 3 September 2013 18:54, Duncan <1i5t5.duncan@cox.net> wrote:> > > In case the data is wrong, there may be a reverse CRC32 algorithm > > implemented. Most likely it''s only several bytes which got "flipped". >...> > But... that flips the entire reason for choosing direct-IO in the first > place -- performance -- on its head, incurring a **HUGE** slowdown justNot wanting to put words in the original posters mouth, but I read that as an offline recovery method (scrub?), rather than real time recovery attempts. If the frequency of errors is low, then for certain purposes accepting, a few errors if you had a recovery option might be acceptable. As mentioned, nocow is probably best for VM images anyhow, but still :) -David -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David MacKinnon posted on Tue, 03 Sep 2013 19:26:10 +1000 as excerpted:> On 3 September 2013 18:54, Duncan <1i5t5.duncan@cox.net> wrote: >> >> > In case the data is wrong, there may be a reverse CRC32 algorithm >> > implemented. Most likely it''s only several bytes which got "flipped". >> >> But... that flips the entire reason for choosing direct-IO in the first >> place -- performance -- on its head, incurring a **HUGE** slowdown just > > Not wanting to put words in the original posters mouth, but I read that > as an offline recovery method (scrub?), rather than real time recovery > attempts. If the frequency of errors is low, then for certain purposes > accepting, a few errors if you had a recovery option might be > acceptable.You might be right. Tho there''s already scrub available... it just requires a second, hopefully valid, copy to work from. Which is what btrfs raid1 mode is all about, and why I chose to run it. =:^) It would be nice to be able to say accept the invalid data, if it''s not deemed critical and isn''t so corrupted it''s entirely invalid, which was something the poster suggested. And in a way, that''s what nocow does, by way of nosum; it just has to be setup before the fact; there''s (currently) no way to make it work after the damage has occurred. But I don''t believe brute-forcing a correct crc match to be as necessarily feasible as the poster suggested as another alternative. And even if a proper match is found, what''s to say it''s the /correct/ match? Meanwhile, even if brute-forcing a match /is/ possible, in this particular case, it''d likely crash the VM or otherwise cause at the very least invalid results if not horrible VM corruption, because the written data was very likely correct, just changed after btrfs calculated the checksum. So changing it back to what btrfs calculated the checksum on, even if possible, would actually corrupt the data from the VM''s perspective, and then the VM would be acting on that corrupt data, which would certainly have unexpected and very possibly horribly bad results.> As mentioned, nocow is probably best for VM images anyhow, but still :)Agreed on that. If the VM insists on breaking the rules and scribbling over its own data, just don''t do the checksumming and LET it scribble over its own data if that''s what it wants to do and as long as it doesn''t try to scribble over anything that''s NOT its data to scribble over. If it breaks in pieces as a result, it gets to keep ''em. =:^\ -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html