Alexandre Oliva
2011-Aug-08 01:00 UTC
“bio too big” regression and silent data corruption in 3.0
tl;dr version: 3.0 produces “bio too big” dmesg entries and silently corrupts data in “meta-raid1/data-single” configurations on disks with different max_hw_sectors, where 2.6.38 worked fine. tl;dr side-issue: on-line removal of partitions holding “single” data attempts to create raid0 (rather than single) block groups. If it can''t get enough room for raid0 over all remaining disks, it fails, leaving the available space incorrect (even underflowed). If it succeeds, it creates raid0 block groups and permanently (?) switches the FS to raid0. I''ve been (more or less) happily using btrfs on various machines with internal and external disks combined into raid1(m/d) and raid1(m)/single(d) -o compress filesystems, using Freed-ora 2.6.38.8-libre.35.fc15. Once I upgraded to 2.6.40(AKA 3.0)-libre.4.fc15 and created a ceph OSD on one of those machines, I hit some I/O errors that turned out to be related with writing out updates to the ceph journal to the external USB-connected disk (an odd choice, considering the internal disk has more I/O bandwidth, though much less space; it seems that 3.0 changed the block group allocation heuristics to avoid filling up disks too soon, I suppose, but that''s another issue). So far so good. I could split out the filesystem, or just refrain from using a journal, but at least I knew I''d get hard errors should I keep on with the split filesystem. Except that I couldn''t count on getting hard errors, as I learned the hard way yesterday. I decided to shuffle some data around on an old server with several internal SATA and PATA disks, plus one larger external USB disk I decided to install on that server to give me enough room for the shuffling. That was an unfortunate decision of mine for a few reasons: 1. Copying (rsync) the first few hundred GBs of data from one internal-only (fast) filesystem to the internal/external filesystem was very fast, which was not unexpected given that I thought it was copying to the internal disk. But it wasn''t: it ended up choosing the larger external disk for most writes, and *discarding* nearly all of the big writes with no more than “bio too big” warnings logged to dmesg, noticed only after the fact. No hard errors, just (nearly)-silent data corruption, detected by data checksums that didn''t match when trying to use the newly-created copy. Oops ;-) That''s Bad (TM) A bit of investigation showed that max_hw_sectors for the USB disk was 120, much lower than the internal SATA and PATA disks. Unfortunately, by just looking at the code in fs/btrfs, I couldn''t tell how a bio that exceeds max_hw_sectors size could possibly be created, but it was the first time I even looked at the btrfs kernel code, or any in-kernel filesystem code, so it doesn''t surprise me that I couldn''t figure it out on my own ;-) Anyway, I couldn''t see changes between 2.6.38 and 3.0 that might be related with that either, so I''m at a loss as to how this extremely serious regression might have come about. 2. Removing a partition from the filesystem (say, the external disk) didn''t relocate “single” block groups as such to other disks, as expected. Raid0 block groups were created to hold data from single block groups and, if it couldn''t create big-enough raid0 blocks because *any* of the other disks was nearly-full, removal would fail. This can make it tricky to remove any partition from a filesystem that has two or more partition members nearly full. I suppose rebalancing might do the trick, though it adds an unnecessary step. Worse: after the failure, the available space, as reported by /bin/df, remains lower than before the request for removal. The difference appears to be the amount of space that would have been made unavailable by the removal of the requested partition. Repeating the request for removal doesn''t make it go lower, but asking for *another* member partition to be removed (and failing in just the same way) does make it go lower. Asking for one large partition to be removed, after the first failure, caused the amount of available space to underflow! Wheee, nearly-infinite storage ;-) At least until the next reboot, that would fix the reported available space. 3. Sometimes failure is better than success. In this case, successful removal of a partition meant the filesystem would no longer allocate single block groups: it would only allocate raid0 groups, a very unfortunate choice for a filesystem containing disks of very different sizes. I haven''t tried to fill it up to check that it wouldn''t revert to single blocks after exhausting all the space that could be devoted to creating raid0 block groups, but the reported available space got me the impression that it would only create block groups while it could get an equal number of blocks from each of the remaining disks. I could reduce the space taken up by RAID0 block groups by asking for removal of partitions holding such raid0 block groups; the blocks would be happily relocated to available space in other pre-existing single groups. However, once it got to single groups, it would allocate raid0 groups, and any further block group allocations on that filesystem would get raid0 block groups, rather than single. I couldn''t find a way to go back, in very much the same way that it appears to be impossible to go back from RAID1 to DUP metadata once you temporarily add a second disk, and any metadata block group happens to be allocated before you remove it (why couldn''t it go back to DUP, rather than refusing the removal outright, which prevents even single block groups from being moved?) 4. I ended up re-creating the filesystem with single data, as intended, and using 2.6.38.8 to safely use the external disk for the copying. I decided to keep it in for the time being, in part because I''m scared of attempting a removal and ending up with raid0 block groups and highly-reduced available disk space. Instead of the large external disk, however, 2.6.38.8 preferred the faster but smaller internal disks, and it would happily fill them up with the large, long-term storage data that was meant to remain mostly in the external disk (as 3.0 would have done), leaving no room for raid1 metadata allocations. I''d get -ENOSPACE errors every now and again while copying data onto this filesystem, even though there was plenty of available space, and even plenty of available space in already-allocated metadata block groups. So much so that retrying the same copies after a few seconds would succeed. Oh well... That''s a 2.6.38 issue that''s AFAICT heuristically fixed in 3.0. Too bad I can''t really take advantage of this fix because of the “bio too big” problem. 5. This long message reminded me that another machine that has been running 3.0 seems to have got *much* slower recently. I thought it had to do with the 98% full filesystem (though 40GB available for new block group allocations would seem to be plenty), and the constant metadata activity caused by ceph creating and removing snapshots all the time. It seems that the removals lagged behind for a long time and kept the disk in constant activity in spite of very little actual ceph activity. I had decided to shuffle disks around precisely to make more disk space available for that one machine. However, once I switched back to 2.6.38, the machine seems to have gotten much faster again, in spite of the larger ceph activity due to resyncing data to a re-created OSD. This suggests some large inefficiency in 3.0''s btrfs, at least for such nearly-full disks, and/or for such frequent snapshot creation and removal as done by ceph. Indeed, I had noticed a significant slow down of the ceph cluster, which I had associated with the nearly-full disk under constant metadata activity, but after I switched back to 2.6.38, the speed of the cluster was back to normal. I''m afraid I don''t have enough data to be any more specific about this issue. 6. On a more positive note, I was totally amazed by btrfs''s ability to recover from a goof of mine. While shuffling disks, removing them from one filesystem and adding to another, I accidentally added to one of the data filesystems a partition that was in use by the btrfs raid1 filesystem containing my root (I mean the stuff mounted in /, including usr, bin, lib, etc). Oops. I promptly noticed the mistake and removed it from the data filesystem and rebooted, already reaching for the recovery disk. I didn''t need it. The root filesystem mounted successfully, reporting a bunch of checksum errors and using the other raid1 copy of the data. Wow! I removed the partition I had double-used and it again reported lots of errors, but succeeded, and then I added it back, and everything was fine. I even compared the root filesystem image with a recent backup, and all the data was correct, and the filesystem was consistent. Great stuff, thanks! I wonder, why can''t btrfs mark at least mounted partitions as busy, in much the same way that swap, md and various filesystems do, to avoid such accidental reuses? I recall another occasion in which I attempted to add a live swap partition to a btrfs filesystem (@&@#$@&@# disks that get assigned different /dev/sd* names on each reboot!), and it refused, because the swap partition was busy. Couldn''t btrfs use the same mechanisms to protect its own mounted partitions from accidents? Thanks in advance for any advice, fixes, or improvements, -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2011-Aug-08 22:39 UTC
Re: “bio too big” regression and silent data corruption in 3.0
On Aug 7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:> tl;dr version: 3.0 produces “bio too big” dmesg entries and silently > corrupts data in “meta-raid1/data-single” configurations on disks with > different max_hw_sectors, where 2.6.38 worked fine.FWIW, I just got the same problem with 2.6.38. No idea how I hadn''t hit it before, but it''s not a 3.0 regression, just a regular (but IMHO very serious) bug. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2011-Aug-09 02:53 UTC
Re: “bio too big” regression and silent data corruption in 3.0
On Aug 7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:> 2. Removing a partition from the filesystem (say, the external disk) > didn''t relocate “single” block groups as such to other disks, as > expected./me reads some code and resets expectations about RAID0 in btrfs ;-) update_block_group_flags is what does this. It doesn''t care what was chosen when the filesystem was created, it just forces RAID0 if more than 1 disk remains: /* turn single device chunks into raid0 */ return stripped | BTRFS_BLOCK_GROUP_RAID0; Is this really intended? Given my current understanding that RAID0 doesn''t mean striping over all disks, but only over two disks, I guess I might even be interested in it, but... I still think the user''s choice should be honored, but I don''t see where the choice is stored (if it is at all).> I wonder, why can''t btrfs mark at least mounted partitions as busy, in > much the same way that swap, md and various filesystems do, to avoid > such accidental reuses?Heh. And *unmark* them when they''re removed, too... As in, it won''t let me create a new filesystem in a partition that was just removed from a filesystem, if that was the partition listed in /etc/mtab. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2011-Aug-09 04:04 UTC
Re: “bio too big” regression and silent data corruption in 3.0
On Aug 7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:> in very much the same way that it appears to be impossible to go > back from RAID1 to DUP metadata once you temporarily add a second disk, > and any metadata block group happens to be allocated before you remove > it (why couldn''t it go back to DUP, rather than refusing the removal > outright, which prevents even single block groups from being moved?)Which also appears to be intentional. The code to suport this is right there in update_block_group_flags, but btrfs_rm_device refuses to let it do its job, denying the removal attempt right away, without any means to bypass the test. Could at least an option to bypass the test be introduced, through say a mount option, some /sys setting, whatever? -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Aug-09 14:01 UTC
Re: “bio too big” regression and silent data corruption in 3.0
On 08/08/2011 10:53 PM, Alexandre Oliva wrote:> On Aug 7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote: > >> 2. Removing a partition from the filesystem (say, the external disk) >> didn''t relocate “single” block groups as such to other disks, as >> expected. > > /me reads some code and resets expectations about RAID0 in btrfs ;-) > > update_block_group_flags is what does this. It doesn''t care what was > chosen when the filesystem was created, it just forces RAID0 if more > than 1 disk remains: > > /* turn single device chunks into raid0 */ > return stripped | BTRFS_BLOCK_GROUP_RAID0; > > Is this really intended? Given my current understanding that RAID0 > doesn''t mean striping over all disks, but only over two disks, I guess I > might even be interested in it, but... I still think the user''s choice > should be honored, but I don''t see where the choice is stored (if it is > at all).Well -m single -d single means that we only have one disk and we don''t want duplication (usually one just does -m single since metadata is the only thing duplicated by default). But if you add more disks we want to do RAID0 as we should be stripping across all the devices in the fs.> > >> I wonder, why can''t btrfs mark at least mounted partitions as busy, in >> much the same way that swap, md and various filesystems do, to avoid >> such accidental reuses? > > Heh. And *unmark* them when they''re removed, too... As in, it won''t > let me create a new filesystem in a partition that was just removed from > a filesystem, if that was the partition listed in /etc/mtab. >Yeah our "what is busy" thing should be a little smarter. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Aug-09 14:02 UTC
Re: “bio too big” regression and silent data corruption in 3.0
On 08/08/2011 06:39 PM, Alexandre Oliva wrote:> On Aug 7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote: > >> tl;dr version: 3.0 produces “bio too big” dmesg entries and silently >> corrupts data in “meta-raid1/data-single” configurations on disks with >> different max_hw_sectors, where 2.6.38 worked fine. > > FWIW, I just got the same problem with 2.6.38. No idea how I hadn''t hit > it before, but it''s not a 3.0 regression, just a regular (but IMHO very > serious) bug. >This is worriesome, I will try and find a usb disk with a small sectorsize and see if I can reproduce. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Josef Bacik
2011-Aug-09 19:05 UTC
Re: “bio too big” regression and silent data corruption in 3.0
On 08/07/2011 09:00 PM, Alexandre Oliva wrote:> tl;dr version: 3.0 produces “bio too big” dmesg entries and silently > corrupts data in “meta-raid1/data-single” configurations on disks with > different max_hw_sectors, where 2.6.38 worked fine. >I''ve reproduced this but I''m stuck on something else atm but I should get to it soon. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Alexandre Oliva
2011-Aug-16 16:56 UTC
Re: “bio too big” regression and silent data corruption in 3.0
Here''s some additional information and work-arounds. On Aug 7, 2011, Alexandre Oliva <oliva@lsd.ic.unicamp.br> wrote:> A bit of investigation showed that max_hw_sectors for the USB disk was > 120, much lower than the internal SATA and PATA disks.FWIW, overriding /sys/class/block/sd*/queue/max_sectors_kb of all disks used by the filesystem to the lowest max_hw_sectors_kb works around this problem, at least as long as you don''t hit it before you get a chance to change the setting.> Raid0 block groups were created to hold data from single block groups > and, if it couldn''t create big-enough raid0 blocks because *any* of > the other disks was nearly-full, removal would fail.AFAICT this was my misunderstanding of the situation. Apparenty btrfs can rebalance the disk space in other partitions so as to create raid0 blocks during removal. However, in my case it didn''t because there was some metadata inconsistency in the partition I was trying to remove that led to block tree checksum errors being printed when it hit that part of the partition, aborting the removal. The checksum errors were likely caused by the bio too big problem.> it appears to be impossible to go back from RAID1 to DUP metadata once > you temporarily add a second disk, and any metadata block group > happens to be allocated before you remove it (why couldn''t it go back > to DUP, rather than refusing the removal outright, which prevents even > single block groups from being moved?)FWIW, I disabled the test that refuses to shrink a filesystem containing RAID1 to a single disk and issued such a request while running this modified kernel, and it completed successfully and perfectly. Can we change it from hard error to warning?> 5. This long message reminded me that another machine that has been > running 3.0 seems to have got *much* slower recently. I thought it had > to do with the 98% full filesystem (though 40GB available for new block > group allocations would seem to be plenty), and the constant metadata > activity caused by ceph creating and removing snapshots all the time.AFAICT it had to do with extended attributes (heavily used by ceph), that caused a large number of metadata block groups to be allocated, even though only a tiny fraction of the space in them ended up being used. I''ve observed this in two of the ceph object stores. I''ve also noticed that rsyncing the OSDs with all extended attributes (-A -X) caused the source to use up a *lot* of CPU and far longer than without. I don''t know why that is, but getfattr --dump at the source and setfattr --restore at the target does pretty much the same, without incurring such large CPU and time costs, so there''s something to be improved somewhere, in rsync and/or in btrfs. -- Alexandre Oliva, freedom fighter http://FSFLA.org/~lxoliva/ You must be the change you wish to see in the world. -- Gandhi Be Free! -- http://FSFLA.org/ FSF Latin America board member Free Software Evangelist Red Hat Brazil Compiler Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Seemingly Similar Threads
- corruption of active mmapped files in btrfs snapshots
- corrupted btrfs after suspend2ram uncorrectable with scrub
- [bug] btrfs fi df doesn't show raid type after balance
- Understanding Default RAID Behavior
- [RFC][PATCH 1/2] Btrfs: try to allocate new chunks with degenerated profile