Lutz Vieweg
2013-Nov-15 11:31 UTC
"btrfs: 1 enospc errors during balance" when balancing after formerly failed raid1 device re-appeared
Hi again, I just did another test on resilience with btrfs/raid1, this time I tested the following scenario: One out of two raid1 devices disappears. The filesystem is written to in degraded mode. The missing device re-appears (think of e.g. a storage device that temporarily became unavailable due to a cable or controller issue that is later fixed). User issues "btrfs filesystem balance". Alas, this scenario ends in an effor "btrfs: 1 enospc errors during balance", with the raid1 staying degraded. Here''s the test procedure in detail: Testing was done using vanilla linux-3.12 (x86_64) plus btrfs-progs at commit 9f0c53f574b242b0d5988db2972c8aac77ef35a9 plus "[PATCH] btrfs-progs: for mixed group check opt before default raid profile is enforced" Preparing two 100 MB image files: > # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100 > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s > > # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100 > 100+0 records in > 100+0 records out > 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/s Preparing two loop devices on those images to act as the underlying block devices for btrfs: > # losetup /dev/loop1 /tmp/img1 > # losetup /dev/loop2 /tmp/img2 Mounting / writing to the fs: > # mount -t btrfs /dev/loop1 /mnt/tmp > # echo asdfasdfasdfasdf >/mnt/tmp/testfile1 > # md5sum /mnt/tmp/testfile1 > f1264d450b9feda62fec5a1e11faba1a /mnt/tmp/testfile1 > # umount /mnt/tmp First storage device "disappears": > # losetup -d /dev/loop1 Mounting degraded btrfs: > # mount -t btrfs -o degraded /dev/loop2 /mnt/tmp Testing that testfile1 is still readable: > # md5sum /mnt/tmp/testfile1 f1264d450b9feda62fec5a1e11faba1a /mnt/tmp/testfile1 Creating "testfile2" on the degraded filesystem: > # echo qwerqwerqwerqwer >/mnt/tmp/testfile2 > # md5sum /mnt/tmp/testfile2 > 9df26d2f2657462c435d58274cc5bdf0 /mnt/tmp/testfile2 > # umount /mnt/tmp Now we assume the issue causing the first storage device to be unavailable to be fixed: > # losetup /dev/loop1 /tmp/img1 > # mount -t btrfs /dev/loop1 /mnt/tmp Notice that at this point, I would have expected some kind of warning in the syslog that the mounted filesystem is not balanced and thus not redundant. But there was no such warning. This may easily lead operators into a situation where they do not realize that a btrfs is not redundant and losing one storage device will lose data. Testing that the two testfiles (one of which is not yet stored on both devices) are still readable: > # md5sum /mnt/tmp/testfile1 f1264d450b9feda62fec5a1e11faba1a /mnt/tmp/testfile1 > # md5sum /mnt/tmp/testfile2 9df26d2f2657462c435d58274cc5bdf0 /mnt/tmp/testfile2 So far, so good. Now since we know the filesystem is not really redundant, we start a "balance": > # btrfs filesystem balance /mnt/tmp > ERROR: error during balancing ''/mnt/tmp'' - No space left on device > There may be more info in syslog - try dmesg | tail Syslog shows: > kernel: btrfs: relocating block group 20971520 flags 21 > kernel: btrfs: found 3 extents > kernel: btrfs: relocating block group 4194304 flags 5 > kernel: btrfs: relocating block group 0 flags 2 > kernel: btrfs: 1 enospc errors during balance So the raid1 remains "degraded". BTW: I wonder why "btrfs balance" seems to require additional space for writing data to the re-appeared disk. I also wonder: Would btrfs try to write _two_ copies of everything to _one_ remaining device of a degraded two-disk raid1? (If yes, then this means a raid1 would have to be planned with twice the capacity just to be sure that one failing disk will not lead to an out-of-diskspace situation. Not good.) Regards, Lutz Vieweg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hugo Mills
2013-Nov-15 12:38 UTC
Re: "btrfs: 1 enospc errors during balance" when balancing after formerly failed raid1 device re-appeared
On Fri, Nov 15, 2013 at 12:31:24PM +0100, Lutz Vieweg wrote:> Hi again, > > I just did another test on resilience with btrfs/raid1, this time I tested > the following scenario: One out of two raid1 devices disappears. The filesystem > is written to in degraded mode. The missing device re-appears (think of e.g. > a storage device that temporarily became unavailable due to a cable or controller > issue that is later fixed). User issues "btrfs filesystem balance". > > Alas, this scenario ends in an effor "btrfs: 1 enospc errors during balance", > with the raid1 staying degraded. > > Here''s the test procedure in detail: > > Testing was done using vanilla linux-3.12 (x86_64) > plus btrfs-progs at commit 9f0c53f574b242b0d5988db2972c8aac77ef35a9 > plus "[PATCH] btrfs-progs: for mixed group check opt before default raid profile is enforced" > > Preparing two 100 MB image files: > > # dd if=/dev/zero of=/tmp/img1 bs=1024k count=100 > > 100+0 records in > > 100+0 records out > > 104857600 bytes (105 MB) copied, 0.201003 s, 522 MB/s > > > > # dd if=/dev/zero of=/tmp/img2 bs=1024k count=100 > > 100+0 records in > > 100+0 records out > > 104857600 bytes (105 MB) copied, 0.185486 s, 565 MB/sFor btrfs, this is *tiny*. I''m not surprised you''ve got ENOSPC problems here -- it''s got nowhere to move the data to, even if you are using --mixed mode. I would recommend using larger sparse files for doing this, otherwise you''re going to keep hitting ENOSPC errors instead of triggering actual bugs in device recovery: $ dd if=/dev/zero of=/tmp/img1 bs=1M count=0 seek=10240 That will give you a small file with a large apparent size. [...]> So far, so good. > Now since we know the filesystem is not really redundant, > we start a "balance": > > > # btrfs filesystem balance /mnt/tmp > > ERROR: error during balancing ''/mnt/tmp'' - No space left on device > > There may be more info in syslog - try dmesg | tail > > Syslog shows: > > kernel: btrfs: relocating block group 20971520 flags 21 > > kernel: btrfs: found 3 extents > > kernel: btrfs: relocating block group 4194304 flags 5 > > kernel: btrfs: relocating block group 0 flags 2 > > kernel: btrfs: 1 enospc errors during balance > > So the raid1 remains "degraded". > > BTW: I wonder why "btrfs balance" seems to require additional space > for writing data to the re-appeared disk.It''s a copy-on-write filesystem: *every* modification of the FS requires additional space to write the new copy to. In your example here, the FS is so small, I''m surprised you could write anything to it all, due to metadata overheads.> I also wonder: Would btrfs try to write _two_ copies of > everything to _one_ remaining device of a degraded two-disk raid1?No. It would have to degrade from RAID-1 to DUP to do that (and I think we prevent DUP data for some reason). Hugo.> (If yes, then this means a raid1 would have to be planned with > twice the capacity just to be sure that one failing disk will > not lead to an out-of-diskspace situation. Not good.) > > Regards, > > Lutz Vieweg >-- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Doughnut furs ache me, Omar Dorlin. ---
Duncan
2013-Nov-15 15:00 UTC
Re: "btrfs: 1 enospc errors during balance" when balancing after formerly failed raid1 device re-appeared
Hugo Mills posted on Fri, 15 Nov 2013 12:38:41 +0000 as excerpted:>> I also wonder: Would btrfs try to write _two_ copies of everything to >> _one_ remaining device of a degraded two-disk raid1? > > No. It would have to degrade from RAID-1 to DUP to do that (and I > think we prevent DUP data for some reason).You may be correct about DUP data, but that is unlikely to be the issue here, because he''s likely using the mixed-mode default due to the <1GB filesystem size, and on a multi-device filesystem that should default to RAID1 just as metadata by itself does. However, I noticed that his outlined duplicator SKIPPED the mkfs.btrfs command part, and there''s no btrfs filesystem show and btrfs filesystem df to verify how the kernel''s actually treating the filesystem, so... @ LV: For further tests, please include these commands and their output: 1) your mkfs.btrfs command [Then mount, and after mount...] 2) btrfs filesystem show <path> 3) btrfs filesystem df <path> Thanks. These should make what btrfs is doing far clearer. Meanwhile, I''ve been following your efforts with quite some interest as they correspond to some of the pre-deployment btrfs raid1 mode testing I did. This was several kernels ago, however, so I had been wondering if the behavior had changed, hopefully for the better, and your testing looks to be headed toward the same test I did at some point. Back then, I found a rather counterintuitive result of my own. Basically, take a two-device raid1 mode (both data and metadata; in my case the devices were over a gig so mixed data+metadata wasn''t invoked and I specified -m raid1 -d raid1 when doing the mkfs.btrfs) btrfs, mount it, copy some files to it, unmount it. Then disconnect one device (I was using actual devices not the loop devices you''re using) and mount degraded. Make a change to the degraded filesystem. Unmount. Then disconnect that device and reconnect the other. Mount degraded. Make a *DIFFERENT* change to the same file. Unmount. The two copies have now forked in an incompatible manner. Now reconnect both devices and remount, this time without degraded. As you, here I expected btrfs to protest, particularly so since the two copies were incompatible. *NO* *PROTEST*!! OK, so check that file to see which version I got. I''ve now forgotten which one it was, but it was definitely one of the two forks, not the original version. Now unmount and disconnect the device with the copy it said it had. Mount the filesystem degraded with the other device. Check the file again. !! I got the other fork! !! Not only did btrfs not protest when I mounted a raid1 device undegraded after making incompatible changes to the file with each of the two devices mounted degraded separately, but accessing the file on the undegraded filesystem neither protested nor corrected the other copy, which remained the incompatibly forked copy as confirmed by remounting the filesystem degraded with just that device in ordered to access it. To my way of thinking, that''s downright dangerous, as well as being entirely unintuitive. Unfortunately, I didn''t actually do a balance to see what btrfs would do with the incompatible versions, I simply blew away that testing filesystem with a new mkfs.btrfs (I''m on SSD so mkfs.btrfs automatically issues a trim/discard to clear the new filesystem space before mking it), and I''ve been kicking myself for not doing so ever since, because I really would like to know balance actually /does/ in such a case! But I was still new enough to btrfs at that time that I didn''t really know what I was doing, so I didn''t realize I''d omitted a critical part of the test until it was too late, and I''ve not been interested /enough/ in the outcome to redo the test again, with a newer kernel and tools and a balance this time. What scrub would have done with it would be an interesting testcase as well, but I don''t know that either, because I never tried it. At least I hadn''t redone the test yet. But I keep thinking about it, and I guess I would have one of these days. But now it seems like you''re heading toward doing it for me. =:^) Meanwhile, the conclusion I took from the test was that if I ever had to mount degraded in read/write mode, I should make *VERY* sure I CONSISTENTLY used the same device when I did so. And when I undegraded, /hopefully/ a balance would choose the newer version. Unfortunately I never did actually test that, so I figured should I actually need to use degraded, even if the degrade was only temporary, the best way to recover was probably to trim that entire partition on the lost device and then proceed to add it back into the filesystem as if it were a new device and do a balance to finally recover raid1 mode. Meanwhile (2), what btrfs raid1 mode /has/ been providing me with is data integrity via the checksumming and scrub features. And with raid1 providing a second copy of the data to work with, scrub really does appear to do what it says on the tin, copy the good copy over the bad, if there''s a checksum mismatch on the bad one. What I do NOT know is what happens in /either/ a scrub or balance case, when the metadata of both copies is valid, but they differ from each other! FWIW, here''s my original list post on the subject, tho it doesn''t seem to have generated any followup (beyond my own). IIRC I did get a reply from another sysadmin on another thread, but I can''t find it now, and all it did was echo my concern, no reply from a dev or the like. http://permalink.gmane.org/gmane.comp.file-systems.btrfs/26096 So anyway, yes, I''m definitely following your results with interest! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html