I''ve created a test volume and copied a bulk of data to it, however the results of the space allocation are confusing at best. I''ve tried to capture the history of events leading up to the current state. This is all on a Debian Wheezy system using a 3.10.5 kernel package (linux-image-3.10-2-amd64) and btrfs tools v0.20-rc1 (Debian package 0.19+20130315-5). The host uses an Intel Atom 330 processor, and runs the 64-bit kernel with a 32-bit userland. I initially created the volume as RAID1 data, then removed (hotplugged out from under the system) one of the drives while empty as a test. I then unmounted and remounted it with the degraded option and copied a small amount of data. Once verifying that the space was used, I hotplugged the second original drive, which was detected and added back to the volume (showing up in a filesystem show instead of missing). I then tried to copy over more data than a RAID1 should be expected to hold (~650GB onto 2 x 500GB disks in RAID1), got out of space reported as expected. I then deleted all data from the volume (did not recreate the filesystem), and copied just over 300GB of data onto the volume, which is the current state. Only as I was typing this up did I notice that the mount options still show degraded from the original mount. I expected that once the second drive was readded, since it showed up as part of the volume automatically (I assume because the UUIDs matched?), however since all data appears to have been written to the first drive, I am led to believe that the second drive was present but not readded, even though it reappeared as devid 2 in the listing. If the above is correct, then I have two questions that I haven''t found any documentation on: 1. What is the expectation on hot-adding a failed drive, is an explicit ''device add'' or ''replace'' expected/required? In my case it appeared to be auto-added, but that may have been spurious or misleading. I''d consider that if an explicit readd is required, that the device not be listed at all, however I would be much more interested to see hotplug of a previously missing device (with older modifications from the same volume) be readded and synced automatically. 2. If initially mounted as degraded, once a new drive is added, is a remount required? I''d hope not, but since the mount flag can''t be changed later on, what is the best way to confirm health of the volume? Until this issue I''d assumed using ''filesystem show''. Since the mount flag is at mount time only, degraded seems to mean "be degraded if needed" instead of a positive indicator that the volume is indeed degraded" $ mount | grep btrfs /dev/sdc on /mnt/new-store type btrfs (rw,relatime,degraded,space_cache) $ du -hsx /mnt/new-store 305G /mnt/new-store $ df -h | grep new-store /dev/sdc 932G 307G 160G 66% /mnt/new-store $ btrfs fi show /dev/sdc Label: ''new-store'' uuid: 14e6e9c7-b249-40ff-8be1-78fc8b26b53d Total devices 2 FS bytes used 540.00KB devid 2 size 465.76GB used 2.01GB path /dev/sdd devid 1 size 465.76GB used 453.03GB path /dev/sdc $ btrfs fi df /mnt/new-store Data, RAID1: total=1.00GB, used=997.21MB Data: total=450.01GB, used=303.18GB System, RAID1: total=8.00MB, used=56.00KB System: total=4.00MB, used=0.00 Metadata, RAID1: total=1.00GB, used=617.14MB Metadata: total=1.01GB, used=0.00 I may be missing or mis-remembering some of the order of events leading to the current state, however the space usage numbers don''t reflect anything close to what I would expect. On the data used on sdc I''ve assumed that''s old from when I filled the volume and hasn''t been reclaimed by a balance or other operation. However, the "used=997.21MB" from fi df, as well as the "FS bytes used 540.00KB" from fi show seem suspect based on what Thanks for help understanding the space allocation and usage patterns, I''ve tried to put pieces together based on man pages, wiki and other postings, but can''t seem to reconcile what I think I should be seeing based on that reading with what I''m actually seeing. Joel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Aug 23, 2013, at 6:05 PM, Joel Johnson <mrjoel@lixil.net> wrote:> What is the expectation on hot-adding a failed drive, is an explicit ''device add'' or ''replace'' expected/required?I''d expect to have to add a device and then remove missing. There isn''t a readd option in btrfs, which in md parlance is used for readding a device previously part of an array. When replacing a failed disk, I''d like btrfs to compare states between the available drives and know that it needs to catch up the newly added device, but this doesn''t yet happen. It''s necessary to call btrfs balance. However, after adding a device and deleting missing, I see what you see, the btrfs volume is still mounted degraded. Further, if I use: # mount -o remount /dev/sdb /mnt # mount | grep btrfs /dev/sdc on /mnt type btrfs (rw,relatime,seclabel,degraded,space_cache) So it''s still degraded. I had to unmount and mount again to clear degraded. That seems to be a problem, to have to unmount the volume in order to remove the degraded flag, which is needed to begin the rebalance. And what if btrfs is the root file system? It needs to be rebooted to clear the degraded option. And still further, somehow the data profile has reverted to single even though the mkfs.btfs was raid1. So even though the volume now has two devices, has been balanced, is not degraded, the data profile is single and presumably I no longer actually have mirrored data at all on this volume. That is a huge bug. I''ll try to come up with some steps to reproduce. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Aug 23, 2013, at 10:24 PM, Chris Murphy <lists@colorremedies.com> wrote:> > When replacing a failed disk, I''d like btrfs to compare states between the available drives and know that it needs to catch up the newly added device, but this doesn''t yet happen. It''s necessary to call btrfs balance.I can only test device replacement, not a readd. Upon ''btrfs device delete missing /mnt'' there''s a delay, and it''s doing a balance. I don''t know what happens for a readd, if the whole volume needs balancing or if it''s able to just write the changes.> And still further, somehow the data profile has reverted to single even though the mkfs.btfs was raid1. [SNIP] That is a huge bug. I''ll try to come up with some steps to reproduce.If I create the file system, mount it, but I do not copy any data, upon adding new and deleting missing, the data profile is changed from raid1 to single. If I''ve first copied data to the volume prior to device failure/missing, this doesn''t happen, it remains raid1. Also, again after the missing device is replaced, and the volume rebalanced, while the mount option is degraded, subsequent file copies end up on both virtual disks. So it says it''s degraded but it really isn''t? I''m not sure what''s up here. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy posted on Fri, 23 Aug 2013 22:58:14 -0600 as excerpted:> So it says it''s degraded but it really isn''t? > I''m not sure what''s up here.In general, this was my experience a couple months ago when I tried it then, as well. And yes, IIRC the wiki actually describes the "degraded" mount option as allowing it to mount "degraded" IF NECESSARY, NOT saying that it''ll force degraded. I found one additional quirk, however, which was rather disturbing. You may recall my thread about it then... Background: I was actually trying to mount a dual device raid1 root without an initr*, using rootflags=device=. However, that didn''t work. The only way I could mount from the kernel commandline was using degraded. (I''m guessing the double-equals in the rootflags=device= got parsed incorrectly, probably trying to take rootflags=device as the name, which of course doesn''t apply to anything, as it should be rootflags. This because it definitely took the rootflags=degraded parameter just fine. That''d be a kernel bug, but I''m not sure that''s it; I just know others have reported trouble with rootflags=device=whatever on this list, as well.) So, while I was trying to get it to work (while I was still experimenting, and before I gave up and setup a simple initramfs using dracut), I booted with root=/dev/sdaX rootflags=degraded, and it worked. I then made a change to the filesystem, and rebooted again to the other one, using root=/dev/sdbX rootflags=degraded, and that worked too. I then made a different change to the same file, thus diverging the two separately mounted devices with a different write to each one. Then I rebooted back to my main system (not yet on btrfs at that point), and mounted the btrfs normally -- it mounted without the degraded option as both devices were found in the scan, DESPITE the fact that the two component devices now had diverged content. I checked and the later change was shown, and there were no errors or warnings with the mount or with reading the file. I then booted degraded again, to the one with the other change, AND IT STILL HAD THE CHANGE I HAD WRITTEN!!! So the mount of both together had given me NO WARNING OR ERROR about diverged copies; it simply showed one of them. BUT THE OTHER ONE WASN''T AUTO-DETECTED OR FIXED, EITHER. With mdraid, attempting to mount (well, re-add, in the case of mdraid) with both devices would have triggered an automatic resync. I expected at LEAST a warning with btrfs, but I didn''t get it. I guess I''d have had to manually initiate a scrub to detect and fix it, but was new enough to (multi-device) btrfs at the time that trying that didn''t occur to me. I''m not sure what a full rebalance would have done. That was rather disturbing to me, to say the least. But for my usage, knowing and noting the problem to avoid in the future was enough. I blew away my (then) test btrfs with a fresh mkfs.btrfs raid1 both data and metadata, copied root from my main system over to it once again, and set it up with an initr* mount to avoid kernel commandline degraded- mounting. Meanwhile, I don''t do any more degraded tests. I do scrubs every so often (after a bad shutdown the other day I actually had a scrub find and fix some bad checksums for the first time, and files that were definitely giving problems before the scrub worked just fine afterward, so I was glad I had btrfs checksums and raid1 copies to restore, the feature worked as advertised! =:^), and... If I ever do lose a device and have to go degraded, I know *NOT* to try degraded-mounting both devices read/write separately, and then recombining them once again. If I need to degraded-mount one, fine, but I''ll make sure it''s only the one, and if I do mount the other, I''ll either mount it read-only, or if I do end up mounting them both read/ write, I''ll blow one away and then add it as a new device, in ordered to avoid having problems figuring out which divergent copy I''m going to be dealing with once I''m using both devices again. With that sort of ground rule, I think I should be fine. But it''s certainly a lot different than mdraid works in the same setup. Btrfs definitely has its own raid1 rules -- the mdraid raid1 rules do *NOT* apply. Meanwhile, hopefully at some point as btrfs heads toward stable, it gets a write signature or whatever it is that mdraid has, so it can detect and warn when raid1 devices diverge, and ideally can auto-sync, at least if configured to do so. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Aug 23, 2013, Chris Murphy <li...@colorremedies.com> wrote:> When replacing a failed disk, I''d like btrfs to compare states between > the > available drives and know that it needs to catch up the newly added > device, > but this doesn''t yet happen. It''s necessary to call btrfs balance.> I can only test device replacement, not a readd. Upon ''btrfs device > delete > missing /mnt'' there''s a delay, and it''s doing a balance. I don''t know > what > happens for a readd, if the whole volume needs balancing or if it''s > able to > just write the changes.Similar to what Duncan described in his response, on a hot-remove (without doing the proper btrfs device delete), there is no opportunity for a rebalance or metadata change on the pulled drives, so I would expect there to be a signature of some sort for consistency checking before readding it. At least, btrfs shouldn''t add the readded device back as an active device when it''s really still inconsistent and not being used, even if it indicates the same UUID.> And still further, somehow the data profile has reverted to single even > though the mkfs.btfs was raid1. [SNIP] That is a huge bug. I''ll try to > come > up with some steps to reproduce.> If I create the file system, mount it, but I do not copy any data, upon > adding > new and deleting missing, the data profile is changed from raid1 to > single. If > I''ve first copied data to the volume prior to device failure/missing, > this > doesn''t happen, it remains raid1.And yet, the tools indicate that it is still raid1, even if internally it reverts to single??? Based on my experience with this and Duncan''s feedback, I''d like to see the wiki have some warnings about dealing with multidevice filesystems, especially surrounding the degraded mount option. Specifically, it sounds like a reasonable practice with the current state is that after a device is removed from a filesystem which receives any subsequent changes, the removed device should be cleared (dd if=/dev/zero, at least the superblocks) before being readded in order to remove any ambiguity about state. Adding such a note to the wiki now would communicate the potential pitfalls (especially with the difficulty determining what the state it by bugs below), and also allow updating once things are improved in this area. Looking again at the wiki Gotchas page, it does say On a multi device btrfs filesystem, mistakingly re-adding a block device that is already part of the btrfs fs with btrfs device add results in an error, and brings btrfs in an inconsistent state. In striping mode, this causes data loss and kernel oops. The btrfs userland tools need to do more checking to prevent these easy mistakes. Which seems very related, however perhaps adding the workaround and clarifying that even hotplugged devices may be added and bitten by this. Should I file a few bugs to capture the related issues? Here are the discrete issues that seem to be present from a user point of view: 1. btrfs filesystem show - shouldn''t list devices as present unlesss they''re in use and in a consistent state If an explicit add is needed to add new device, don''t auto-add, even for devices previously part of the filesystem. Although I would claim that auto-adding and being consistent is most desired (when it makes sense with an existing signature, an empty drive has no indication on where to be added), it should be all or nothing instead of showing the device as being added (or at least how I interpret it being present in the ''fi show'' listing) but internally being untracked or inconsistent. As Chris said, "There isn''t a readd option in btrfs, which in md parlance is used for readding a device previously part of an array." However, when I hotplugged the drive and it reappeared in the ''fi show'' output, I assumed exactly the md semantics had occurred, with the drive having been readded and made consistent - it didn''t take any time, but I hadn''t copied data yet and knew btrfs may only sync the used data and metadata blocks. In other words, I never ran a device add or remove, but still saw what appeared to be consistent behavior. 2. data profile shouldn''t revert to single if adding/deleting before copying data 3. degraded mount option should be "allow degraded if needed", allowing non-degraded when it becomes available Shouldn''t force degraded, especially after adding (either manually or automatically) sufficient devices to operate in non-degraded mode. As soon as devices are added, a rebalance should be done to bring the new device(s) into consistent state. This then drives the question, how does one check the degraded state of a filesystem if not the mount flag. I (quite likely with an md-raid bias) expected to use the ''filesystem show'' output, listed the devices as well as a status flag of fully-consistent or rebalance in progress. If that''s not the correct or intended location, then provide documentation on how to properly check the consistency state and degraded state of a filesystem. Joel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2013-08-24 11:24, Joel Johnson wrote:> Should I file a few bugs to capture the related issues? Here are the > discrete issues that seem to be present from a user point of view:After writing this, I figured I''d experiment with the current state and try to properly delete and add the sdd device. However, I was surprised to find that I''m not allowed to remove one of the two drives in a RAID1. Kernel message is "btrfs: unable to go below two devices on raid1" Not allowing it by default makes some sense, however a --force flag or something would be beneficial. I understand that the preferred method is to add the replacement device first and then delete the old one (or do a direct replace), however the system I''m using for testing only has three SATA ports, the first is used for the system drive, and the second and third have the two drives I''m using for my btrfs testing - I have no way to add a third drive for the filesystem before removing one first. This is still with the filesystem mounted with the degraded mount option set. As it stands, I don''t see how I can run btrfs reliably on this system since I now know I want to properly remove devices and add them, but I''m unable to do so with the limited number of drive interfaces, requiring hot removal of drive, and either readding it, or using another system to clear the drive before readding it. Is there a flag I''m missing? If not, I''d add a fourth bug item: 4. allow (forced only if appropriate) removal of redundant drives (e.g. second drive in RAID1) Joel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Aug 24, 2013 at 2:02 PM, Joel Johnson <mrjoel@lixil.net> wrote:> On 2013-08-24 11:24, Joel Johnson wrote: >> >> Should I file a few bugs to capture the related issues? Here are the >> discrete issues that seem to be present from a user point of view: > > > After writing this, I figured I''d experiment with the current state and try > to properly delete and add the sdd device. However, I was surprised to find > that I''m not allowed to remove one of the two drives in a RAID1. Kernel > message is "btrfs: unable to go below two devices on raid1" Not allowing it > by default makes some sense, however a --force flag or something would be > beneficial. I understand that the preferred method is to add the replacement > device first and then delete the old one (or do a direct replace), however > the system I''m using for testing only has three SATA ports, the first is > used for the system drive, and the second and third have the two drives I''m > using for my btrfs testing - I have no way to add a third drive for the > filesystem before removing one first. This is still with the filesystem > mounted with the degraded mount option set.If you have an USB enclosure you could connect another drive via USB temporarily. That said my biggest btrfs problem happened while I had some drives on SATA and others on USB. -- Sandy McArthur -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2013-08-24 14:30, Sandy McArthur wrote:> I was surprised to find > that I''m not allowed to remove one of the two drives in a RAID1. Kernel > message is "btrfs: unable to go below two devices on raid1" Not > allowing it > by default makes some sense, however a --force flag or something would > be > beneficial. > > If you have an USB enclosure you could connect another drive via USB > temporarily. > That said my biggest btrfs problem happened while I had some drives on > SATA and others on USB.Ah, indeed, a close look at my device names reveals that I claim to only have three SATA ports, but sdc and sdd are used for my btrfs test. The sdb device is in fact an external USB enclosure that I''m using as my data source for the testing. I''m really just wanting to test what I see as some common use cases for a RAID1 btrfs filesystem. A very common one that seems to fail currently is wishing to preemptively (SMART errors on drive, many other reasons) remove it from the mirror, in order to add a replacement device - and keep things online during the entire process. As I see it now, I''d need to umount, forcibly/uncleanly pull the drive, mount degraded, add the new device, umount, mount cleanly, rebalance and move forward. I''d like to use btrfs for the data checksumming, however based on my investigations it doesn''t appear to be ready for the set of RAID1 support I''d expect. Based on that, I''m relooking at options. Are there any known gotchas or limitations with a btrfs filesystem with single data and metadata, running on top of an md-raid RAID1 mirror? Joel -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Aug 24, 2013, at 11:24 AM, Joel Johnson <mrjoel@lixil.net> wrote:> > Similar to what Duncan described in his response, on a hot-remove (without doing the proper btrfs device delete), there is no opportunity for a rebalance or metadata change on the pulled drives, so I would expect there to be a signature of some sort for consistency checking before readding it. At least, btrfs shouldn''t add the readded device back as an active device when it''s really still inconsistent and not being used, even if it indicates the same UUID.Question: On hot-remove, does ''mount'' show the volume as degraded? I find the degraded mount option confusing. What does it mean to use -o degraded when mounting a volume for which all devices are present and functioning?>> If I create the file system, mount it, but I do not copy any data, upon adding >> new and deleting missing, the data profile is changed from raid1 to single. If >> I''ve first copied data to the volume prior to device failure/missing, this >> doesn''t happen, it remains raid1. > > And yet, the tools indicate that it is still raid1, even if internally it reverts to single???No. btrfs fi df <mp> does reflect that the data profile has flipped from raid1 to single. As I mention later, this is reproducible only if the volume has had no data written to it. If I first write a file, then the reversion from raid1 to single doesn''t happen upon ''btrfs device delete''.> Based on my experience with this and Duncan''s feedback, I''d like to see the wiki have some warnings about dealing with multidevice filesystems, especially surrounding the degraded mount option.To me, degraded is an array or volume state, not up to the user to set as an option. So I''d like to know if the option is temporary, to more easily handle a particular problem for now, but the intention is to handle it better (differently) in the future.> > Looking again at the wiki Gotchas page, it does say > On a multi device btrfs filesystem, mistakingly re-adding > a block device that is already part of the btrfs fs with > btrfs device add results in an error, and brings btrfs in > an inconsistent state.For raid1 and raid10 this seems a problem for a file system that can become very large. The devices have enough information to determine exactly how far behind temporarily kicked devices are; it seems they effectively have an mdraid write-intent bitmap.> > 1. btrfs filesystem show - shouldn''t list devices as present unlesss they''re in use and in a consistent state.Or mark them as being inconsistent/unavailable.> > As Chris said, "There isn''t a readd option in btrfs, which in md parlance is used for readding a device previously part of an array." However, when I hotplugged the drive and it reappeared in the ''fi show'' output, I assumed exactly the md semantics had occurred, with the drive having been readded and made consistent - it didn''t take any time, but I hadn''t copied data yet and knew btrfs may only sync the used data and metadata blocks.The md semantics is that there is no auto add or readd. You must tell it to do this once the dropped device is made available again. If there''s a write-intent bitmap, the readded device is caught up very quickly. I think it''s a problem if there isn''t an write-intent bitmap equivalent for btrfs raid1/raid10, and right now there doesn''t seem to be one. A compulsory rebalance means hours or days of rebalance just because one drive was dropped for a short while.> > In other words, I never ran a device add or remove, but still saw what appeared to be consistent behavior. > > 2. data profile shouldn''t revert to single if adding/deleting before copying dataYes I think it''s a bug too, but it''s probably benign.> > This then drives the question, how does one check the degraded state of a filesystem if not the mount flag. I (quite likely with an md-raid bias) expected to use the ''filesystem show'' output, listed the devices as well as a status flag of fully-consistent or rebalance in progress. If that''s not the correct or intended location, then provide documentation on how to properly check the consistency state and degraded state of a filesystem.Yeah I think something functionally equivalent to a combination of mdadm -D and -E. mdadm distinguishes between array status/metadata vs member device status/metadata with those two commands. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy posted on Sat, 24 Aug 2013 23:18:26 -0600 as excerpted:> On Aug 24, 2013, at 11:24 AM, Joel Johnson <mrjoel@lixil.net> wrote: >> >> Similar to what Duncan described in his response, on a hot-remove >> (without doing the proper btrfs device delete), there is no opportunity >> for a rebalance or metadata change on the pulled drives, so I would >> expect there to be a signature of some sort for consistency checking >> before readding it. At least, btrfs shouldn''t add the readded device >> back as an active device when it''s really still inconsistent and not >> being used, even if it indicates the same UUID. > > Question: On hot-remove, does ''mount'' show the volume as degraded? > > I find the degraded mount option confusing. What does it mean to use -o > degraded when mounting a volume for which all devices are present and > functioning?The degraded mount option does indeed simply ALLOW mounting without all devices. If all devices can be found, btrfs will still integrate them all, regardless of the mount option. Looked at in that way, therefore, having the degraded option remain when all devices were found and integrated makes sense. It''s simply denoting the historical fact at that point, that the degraded option was included when mounting, and thus that it WOULD have mounted without all devices, if it couldn''t find them all, regardless of whether it found and integrated all devices or not. And hot-remove won''t change the options used to mount, either, so degraded won''t (or shouldn''t, I don''t think it does but didn''t actually check that case personally) magically appear in the options due to the hot-remove. However, I /believe/ btrfs filesystem show should display MISSING when a device has been hot-removed, until it''s added again. That''s what I understand Joel to be saying, at least, and it''s consistent with my understanding of the situation. (I would have tested that when I did my original testing, except I didn''t know my way around multi-device btrfs well enough to properly grok either the commands I really should be running or their output. I did run the commands, but I had the other device still attached even tho I''d originally mounted degraded, so it didn''t show up as missing, and I didn''t understand the significance of what I was seeing, except to the extent that I knew the results I got from the separate degraded writes followed by a non-degraded mount were NOT what I expected, and I simply resolved to steer well clear of degraded mounting in the first place, if I could help it, and to take steps to wipe and clean-add in the event something happened and I really NEEDED that degraded.)>> Based on my experience with this and Duncan''s feedback, I''d like to see >> the wiki have some warnings about dealing with multidevice filesystems, >> especially surrounding the degraded mount option.Well, as I got told at one point, it''s a wiki, knock yourself out. =:^/ Tho... in fairness, while I intend to register and do some of these changes at some point, in practice, I''m far more comfortable on newsgroups and mailing lists than in web forums or editing wikis, so unfortunately I''ve not gotten "the properly rounded tuit" yet. =:^( But seriously, Joel, I agree it needs done, and if you get to it before I do... there''ll be less I need to do. So if you have the time and motivation to do it, please do so! =:^) Plus you appear to be doing a bit more thorough testing with it than I did, so you''re arguably better placed to do it anyway.> To me, degraded is an array or volume state, not up to the user to set > as an option. So I''d like to know if the option is temporary, to more > easily handle a particular problem for now, but the intention is to > handle it better (differently) in the future.Hopefully the above helped with that. AFAIK the degraded mount-option will remain more or less as it is -- simply allowing the filesystem to start instead of error-out if it can''t find all devices, but effectively doing nothing if it does find all devices. Meanwhile, I''m used to running beta and at times alpha software, and what we have now is clearly classic alpha, not all primary features implemented yet, let alone all the sharp edges removed and the chrome polished up. Classic beta has all the baseline features, and we are getting close, but still has sharp edges/bugs that can hurt if one isn''t careful around them. I honestly expect btrfs should be hitting that by end of year or certainly early next, as it really is getting close now. What that means in context is that I expect and hope that once the last few primary features get added, finish up raid5/6 mode and get full N-way mirroring (not just the 2-way referred to as raid1 currently), possibly dedup, finish up send/receive (it''s there but rather too buggy to be entirely practical at present)... AFAIK, that''s about it on the primary features list. Then it''s beta, with the full focus turning to debugging and getting rid of those sharp corners, and I expect THAT is when we''ll see some of these really bare and sharp-cornered features such as multi-device raidN get rounded out in userspace, with the tools actually turning into something reasonably usable, not the bare-bones alpha proof-of-concept userspace tools we have for the multi-device features at present.> For raid1 and raid10 this seems a problem for a file system that can > become very large. The devices have enough information to determine > exactly how far behind temporarily kicked devices are; it seems they > effectively have an mdraid write-intent bitmap.With atomic tree updates not taking effect until the root node is finally written, and with btrfs keeping a list of the last several root nodes as it has actually been doing for several versions now (since 3.0 at least, I believe), I /believe/ it''s even better than a 1-deep write-intent bitmap, as it''s effectively an N-deep stack of such bitmaps. =:^) The problem is, as I explained above, btrfs is still effectively alpha, and the tools we are using to work with it are effectively bare-bones proof-of-concept alpha level tools, since not all features have yet been fully implemented, let alone having time to flesh anything out properly. It''ll take time...> I think it''s a problem if there isn''t an write-intent bitmap equivalent > for btrfs raid1/raid10, and right now there doesn''t seem to be one.As I explained I believe btrfs has even better. It''s simply that there''s no proper tools available to use it yet...> A compulsory rebalance means hours or days of rebalance just because one > drive was dropped for a short while.I think I consider myself lucky. One thing I learned with my years of playing with mdraid is how to make proper use of partitions, with only the ones I actually needed active and the filesystems mounted, and activating/mounting read-only where possible, so if a device did drop out for whatever reason, between the split-up mounts meaning relatively little actual data was affected, and the write-intent bitmaps, I was back online right away. While btrfs doesn''t YET expose its root-node-stack as a stack of write- intent-bitmaps as it COULD, and I believe eventually WILL, unlike back when I was running mdraid and dealing with the write-intent bitmaps there, I''m on SSD for my btrfs filesystems today, and they''re MUCH faster. *So* much so that between the multiple relatively small partitions (fully independent, I don''t want all my eggs in one filesystem tree basket, so no subvolumes, they''re fully independent filesystems/partitions) and the fact that they''re on ssd... Here, a full filesystem balance typically takes on the order of seconds to a minute, depending on the filesystem/partition. That''s rewriting ALL data and metadata on the filesystem! So while I understand the concept of a full multi-terabyte filesystem rebalance taking on the order of days, the contrast between that concept, and the reality of a few gigabytes of data in its own dedicated filesystem on ssd rebalancing in a few tens of seconds here... Makes a world of difference! Let''s just say I''m glad it isn''t the other way around! =:^)>> This then drives the question, how does one check the degraded state of >> a filesystem if not the mount flag. I (quite likely with an md-raid >> bias) expected to use the ''filesystem show'' output, listed the devices >> as well as a status flag of fully-consistent or rebalance in progress. >> If that''s not the correct or intended location, then provide >> documentation on how to properly check the consistency state and >> degraded state of a filesystem. > > Yeah I think something functionally equivalent to a combination of mdadm > -D and -E. mdadm distinguishes between array status/metadata vs member > device status/metadata with those two commands.While the bare-bones-alpha tools-state explains the current situation, never-the-less I believe these sorts of conversations are important, as they will very possibly help drive the shaping of the tools as they flesh out. And yes, I do hope that the btrfs tools eventually get something comparable to mdadm -D and -E. But I think it''s equally important to realize that mdadm is actually a second-generation solution, that what we''re looking at in mdadm is the end product of several years of it maturing plus the raid-tools solution before that, and that even then, those were already patterned after commercial and other raid product administration tools from before that. Meanwhile, while there''s some analogies between btrfs and md, and others between btrfs and zfs, really, this whole field of having filesystems do all that btrfs is attempting to do is relatively new ground, and we cannot and should not expect to directly compare the state of btrfs tools even after it''s first declared stable, with the state of either mdadm or zfs tools, today. It''ll take some time to get there. But get there I believe it will eventually do. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Aug 25, 2013, at 6:12 AM, Duncan <1i5t5.duncan@cox.net> wrote:> > The degraded mount option does indeed simply ALLOW mounting without all > devices. If all devices can be found, btrfs will still integrate them > all, regardless of the mount option.I understand btrfs handling is necessarily different because array assembly vs mounting aren''t distinguished as they are with md and hardware raid. An md device won''t come up on its own if all members aren''t available, you have to tell mdadm to assemble what it can, if successful the array is degraded but not mounted, then you mount the degraded array. So what takes two steps for md raid is a single step with btrfs, and in that context it makes sense intent is indicated with the degraded mount option. Aside: I think a more conservative approach would be for -o degraded to also imply, by default, ro. Presently specifying -o degraded, I still get a rw mount.> > Looked at in that way, therefore, having the degraded option remain when > all devices were found and integrated makes sense. It''s simply denoting > the historical fact at that point, that the degraded option was included > when mounting, and thus that it WOULD have mounted without all devices, > if it couldn''t find them all, regardless of whether it found and > integrated all devices or not.As far as I''m aware, nothing else in the mount line works based on history though. If I use -o rw, the line says rw. But if for some reason the kernel finds an inconsistency and drops the filesystem to ro, the mount line immediately says ro, not the rw used when mounting. If I -o remount,rw and the operation is successful, again the mount line reflects this. But with a btrfs mount, even -o remount doesn''t clear degraded once all devices are available. That''s confusing.> And hot-remove won''t change the options used to mount, either, so > degraded won''t (or shouldn''t, I don''t think it does but didn''t actually > check that case personally) magically appear in the options due to the > hot-remove.Even btrfs, when it detects certain problems, will change the mount state from rw to ro. There''s every reason it could do the same thing when the volume becomes degraded during use. If the kernel doesn''t export both volume and device state somehow, how does e.g. udisks know the volume is degraded, and which device is the source of the problem, so that the user can be informed in the desktop UI? And also, when the problem is rectified, that the volume is no longer degraded?> >> To me, degraded is an array or volume state, not up to the user to set >> as an option. So I''d like to know if the option is temporary, to more >> easily handle a particular problem for now, but the intention is to >> handle it better (differently) in the future. > > Hopefully the above helped with that.Yes, I agree it''s both a state and a mount option.>> I think it''s a problem if there isn''t an write-intent bitmap equivalent >> for btrfs raid1/raid10, and right now there doesn''t seem to be one. > > As I explained I believe btrfs has even better. It''s simply that there''s > no proper tools available to use it yet…Understood. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html