Dear list, Is there a way to find out what "btrfs device delete" is doing, and how far it has come? I know there''s no "status" command, but I''m thinking it might be possible to obtain something via some other channel, such as "btrfs fi df" and "btrfs fi show", which I have been using to try and figure out what is happening. In this context, I''ve been trying to remove two drives from a filesystem of four, using the following command: btrfs dev del /dev/sdg1 /dev/sdh1 /home/pub It has been running for almost 36 hours by now, however, and I''m kinda wondering what''s happening. :) I''ve been trying to monitor it using "btrfs fi df /home/pub" and "btrfs fi show". Before starting to remove the devices, they gave the following output: $ sudo btrfs fi df /home/pub Data, RAID0: total=10.9TB, used=3.35TB System, RAID1: total=32.00MB, used=200.00KB Metadata, RAID1: total=5.00GB, used=3.68GB $ sudo btrfs fi show Label: none uuid: 40d346bb-2c77-4a78-8803-1e441bf0aff7 Total devices 4 FS bytes used 3.35TB devid 5 size 2.73TB used 2.71TB path /dev/sdh1 devid 4 size 2.73TB used 2.71TB path /dev/sdg1 devid 3 size 2.73TB used 2.71GB path /dev/sdd1 devid 2 size 2.73TB used 2.71GB path /dev/sde1 The Data part of the "df" output has since been decreasing (which I have been using as sign of progress), but only until it hit 3.36 TB: $ sudo btrfs fi df /home/pub/video/ Data, RAID0: total=3.36TB, used=3.35TB System, RAID1: total=32.00MB, used=200.00KB Metadata, RAID1: total=5.00GB, used=3.68GB It has been sitting there for quite some hours by now. "show", on its hand, displays the following: $ sudo btrfs fi show Label: none uuid: 40d346bb-2c77-4a78-8803-1e441bf0aff7 Total devices 4 FS bytes used 3.35TB devid 5 size 2.73TB used 965.00GB path /dev/sdh1 devid 4 size 2.73TB used 2.71TB path /dev/sdg1 devid 3 size 2.73TB used 968.03GB path /dev/sdd1 devid 2 size 2.73TB used 969.03GB path /dev/sde1 This I find quite weird. Why is the usage of sdd1 and sde1 decreasing, when those are not the disks I''m trying to remove, while sdg1 sits there at its original usage, when it is one of those I have requested to have removed? By the way, since the Data part hit 3.36TB, those usages of sdd1, sde1 and sdh1 have been fluctuating up and down between around 850GB up to around those values shown right now. Is there any way I can find out what''s going on? -- Fredrik Tolf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sep 29, 2013, at 1:13 AM, Fredrik Tolf <fredrik@dolda2000.com> wrote:> > Is there any way I can find out what''s going on?For whatever reason, it started out with every drive practically full, in terms of chunk allocation. e.g. devid 5 size 2.73TB used 2.71TB path /dev/sdh1 I don''t know if the code works this way, but it needs to do a balance (or partial one) to make room before it can start migrating actual complete chunks from 4 disks to 2 disks. And my guess is that it''s not yet done balancing sdg1 still. Post kernel version, btrfs-progs version, and dmesg from the time the delete command was initiated. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sep 30, 2013, at 8:27 AM, Chris Murphy <lists@colorremedies.com> wrote:> > On Sep 29, 2013, at 1:13 AM, Fredrik Tolf <fredrik@dolda2000.com> wrote: >> >> Is there any way I can find out what''s going on? > > For whatever reason, it started out with every drive practically full, in terms of chunk allocation. > > e.g. devid 5 size 2.73TB used 2.71TB path /dev/sdh1 > > I don''t know if the code works this way, but it needs to do a balance (or partial one) to make room before it can start migrating actual complete chunks from 4 disks to 2 disks. And my guess is that it''s not yet done balancing sdg1 still. > > Post kernel version, btrfs-progs version, and dmesg from the time the delete command was initiated.Without knowing more information about how it''s expected to behave (now and near future), I think if I were you I would have added a couple of drives to the volume first so that it had more maneuvering room. It probably seems weird to add drives to remove drives, but sometimes (always?) Btrfs really gets a bit piggish about allocating a lot more chunks than there is data. Or maybe it''s not deallocating space as aggressively as it could. So it can get to a point where even though there isn''t that much data in the volume (in your case 1.5x the drive size, and you have 4 drives) yet all of it''s effectively allocated. So to back out of that takes free space. Then once the chunks are better allocated, you''d have been able to remove the drives. Open question if I would have removed 2, then anoth er 2. Or removed 4. And for all I know adding one drive might be enough even though it''s raid0. *shrug* New territory. Chris Murphy-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy posted on Mon, 30 Sep 2013 19:05:36 -0600 as excerpted:> It probably seems weird to add drives to remove drives, but sometimes > (always?) Btrfs really gets a bit piggish about allocating a lot more > chunks than there is data. Or maybe it''s not deallocating space as > aggressively as it could. So it can get to a point where even though > there isn''t that much data in the volume (in your case 1.5x the drive > size, and you have 4 drives) yet all of it''s effectively allocated. So > to back out of that takes free space. Then once the chunks are better > allocated, you''d have been able to remove the drives.As I understand things and from what I''ve actually observed here, btrfs only allocates chunks on-demand, but doesn''t normally DEallocate them at all, except during balance, etc, when it rewrites all the (meta)data that matches the filters, compacting all those "data holes" that were opened up thru deletion in the process, by filling chunks as it rewrites the (meta)data. So effectively, allocated chunks should always be the high-water-mark (rounded up to the nearest chunk size) of usage since the last balance effectively compacted chunk usage, because chunk allocation is automatic but chunk deallocation requires a balance. This is actually a fairly reasonable approach in the normal case, since it''s reasonable to assume that even if the size of the data has reduced substantially, if it once reached a particular size, it''s likely to reach it again, and particularly the deallocation process has a serious time cost to rewrite the remaining active data to other chunks, so best to just let it be unless an admin decides it''s worth eating that cost to get the lower chunk allocation and invokes a balance to effect that. So as you were saying, the most efficient way to delete a device could be to add one first if chunk allocation is almost maxed out and well above actual (meta)data size, then do a balance to rewrite all those nearly empty chunks to nearly full ones and shrink the number of allocated chunks to something reasonable as a result, and only THEN, when there''s some reasonable amount of unallocated space available, attempt the device delete. Meanwhile, I really do have to question the use case where the risks of a single dead device killing a raid0 (or for that matter, running still experimental btrfs) are fine, but spending days doing data maintenance on data not valuable enough to put on anything but experimental btrfs raid0, is warranted over simply blowing the data away and starting with brand new mkfs-ed filesystems. That a strong hint to me that either the raid0 use case is wrong, or the days of data move and reshape instead of blowing it away and recreating brand new filesystems, is wrong, and that one or the other should be reevaluated. However, I''m sure there must be use cases for which it''s appropriate, and I simply don''t have a sufficiently creative imagination, so I''ll admit I could be wildly wrong on that. If a sysadmin is sure he''s on solid ground with his use case, for him, he very well could be. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sep 30, 2013, at 10:43 PM, Duncan <1i5t5.duncan@cox.net> wrote:> > Meanwhile, I really do have to question the use case where the risks of a > single dead device killing a raid0 (or for that matter, running still > experimental btrfs) are fine, but spending days doing data maintenance on > data not valuable enough to put on anything but experimental btrfs raid0, > is warranted over simply blowing the data away and starting with brand > new mkfs-ed filesystems.Yes of course. It must be a test case, and I think for non-experimental stable Btrfs it''s reasonable to have reliable device delete regardless of the raid level because it''s offered. And after all maybe the use case involves enterprise SSDs, each of which should have a less than 1% chance of failing during their service life. (Naturally, that''s going to go a lot faster than days.)> That a strong hint to me that either the raid0 > use case is wrong, or the days of data move and reshape instead of > blowing it away and recreating brand new filesystems, is wrong, and that > one or the other should be reevaluated.I think it''s the wrong use case today, except for testing it. It''s legit to try and blow things up, simply because it''s offered functionality, so long as the idea is "I really would like to have this workflow actually work in 2-5 years". Otherwise it is sortof a rat hole. The other thing, clearly the OP is surprised it''s taking anywhere near this long. Had he known in advance, he probably would have made a different choice. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Chris Murphy posted on Mon, 30 Sep 2013 23:26:16 -0600 as excerpted:> The other thing, clearly the OP is surprised it''s taking anywhere near > this long. Had he known in advance, he probably would have made a > different choice.I had a longer version that I wrote first, but decided was /too/ long to post as it was. In it I compared the time of seconds to a couple minutes to do a full btrfs balance here, to the time of days I''m seeing reported on-list. That''s down to two reasons, AFAIK, the fact that I''m running SSDs, and the fact that I partition things up so that even for my backups on spinning rust, I''m looking at perhaps a couple hours, not days, for a full balance or pre-btrfs, an mdraid (1) return from degraded. The point there was that when a balance or raid rebuild takes seconds, minutes or hours, it''s feasible and likely to do it as a test or as part of routine maintenance, before things get as bad as terabytes of over- allocation. As I result, I actually know what my normal balance times look like since I do them reasonably often, something that someone on a system where it''s going to take days isn''t likely to have the option or luxury of doing. So there''s a benefit in if possible partitioning, even if SSDs aren''t a current option, or otherwise managing the data scale so maintenance time scales are at very worst on the scale of hours, not days, and preferably at the low end of that range (based on my mdraid days, a practical limit for me is a couple hours, before it''s simply too long to do routinely enough to be familiar with the scale times, etc). Of course that can''t be done for all use cases, but if it''s at all possible, simply managing the scale helps a *LOT*. As I said, the practical wall before things go off the not routinely manageable end for me is about two hours. A pre-deployment test can give one an idea of time scale, and from there... at least I''d have some idea whether it''d take a day or longer, and if it did there was definitely something wrong. Of course if this /is/ part of that pre-deployment testing or even btrfs development testing, as is appropriate at this stage of btrfs development, then points to him for doing that testing and finding there''s either something badly wrong or he''s simply off the practical end of the size scale before actual deployment. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html