Thomas Schmidt
2011-Nov-17 00:27 UTC
[RFC] improve space utilization on off-sized raid devices
I wrote a small patch to improve allocation on differently sized raid devices. With 2.6.38 I frequently ran into a no space left error that I attribute to this. But I''m not entierly sure. The fs was an 8 device -d raid0 -m raid10. The used space was the same across all devices. 5 were full and 3 bigger ones still had plenty of space. I was unable to use the remaning space and a balance did not fix it for long. Now I tried to avoid getting there again. The basic idea to not allocate space on the devices with the least free space. The amount of devices to leave out is calculated on each allocation to ajust to changing circumstances. It leaves the minimum number that still can achieve full space usage. Additionally I tought leaving at least one out might be of use in device removal. Please take extra care with this. I''m new to btrfs, kernel and C in general. It was written and tested with 3.0.0. --- volumes.c.orig 2011-10-07 16:50:04.000000000 +0200 +++ volumes.c 2011-11-16 23:49:08.097085568 +0100 @@ -2329,6 +2329,8 @@ static int __btrfs_alloc_chunk(struct bt u64 stripe_size; u64 num_bytes; int ndevs; + u64 fs_total_avail; + int opt_ndevs; int i; int j; @@ -2404,6 +2406,7 @@ static int __btrfs_alloc_chunk(struct bt * about the available holes on each device. */ ndevs = 0; + fs_total_avail = 0; while (cur != &fs_devices->alloc_list) { struct btrfs_device *device; u64 max_avail; @@ -2448,6 +2451,7 @@ static int __btrfs_alloc_chunk(struct bt devices_info[ndevs].total_avail = total_avail; devices_info[ndevs].dev = device; ++ndevs; + fs_total_avail += total_avail; } /* @@ -2456,6 +2460,20 @@ static int __btrfs_alloc_chunk(struct bt sort(devices_info, ndevs, sizeof(struct btrfs_device_info), btrfs_cmp_device_info, NULL); + /* + * do not allocate space on all devices + * instead balance free space to maximise space utilization + * (this needs tweaking if parity raid gets implemented + * for n parity ignore the n first (after sort) devs in the sum and division) + */ + opt_ndevs = fs_total_avail / devices_info[0].total_avail; + if (opt_ndevs >= ndevs) + opt_ndevs = ndevs - 1; //optional, might be used for faster dev remove? + if (opt_ndevs < devs_min) + opt_ndevs = devs_min; + if (ndevs > opt_ndevs) + ndevs = opt_ndevs; + /* round down to number of usable stripes */ ndevs -= ndevs % devs_increment; -- NEU: FreePhone - 0ct/min Handyspartarif mit Geld-zurück-Garantie! Jetzt informieren: http://www.gmx.net/de/go/freephone -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Arne Jansen
2011-Nov-17 07:42 UTC
Re: [RFC] improve space utilization on off-sized raid devices
On 17.11.2011 01:27, Thomas Schmidt wrote:> I wrote a small patch to improve allocation on differently sized raid devices. > > With 2.6.38 I frequently ran into a no space left error that I attribute to > this. But I''m not entierly sure. The fs was an 8 device -d raid0 -m raid10. > The used space was the same across all devices. 5 were full and 3 bigger ones still had plenty of space. > I was unable to use the remaning space and a balance did not fix it for > long. >Did you also test with 3.0? In 3.0, the allocation strategy changed vastly. In your setup, it should stripe to all 8 devices until the 5 smaller ones are full, and from then on stripe to the 3 remaining devices. See commit commit 73c5de0051533cbdf2bb656586c3eb21a475aa7d Author: Arne Jansen <sensille@gmx.net> Date: Tue Apr 12 12:07:57 2011 +0200 btrfs: quasi-round-robin for chunk allocation Also using raid1 instead of raid10 will yield a better space utilization. -Arne> Now I tried to avoid getting there again. > > The basic idea to not allocate space on the devices with the least free > space. The amount of devices to leave out is calculated on each allocation > to ajust to changing circumstances. It leaves the minimum number that still > can achieve full space usage. >> Additionally I tought leaving at least one out might be of use in device removal. > > Please take extra care with this. I''m new to btrfs, kernel and C in general. > It was written and tested with 3.0.0. > > > --- volumes.c.orig 2011-10-07 16:50:04.000000000 +0200 > +++ volumes.c 2011-11-16 23:49:08.097085568 +0100 > @@ -2329,6 +2329,8 @@ static int __btrfs_alloc_chunk(struct bt > u64 stripe_size; > u64 num_bytes; > int ndevs; > + u64 fs_total_avail; > + int opt_ndevs; > int i; > int j; > > @@ -2404,6 +2406,7 @@ static int __btrfs_alloc_chunk(struct bt > * about the available holes on each device. > */ > ndevs = 0; > + fs_total_avail = 0; > while (cur != &fs_devices->alloc_list) { > struct btrfs_device *device; > u64 max_avail; > @@ -2448,6 +2451,7 @@ static int __btrfs_alloc_chunk(struct bt > devices_info[ndevs].total_avail = total_avail; > devices_info[ndevs].dev = device; > ++ndevs; > + fs_total_avail += total_avail; > } > > /* > @@ -2456,6 +2460,20 @@ static int __btrfs_alloc_chunk(struct bt > sort(devices_info, ndevs, sizeof(struct btrfs_device_info), > btrfs_cmp_device_info, NULL); > > + /* > + * do not allocate space on all devices > + * instead balance free space to maximise space utilization > + * (this needs tweaking if parity raid gets implemented > + * for n parity ignore the n first (after sort) devs in the sum and division) > + */ > + opt_ndevs = fs_total_avail / devices_info[0].total_avail; > + if (opt_ndevs >= ndevs) > + opt_ndevs = ndevs - 1; //optional, might be used for faster dev remove? > + if (opt_ndevs < devs_min) > + opt_ndevs = devs_min; > + if (ndevs > opt_ndevs) > + ndevs = opt_ndevs; > + > /* round down to number of usable stripes */ > ndevs -= ndevs % devs_increment; >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thomas Schmidt
2011-Nov-17 11:53 UTC
Re: [RFC] improve space utilization on off-sized raid devices
-------- Original-Nachricht --------> Datum: Thu, 17 Nov 2011 08:42:48 +0100 > Von: Arne Jansen <sensille@gmx.net> > An: Thomas Schmidt <Schmidt-T@gmx.de> > CC: linux-btrfs@vger.kernel.org > Betreff: Re: [RFC] improve space utilization on off-sized raid devices> On 17.11.2011 01:27, Thomas Schmidt wrote: > > With 2.6.38 I frequently ran into a no space left error > Did you also test with 3.0? In 3.0, the allocation strategy changed > vastly. > In your setup, it should stripe to all 8 devices until the 5 smaller ones > are full, and from then on stripe to the 3 remaining devices. > See commit > > commit 73c5de0051533cbdf2bb656586c3eb21a475aa7d > Author: Arne Jansen <sensille@gmx.net> > Date: Tue Apr 12 12:07:57 2011 +0200 > > btrfs: quasi-round-robin for chunk allocation > > Also using raid1 instead of raid10 will yield a better space utilization.No I did not test if the problem occoured in vanilla 3.0.0. I only did compare the code and saw no reason why behavior should have changed (for my case). The sorting is the base of my idea. But the order does not matter if you allocate on all devices anyway (as with an even number of devs). Afaik the behavior you describe is exactly the problem. It wants to continuing with 3 devices, but according to the code raid10 requires 4. I can''t use the actual fs (or devs) the problem happened on but I will try a small scale test on some files. As I currently have my patch in use I will have to wait till I can reboot. -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Arne Jansen
2011-Nov-17 12:59 UTC
Re: [RFC] improve space utilization on off-sized raid devices
On 17.11.2011 12:53, Thomas Schmidt wrote:> >> On 17.11.2011 01:27, Thomas Schmidt wrote: >> In your setup, it should stripe to all 8 devices until the 5 smaller ones >> are full, and from then on stripe to the 3 remaining devices. > > Afaik the behavior you describe is exactly the problem. > It wants to continuing with 3 devices, but according to the code raid10 > requires 4. >Right you are. So you want to sacrifice stripe size for space efficiency. Why don''t you just use RAID1? Instead of reducing the stripe size for the majority of writes, I''d prefer to allow RAID10 to go down to 2 disks. This should also solve it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thomas Schmidt
2011-Nov-17 14:06 UTC
Re: [RFC] improve space utilization on off-sized raid devices
-------- Original-Nachricht --------> Datum: Thu, 17 Nov 2011 13:59:26 +0100 > Von: Arne Jansen <sensille@gmx.net> > An: Thomas Schmidt <Schmidt-T@gmx.de> > CC: linux-btrfs@vger.kernel.org > Betreff: Re: [RFC] improve space utilization on off-sized raid devices> Right you are. So you want to sacrifice stripe size for space efficiency. > Why don''t you just use RAID1? > Instead of reducing the stripe size for the majority of writes, I''d prefer > to allow RAID10 to go down to 2 disks. This should also solve it.Yes, that''s my trade. With my patch I still have striping across 6 devices for meta (6-7 for data) which is faster then the 2 raid1 would give me. Since 6 drives allready saturate my bus it''s a very good trade in my case. While implementing your "degenerated raid0/10" would somewhat lessen the problem (and fix it for me), it would not fix it in general. But implementing it might is still be a good idea. Consider a 4 dev setup: 3 1TB drives and 1 2TB using -m raid1 -d raid0: 80% capacity striped 4 way. -d single: 100% but no striping. My patch: 100% striped 3 way, a good trade imho. I don''t think such a setup is unlikely enough to ignore, a home user will simply buy the drive with the best space/cost whenever he needs space, leading exactly to the described situation. Adding a newly bought 2T drive to my 3x1T setup, only to see that only half of it can be used would really piss me off. Note that if the (optional) first "if" is removed I only reduce width if it is required to reach 100% capacity. At least thats the intention, it might need some tweaking. According to the (hackish) simulator I used to test this, typically the average stripe width sacrificed on setups of 5+ unmatched devices is below 2 -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Phillip Susi
2011-Nov-17 18:27 UTC
Re: [RFC] improve space utilization on off-sized raid devices
On 11/17/2011 7:59 AM, Arne Jansen wrote:> Right you are. So you want to sacrifice stripe size for space efficiency. > Why don''t you just use RAID1? > Instead of reducing the stripe size for the majority of writes, I''d prefer > to allow RAID10 to go down to 2 disks. This should also solve it.Yes, it appears that btrfs''s current idea of raid10 is actually raid0+1, not raid10. If it were proper raid10, it could use the remaining space on the 3 larger disks for a raid10 metadata chunk. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Arne Jansen
2011-Dec-01 08:55 UTC
Re: [RFC] improve space utilization on off-sized raid devices
On 17.11.2011 15:06, Thomas Schmidt wrote:> -------- Original-Nachricht -------- >> Datum: Thu, 17 Nov 2011 13:59:26 +0100 >> Von: Arne Jansen <sensille@gmx.net> >> An: Thomas Schmidt <Schmidt-T@gmx.de> >> CC: linux-btrfs@vger.kernel.org >> Betreff: Re: [RFC] improve space utilization on off-sized raid devices > > > Consider a 4 dev setup: 3 1TB drives and 1 2TB using -m raid1 > -d raid0: 80% capacity striped 4 way. > -d single: 100% but no striping. > My patch: 100% striped 3 way, a good trade imho. > > I don''t think such a setup is unlikely enough to ignore, a home user will > simply buy the drive with the best space/cost whenever he needs space, > leading exactly to the described situation. Adding a newly bought 2T drive > to my 3x1T setup, only to see that only half of it can be used would really > piss me off. > > Note that if the (optional) first "if" is removed I only reduce width if it > is required to reach 100% capacity. At least thats the intention, it might > need some tweaking. > According to the (hackish) simulator I used to test this, typically the > average stripe width sacrificed on setups of 5+ unmatched devices is below > 2As RAID0 is already not a strict ''all disks or none'', I like the idea to have it even more dynamic to reach full optimization. But I''d like to see some properties conserved: a) In case of even size disks, the stripes should always be full size, not n - 1 b) Minor variations in the used space per disk due to metadata chunks should not lead to deviation from a) c) The algorithms should not give weird results under unconventional setups. Some theoretical background would be nice :) It might well be that your algorithm is already close :) -Arne -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
THomas Schmidt
2012-Jan-24 17:15 UTC
Re: [RFC] improve space utilization on off-sized raid devices
On Thursday 01 December 2011 09:55:27 Arne Jansen wrote:> As RAID0 is already not a strict ''all disks or none'', I like the idea to > have it even more dynamic to reach full optimization. But I''d like to see > some properties conserved: > a) In case of even size disks, the stripes should always be full size, not > n - 1 > b) Minor variations in the used space per disk due to metadata chunks > should not lead to deviation from a) > c) The algorithms should not give weird results under unconventional > setups. Some theoretical background would be nice :)Sorry to only get back to you now, I must have missed your mail somehow. The problem is the shrinking stripe width with unmatched devices. Once it hits devs_min-1 it''s over. My solution is to try to keep the stripe width constant. The sorting then takes care of selecting the right devices. It''s simply: space / min-hight = max-width a) is dictated by math Since circumstances change (add, rm devs, rounding, ...) it is calculated again at every allocation. The result is then rounded to the nearest multiple of devs_increment. This takes care of b). The code may look wiered but should be identical to the mathematical floor(Space / min-hight + increment/2) if considered together with the round down already present in the line after my patch. The two ifs should safeguard against weird stuff by limiting the result to sane values. I include an updated patch below. It''s again written for and tested with 3.0.0 but diff3 worked nicely for applying it to 3.3-rc1. --- volumes.c.orig 2012-01-20 16:59:31.000000000 +0100 +++ volumes.c 2012-01-24 11:24:07.261401805 +0100 @@ -2329,6 +2329,8 @@ u64 stripe_size; u64 num_bytes; int ndevs; + u64 fs_total_avail; + int opt_ndevs; int i; int j; @@ -2448,6 +2450,7 @@ devices_info[ndevs].total_avail = total_avail; devices_info[ndevs].dev = device; ++ndevs; + fs_total_avail += total_avail; } /* @@ -2456,6 +2459,16 @@ sort(devices_info, ndevs, sizeof(struct btrfs_device_info), btrfs_cmp_device_info, NULL); + /* + * do not allocate space on all devices + * instead balance free space to maximise space utilization + */ + opt_ndevs = (fs_total_avail*2 + devs_increment*devices_info[0].total_avail) / (devices_info[0].total_avail*2); + if (opt_ndevs < devs_min) + opt_ndevs = devs_min; + if (ndevs > opt_ndevs) + ndevs = opt_ndevs; + /* round down to number of usable stripes */ ndevs -= ndevs % devs_increment; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thomas Schmidt
2012-Jan-24 21:01 UTC
Re: [RFC] improve space utilization on off-sized raid devices
On Thursday 01 December 2011 09:55:27 Arne Jansen wrote:> As RAID0 is already not a strict ''all disks or none'', I like the idea to > have it even more dynamic to reach full optimization. But I''d like to see > some properties conserved: > a) In case of even size disks, the stripes should always be full size, not > n - 1 > b) Minor variations in the used space per disk due to metadata chunks > should not lead to deviation from a) > c) The algorithms should not give weird results under unconventional > setups. Some theoretical background would be niceResent because it did not appear on the ML for about 4h. KMail''s acting up. Sorry to only get back to you now, I must have missed your mail somehow. The problem is the shrinking stripe width with unmatched devices. Once it hits devs_min-1 it''s over. My solution is to try to keep the stripe width constant. The sorting then takes care of selecting the right devices. It''s simply: space / min-hight = max-width a) is dictated by math Since circumstances change (add, rm devs, rounding, ...) it is calculated again at every allocation. The result is then rounded to the nearest multiple of devs_increment. This takes care of b). The code may look wiered but should be identical to the mathematical floor(Space / min-hight + increment/2) if considered together with the round down already present in the line after my patch. The two ifs should safeguard against weird stuff by limiting the result to sane values. I include an updated patch below. It''s again written for and tested with 3.0.0 but diff3 worked nicely for applying it to 3.3-rc1. --- volumes.c.orig 2012-01-20 16:59:31.000000000 +0100 +++ volumes.c 2012-01-24 11:24:07.261401805 +0100 @@ -2329,6 +2329,8 @@ u64 stripe_size; u64 num_bytes; int ndevs; + u64 fs_total_avail; + int opt_ndevs; int i; int j; @@ -2448,6 +2450,7 @@ devices_info[ndevs].total_avail = total_avail; devices_info[ndevs].dev = device; ++ndevs; + fs_total_avail += total_avail; } /* @@ -2456,6 +2459,16 @@ sort(devices_info, ndevs, sizeof(struct btrfs_device_info), btrfs_cmp_device_info, NULL); + /* + * do not allocate space on all devices + * instead balance free space to maximise space utilization + */ + opt_ndevs = (fs_total_avail*2 + devs_increment*devices_info[0].total_avail) / (devices_info[0].total_avail*2); + if (opt_ndevs < devs_min) + opt_ndevs = devs_min; + if (ndevs > opt_ndevs) + ndevs = opt_ndevs; + /* round down to number of usable stripes */ ndevs -= ndevs % devs_increment; -- Ihr GMX Postfach immer dabei: die kostenlose GMX Mail App für Android. Komfortabel, sicher und schnell: www.gmx.de/android -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html