I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with the "btrfs device add" command. Once the device was added, I used the balance command to distribute the data through the drives. This resulted in an infinite run of the btrfs tool with data moving back and forth across the drives over and over again. When using the "btrfs filesystem show" command, I could see the same pattern repeated in the byte counts on each of the drives. It would probably add more complexity to the code, but adding a check for loops like this may be handy. While a 5-drive RAID10 array is a weird configuration (I''m waiting for a case with 6 bays), it _should_ be possible with filesystems like BTRFS. In my head, the distribution of data would be uneven across drives, but the duplicate and stripe count should be even at the end. I''d imagine it to look something like this: D1: A1 B1 C1 D1 D2: A1 B1 C1 E1 D3: A2 B2 D1 E1 D4: A2 C2 D2 E2 D5: B2 C2 D2 E2 This is obviously over simplified, but the general idea is the same. I haven''t looked into the way the "RAID"ing of objects works in BTRFS yet, but because it''s a filesystem and not a block-based system it should be smart enough to care only about the duplication and striping of data, and not the actual block-level or extent-level balancing. Thoughts? Thanks in advance! Tom -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I''ve noticed similar behavior when even RAID0''ing an odd number of devices which should be even more trivial in practice. You would expect something like: sda A1 B1 sdb A2 B2 sdc A3 B3 or at least, if BTRFS can only handle block pairs, sda A1 B2 sdb A2 C1 sdc B1 C2 But the end result was that disk usage and reporting went all out of whack, allocation reporting got confused and started returning impossible values, and very shortly after the entire FS was corrupted. Rebalancing messed everything up royally and in the end I concluded to simply not use an odd number of drives with BTRFS. I also tried RAID1 with an odd number of drives, expecting to have 2 redundant mirrors. Instead the end result was that the blocks were still only allocated in pairs, and since they were allocated round-robbin on the drives I completely lost the ability to remove any single drive from the array without data loss. ie: Instead of: sda A1 B1 sdb A1 B1 sdc A1 B1 it ended up doing: sda A1 B1 sdb A1 C1 sdc B1 C1 meaning removing any 1 drive would result in lost data. I was told that this issue should have been resolved a while ago by a dev at Linuxconf, however this test of mine was only about 2 months ago. On Tue, Feb 21, 2012 at 11:35 AM, Tom Cameron <tomc603@gmail.com> wrote:> I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with > the "btrfs device add" command. Once the device was added, I used the > balance command to distribute the data through the drives. This > resulted in an infinite run of the btrfs tool with data moving back > and forth across the drives over and over again. When using the "btrfs > filesystem show" command, I could see the same pattern repeated in the > byte counts on each of the drives. > > It would probably add more complexity to the code, but adding a check > for loops like this may be handy. While a 5-drive RAID10 array is a > weird configuration (I''m waiting for a case with 6 bays), it _should_ > be possible with filesystems like BTRFS. In my head, the distribution > of data would be uneven across drives, but the duplicate and stripe > count should be even at the end. I''d imagine it to look something like > this: > > D1: A1 B1 C1 D1 > D2: A1 B1 C1 E1 > D3: A2 B2 D1 E1 > D4: A2 C2 D2 E2 > D5: B2 C2 D2 E2 > > This is obviously over simplified, but the general idea is the same. I > haven''t looked into the way the "RAID"ing of objects works in BTRFS > yet, but because it''s a filesystem and not a block-based system it > should be smart enough to care only about the duplication and striping > of data, and not the actual block-level or extent-level balancing. > Thoughts? > > Thanks in advance! > Tom > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sorry, I meant ''removing 2 drives'' in the raid1 with 3 drives example On Tue, Feb 21, 2012 at 11:45 AM, Wes <anomaly256@gmail.com> wrote:> I''ve noticed similar behavior when even RAID0''ing an odd number of > devices which should be even more trivial in practice. > You would expect something like: > sda A1 B1 > sdb A2 B2 > sdc A3 B3 > > or at least, if BTRFS can only handle block pairs, > > sda A1 B2 > sdb A2 C1 > sdc B1 C2 > > But the end result was that disk usage and reporting went all out of > whack, allocation reporting got confused and started returning > impossible values, and very shortly after the entire FS was corrupted. > Rebalancing messed everything up royally and in the end I concluded > to simply not use an odd number of drives with BTRFS. > > I also tried RAID1 with an odd number of drives, expecting to have 2 > redundant mirrors. Instead the end result was that the blocks were > still only allocated in pairs, and since they were allocated > round-robbin on the drives I completely lost the ability to remove any > single drive from the array without data loss. > > ie: > Instead of: > sda A1 B1 > sdb A1 B1 > sdc A1 B1 > > it ended up doing: > > sda A1 B1 > sdb A1 C1 > sdc B1 C1 > > meaning removing any 1 drive would result in lost data. > > I was told that this issue should have been resolved a while ago by a > dev at Linuxconf, however this test of mine was only about 2 months > ago. > > > > > On Tue, Feb 21, 2012 at 11:35 AM, Tom Cameron <tomc603@gmail.com> wrote: >> I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with >> the "btrfs device add" command. Once the device was added, I used the >> balance command to distribute the data through the drives. This >> resulted in an infinite run of the btrfs tool with data moving back >> and forth across the drives over and over again. When using the "btrfs >> filesystem show" command, I could see the same pattern repeated in the >> byte counts on each of the drives. >> >> It would probably add more complexity to the code, but adding a check >> for loops like this may be handy. While a 5-drive RAID10 array is a >> weird configuration (I''m waiting for a case with 6 bays), it _should_ >> be possible with filesystems like BTRFS. In my head, the distribution >> of data would be uneven across drives, but the duplicate and stripe >> count should be even at the end. I''d imagine it to look something like >> this: >> >> D1: A1 B1 C1 D1 >> D2: A1 B1 C1 E1 >> D3: A2 B2 D1 E1 >> D4: A2 C2 D2 E2 >> D5: B2 C2 D2 E2 >> >> This is obviously over simplified, but the general idea is the same. I >> haven''t looked into the way the "RAID"ing of objects works in BTRFS >> yet, but because it''s a filesystem and not a block-based system it >> should be smart enough to care only about the duplication and striping >> of data, and not the actual block-level or extent-level balancing. >> Thoughts? >> >> Thanks in advance! >> Tom >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 21, 2012 at 11:45:51AM +1100, Wes wrote:> I''ve noticed similar behavior when even RAID0''ing an odd number of > devices which should be even more trivial in practice. > You would expect something like: > sda A1 B1 > sdb A2 B2 > sdc A3 B3This is what it should do -- it''ll use as many disks as it can find to put stripes across at the time the allocator is asked to make another block group.> or at least, if BTRFS can only handle block pairs, > > sda A1 B2 > sdb A2 C1 > sdc B1 C2 > > But the end result was that disk usage and reporting went all out of > whack, allocation reporting got confused and started returning > impossible values, and very shortly after the entire FS was corrupted. > Rebalancing messed everything up royally and in the end I concluded > to simply not use an odd number of drives with BTRFS.I can''t see why that should have happened. What kernel were you doing this with?> I also tried RAID1 with an odd number of drives, expecting to have 2 > redundant mirrors.This isn''t a valid expectation. Or rather, you can expect it, but it''s not what btrfs is designed to deliver. Btrfs''s RAID-1 implementation is *precisely two* copies. Hence it isn''t really much like RAID-1, as you''ve found out.> Instead the end result was that the blocks were > still only allocated in pairs, and since they were allocated > round-robbin on the drives I completely lost the ability to remove any > single drive from the array without data loss. > > ie: > Instead of: > sda A1 B1 > sdb A1 B1 > sdc A1 B1 > > it ended up doing: > > sda A1 B1 > sdb A1 C1 > sdc B1 C1 > > meaning removing any 1 drive would result in lost data.(Any 2 drives, as you corrected in your subsequent email) However, you can remove any one drive, and your data is fine, which is what btrfs''s RAID-1 guarantee is. I understand that there will be additional features coming along Real Soon Now (possibly at the same time that RAID-5 and -6 are integrated) which will allow the selection of larger numbers of copies. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- People are too unreliable to be replaced by machines. ---
I figured you meant that. Using RAID1 on N drives normally would mean all drives have a copy of the object. The upshot of this is that you can lose N-1 drives and still access data. In systems like ZFS or BTRFS you would also expect a read speed of N*, since you could theoretically read from all drives in parallel as long as the checksum is valid. It seems from the BTRFS documentation that the RAID1 profile is actually "mirror", or store 2 copies of the object. Perhaps when Oracle makes BTRFS a production option they should more clearly spell that out. So, if the fixes were done at Linuxconf, would we be looking at a 3.3 or a 3.4 release? On Mon, Feb 20, 2012 at 7:51 PM, Wes <anomaly256@gmail.com> wrote:> Sorry, I meant ''removing 2 drives'' in the raid1 with 3 drives example > > > > On Tue, Feb 21, 2012 at 11:45 AM, Wes <anomaly256@gmail.com> wrote: >> I''ve noticed similar behavior when even RAID0''ing an odd number of >> devices which should be even more trivial in practice. >> You would expect something like: >> sda A1 B1 >> sdb A2 B2 >> sdc A3 B3 >> >> or at least, if BTRFS can only handle block pairs, >> >> sda A1 B2 >> sdb A2 C1 >> sdc B1 C2 >> >> But the end result was that disk usage and reporting went all out of >> whack, allocation reporting got confused and started returning >> impossible values, and very shortly after the entire FS was corrupted. >> Rebalancing messed everything up royally and in the end I concluded >> to simply not use an odd number of drives with BTRFS. >> >> I also tried RAID1 with an odd number of drives, expecting to have 2 >> redundant mirrors. Instead the end result was that the blocks were >> still only allocated in pairs, and since they were allocated >> round-robbin on the drives I completely lost the ability to remove any >> single drive from the array without data loss. >> >> ie: >> Instead of: >> sda A1 B1 >> sdb A1 B1 >> sdc A1 B1 >> >> it ended up doing: >> >> sda A1 B1 >> sdb A1 C1 >> sdc B1 C1 >> >> meaning removing any 1 drive would result in lost data. >> >> I was told that this issue should have been resolved a while ago by a >> dev at Linuxconf, however this test of mine was only about 2 months >> ago. >> >> >> >> >> On Tue, Feb 21, 2012 at 11:35 AM, Tom Cameron <tomc603@gmail.com> wrote: >>> I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with >>> the "btrfs device add" command. Once the device was added, I used the >>> balance command to distribute the data through the drives. This >>> resulted in an infinite run of the btrfs tool with data moving back >>> and forth across the drives over and over again. When using the "btrfs >>> filesystem show" command, I could see the same pattern repeated in the >>> byte counts on each of the drives. >>> >>> It would probably add more complexity to the code, but adding a check >>> for loops like this may be handy. While a 5-drive RAID10 array is a >>> weird configuration (I''m waiting for a case with 6 bays), it _should_ >>> be possible with filesystems like BTRFS. In my head, the distribution >>> of data would be uneven across drives, but the duplicate and stripe >>> count should be even at the end. I''d imagine it to look something like >>> this: >>> >>> D1: A1 B1 C1 D1 >>> D2: A1 B1 C1 E1 >>> D3: A2 B2 D1 E1 >>> D4: A2 C2 D2 E2 >>> D5: B2 C2 D2 E2 >>> >>> This is obviously over simplified, but the general idea is the same. I >>> haven''t looked into the way the "RAID"ing of objects works in BTRFS >>> yet, but because it''s a filesystem and not a block-based system it >>> should be smart enough to care only about the duplication and striping >>> of data, and not the actual block-level or extent-level balancing. >>> Thoughts? >>> >>> Thanks in advance! >>> Tom >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 20, 2012 at 07:35:18PM -0500, Tom Cameron wrote:> I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with > the "btrfs device add" command. Once the device was added, I used the > balance command to distribute the data through the drives. This > resulted in an infinite run of the btrfs tool with data moving back > and forth across the drives over and over again. When using the "btrfs > filesystem show" command, I could see the same pattern repeated in the > byte counts on each of the drives.The balance operation should be guaranteed to complete. At least, it does these days (back in the 2.6.35 days, it didn''t always complete). Having a repeating pattern of bytes counts isn''t necessarily a sign that it''s stuck in an infinite loop. It was probably just taking a very long time. If you use 3.3-rc4, and apply the restriper patches to the userspace tools, you can use the new restriper code, which adds (amongst many other things) a progress counter to balances.> It would probably add more complexity to the code, but adding a check > for loops like this may be handy. While a 5-drive RAID10 array is a > weird configuration (I''m waiting for a case with 6 bays), it _should_ > be possible with filesystems like BTRFS.Indeed it should. I''ve not tested it yet myself, though.> In my head, the distribution > of data would be uneven across drives, but the duplicate and stripe > count should be even at the end. I''d imagine it to look something like > this: > > D1: A1 B1 C1 D1 > D2: A1 B1 C1 E1 > D3: A2 B2 D1 E1 > D4: A2 C2 D2 E2 > D5: B2 C2 D2 E2Yup, that''s about right. Except that the empty spaces aren''t there, so it''ll look more like this: D1: A1 B1 C1 D1 D2: A1 B1 C1 E1 D3: A2 B2 D1 E1 D4: A2 C2 D2 E2 D5: B2 C2 D2 E2> This is obviously over simplified, but the general idea is the same. I > haven''t looked into the way the "RAID"ing of objects works in BTRFS > yet,See the "SysadminGuide" on the wiki[1] for a fuller explanation. I should probably expand the example to show the case with odd numbers of drives (and possibly with unbalanced disk sizes too).> but because it''s a filesystem and not a block-based system it > should be smart enough to care only about the duplication and striping > of data, and not the actual block-level or extent-level balancing.Hugo. [1] http://btrfs.ipv5.de/index.php?title=SysadminGuide -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I''d make a joke about UDP, but I don''t know if --- anyone''s actually listening...
On Mon, Feb 20, 2012 at 8:07 PM, Hugo Mills <hugo@carfax.org.uk> wrote:> > However, you can remove any one drive, and your data is fine, which > is what btrfs''s RAID-1 guarantee is. I understand that there will be > additional features coming along Real Soon Now (possibly at the same > time that RAID-5 and -6 are integrated) which will allow the selection > of larger numbers of copies. >Is there a projected timeframe for RAID5/6? I understand it''s currently not the development focus of the BTRFS team, and most organizations want performance over capacity making RAID10 the clear choice. But, there are still some situations where RAID6 is better suited (large pools of archive storage). Also, do we know if the RAID5/6 implementation will simply break data into two data objects and one or two parity objects, or will it work with an arbitrary number of devices? Meaning, if I have a RAID6 pool of 12 drives, will I get 10 data objects and two parity objects? Thanks all for your replies! Tom -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/21/2012 08:45 AM, Wes wrote:> I''ve noticed similar behavior when even RAID0''ing an odd number of > devices which should be even more trivial in practice. > You would expect something like: > sda A1 B1 > sdb A2 B2 > sdc A3 B3 > > or at least, if BTRFS can only handle block pairs, > > sda A1 B2 > sdb A2 C1 > sdc B1 C2 > > But the end result was that disk usage and reporting went all out of > whack, allocation reporting got confused and started returning > impossible values, and very shortly after the entire FS was corrupted. > Rebalancing messed everything up royally and in the end I concluded > to simply not use an odd number of drives with BTRFS. > > I also tried RAID1 with an odd number of drives, expecting to have 2 > redundant mirrors. Instead the end result was that the blocks were > still only allocated in pairs, and since they were allocated > round-robbin on the drives I completely lost the ability to remove any > single drive from the array without data loss. > > ie: > Instead of: > sda A1 B1 > sdb A1 B1 > sdc A1 B1 > > it ended up doing: > > sda A1 B1 > sdb A1 C1 > sdc B1 C1 > > meaning removing any 1 drive would result in lost data. >Removing any disk will not lose data cause btrfs ensure all the data in the removed disk is safely placed on right places. And if there is not enough rest space for the data, the remove operations will fail. Or what am I missing? thanks, liubo> I was told that this issue should have been resolved a while ago by a > dev at Linuxconf, however this test of mine was only about 2 months > ago. > > > > > On Tue, Feb 21, 2012 at 11:35 AM, Tom Cameron <tomc603@gmail.com> wrote: >> I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with >> the "btrfs device add" command. Once the device was added, I used the >> balance command to distribute the data through the drives. This >> resulted in an infinite run of the btrfs tool with data moving back >> and forth across the drives over and over again. When using the "btrfs >> filesystem show" command, I could see the same pattern repeated in the >> byte counts on each of the drives. >> >> It would probably add more complexity to the code, but adding a check >> for loops like this may be handy. While a 5-drive RAID10 array is a >> weird configuration (I''m waiting for a case with 6 bays), it _should_ >> be possible with filesystems like BTRFS. In my head, the distribution >> of data would be uneven across drives, but the duplicate and stripe >> count should be even at the end. I''d imagine it to look something like >> this: >> >> D1: A1 B1 C1 D1 >> D2: A1 B1 C1 E1 >> D3: A2 B2 D1 E1 >> D4: A2 C2 D2 E2 >> D5: B2 C2 D2 E2 >> >> This is obviously over simplified, but the general idea is the same. I >> haven''t looked into the way the "RAID"ing of objects works in BTRFS >> yet, but because it''s a filesystem and not a block-based system it >> should be smart enough to care only about the duplication and striping >> of data, and not the actual block-level or extent-level balancing. >> Thoughts? >> >> Thanks in advance! >> Tom >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 20, 2012 at 08:13:43PM -0500, Tom Cameron wrote:> On Mon, Feb 20, 2012 at 8:07 PM, Hugo Mills <hugo@carfax.org.uk> wrote: > > > > However, you can remove any one drive, and your data is fine, which > > is what btrfs''s RAID-1 guarantee is. I understand that there will be > > additional features coming along Real Soon Now (possibly at the same > > time that RAID-5 and -6 are integrated) which will allow the selection > > of larger numbers of copies. > > > > Is there a projected timeframe for RAID5/6? I understand it''s > currently not the development focus of the BTRFS team, and most > organizations want performance over capacity making RAID10 the clear > choice. But, there are still some situations where RAID6 is better > suited (large pools of archive storage).Rumour has it that it''s the next major thing after btrfsck is out of the door. I don''t know how accurate that is. I''m just some bloke on the Internet. :)> Also, do we know if the RAID5/6 implementation will simply break data > into two data objects and one or two parity objects, or will it work > with an arbitrary number of devices? Meaning, if I have a RAID6 pool > of 12 drives, will I get 10 data objects and two parity objects?AFAIK, the original implementation looked something like the RAID-0 code, so if you have n drives with space for the next block group, it''ll take all n drives to use for the block group. Parity is then allocated out of those n (with the distribution of the parity blocks across different drives, as RAID-5 and -6 should do). So, allocating a RAID-6 block group of width 1G on your example 12-drive machine, you will indeed end up with 10G of space in that block group, and 2G of parity data spread across all 12 drives. I don''t know if the code that will be delivered will allow you to set a smaller fixed-size stripe width (e.g. 4 data + 2 parity over 8 drives). If the 3-copies RAID-1 code rumour is also true, I would hope so. Again, I''m just some bloke on the Internet... Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I''d make a joke about UDP, but I don''t know if --- anyone''s actually listening...
On Tue, Feb 21, 2012 at 09:16:40AM +0800, Liu Bo wrote:> On 02/21/2012 08:45 AM, Wes wrote: > > meaning removing any 1 drive would result in lost data. > > Removing any disk will not lose data cause btrfs ensure all the data > in the removed disk is safely placed on right places. And if there > is not enough rest space for the data, the remove operations will > fail. Or what am I missing?The typo. :) He said he meant "removing any 2 drives" in the follow-up mail. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I''d make a joke about UDP, but I don''t know if --- anyone''s actually listening...
@hugo iirc that was on ~3.0.8 but it might have been 3.0.0. I''ll revisit the raid0 setup on a newer kernel series and test though before making any more claims. :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 21, 2012 at 12:27:56PM +1100, Wes wrote:> @hugo > > iirc that was on ~3.0.8 but it might have been 3.0.0. I''ll revisit > the raid0 setup on a newer kernel series and test though before making > any more claims. :)There''s a repeating pattern of three log messages that comes out in your syslogs. It''s something like two "found n extents" messages, and then "moving block group yyyyyyyyyyyy". As long as you keep getting the latter message with different numbers, it''s still working OK. The block group numbers are monotonically decreasing (if they go up again, there''s a problem we need to know about), but aren''t necessarily linearly-spaced, particularly if you''ve done a balance or partial balance before. i.e. they''re an indication that something''s happening, but not how much more of it there is to go. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I''d make a joke about UDP, but I don''t know if --- anyone''s actually listening...
Gareth, I would completely agree. I only use the RAID vernacular here because, well, it''s the unfortunate defacto standard way to talk about data protection. I''d go a step beyond saying dupe or dupe + stripe, because future modifications could conceivably see the addition of multiple duplicated sets. The case of 4 disks in a BTRFS filesystem with dupe running across all of them would be a clear extension I could see. So that would be something like 4D. I''m not real sure what you''d use for the terminology, but something completely different than RAID-like terms is almost certainly best. Just look at the ZFS documentation to see how carefully they have to spell out what RAID-Z, Z2, and Z3 do because they used the RAID acronym. On Mon, Feb 20, 2012 at 8:47 PM, Gareth Pye <gareth@cerberos.id.au> wrote:> On Tue, Feb 21, 2012 at 12:07 PM, Tom Cameron <tomc603@gmail.com> wrote: >> >> It seems from the BTRFS documentation that the RAID1 profile is >> actually "mirror", or store 2 copies of the object. Perhaps when >> Oracle makes BTRFS a production option they should more clearly spell >> that out. > > > I''d really like BTRFS to not use RAID level terminology anywhere (other than > maybe in parenthesis along the lines of: "this is similar to RAIDX") and use > less ambigious options as the recommended way to talk about things. As there > is good reason to talk about Dup and RAID1 differently as they aren''t the > same on more than 2 drives. Doing it that way will make people understand > what is going on more often, which should be good. > > It also makes things much easier to remember. Like how much data can you fit > on a 6 drive RAID10? I dunno, but I can more intuitively answer that same > question when it is phrased as just simply ''dup'', or maybe ''dup + stripe''. > > Is there a difference in BTRFS between dup and raid10? > > -- > Gareth Pye > Level 2 Judge, Melbourne, Australia > Australian MTG Forum: mtgau.com > gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com > "Dear God, I would like to file a bug report" >-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I''d probably want to use DupeX to refer to what was classically RAID1 (Duplicate across all disks) and Dupe is an alias for Dup2 but one can also choose Dupe3 through Dupe99 And I keep forgetting to post to the list in plain text, so many of you may not have noticed my original email that only exists on the mailing list in the Quotation in Tom''s email On Tue, Feb 21, 2012 at 12:59 PM, Tom Cameron <tomc603@gmail.com> wrote:> Gareth, > > I would completely agree. I only use the RAID vernacular here because, > well, it''s the unfortunate defacto standard way to talk about data > protection. > > I''d go a step beyond saying dupe or dupe + stripe, because future > modifications could conceivably see the addition of multiple > duplicated sets. The case of 4 disks in a BTRFS filesystem with dupe > running across all of them would be a clear extension I could see. So > that would be something like 4D. I''m not real sure what you''d use for > the terminology, but something completely different than RAID-like > terms is almost certainly best. Just look at the ZFS documentation to > see how carefully they have to spell out what RAID-Z, Z2, and Z3 do > because they used the RAID acronym. > > On Mon, Feb 20, 2012 at 8:47 PM, Gareth Pye <gareth@cerberos.id.au> wrote: >> On Tue, Feb 21, 2012 at 12:07 PM, Tom Cameron <tomc603@gmail.com> wrote: >>> >>> It seems from the BTRFS documentation that the RAID1 profile is >>> actually "mirror", or store 2 copies of the object. Perhaps when >>> Oracle makes BTRFS a production option they should more clearly spell >>> that out. >> >> >> I''d really like BTRFS to not use RAID level terminology anywhere (other than >> maybe in parenthesis along the lines of: "this is similar to RAIDX") and use >> less ambigious options as the recommended way to talk about things. As there >> is good reason to talk about Dup and RAID1 differently as they aren''t the >> same on more than 2 drives. Doing it that way will make people understand >> what is going on more often, which should be good. >> >> It also makes things much easier to remember. Like how much data can you fit >> on a 6 drive RAID10? I dunno, but I can more intuitively answer that same >> question when it is phrased as just simply ''dup'', or maybe ''dup + stripe''. >> >> Is there a difference in BTRFS between dup and raid10? >> >> -- >> Gareth Pye >> Level 2 Judge, Melbourne, Australia >> Australian MTG Forum: mtgau.com >> gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com >> "Dear God, I would like to file a bug report" >>-- Gareth Pye Level 2 Judge, Melbourne, Australia Australian MTG Forum: mtgau.com gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com "Dear God, I would like to file a bug report" -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 20, 2012 at 08:59:05PM -0500, Tom Cameron wrote:> Gareth, > > I would completely agree. I only use the RAID vernacular here because, > well, it''s the unfortunate defacto standard way to talk about data > protection. > > I''d go a step beyond saying dupe or dupe + stripe, because future > modifications could conceivably see the addition of multiple > duplicated sets. The case of 4 disks in a BTRFS filesystem with dupe > running across all of them would be a clear extension I could see. So > that would be something like 4D. I''m not real sure what you''d use for > the terminology, but something completely different than RAID-like > terms is almost certainly best. Just look at the ZFS documentation to > see how carefully they have to spell out what RAID-Z, Z2, and Z3 do > because they used the RAID acronym./me opens a plate to put the can of worms on. Some time ago, I proposed the following scheme: <n>C<m>S<p>P where n is the number of copies (suffixed by C), m is the number of stripes for that data (suffixed by S), and p is the number of parity blocks (suffixed by P). Values of zero are omitted. So btrfs''s RAID-1 would be 2C, RAID-0 would be 1CnS, RAID-5 would be 1CnS1P, and RAID-6 would be 1CnS2P. DUP would need a special indicator to show that it wasn''t redundant in the face of a whole-disk failure: 2CN Hugo.> On Mon, Feb 20, 2012 at 8:47 PM, Gareth Pye <gareth@cerberos.id.au> wrote: > > On Tue, Feb 21, 2012 at 12:07 PM, Tom Cameron <tomc603@gmail.com> wrote: > >> > >> It seems from the BTRFS documentation that the RAID1 profile is > >> actually "mirror", or store 2 copies of the object. Perhaps when > >> Oracle makes BTRFS a production option they should more clearly spell > >> that out. > > > > > > I''d really like BTRFS to not use RAID level terminology anywhere (other than > > maybe in parenthesis along the lines of: "this is similar to RAIDX") and use > > less ambigious options as the recommended way to talk about things. As there > > is good reason to talk about Dup and RAID1 differently as they aren''t the > > same on more than 2 drives. Doing it that way will make people understand > > what is going on more often, which should be good. > > > > It also makes things much easier to remember. Like how much data can you fit > > on a 6 drive RAID10? I dunno, but I can more intuitively answer that same > > question when it is phrased as just simply ''dup'', or maybe ''dup + stripe''. > > > > Is there a difference in BTRFS between dup and raid10? > >-- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Great oxymorons of the world, no. 4: Future Perfect ---
Le 21 February 2012 ? 07:54, Hugo Mills a écrit:> Some time ago, I proposed the following scheme: > > <n>C<m>S<p>P > > where n is the number of copies (suffixed by C), m is the number of > stripes for that data (suffixed by S), and p is the number of parity > blocks (suffixed by P). Values of zero are omitted. > > So btrfs''s RAID-1 would be 2C, RAID-0 would be 1CnS, RAID-5 would > be 1CnS1P, and RAID-6 would be 1CnS2P. DUP would need a special > indicator to show that it wasn''t redundant in the face of a whole-disk > failure: 2CNSeems clear. However, is the S really relevant ? It would be simpler without it, wouldn''t it ? -- Xavier Nicollet -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 22 of February 2012 09:56:27 Xavier Nicollet wrote:> Le 21 February 2012 ? 07:54, Hugo Mills a écrit: > > Some time ago, I proposed the following scheme: > > <n>C<m>S<p>P > > > > where n is the number of copies (suffixed by C), m is the number of > > > > stripes for that data (suffixed by S), and p is the number of parity > > blocks (suffixed by P). Values of zero are omitted. > > > > So btrfs''s RAID-1 would be 2C, RAID-0 would be 1CnS, RAID-5 would > > > > be 1CnS1P, and RAID-6 would be 1CnS2P. DUP would need a special > > indicator to show that it wasn''t redundant in the face of a whole-disk > > failure: 2CN > > Seems clear. However, is the S really relevant ? > It would be simpler without it, wouldn''t it ?It depends how striping will be implemented. Generally it provides information on how much spindles is the data using. With static configuration it will be useless, but when you start changing number of drives in set then it''s necessary to know if you''re not under- or over- utilising the disks. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 22, 2012 at 11:22:08AM +0100, Hubert Kario wrote:> On Wednesday 22 of February 2012 09:56:27 Xavier Nicollet wrote: > > Le 21 February 2012 ? 07:54, Hugo Mills a écrit: > > > Some time ago, I proposed the following scheme: > > > <n>C<m>S<p>P > > > > > > where n is the number of copies (suffixed by C), m is the number of > > > > > > stripes for that data (suffixed by S), and p is the number of parity > > > blocks (suffixed by P). Values of zero are omitted. > > > > > > So btrfs''s RAID-1 would be 2C, RAID-0 would be 1CnS, RAID-5 would > > > > > > be 1CnS1P, and RAID-6 would be 1CnS2P. DUP would need a special > > > indicator to show that it wasn''t redundant in the face of a whole-disk > > > failure: 2CN > > > > Seems clear. However, is the S really relevant ? > > It would be simpler without it, wouldn''t it ? > > It depends how striping will be implemented. Generally it provides > information on how much spindles is the data using. With static > configuration it will be useless, but when you start changing number of > drives in set then it''s necessary to know if you''re not under- or over- > utilising the disks.Indeed. If the implementation always uses the largest number of devices possible, then we''ll always have nS. If it allows you to set a fixed number of devices for a stripe, then the n will be a fixed number, and it becomes useful. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk == PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Happiness is mandatory. Are you happy? ---
Hugo Mills posted on Tue, 21 Feb 2012 01:21:48 +0000 as excerpted:> On Mon, Feb 20, 2012 at 08:13:43PM -0500, Tom Cameron wrote: >> On Mon, Feb 20, 2012 at 8:07 PM, Hugo Mills <hugo@carfax.org.uk> wrote: >> > >> > However, you can remove any one drive, and your data is fine, >> > which >> > is what btrfs''s RAID-1 guarantee is. I understand that there will be >> > additional features coming along Real Soon Now (possibly at the same >> > time that RAID-5 and -6 are integrated) which will allow the >> > selection of larger numbers of copies. >> > >> > >> Is there a projected timeframe for RAID5/6? I understand it''s currently >> not the development focus of the BTRFS team, and most organizations >> want performance over capacity making RAID10 the clear choice. But, >> there are still some situations where RAID6 is better suited (large >> pools of archive storage). > > Rumour has it that it''s the next major thing after btrfsck is out > of the door. I don''t know how accurate that is. I''m just some bloke on > the Internet. :)The report I read (on phoronix, ymmv but it was supposed to be from a talk at scalex, iirc) said raid-5/6 was planned for kernel 3.4 or 3.5, with triple-copy-mirroring said to piggyback on some of that code, so presumably 3.5 or 3.6. Triple-copy-mirroring as a special case doesn''t really make sense to me, tho. The first implementation as two-copy (dup) only makes sense, but in generalizing that to allow triple copy, I''d think/hope they''d generalize it to N-copy, IOW, traditional raid-1 style, instead. I guess we''ll see. FWIW, I''m running on an older 4-spindle md-raid1 setup now, and I had /hoped/ to convert that to 4-copy btrfs-raid1, but that''s simply not possible ATM tho a hybrid 2-copy btrfs on dual dual-spindle md/raid1s is possible, if a bit complex. Given that the disks are older, 300 gig sata seagates nearing half their rated run-hours according to smart (great on power and spinup cycles tho), now''s not the time to switch them to dual-copy-only! I''d think about triple-copy, but no less! Thus, I''m eagerly awaiting the introduction of tri- or preferably N-copy raid1 mode, in 3.5-ish. But the various articles had lead me to believe that btrfs was almost ready to have the experimental label removed, and it turns out not to be quite that far along, maybe end-of-year if things go well, so letting btrfs continue to stabilize in general while I wait, certainly won''t hurt. =:^) Meanwhile, I''m staying on-list so as to keep informed of what else is going on, btrfs-wise, while I wait for triple-copy-mode, minimum. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html