thr3ads.net - Btrfs devel - Balance RAID10 with odd device count [Feb 2012]

If this information is useful, please help other people find it:
Share via:

Tom Cameron

2012-Feb-21 00:35 UTC

Balance RAID10 with odd device count

I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with
the "btrfs device add" command. Once the device was added, I used the
balance command to distribute the data through the drives. This
resulted in an infinite run of the btrfs tool with data moving back
and forth across the drives over and over again. When using the "btrfs
filesystem show" command, I could see the same pattern repeated in the
byte counts on each of the drives.

It would probably add more complexity to the code, but adding a check
for loops like this may be handy. While a 5-drive RAID10 array is a
weird configuration (I''m waiting for a case with 6 bays), it _should_
be possible with filesystems like BTRFS. In my head, the distribution
of data would be uneven across drives, but the duplicate and stripe
count should be even at the end. I''d imagine it to look something like
this:

D1: A1 B1 C1 D1
D2: A1 B1 C1    E1
D3: A2 B2    D1 E1
D4: A2    C2 D2 E2
D5:    B2 C2 D2 E2

This is obviously over simplified, but the general idea is the same. I
haven''t looked into the way the "RAID"ing of objects works in
BTRFS
yet, but because it''s a filesystem and not a block-based system it
should be smart enough to care only about the duplication and striping
of data, and not the actual block-level or extent-level balancing.
Thoughts?

Thanks in advance!
Tom
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wes

2012-Feb-21 00:45 UTC

head link

Re: Balance RAID10 with odd device count

I''ve noticed similar behavior when even RAID0''ing an odd
number of
devices which should be even more trivial in practice.
You would expect something like:
sda A1 B1
sdb A2 B2
sdc A3 B3

or at least, if BTRFS can only handle block pairs,

sda  A1 B2
sdb  A2 C1
sdc  B1 C2

But the end result was that disk usage and reporting went all out of
whack, allocation reporting got confused and started returning
impossible values, and very shortly after the entire FS was corrupted.
 Rebalancing messed everything up royally and in the end I concluded
to simply not use an odd number of drives with BTRFS.

I also tried RAID1 with an odd number of drives, expecting to have 2
redundant mirrors.  Instead the end result was that the blocks were
still only allocated in pairs, and since they were allocated
round-robbin on the drives I completely lost the ability to remove any
single drive from the array without data loss.

ie:
Instead of:
sda A1 B1
sdb A1 B1
sdc A1 B1

it ended up doing:

sda A1 B1
sdb A1 C1
sdc B1 C1

meaning removing any 1 drive would result in lost data.

I was told that this issue should have been resolved a while ago by a
dev at Linuxconf, however this test of mine was only about 2 months
ago.

On Tue, Feb 21, 2012 at 11:35 AM, Tom Cameron <tomc603@gmail.com>
wrote:> I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with
> the "btrfs device add" command. Once the device was added, I used
the
> balance command to distribute the data through the drives. This
> resulted in an infinite run of the btrfs tool with data moving back
> and forth across the drives over and over again. When using the "btrfs
> filesystem show" command, I could see the same pattern repeated in the
> byte counts on each of the drives.
>
> It would probably add more complexity to the code, but adding a check
> for loops like this may be handy. While a 5-drive RAID10 array is a
> weird configuration (I''m waiting for a case with 6 bays), it
_should_
> be possible with filesystems like BTRFS. In my head, the distribution
> of data would be uneven across drives, but the duplicate and stripe
> count should be even at the end. I''d imagine it to look something
like
> this:
>
> D1: A1 B1 C1 D1
> D2: A1 B1 C1    E1
> D3: A2 B2    D1 E1
> D4: A2    C2 D2 E2
> D5:    B2 C2 D2 E2
>
> This is obviously over simplified, but the general idea is the same. I
> haven''t looked into the way the "RAID"ing of objects
works in BTRFS
> yet, but because it''s a filesystem and not a block-based system it
> should be smart enough to care only about the duplication and striping
> of data, and not the actual block-level or extent-level balancing.
> Thoughts?
>
> Thanks in advance!
> Tom
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Wes

2012-Feb-21 00:51 UTC

head link

Re: Balance RAID10 with odd device count

Sorry, I meant ''removing 2 drives'' in the raid1 with 3 drives
example



On Tue, Feb 21, 2012 at 11:45 AM, Wes <anomaly256@gmail.com>
wrote:> I''ve noticed similar behavior when even RAID0''ing an odd
number of
> devices which should be even more trivial in practice.
> You would expect something like:
> sda A1 B1
> sdb A2 B2
> sdc A3 B3
>
> or at least, if BTRFS can only handle block pairs,
>
> sda  A1 B2
> sdb  A2 C1
> sdc  B1 C2
>
> But the end result was that disk usage and reporting went all out of
> whack, allocation reporting got confused and started returning
> impossible values, and very shortly after the entire FS was corrupted.
>  Rebalancing messed everything up royally and in the end I concluded
> to simply not use an odd number of drives with BTRFS.
>
> I also tried RAID1 with an odd number of drives, expecting to have 2
> redundant mirrors.  Instead the end result was that the blocks were
> still only allocated in pairs, and since they were allocated
> round-robbin on the drives I completely lost the ability to remove any
> single drive from the array without data loss.
>
> ie:
> Instead of:
> sda A1 B1
> sdb A1 B1
> sdc A1 B1
>
> it ended up doing:
>
> sda A1 B1
> sdb A1 C1
> sdc B1 C1
>
> meaning removing any 1 drive would result in lost data.
>
> I was told that this issue should have been resolved a while ago by a
> dev at Linuxconf, however this test of mine was only about 2 months
> ago.
>
>
>
>
> On Tue, Feb 21, 2012 at 11:35 AM, Tom Cameron <tomc603@gmail.com>
wrote:
>> I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with
>> the "btrfs device add" command. Once the device was added, I
used the
>> balance command to distribute the data through the drives. This
>> resulted in an infinite run of the btrfs tool with data moving back
>> and forth across the drives over and over again. When using the
"btrfs
>> filesystem show" command, I could see the same pattern repeated in
the
>> byte counts on each of the drives.
>>
>> It would probably add more complexity to the code, but adding a check
>> for loops like this may be handy. While a 5-drive RAID10 array is a
>> weird configuration (I''m waiting for a case with 6 bays), it
_should_
>> be possible with filesystems like BTRFS. In my head, the distribution
>> of data would be uneven across drives, but the duplicate and stripe
>> count should be even at the end. I''d imagine it to look
something like
>> this:
>>
>> D1: A1 B1 C1 D1
>> D2: A1 B1 C1    E1
>> D3: A2 B2    D1 E1
>> D4: A2    C2 D2 E2
>> D5:    B2 C2 D2 E2
>>
>> This is obviously over simplified, but the general idea is the same. I
>> haven''t looked into the way the "RAID"ing of objects
works in BTRFS
>> yet, but because it''s a filesystem and not a block-based
system it
>> should be smart enough to care only about the duplication and striping
>> of data, and not the actual block-level or extent-level balancing.
>> Thoughts?
>>
>> Thanks in advance!
>> Tom
>> --
>> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2012-Feb-21 01:07 UTC

head link

Re: Balance RAID10 with odd device count

On Tue, Feb 21, 2012 at 11:45:51AM +1100, Wes wrote:> I''ve noticed similar behavior when even RAID0''ing an odd
number of
> devices which should be even more trivial in practice.
> You would expect something like:
> sda A1 B1
> sdb A2 B2
> sdc A3 B3
   This is what it should do -- it''ll use as many disks as it can find
to put stripes across at the time the allocator is asked to make
another block group.
> or at least, if BTRFS can only handle block pairs,
> 
> sda  A1 B2
> sdb  A2 C1
> sdc  B1 C2
> 
> But the end result was that disk usage and reporting went all out of
> whack, allocation reporting got confused and started returning
> impossible values, and very shortly after the entire FS was corrupted.
>  Rebalancing messed everything up royally and in the end I concluded
> to simply not use an odd number of drives with BTRFS.
   I can''t see why that should have happened. What kernel were you
doing this with?
> I also tried RAID1 with an odd number of drives, expecting to have 2
> redundant mirrors.
   This isn''t a valid expectation. Or rather, you can expect it, but
it''s not what btrfs is designed to deliver. Btrfs''s RAID-1
implementation is *precisely two* copies. Hence it isn''t really much
like RAID-1, as you''ve found out.
>  Instead the end result was that the blocks were
> still only allocated in pairs, and since they were allocated
> round-robbin on the drives I completely lost the ability to remove any
> single drive from the array without data loss.
> 
> ie:
> Instead of:
> sda A1 B1
> sdb A1 B1
> sdc A1 B1
> 
> it ended up doing:
> 
> sda A1 B1
> sdb A1 C1
> sdc B1 C1
> 
> meaning removing any 1 drive would result in lost data.
   (Any 2 drives, as you corrected in your subsequent email)

   However, you can remove any one drive, and your data is fine, which
is what btrfs''s RAID-1 guarantee is. I understand that there will be
additional features coming along Real Soon Now (possibly at the same
time that RAID-5 and -6 are integrated) which will allow the selection
of larger numbers of copies.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
      --- People are too unreliable to be replaced by machines. ---

Tom Cameron

2012-Feb-21 01:07 UTC

head link

Re: Balance RAID10 with odd device count

I figured you meant that.

Using RAID1 on N drives normally would mean all drives have a copy of
the object. The upshot of this is that you can lose N-1 drives and
still access data. In systems like ZFS or BTRFS you would also expect
a read speed of N*, since you could theoretically read from all drives
in parallel as long as the checksum is valid.

It seems from the BTRFS documentation that the RAID1 profile is
actually "mirror", or store 2 copies of the object. Perhaps when
Oracle makes BTRFS a production option they should more clearly spell
that out.

So, if the fixes were done at Linuxconf, would we be looking at a 3.3
or a 3.4 release?


On Mon, Feb 20, 2012 at 7:51 PM, Wes <anomaly256@gmail.com>
wrote:> Sorry, I meant ''removing 2 drives'' in the raid1 with 3
drives example
>
>
>
> On Tue, Feb 21, 2012 at 11:45 AM, Wes <anomaly256@gmail.com> wrote:
>> I''ve noticed similar behavior when even RAID0''ing an
odd number of
>> devices which should be even more trivial in practice.
>> You would expect something like:
>> sda A1 B1
>> sdb A2 B2
>> sdc A3 B3
>>
>> or at least, if BTRFS can only handle block pairs,
>>
>> sda  A1 B2
>> sdb  A2 C1
>> sdc  B1 C2
>>
>> But the end result was that disk usage and reporting went all out of
>> whack, allocation reporting got confused and started returning
>> impossible values, and very shortly after the entire FS was corrupted.
>>  Rebalancing messed everything up royally and in the end I concluded
>> to simply not use an odd number of drives with BTRFS.
>>
>> I also tried RAID1 with an odd number of drives, expecting to have 2
>> redundant mirrors.  Instead the end result was that the blocks were
>> still only allocated in pairs, and since they were allocated
>> round-robbin on the drives I completely lost the ability to remove any
>> single drive from the array without data loss.
>>
>> ie:
>> Instead of:
>> sda A1 B1
>> sdb A1 B1
>> sdc A1 B1
>>
>> it ended up doing:
>>
>> sda A1 B1
>> sdb A1 C1
>> sdc B1 C1
>>
>> meaning removing any 1 drive would result in lost data.
>>
>> I was told that this issue should have been resolved a while ago by a
>> dev at Linuxconf, however this test of mine was only about 2 months
>> ago.
>>
>>
>>
>>
>> On Tue, Feb 21, 2012 at 11:35 AM, Tom Cameron <tomc603@gmail.com>
wrote:
>>> I had a 4 drive RAID10 btrfs setup that I added a fifth drive to
with
>>> the "btrfs device add" command. Once the device was
added, I used the
>>> balance command to distribute the data through the drives. This
>>> resulted in an infinite run of the btrfs tool with data moving back
>>> and forth across the drives over and over again. When using the
"btrfs
>>> filesystem show" command, I could see the same pattern
repeated in the
>>> byte counts on each of the drives.
>>>
>>> It would probably add more complexity to the code, but adding a
check
>>> for loops like this may be handy. While a 5-drive RAID10 array is a
>>> weird configuration (I''m waiting for a case with 6 bays),
it _should_
>>> be possible with filesystems like BTRFS. In my head, the
distribution
>>> of data would be uneven across drives, but the duplicate and stripe
>>> count should be even at the end. I''d imagine it to look
something like
>>> this:
>>>
>>> D1: A1 B1 C1 D1
>>> D2: A1 B1 C1    E1
>>> D3: A2 B2    D1 E1
>>> D4: A2    C2 D2 E2
>>> D5:    B2 C2 D2 E2
>>>
>>> This is obviously over simplified, but the general idea is the
same. I
>>> haven''t looked into the way the "RAID"ing of
objects works in BTRFS
>>> yet, but because it''s a filesystem and not a block-based
system it
>>> should be smart enough to care only about the duplication and
striping
>>> of data, and not the actual block-level or extent-level balancing.
>>> Thoughts?
>>>
>>> Thanks in advance!
>>> Tom
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2012-Feb-21 01:13 UTC

head link

Re: Balance RAID10 with odd device count

On Mon, Feb 20, 2012 at 07:35:18PM -0500, Tom Cameron
wrote:> I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with
> the "btrfs device add" command. Once the device was added, I used
the
> balance command to distribute the data through the drives. This
> resulted in an infinite run of the btrfs tool with data moving back
> and forth across the drives over and over again. When using the "btrfs
> filesystem show" command, I could see the same pattern repeated in the
> byte counts on each of the drives.
   The balance operation should be guaranteed to complete. At least,
it does these days (back in the 2.6.35 days, it didn''t always
complete). Having a repeating pattern of bytes counts isn''t
necessarily a sign that it''s stuck in an infinite loop. It was
probably just taking a very long time.

   If you use 3.3-rc4, and apply the restriper patches to the
userspace tools, you can use the new restriper code, which adds
(amongst many other things) a progress counter to balances.
> It would probably add more complexity to the code, but adding a check
> for loops like this may be handy. While a 5-drive RAID10 array is a
> weird configuration (I''m waiting for a case with 6 bays), it
_should_
> be possible with filesystems like BTRFS.
   Indeed it should. I''ve not tested it yet myself, though.
> In my head, the distribution
> of data would be uneven across drives, but the duplicate and stripe
> count should be even at the end. I''d imagine it to look something
like
> this:
> 
> D1: A1 B1 C1 D1
> D2: A1 B1 C1    E1
> D3: A2 B2    D1 E1
> D4: A2    C2 D2 E2
> D5:    B2 C2 D2 E2
   Yup, that''s about right. Except that the empty spaces
aren''t there,
so it''ll look more like this:

D1: A1 B1 C1 D1
D2: A1 B1 C1 E1
D3: A2 B2 D1 E1
D4: A2 C2 D2 E2
D5: B2 C2 D2 E2
> This is obviously over simplified, but the general idea is the same. I
> haven''t looked into the way the "RAID"ing of objects
works in BTRFS
> yet,
   See the "SysadminGuide" on the wiki[1] for a fuller explanation. I
should probably expand the example to show the case with odd numbers
of drives (and possibly with unbalanced disk sizes too).
> but because it''s a filesystem and not a block-based system it
> should be smart enough to care only about the duplication and striping
> of data, and not the actual block-level or extent-level balancing.
   Hugo.

[1] http://btrfs.ipv5.de/index.php?title=SysadminGuide

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
         --- I''d make a joke about UDP,  but I don''t know if
---
                     anyone''s actually listening...

Tom Cameron

2012-Feb-21 01:13 UTC

head link

Re: Balance RAID10 with odd device count

On Mon, Feb 20, 2012 at 8:07 PM, Hugo Mills <hugo@carfax.org.uk>
wrote:>
>   However, you can remove any one drive, and your data is fine, which
> is what btrfs''s RAID-1 guarantee is. I understand that there will
be
> additional features coming along Real Soon Now (possibly at the same
> time that RAID-5 and -6 are integrated) which will allow the selection
> of larger numbers of copies.
>
Is there a projected timeframe for RAID5/6? I understand it''s
currently not the development focus of the BTRFS team, and most
organizations want performance over capacity making RAID10 the clear
choice. But, there are still some situations where RAID6 is better
suited (large pools of archive storage).

Also, do we know if the RAID5/6 implementation will simply break data
into two data objects and one or two parity objects, or will it work
with an arbitrary number of devices? Meaning, if I have a RAID6 pool
of 12 drives, will I get 10 data objects and two parity objects?

Thanks all for your replies!
Tom
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Liu Bo

2012-Feb-21 01:16 UTC

head link

Re: Balance RAID10 with odd device count

On 02/21/2012 08:45 AM, Wes wrote:> I''ve noticed similar behavior when even RAID0''ing an odd
number of
> devices which should be even more trivial in practice.
> You would expect something like:
> sda A1 B1
> sdb A2 B2
> sdc A3 B3
> 
> or at least, if BTRFS can only handle block pairs,
> 
> sda  A1 B2
> sdb  A2 C1
> sdc  B1 C2
> 
> But the end result was that disk usage and reporting went all out of
> whack, allocation reporting got confused and started returning
> impossible values, and very shortly after the entire FS was corrupted.
>  Rebalancing messed everything up royally and in the end I concluded
> to simply not use an odd number of drives with BTRFS.
> 
> I also tried RAID1 with an odd number of drives, expecting to have 2
> redundant mirrors.  Instead the end result was that the blocks were
> still only allocated in pairs, and since they were allocated
> round-robbin on the drives I completely lost the ability to remove any
> single drive from the array without data loss.
> 
> ie:
> Instead of:
> sda A1 B1
> sdb A1 B1
> sdc A1 B1
> 
> it ended up doing:
> 
> sda A1 B1
> sdb A1 C1
> sdc B1 C1
> 
> meaning removing any 1 drive would result in lost data.
> 
Removing any disk will not lose data cause btrfs ensure all the data in the
removed disk is
safely placed on right places.  And if there is not enough rest space for the
data,
the remove operations will fail.  Or what am I missing?

thanks,
liubo
> I was told that this issue should have been resolved a while ago by a
> dev at Linuxconf, however this test of mine was only about 2 months
> ago.
> 
> 
> 
> 
> On Tue, Feb 21, 2012 at 11:35 AM, Tom Cameron <tomc603@gmail.com>
wrote:
>> I had a 4 drive RAID10 btrfs setup that I added a fifth drive to with
>> the "btrfs device add" command. Once the device was added, I
used the
>> balance command to distribute the data through the drives. This
>> resulted in an infinite run of the btrfs tool with data moving back
>> and forth across the drives over and over again. When using the
"btrfs
>> filesystem show" command, I could see the same pattern repeated in
the
>> byte counts on each of the drives.
>>
>> It would probably add more complexity to the code, but adding a check
>> for loops like this may be handy. While a 5-drive RAID10 array is a
>> weird configuration (I''m waiting for a case with 6 bays), it
_should_
>> be possible with filesystems like BTRFS. In my head, the distribution
>> of data would be uneven across drives, but the duplicate and stripe
>> count should be even at the end. I''d imagine it to look
something like
>> this:
>>
>> D1: A1 B1 C1 D1
>> D2: A1 B1 C1    E1
>> D3: A2 B2    D1 E1
>> D4: A2    C2 D2 E2
>> D5:    B2 C2 D2 E2
>>
>> This is obviously over simplified, but the general idea is the same. I
>> haven''t looked into the way the "RAID"ing of objects
works in BTRFS
>> yet, but because it''s a filesystem and not a block-based
system it
>> should be smart enough to care only about the duplication and striping
>> of data, and not the actual block-level or extent-level balancing.
>> Thoughts?
>>
>> Thanks in advance!
>> Tom
>> --
>> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe
linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2012-Feb-21 01:21 UTC

head link

Re: Balance RAID10 with odd device count

On Mon, Feb 20, 2012 at 08:13:43PM -0500, Tom Cameron
wrote:> On Mon, Feb 20, 2012 at 8:07 PM, Hugo Mills <hugo@carfax.org.uk>
wrote:
> >
> >   However, you can remove any one drive, and your data is fine, which
> > is what btrfs''s RAID-1 guarantee is. I understand that there
will be
> > additional features coming along Real Soon Now (possibly at the same
> > time that RAID-5 and -6 are integrated) which will allow the selection
> > of larger numbers of copies.
> >
> 
> Is there a projected timeframe for RAID5/6? I understand it''s
> currently not the development focus of the BTRFS team, and most
> organizations want performance over capacity making RAID10 the clear
> choice. But, there are still some situations where RAID6 is better
> suited (large pools of archive storage).
   Rumour has it that it''s the next major thing after btrfsck is out
of the door. I don''t know how accurate that is. I''m just some
bloke on
the Internet. :)
> Also, do we know if the RAID5/6 implementation will simply break data
> into two data objects and one or two parity objects, or will it work
> with an arbitrary number of devices? Meaning, if I have a RAID6 pool
> of 12 drives, will I get 10 data objects and two parity objects?
   AFAIK, the original implementation looked something like the RAID-0
code, so if you have n drives with space for the next block group,
it''ll take all n drives to use for the block group. Parity is then
allocated out of those n (with the distribution of the parity blocks
across different drives, as RAID-5 and -6 should do).

   So, allocating a RAID-6 block group of width 1G on your example
12-drive machine, you will indeed end up with 10G of space in that
block group, and 2G of parity data spread across all 12 drives.

   I don''t know if the code that will be delivered will allow you to
set a smaller fixed-size stripe width (e.g. 4 data + 2 parity over 8
drives). If the 3-copies RAID-1 code rumour is also true, I would hope
so. Again, I''m just some bloke on the Internet...

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
         --- I''d make a joke about UDP,  but I don''t know if
---
                     anyone''s actually listening...

Hugo Mills

2012-Feb-21 01:22 UTC

head link

Re: Balance RAID10 with odd device count

On Tue, Feb 21, 2012 at 09:16:40AM +0800, Liu Bo wrote:> On 02/21/2012 08:45 AM, Wes wrote:
> > meaning removing any 1 drive would result in lost data.
> 
> Removing any disk will not lose data cause btrfs ensure all the data
> in the removed disk is safely placed on right places.  And if there
> is not enough rest space for the data, the remove operations will
> fail.  Or what am I missing?
   The typo. :) He said he meant "removing any 2 drives" in the
follow-up mail.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
         --- I''d make a joke about UDP,  but I don''t know if
---
                     anyone''s actually listening...

Wes

2012-Feb-21 01:27 UTC

head link

Re: Balance RAID10 with odd device count

@hugo

iirc that was on ~3.0.8 but it might have been 3.0.0.  I''ll revisit
the raid0 setup on a newer kernel series and test though before making
any more claims. :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2012-Feb-21 01:31 UTC

head link

Re: Balance RAID10 with odd device count

On Tue, Feb 21, 2012 at 12:27:56PM +1100, Wes wrote:> @hugo
> 
> iirc that was on ~3.0.8 but it might have been 3.0.0.  I''ll
revisit
> the raid0 setup on a newer kernel series and test though before making
> any more claims. :)
   There''s a repeating pattern of three log messages that comes out in
your syslogs. It''s something like two "found n extents"
messages, and
then "moving block group yyyyyyyyyyyy". As long as you keep getting
the latter message with different numbers, it''s still working OK. The
block group numbers are monotonically decreasing (if they go up again,
there''s a problem we need to know about), but aren''t
necessarily
linearly-spaced, particularly if you''ve done a balance or partial
balance before. i.e. they''re an indication that something''s
happening,
but not how much more of it there is to go.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
         --- I''d make a joke about UDP,  but I don''t know if
---
                     anyone''s actually listening...

Tom Cameron

2012-Feb-21 01:59 UTC

head link

Re: Balance RAID10 with odd device count

Gareth,

I would completely agree. I only use the RAID vernacular here because,
well, it''s the unfortunate defacto standard way to talk about data
protection.

I''d go a step beyond saying dupe or dupe + stripe, because future
modifications could conceivably see the addition of multiple
duplicated sets. The case of 4 disks in a BTRFS filesystem with dupe
running across all of them would be a clear extension I could see. So
that would be something like 4D. I''m not real sure what you''d
use for
the terminology, but something completely different than RAID-like
terms is almost certainly best. Just look at the ZFS documentation to
see how carefully they have to spell out what RAID-Z, Z2, and Z3 do
because they used the RAID acronym.

On Mon, Feb 20, 2012 at 8:47 PM, Gareth Pye <gareth@cerberos.id.au>
wrote:> On Tue, Feb 21, 2012 at 12:07 PM, Tom Cameron <tomc603@gmail.com>
wrote:
>>
>> It seems from the BTRFS documentation that the RAID1 profile is
>> actually "mirror", or store 2 copies of the object. Perhaps
when
>> Oracle makes BTRFS a production option they should more clearly spell
>> that out.
>
>
> I''d really like BTRFS to not use RAID level terminology anywhere
(other than
> maybe in parenthesis along the lines of: "this is similar to
RAIDX") and use
> less ambigious options as the recommended way to talk about things. As
there
> is good reason to talk about Dup and RAID1 differently as they
aren''t the
> same on more than 2 drives. Doing it that way will make people understand
> what is going on more often, which should be good.
>
> It also makes things much easier to remember. Like how much data can you
fit
> on a 6 drive RAID10? I dunno, but I can more intuitively answer that same
> question when it is phrased as just simply ''dup'', or
maybe ''dup + stripe''.
>
> Is there a difference in BTRFS between dup and raid10?
>
> --
> Gareth Pye
> Level 2 Judge, Melbourne, Australia
> Australian MTG Forum: mtgau.com
> gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com
> "Dear God, I would like to file a bug report"
>--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Gareth Pye

2012-Feb-21 02:46 UTC

head link

Re: Balance RAID10 with odd device count

I''d probably want to use DupeX to refer to what was classically RAID1
(Duplicate across all disks) and Dupe is an alias for Dup2 but one can
also choose Dupe3 through Dupe99

And I keep forgetting to post to the list in plain text, so many of
you may not have noticed my original email that only exists on the
mailing list in the Quotation in Tom''s email

On Tue, Feb 21, 2012 at 12:59 PM, Tom Cameron <tomc603@gmail.com>
wrote:> Gareth,
>
> I would completely agree. I only use the RAID vernacular here because,
> well, it''s the unfortunate defacto standard way to talk about data
> protection.
>
> I''d go a step beyond saying dupe or dupe + stripe, because future
> modifications could conceivably see the addition of multiple
> duplicated sets. The case of 4 disks in a BTRFS filesystem with dupe
> running across all of them would be a clear extension I could see. So
> that would be something like 4D. I''m not real sure what
you''d use for
> the terminology, but something completely different than RAID-like
> terms is almost certainly best. Just look at the ZFS documentation to
> see how carefully they have to spell out what RAID-Z, Z2, and Z3 do
> because they used the RAID acronym.
>
> On Mon, Feb 20, 2012 at 8:47 PM, Gareth Pye <gareth@cerberos.id.au>
wrote:
>> On Tue, Feb 21, 2012 at 12:07 PM, Tom Cameron <tomc603@gmail.com>
wrote:
>>>
>>> It seems from the BTRFS documentation that the RAID1 profile is
>>> actually "mirror", or store 2 copies of the object.
Perhaps when
>>> Oracle makes BTRFS a production option they should more clearly
spell
>>> that out.
>>
>>
>> I''d really like BTRFS to not use RAID level terminology
anywhere (other than
>> maybe in parenthesis along the lines of: "this is similar to
RAIDX") and use
>> less ambigious options as the recommended way to talk about things. As
there
>> is good reason to talk about Dup and RAID1 differently as they
aren''t the
>> same on more than 2 drives. Doing it that way will make people
understand
>> what is going on more often, which should be good.
>>
>> It also makes things much easier to remember. Like how much data can
you fit
>> on a 6 drive RAID10? I dunno, but I can more intuitively answer that
same
>> question when it is phrased as just simply ''dup'', or
maybe ''dup + stripe''.
>>
>> Is there a difference in BTRFS between dup and raid10?
>>
>> --
>> Gareth Pye
>> Level 2 Judge, Melbourne, Australia
>> Australian MTG Forum: mtgau.com
>> gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com
>> "Dear God, I would like to file a bug report"
>>


-- 
Gareth Pye
Level 2 Judge, Melbourne, Australia
Australian MTG Forum: mtgau.com
gareth@cerberos.id.au - www.rockpaperdynamite.wordpress.com
"Dear God, I would like to file a bug report"
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2012-Feb-21 07:54 UTC

head link

Re: Balance RAID10 with odd device count

On Mon, Feb 20, 2012 at 08:59:05PM -0500, Tom Cameron
wrote:> Gareth,
> 
> I would completely agree. I only use the RAID vernacular here because,
> well, it''s the unfortunate defacto standard way to talk about data
> protection.
> 
> I''d go a step beyond saying dupe or dupe + stripe, because future
> modifications could conceivably see the addition of multiple
> duplicated sets. The case of 4 disks in a BTRFS filesystem with dupe
> running across all of them would be a clear extension I could see. So
> that would be something like 4D. I''m not real sure what
you''d use for
> the terminology, but something completely different than RAID-like
> terms is almost certainly best. Just look at the ZFS documentation to
> see how carefully they have to spell out what RAID-Z, Z2, and Z3 do
> because they used the RAID acronym.
   /me opens a plate to put the can of worms on.

   Some time ago, I proposed the following scheme:

<n>C<m>S<p>P

   where n is the number of copies (suffixed by C), m is the number of
stripes for that data (suffixed by S), and p is the number of parity
blocks (suffixed by P). Values of zero are omitted.

   So btrfs''s RAID-1 would be 2C, RAID-0 would be 1CnS, RAID-5 would
be 1CnS1P, and RAID-6 would be 1CnS2P. DUP would need a special
indicator to show that it wasn''t redundant in the face of a whole-disk
failure: 2CN

   Hugo.
> On Mon, Feb 20, 2012 at 8:47 PM, Gareth Pye <gareth@cerberos.id.au>
wrote:
> > On Tue, Feb 21, 2012 at 12:07 PM, Tom Cameron
<tomc603@gmail.com> wrote:
> >>
> >> It seems from the BTRFS documentation that the RAID1 profile is
> >> actually "mirror", or store 2 copies of the object.
Perhaps when
> >> Oracle makes BTRFS a production option they should more clearly
spell
> >> that out.
> >
> >
> > I''d really like BTRFS to not use RAID level terminology
anywhere (other than
> > maybe in parenthesis along the lines of: "this is similar to
RAIDX") and use
> > less ambigious options as the recommended way to talk about things. As
there
> > is good reason to talk about Dup and RAID1 differently as they
aren''t the
> > same on more than 2 drives. Doing it that way will make people
understand
> > what is going on more often, which should be good.
> >
> > It also makes things much easier to remember. Like how much data can
you fit
> > on a 6 drive RAID10? I dunno, but I can more intuitively answer that
same
> > question when it is phrased as just simply ''dup'', or
maybe ''dup + stripe''.
> >
> > Is there a difference in BTRFS between dup and raid10?
> >
-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
       --- Great oxymorons of the world, no. 4: Future Perfect ---

Xavier Nicollet

2012-Feb-22 08:56 UTC

head link

Re: Balance RAID10 with odd device count

Le 21 February 2012 ? 07:54, Hugo Mills a écrit:>    Some time ago, I proposed the following scheme:
> 
> <n>C<m>S<p>P
> 
>    where n is the number of copies (suffixed by C), m is the number of
> stripes for that data (suffixed by S), and p is the number of parity
> blocks (suffixed by P). Values of zero are omitted.
> 
>    So btrfs''s RAID-1 would be 2C, RAID-0 would be 1CnS, RAID-5
would
> be 1CnS1P, and RAID-6 would be 1CnS2P. DUP would need a special
> indicator to show that it wasn''t redundant in the face of a
whole-disk
> failure: 2CN
Seems clear. However, is the S really relevant ?
It would be simpler without it, wouldn''t it ?

-- 
Xavier Nicollet
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hubert Kario

2012-Feb-22 10:22 UTC

head link

Re: Balance RAID10 with odd device count

On Wednesday 22 of February 2012 09:56:27 Xavier Nicollet
wrote:> Le 21 February 2012 ? 07:54, Hugo Mills a écrit:
> >    Some time ago, I proposed the following scheme:
> > <n>C<m>S<p>P
> > 
> >    where n is the number of copies (suffixed by C), m is the number of
> > 
> > stripes for that data (suffixed by S), and p is the number of parity
> > blocks (suffixed by P). Values of zero are omitted.
> > 
> >    So btrfs''s RAID-1 would be 2C, RAID-0 would be 1CnS,
RAID-5 would
> > 
> > be 1CnS1P, and RAID-6 would be 1CnS2P. DUP would need a special
> > indicator to show that it wasn''t redundant in the face of a
whole-disk
> > failure: 2CN
> 
> Seems clear. However, is the S really relevant ?
> It would be simpler without it, wouldn''t it ?
It depends how striping will be implemented. Generally it provides 
information on how much spindles is the data using. With static 
configuration it will be useless, but when you start changing number of 
drives in set then it''s necessary to know if you''re not under-
or over-
utilising the disks.
-- 
Hubert Kario
QBS - Quality Business Software
02-656 Warszawa, ul. Ksawerów 30/85
tel. +48 (22) 646-61-51, 646-74-24
www.qbs.com.pl
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hugo Mills

2012-Feb-22 11:09 UTC

head link

Re: Balance RAID10 with odd device count

On Wed, Feb 22, 2012 at 11:22:08AM +0100, Hubert Kario
wrote:> On Wednesday 22 of February 2012 09:56:27 Xavier Nicollet wrote:
> > Le 21 February 2012 ? 07:54, Hugo Mills a écrit:
> > >    Some time ago, I proposed the following scheme:
> > > <n>C<m>S<p>P
> > > 
> > >    where n is the number of copies (suffixed by C), m is the
number of
> > > 
> > > stripes for that data (suffixed by S), and p is the number of
parity
> > > blocks (suffixed by P). Values of zero are omitted.
> > > 
> > >    So btrfs''s RAID-1 would be 2C, RAID-0 would be 1CnS,
RAID-5 would
> > > 
> > > be 1CnS1P, and RAID-6 would be 1CnS2P. DUP would need a special
> > > indicator to show that it wasn''t redundant in the face
of a whole-disk
> > > failure: 2CN
> > 
> > Seems clear. However, is the S really relevant ?
> > It would be simpler without it, wouldn''t it ?
> 
> It depends how striping will be implemented. Generally it provides 
> information on how much spindles is the data using. With static 
> configuration it will be useless, but when you start changing number of 
> drives in set then it''s necessary to know if you''re not
under- or over-
> utilising the disks.
   Indeed. If the implementation always uses the largest number of
devices possible, then we''ll always have nS. If it allows you to set a
fixed number of devices for a stripe, then the n will be a fixed
number, and it becomes useful.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ==  PGP
key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
             --- Happiness is mandatory.  Are you happy? ---

Duncan

2012-Feb-22 11:48 UTC

head link

Re: Balance RAID10 with odd device count

Hugo Mills posted on Tue, 21 Feb 2012 01:21:48 +0000 as excerpted:
> On Mon, Feb 20, 2012 at 08:13:43PM -0500, Tom Cameron wrote:
>> On Mon, Feb 20, 2012 at 8:07 PM, Hugo Mills <hugo@carfax.org.uk>
wrote:
>> >
>> >   However, you can remove any one drive, and your data is fine,
>> >   which
>> > is what btrfs''s RAID-1 guarantee is. I understand that
there will be
>> > additional features coming along Real Soon Now (possibly at the
same
>> > time that RAID-5 and -6 are integrated) which will allow the
>> > selection of larger numbers of copies.
>> >
>> >
>> Is there a projected timeframe for RAID5/6? I understand it''s
currently
>> not the development focus of the BTRFS team, and most organizations
>> want performance over capacity making RAID10 the clear choice. But,
>> there are still some situations where RAID6 is better suited (large
>> pools of archive storage).
> 
>    Rumour has it that it''s the next major thing after btrfsck is
out
> of the door. I don''t know how accurate that is. I''m just
some bloke on
> the Internet. :)
The report I read (on phoronix, ymmv but it was supposed to be from a 
talk at scalex, iirc) said raid-5/6 was planned for kernel 3.4 or 3.5, 
with triple-copy-mirroring said to piggyback on some of that code, so 
presumably 3.5 or 3.6.

Triple-copy-mirroring as a special case doesn''t really make sense to
me,
tho.  The first implementation as two-copy (dup) only makes sense, but in 
generalizing that to allow triple copy, I''d think/hope they''d
generalize
it to N-copy, IOW, traditional raid-1 style, instead.

I guess we''ll see.

FWIW, I''m running on an older 4-spindle md-raid1 setup now, and I had 
/hoped/ to convert that to 4-copy btrfs-raid1, but that''s simply not 
possible ATM tho a hybrid 2-copy btrfs on dual dual-spindle md/raid1s is 
possible, if a bit complex.

Given that the disks are older, 300 gig sata seagates nearing half their 
rated run-hours according to smart (great on power and spinup cycles 
tho), now''s not the time to switch them to dual-copy-only! 
I''d think
about triple-copy, but no less!  Thus, I''m eagerly awaiting the 
introduction of tri- or preferably N-copy raid1 mode, in 3.5-ish.  But 
the various articles had lead me to believe that btrfs was almost ready 
to have the experimental label removed, and it turns out not to be quite 
that far along, maybe end-of-year if things go well, so letting btrfs 
continue to stabilize in general while I wait, certainly won''t hurt.
=:^)

Meanwhile, I''m staying on-list so as to keep informed of what else is 
going on, btrfs-wise, while I wait for triple-copy-mode, minimum.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs devel - Feb 2012 - Balance RAID10 with odd device count

Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count

Re: Balance RAID10 with odd device count