thr3ads.net - zfs discuss - [zfs-discuss] triple-parity: RAID-Z3 [Jul 2009]

If this information is useful, please help other people find it:
Share via:

David Magda

2009-Jul-17 15:17 UTC

[zfs-discuss] triple-parity: RAID-Z3

Don''t hear about triple-parity RAID that often:
> Author: Adam Leventhal
> Repository: /hg/onnv/onnv-gate
> Latest revision: 17811c723fb4f9fce50616cb740a92c8f6f97651
> Total changesets: 1
> Log message:
> 6854612 triple-parity RAID-Z
http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/009872.html
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6854612

(Via Blog O'' Matty.)

Would be curious to see performance characteristics.

Martin

2009-Jul-19 03:18 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

> Don''t hear about triple-parity RAID that often:
I agree completely.  In fact, I have wondered (probably in these forums), why we
don''t bite the bullet and make a generic raidzN, where N is any number
>=0.

In fact, get rid of mirroring, because it clearly is a variant of raidz with two
devices.  Want three way mirroring?  Call that raidz2 with three devices.  The
truth is that a generic raidzN would roll up everything: striping, mirroring,
parity raid, double parity, etc. into a single format with one parameter.

If memory serves, the second parity is calculated using Reed-Solomon which
implies that any number of parity devices is possible.

Let''s not stop there, though.  Once we have any number of parity
devices, why can''t I add a parity device to an array?  That should be
simple enough with a scrub to set the parity.  In fact, what is to stop me from
removing a parity device?  Once again, I think the code would make this rather
easy.

Once we can add and remove parity devices at will, it might not be a stretch to
convert a parity device to data and vice versa.  If you have four data drives
and two parity drives but need more space, in a pinch just convert one parity
drive to data and get more storage.

The flip side would work as well.  If I have six data drives and a single parity
drive but have, over the years, replaced them all with vastly larger drives and
have space to burn, I might want to covert a data drive to parity.  I may sleep
better at night.

If we had a generic raidzN, the ability to add/remove parity devices and the
ability to convert a data device from/to a parity device, then what happens? 
Total freedom.  Add devices to the array, or take them away.  Choose the blend
of performance and redundancy that meets YOUR needs, then change it later when
the technology and business needs change, all without interruption.

Ok, back to the real world.  The one downside to triple parity is that I recall
the code discovered the corrupt block by excluding it from the stripe,
reconstructing the stripe and comparing that with the checksum.  In other words,
for a given cost of X to compute a stripe and a number P of corrupt blocks, the
cost of reading a stripe is approximately X^P.  More corrupt blocks would
radically slow down the system.  With raidz2, the maximum number of corrupt
blocks would be two, putting a cap on how costly the read can be.

Standard disclaimers apply: I could be wrong, I am often wrong, etc.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Jul-19 14:23 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

On Sat, 18 Jul 2009, Martin wrote:
> In fact, get rid of mirroring, because it clearly is a variant of 
> raidz with two devices.  Want three way mirroring?  Call that raidz2
I don''t see much similarity between mirroring and raidz other than 
that they both support redundancy.
> Let''s not stop there, though.  Once we have any number of parity 
> devices, why can''t I add a parity device to an array?  That should
> be simple enough with a scrub to set the parity.  In fact, what is 
> to stop me from removing a parity device?  Once again, I think the 
> code would make this rather easy.
A RAID system with distributed parity (like raidz) does not have a 
"parity device".  Instead, all disks are treated as equal.  Without 
distributed parity you have a bottleneck and it becomes difficult to 
scale the array to different stripe sizes.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Martin

2009-Jul-19 23:06 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

> I don''t see much similarity between mirroring and raidz other than
> that they both support redundancy.
A single parity device against a single data device is, in essence, mirroring. 
For all intents and purposes, raid and mirroring with this configuration are one
and the same.
> A RAID system with distributed parity (like raidz) does not have a
> "parity device". Instead, all disks are treated as equal. Without
> distributed parity you have a bottleneck and it becomes difficult to
> scale the array to different stripe sizes.
Agreed.  Distributed parity is the way to go.  Nonetheless, if I have an array
with a single parity, then I still have one device dedicated to parity, even if
the actual device which holds the parity information will vary from stripe to
stripe.

The point simply was that it might be straightforward to add a device and
convert a raidz array into a raidz2 array, which effectively would be adding a
parity device.  An extension of that is to convert a raidz2 array back into a
raidz array and increase its size without adding a device.
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Jul-19 23:15 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

On Sun, 19 Jul 2009, Martin wrote:
>> I don''t see much similarity between mirroring and raidz other
than
>> that they both support redundancy.
>
> A single parity device against a single data device is, in essence, 
> mirroring.  For all intents and purposes, raid and mirroring with 
> this configuration are one and the same.
Try creating a raidz pool with two drives (or files), pull one of the 
drives, and see what happens.  Then try the same with mirroring.  Do 
they behave the same?  I expect not ...

I am not sure if raidz even allows you to create  pool with just two 
drives.
> The point simply was that it might be straightforward to add a 
> device and convert a raidz array into a raidz2 array, which 
> effectively would be adding a parity device.  An extension of that 
> is to convert a raidz2 array back into a raidz array and increase 
> its size without adding a device.
That would be nice.  Before developers worry about such exotic 
features, I would rather that they attend to the gross performance 
issues so that zfs performs at least as well as Windows NTFS or Linux 
XFS in all common cases.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Craig Cory

2009-Jul-19 23:28 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

In response to:>> I don''t see much similarity between mirroring and raidz other
than
>> that they both support redundancy.
Martin wrote:> A single parity device against a single data device is, in essence,
mirroring.
>  For all intents and purposes, raid and mirroring with this configuration
are
> one and the same.
>I would have to disagree with this. Mirrored data will have mulitple copies of
the actual data. Any copy is a valid source for data access. Lose one disk and
the other is a complete "original". A raid 3/4/5/6/z/z2 configuration
will
generate a mathematical value to restore a portion of the lost data one of the
storage units in the stripe. A 2-disk raidz will have 1/2 of each
disk''s used
space holding primary data interlaced with the other 1/2 holding a parity
"reflection" of the data. Any time we access the parity
representation, some
computation will be needed to render the live data. This would have to add
*some* overhead to the io.

Craig Cory

Thomas

2009-Jul-20 08:34 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

which gap?
-- 
This message posted from opensolaris.org

Scott Meilicke

2009-Jul-20 15:30 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

> which gap?
> 
> ''RAID-Z should mind the gap on writes'' ?
> 
> Message was edited by: thometal
I believe this is in reference to the raid 5 write hole, described here:
http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance

RAIDZ should avoid this via it''s Copy on Write model:
http://en.wikipedia.org/wiki/Zfs#Copy-on-write_transactional_model

So I''m not sure what the ''RAID-Z should mind the gap on
writes'' comment is getting at either.

Clarification?

-Scott
-- 
This message posted from opensolaris.org

Thomas

2009-Jul-20 23:38 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/009872.html

second bug, its the same link like in the first post.
-- 
This message posted from opensolaris.org

chris

2009-Jul-21 00:05 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

>That would be nice. Before developers worry about such exotic 
>features, I would rather that they attend to the gross performance 
>issues so that zfs performs at least as well as Windows NTFS or Linux 
>XFS in all common cases.
To each their own. 
A FS that calculates and writes parity onto disks will have difficulties being
as fast as a FS that just dumps data.
A FS that verifies read data parity will have difficulties being as fast as a FS
that just returns whatever it reads.
I can not see how this can happen. That''s no reason not to aim for a
low overhead, but one has to make choices here. Mine is data safety and ease of
use, so I''d love the "elastic" zpool idea. Of course, others
will have different needs. Enterprises will not care about ease so much as they
have dedicated professionals to pamper their arrays. They can also address speed
issues with more spindles. ZFS+RAIDZ provides data integrity no RAID level can
match thanks its checksumming. That''s worth a speed sacrifice in my
book.
Anything I missed?
-- 
This message posted from opensolaris.org

Bob Friesenhahn

2009-Jul-21 00:56 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

On Mon, 20 Jul 2009, chris wrote:
>> That would be nice. Before developers worry about such exotic
>> features, I would rather that they attend to the gross performance
>> issues so that zfs performs at least as well as Windows NTFS or Linux
>> XFS in all common cases.
>
> To each their own.
I was referring to gripes about performance in another discussion 
thread, and not due to RAID-Z3.  I don''t think that adding another 
parity disk will make much difference to performance.  Adding another 
parity disk has a similar performance impact as making the stripe one 
disk wider.

MTTDL analysis shows that given normal evironmental conditions, the 
MTTDL of RAID-Z2 is already much longer than the life of the computer 
or the attendant human.  Of course sometimes one encounters unusual 
conditions where additional redundancy is desired.

I do think that it is worthwhile to be able to add another parity disk 
to an existing raidz vdev but I don''t know how much work that entails.

Zfs development seems to be overwelmed with marketing-driven 
requirements lately and it is time to get back to brass tacks and make 
sure that the parts already developed are truely enterprise-grade.

Bob
--
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Martin

2009-Jul-21 04:54 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

> Enterprises will not care about ease so much as they
> have dedicated professionals to pamper their arrays.
Enterprises can afford the professionals.  I work for a fairly large bank which
can, and does, afford a dedicated storage team.

On the other hand, no enterprise can afford downtime.  Where I work, a planned
outage is a major event and any solution which allows flexibility without an
outage is most welcome.

While I am unfamiliar withe innards of VxFS, I have seen several critical
production VxFS mount points expanded with little or no interruption.

ZFS is so close on so many levels.
-- 
This message posted from opensolaris.org

Adam Leventhal

2009-Jul-22 06:37 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

Hey Bob,
> MTTDL analysis shows that given normal evironmental conditions, the  
> MTTDL of RAID-Z2 is already much longer than the life of the  
> computer or the attendant human.  Of course sometimes one encounters  
> unusual conditions where additional redundancy is desired.
To what analysis are you referring? Today the absolute fastest you can  
resilver a 1TB drive is about 4 hours. Real-world speeds might be half  
that. In 2010 we''ll have 3TB drives meaning it may take a full day to  
resilver. The odds of hitting a latent bit error are already  
reasonably high especially with a large pool that''s infrequently  
scrubbed meaning. What then are the odds of a second drive failing in  
the 24 hours it takes to resiler?
> I do think that it is worthwhile to be able to add another parity  
> disk to an existing raidz vdev but I don''t know how much work that
> entails.
It entails a bunch of work:

   http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z

Matt Ahrens is working on a key component after which it should all be  
possible.
> Zfs development seems to be overwelmed with marketing-driven  
> requirements lately and it is time to get back to brass tacks and  
> make sure that the parts already developed are truely enterprise- 
> grade.

While I don''t disagree that the focus for ZFS should be ensuring  
enterprise-class reliability and performance, let me assure you that  
requirements are driven by the market and not by marketing.

Adam

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

Adam Leventhal

2009-Jul-22 06:45 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

>> which gap?
>>
>> ''RAID-Z should mind the gap on writes'' ?
>>
>> Message was edited by: thometal
>
> I believe this is in reference to the raid 5 write hole, described  
> here:
> http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance
It''s not.
> So I''m not sure what the ''RAID-Z should mind the gap on
writes''
> comment is getting at either.
>
> Clarification?

I''m planning to write a blog post describing this, but the basic  
problem is that RAID-Z, by virtue of supporting variable stripe writes  
(the insight that allows us to avoid the RAID-5 write hole), must  
round the number of sectors up to a multiple of nparity+1. This means  
that we may have sectors that are effectively skipped. ZFS generally  
lays down data in large contiguous streams, but these skipped sectors  
can stymie both ZFS''s write aggregation as well as the hard
drive''s
ability to group I/Os and write them quickly.

Jeff Bonwick added some code to mind these gaps on reads. The key  
insight there is that if we''re going to read 64K, say, with a 512 byte
hole in the middle, we might as well do one big read rather than two  
smaller reads and just throw out the data that we don''t care about.

Of course, doing this for writes is a bit trickier since we can''t just
blithely write over gaps as those might contain live data on the disk.  
To solve this we push the knowledge of those skipped sectors down to  
the I/O aggregation layer in the form of ''optional'' I/Os
purely for
the purpose of coalescing writes into larger chunks.

I hope that''s clear; if it''s not, stay tuned for the
aforementioned
blog post.

Adam

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

Adam Leventhal

2009-Jul-22 06:52 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

> Don''t hear about triple-parity RAID that often:
>
>> Author: Adam Leventhal
>> Repository: /hg/onnv/onnv-gate
>> Latest revision: 17811c723fb4f9fce50616cb740a92c8f6f97651
>> Total changesets: 1
>> Log message:
>> 6854612 triple-parity RAID-Z
>
> http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/ 
> 009872.html
> http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6854612
>
> (Via Blog O'' Matty.)
>
> Would be curious to see performance characteristics.

I just blogged about triple-parity RAID-Z (raidz3):

   http://blogs.sun.com/ahl/entry/triple_parity_raid_z

As for performance, on the system I was using (a max config Sun Storage
7410), I saw about a 25% improvement to 1GB/s for a streaming write
workload. YMMV, but I''d be interested in hearing your results.

Adam

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

Adam Leventhal

2009-Jul-22 07:11 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

>> Don''t hear about triple-parity RAID that often:
>
> I agree completely.  In fact, I have wondered (probably in these  
> forums), why we don''t bite the bullet and make a generic raidzN,  
> where N is any number >=0.
I agree, but raidzN isn''t simple to implement and it''s
potentially
difficult
to get it to perform well. That said, it''s something I intend to bring
to
ZFS in the next year or so.
> If memory serves, the second parity is calculated using Reed-Solomon  
> which implies that any number of parity devices is possible.
True; it''s a degenerate case.
> In fact, get rid of mirroring, because it clearly is a variant of  
> raidz with two devices.  Want three way mirroring?  Call that raidz2  
> with three devices.  The truth is that a generic raidzN would roll  
> up everything: striping, mirroring, parity raid, double parity, etc.  
> into a single format with one parameter.
That''s an interesting thought, but there are some advantages to  
calling out mirroring for example as its own vdev type. As has been  
pointed out, reading from either side of the mirror involves no  
computation whereas reading from a RAID-Z 1+2 for example would  
involve more computation. This would
complicate the calculus of balancing read operations over the mirror
devices.
> Let''s not stop there, though.  Once we have any number of parity  
> devices, why can''t I add a parity device to an array?  That should
> be simple enough with a scrub to set the parity.  In fact, what is  
> to stop me from removing a parity device?  Once again, I think the  
> code would make this rather easy.
With RAID-Z stripes can be of variable width meaning that, say, a  
single row
in a 4+2 configuration might have two stripes of 1+2. In other words,  
there
might not be enough space in the new parity device. I did write up the  
steps
that would be needed to support RAID-Z expansion; you can find it here:

   http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z
> Ok, back to the real world.  The one downside to triple parity is  
> that I recall the code discovered the corrupt block by excluding it  
> from the stripe, reconstructing the stripe and comparing that with  
> the checksum.  In other words, for a given cost of X to compute a  
> stripe and a number P of corrupt blocks, the cost of reading a  
> stripe is approximately X^P.  More corrupt blocks would radically  
> slow down the system.  With raidz2, the maximum number of corrupt  
> blocks would be two, putting a cap on how costly the read can be.
Computing the additional parity of triple-parity RAID-Z is slightly  
more expensive, but not much -- it''s just bitwise operations.  
Recovering from
a read failure is identical (and performs identically) to raidz1 or  
raidz2
until you actually have sustained three failures. In that case,  
performance
is slower as more computation is involved -- but aren''t you just happy
to
get your data back?

If there is silent data corruption, then and only then can you encounter
the O(n^3) algorithm that you alluded to, but only as a last resort.  
If we
don''t know what drives failed, we try to reconstruct your data by  
assuming
that one drive, then two drives, then three drives are returning bad  
data.
For raidz1, this was a linear operation; raidz2, quadratic; now raidz3  
is
N-cubed. There''s really no way around it. Fortunately with proper  
scrubbing
encountering data corruption in one stripe on three different drives is
highly unlikely.

Adam

--
Adam Leventhal, Fishworks                        http://blogs.sun.com/ahl

Victor Latushkin

2009-Jul-23 07:05 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

On 22.07.09 10:45, Adam Leventhal wrote:>>> which gap?
>>>
>>> ''RAID-Z should mind the gap on writes'' ?
>>>
>>> Message was edited by: thometal
>>
>> I believe this is in reference to the raid 5 write hole, described
here:
>> http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance
> 
> It''s not.
> 
>> So I''m not sure what the ''RAID-Z should mind the gap
on writes''
>> comment is getting at either.
>>
>> Clarification?
> 
> 
> I''m planning to write a blog post describing this, but the basic
problem
> is that RAID-Z, by virtue of supporting variable stripe writes (the 
> insight that allows us to avoid the RAID-5 write hole), must round the 
> number of sectors up to a multiple of nparity+1. This means that we may 
> have sectors that are effectively skipped. ZFS generally lays down data 
> in large contiguous streams, but these skipped sectors can stymie both 
> ZFS''s write aggregation as well as the hard drive''s
ability to group
> I/Os and write them quickly.
> 
> Jeff Bonwick added some code to mind these gaps on reads. The key 
> insight there is that if we''re going to read 64K, say, with a 512
byte
> hole in the middle, we might as well do one big read rather than two 
> smaller reads and just throw out the data that we don''t care
about.
> 
> Of course, doing this for writes is a bit trickier since we can''t
just
> blithely write over gaps as those might contain live data on the disk. 
> To solve this we push the knowledge of those skipped sectors down to the 
> I/O aggregation layer in the form of ''optional'' I/Os
purely for the
> purpose of coalescing writes into larger chunks.
This exact issue was discussed here almost three years ago:

http://www.opensolaris.org/jive/thread.jspa?messageID=60241

Robert Milkowski

2009-Jul-23 23:59 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

Adam Leventhal wrote:> Hey Bob,
>
>> MTTDL analysis shows that given normal evironmental conditions, the 
>> MTTDL of RAID-Z2 is already much longer than the life of the computer 
>> or the attendant human.  Of course sometimes one encounters unusual 
>> conditions where additional redundancy is desired.
>
> To what analysis are you referring? Today the absolute fastest you can 
> resilver a 1TB drive is about 4 hours. Real-world speeds might be half 
> that. In 2010 we''ll have 3TB drives meaning it may take a full day
to
> resilver. The odds of hitting a latent bit error are already 
> reasonably high especially with a large pool that''s infrequently 
> scrubbed meaning. What then are the odds of a second drive failing in 
> the 24 hours it takes to resiler?
>
I wish it was so good with raid-zN.
In real life, at least from mine experience, it can take several days to 
resilver a disk for vdevs in raid-z2 made of 11x sata disk drives with 
real data.
While the way zfs ynchronizes data is way faster under some 
circumstances it is also much slower under other.
IIRC some builds ago there were some fixes integrated so maybe it is 
different now.

>> I do think that it is worthwhile to be able to add another parity 
>> disk to an existing raidz vdev but I don''t know how much work
that
>> entails.
>
> It entails a bunch of work:
>
>   http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z
>
> Matt Ahrens is working on a key component after which it should all be 
> possible.
>A lot of people are waiting for it! :) :) :)


ps. thank you for raid-z3!

-- 
Robert Milkowski
http://milek.blogspot.com

Robert Milkowski

2009-Jul-24 00:11 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

Adam Leventhal wrote:>
> I just blogged about triple-parity RAID-Z (raidz3):
>
>   http://blogs.sun.com/ahl/entry/triple_parity_raid_z
>
> As for performance, on the system I was using (a max config Sun Storage
> 7410), I saw about a 25% improvement to 1GB/s for a streaming write
> workload. YMMV, but I''d be interested in hearing your results.
25% improvement when comparing what exactly to what?


-- 
Robert Milkowski
http://milek.blogspot.com

Adam Leventhal

2009-Jul-24 00:16 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

Robert,

On Fri, Jul 24, 2009 at 12:59:01AM +0100, Robert Milkowski
wrote:>> To what analysis are you referring? Today the absolute fastest you can 
>> resilver a 1TB drive is about 4 hours. Real-world speeds might be half 
>> that. In 2010 we''ll have 3TB drives meaning it may take a full
day to
>> resilver. The odds of hitting a latent bit error are already reasonably
>> high especially with a large pool that''s infrequently scrubbed
meaning.
>> What then are the odds of a second drive failing in the 24 hours it
takes
>> to resiler?
>
> I wish it was so good with raid-zN.
> In real life, at least from mine experience, it can take several days to 
> resilver a disk for vdevs in raid-z2 made of 11x sata disk drives with real
> data.
> While the way zfs ynchronizes data is way faster under some circumstances 
> it is also much slower under other.
> IIRC some builds ago there were some fixes integrated so maybe it is 
> different now.
Absolutely. I was talking more or less about optimal timing. I realize that
due to the priorities within ZFS and real word loads that it can take far
longer.

Adam

-- 
Adam Leventhal, Fishworks                     http://blogs.sun.com/ahl

Ross

2009-Jul-24 07:55 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

Interesting, so the more drive failures you have, the slower the array gets?

Would I be right in assuming that the slowdown is only up to the point where FMA
/ ZFS marks the drive as faulted?
-- 
This message posted from opensolaris.org

Martin

2009-Aug-05 01:32 UTC

head link

[zfs-discuss] triple-parity: RAID-Z3

> With RAID-Z stripes can be of variable width meaning that, say, a
> single row
> in a 4+2 configuration might have two stripes of 1+2. In other words,
> there
> might not be enough space in the new parity device.
Wow -- I totally missed that scenario.  Excellent point.
>  I did write up the
> steps
> that would be needed to support RAID-Z expansion
Good write up.  If I understand it, the basic approach is to add the device to
each row and leave the unusable fragments there.  New stripes will take
advantage of the wider row but old stripes will not.

It would seem that the mythical bp_rewrite() that I see mentioned here and there
could relocate a stripe to another set of rows without altering the
transaction_id (or whatever it''s called), critical for tracking
snapshots.  I suspect this function would allow background defrag/coalesce (a
needed feature IMHO) and deduplication.  With background defrag, the extra space
on existing stripes would not immediately be usable, but would appear over time.

Many thanks for the insight and thoughts.

Bluntly, how can I help?  I have cut a lifetime of C code in a past life.

Cheers,
Marty
-- 
This message posted from opensolaris.org

zfs discuss - Jul 2009 - triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3

[zfs-discuss] triple-parity: RAID-Z3