thr3ads.net - zfs discuss - [zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion? [Dec 2005]

If this information is useful, please help other people find it:
Share via:

Jason Ozolins

2005-Dec-06 05:39 UTC

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

To the Sun ZFS folks:

I was just thinking:  During the ZFS design phase, was a 2-D block 
address format ever considered to allow expansion of RAID-Z vdevs after 
their initial creation?   I know that RAID-Z vdev components don''t need
to be zero''d before use, thanks to ZFS'' nifty full-stripe
writes, but if
you allocated a few bits from that hefty 128 bit device virtual address 
for a component number, then after the initial creation of a RAID-Z 
vdev, you could add a new component to the vdev by zeroing it.

As it stands though, it sounds like the DVA geometry is set in stone by 
the number of components at the time the RAID-Z vdev is created, which 
doesn''t allow this expansion.

ISTR reading that Network Appliance filers allow this sort of expansion. 
  Probably means it''s patented... :-(

-Jason

Jeff Bonwick

2005-Dec-06 06:28 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

> I was just thinking:  During the ZFS design phase, was a 2-D block 
> address format ever considered to allow expansion of RAID-Z vdevs after 
> their initial creation?
Yes!

You''re correct that each RAID-Z group is a fixed number of devices.
If you have a 5-disk RAID-Z group, you can add *another* 5-disk RAID-Z
group (or 4-disk, or whatever size you like); but you can''t add a
single disk and convert an existing 5-disk RAID-Z into a 6-disk RAID-Z.

The solution you suggested is one option.  Another option we''ve
considered is time-dependent geometry.  That is, suppose we allowed a
RAID-Z config change from 5 to 6 disks as part of transaction group 37.
Then all blocks born in txg 37 or before would remain striped across
the original 5 disks, and all blocks born in txg 38 or later would be
striped across all 6 disks.  We could determine how each block was
allocated based solely on its birth time and the (presumably tiny)
history of RAID-Z config changes.

That said, as tempting as this is on the basis of technical coolness,
I question whether the added complexity is really worth it.  As the old
saying goes, what you leave out is as important as what you put in.

Jeff

Al Hopper

2005-Dec-06 14:36 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

On Mon, 5 Dec 2005, Jeff Bonwick wrote:
> > I was just thinking:  During the ZFS design phase, was a 2-D block
> > address format ever considered to allow expansion of RAID-Z vdevs
after
> > their initial creation?
>
> Yes!
>
> You''re correct that each RAID-Z group is a fixed number of
devices.
> If you have a 5-disk RAID-Z group, you can add *another* 5-disk RAID-Z
> group (or 4-disk, or whatever size you like); but you can''t add a
> single disk and convert an existing 5-disk RAID-Z into a 6-disk RAID-Z.
>
> The solution you suggested is one option.  Another option we''ve
> considered is time-dependent geometry.  That is, suppose we allowed a
> RAID-Z config change from 5 to 6 disks as part of transaction group 37.
> Then all blocks born in txg 37 or before would remain striped across
> the original 5 disks, and all blocks born in txg 38 or later would be
> striped across all 6 disks.  We could determine how each block was
> allocated based solely on its birth time and the (presumably tiny)
> history of RAID-Z config changes.
>
> That said, as tempting as this is on the basis of technical coolness,
> I question whether the added complexity is really worth it.  As the old
> saying goes, what you leave out is as important as what you put in.
Having seen a number of posts suggesting/requesting feature adds and
feature enhancement, I''d rather see the current
implementation/feature-set
be solidified and the code performance/efficiency increased.

The danger of adding more features is, IMHO, that it''ll take more host
CPU
cycles per block of data read/written, to implement them.  So when I say
code efficiency, I''m really saying to minimize the number of CPU cycles
it
takes to read/write data and do real (world) disk I/O.

Just my $0.02 - I know others will have different priorities.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005

Bill Sommerfeld

2005-Dec-06 14:50 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

On Tue, 2005-12-06 at 16:39 +1100, Jason Ozolins wrote:> ISTR reading that Network Appliance filers allow this sort of expansion. 
actually, Solaris volume manager''s raid-5 allows for this as well; you
can metattach new columns to a raid metadevice.

that said, it would seem that adding a single extra column to a raid-z
group which was mostly full wouldn''t immediately add significant usable
space -- most of the new blocks would be opposite allocated blocks and
wouldn''t be useable until churn had freed up the rest of the row.

					- Bill

Richard Elling

2005-Dec-06 18:01 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

Al Hopper wrote:> Having seen a number of posts suggesting/requesting feature adds and
> feature enhancement, I''d rather see the current
implementation/feature-set
> be solidified and the code performance/efficiency increased.
Agree.
> The danger of adding more features is, IMHO, that it''ll take more
host CPU
> cycles per block of data read/written, to implement them.  So when I say
> code efficiency, I''m really saying to minimize the number of CPU
cycles it
> takes to read/write data and do real (world) disk I/O.
Not sure I agree.  More likely I disagree.  Data integrity is the paramount
requirement.  Any code which enhances data integrity should not be subject
to removal in order to enhance performance.  If you just want performance,
then there are dozens and dozens of other technologies you can use which
prefer performance over data integrity.  I''ve been looking at a lot of
the
source lately, and am pleasantly surprised at the clarity and attention to
data integrity.  Comparatively, UFS is a flaming PODS.

Further, I''ll predict that improving performance will require more
code,
and CPU cycles, rather than less.  The good news is that CPU cycles are
approaching free at a rapid rate.
http://www.sun.com/processors/UltraSPARC-T1/index.xml
> Just my $0.02 - I know others will have different priorities.
... and requirements :-)
  -- richard

Jason Ozolins

2005-Dec-06 22:52 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

Jeff Bonwick wrote:>>I was just thinking:  During the ZFS design phase, was a 2-D block 
>>address format ever considered to allow expansion of RAID-Z vdevs after 
>>their initial creation?
[...]> The solution you suggested is one option.  Another option we''ve
> considered is time-dependent geometry.  That is, suppose we allowed a
> RAID-Z config change from 5 to 6 disks as part of transaction group 37.
> Then all blocks born in txg 37 or before would remain striped across
> the original 5 disks, and all blocks born in txg 38 or later would be
> striped across all 6 disks.  We could determine how each block was
> allocated based solely on its birth time and the (presumably tiny)
> history of RAID-Z config changes.
> 
> That said, as tempting as this is on the basis of technical coolness,
> I question whether the added complexity is really worth it.  As the old
> saying goes, what you leave out is as important as what you put in.
It wasn''t a feature request (relax Al), rather a question about the 
design history.  As for whether it''s worth the hassle: I''d say
that
being able to migrate data off a vdev without unmounting the datasets on 
that pool would remove most of the need to reconfigure RAID-Z vdevs on 
the fly.  I think this was already mentioned by Sun folks as a desired 
enhancement to ZFS.

Home users would probably like the ability to slap another drive on the 
side of a RAID-Z vdev, because it means they only have to buy one drive 
at a time to expand... but they''re not a big revenue source.  ;-)

As for cross-fs mv and special case file concatenation APIs and other 
weird stuff of marginal appeal, my favourite feature that looks like it 
will never get implemented on UNIX would be a decent API for mapping out 
allocated data extents in sparse files.  Even Windows NT has one that 
doesn''t suck.  If you ever try to use user-level backup tools like
rsync
to backup 64-bit sparse files, you''ll spend an awfully long time
reading
zeros to work out where the holes are.  "zfs backup" won''t
help you here
if the destination file system isn''t ZFS...

-Jason =:^)

Neil Perrin

2005-Dec-07 00:04 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

> As for cross-fs mv and special case file concatenation APIs and other 
> weird stuff of marginal appeal, my favourite feature that looks like it 
> will never get implemented on UNIX would be a decent API for mapping out 
> allocated data extents in sparse files.  Even Windows NT has one that 
> doesn''t suck.  If you ever try to use user-level backup tools like
rsync
> to backup 64-bit sparse files, you''ll spend an awfully long time
reading
> zeros to work out where the holes are.  "zfs backup"
won''t help you here
> if the destination file system isn''t ZFS...
Take a look at lseek(2). It was enhanced (8 months ago) to seek
to the next hole (SEEK_HOLE) or data (SEEK_DATA). This allows a
simple program to be written to find/copy/backup/etc a sparse file.
Both UFS and ZFS support this, although ZFS is more efficient.
We ought to make cp use it, though.

Neil

Jeff Bonwick

2005-Dec-07 00:19 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

> As for cross-fs mv and special case file concatenation APIs and other 
> weird stuff of marginal appeal, my favourite feature that looks like it 
> will never get implemented on UNIX would be a decent API for mapping out 
> allocated data extents in sparse files.
Hah!  You''re in luck!

As part of the ZFS project, we introduced two general extensions to
lseek(): SEEK_HOLE and SEEK_DATA.  These allow you to quickly discover
the non-zero regions of holey files.  Quoting from the updated man page:

       o  If whence is SEEK_HOLE, the offset of the start of  the
          next  hole greater than or equal to the supplied offset
          is returned. The definition of a hole is provided  near
          the end of the DESCRIPTION.

       o  If whence is SEEK_DATA, the file pointer is set to  the
          start  of the next non-hole file region greater than or
          equal to the supplied offset.

	[...]
	
     A "hole" is defined as a contiguous  range  of  bytes  in  a
     file,  all  having the value of zero, but not all zeros in a
     file are guaranteed to be represented as holes returned with
     SEEK_HOLE. Filesystems are allowed to expose ranges of zeros
     with SEEK_HOLE, but not required to.  Applications  can  use
     SEEK_HOLE  to  optimise  their behavior for ranges of zeros,
     but must not depend on it to find all such ranges in a file.
     The  existence  of  a  hole  at the end of every data region
     allows for easy programming and implies that a virtual  hole
     exists  at  the  end  of  the  file. Applications should use
     fpathconf(_PC_MIN_HOLE_SIZE) or  pathconf(_PC_MIN_HOLE_SIZE)
     to  determine if a filesystem supports SEEK_HOLE. See fpath-
     conf(2).

     For filesystems that do not supply information about  holes,
     the file will be represented as one entire data region.

Any filesystem can support SEEK_HOLE / SEEK_DATA.  Even a filesystem
like UFS, which has no special support for sparseness, can walk its
block pointers a lot faster than it can copy out a bunch of zeroes.

For ZFS, we made sparse file navigation a first-class operation --
no linear searches.  Each block pointer contains a ''fill
count''
describing the number  of blocks beneath it in the tree.  This allows
ZFS to find holes and non-holes (real data) in logarithmic time.
I''ll say more about the implementation in an upcoming blog entry.

Jeff

Al Hopper

2005-Dec-07 03:58 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

On Tue, 6 Dec 2005, Richard Elling wrote:
> Al Hopper wrote:
> > Having seen a number of posts suggesting/requesting feature adds and
> > feature enhancement, I''d rather see the current
implementation/feature-set
> > be solidified and the code performance/efficiency increased.
>
> Agree.
>
> > The danger of adding more features is, IMHO, that it''ll take
more host CPU
> > cycles per block of data read/written, to implement them.  So when I
say
> > code efficiency, I''m really saying to minimize the number of
CPU cycles it
> > takes to read/write data and do real (world) disk I/O.
.... reformatted/hacked up/but not modified ...> Not sure I agree.  More likely I disagree.
> Data integrity is the paramount requirement.
Agreed.
> Any code which enhances data integrity should not be subject
> to removal in order to enhance performance.
Agreed.  My position is that the current implementation provides such a
step forward, in terms of data integrity, over existing file system
implementations, that enhancing it with additional ECC should wait until
some time in the future.  IOW - I''d prefer to see the current
implementation optimized, before adding additional features like ECC that
would only increase the CPU cycles per block of I/O.
> If you just want performance, then there are dozens and dozens of other
> technologies you can use which prefer performance over data integrity.
Agreed.  In this respect, ZFS is revolutionary, not evolutionary and a
testament to the ZFS teams ability to think outside the box.
> I''ve been looking at a lot of the source lately, and am pleasantly
> surprised at the clarity and attention to data integrity.
> Comparatively, UFS is a flaming PODS.
Agreed.
> Further, I''ll predict that improving performance will require more
code,
> and CPU cycles, rather than less.  The good news is that CPU cycles are
> approaching free at a rapid rate.
> http://www.sun.com/processors/UltraSPARC-T1/index.xml
Agreed 100%.  I often draw a pseudo realistic graph on a white board that
tries to show 3 curves over time.  a) CPU "horsepower" (which reflects
CPU/memory latency), b) Avg system memory size and c) system I/O
performance.  Obviously (preaching to the converted)  the CPU & memory size
curves climb dramatically while (disk) I/O rates climb at a dramatically
lower rate.  In fact, the main challenge in building high performance
computer systems, is dealing with the disparity between the available CPU
cycles (with attendant large system memory size) and I/O rates, both in
terms of I/O Operations per Second and overall (disk) I/O bandwidth.

Having recognized that trend years ago, I shifted most of our development
towards Java - to "ride" the CPU & memory size curves!
> > Just my $0.02 - I know others will have different priorities.
>
> ... and requirements :-)
:)

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  al at logical-approach.com
           Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005

Richard Elling

2005-Dec-07 05:05 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

Al Hopper wrote:> On Tue, 6 Dec 2005, Richard Elling wrote:
>>Any code which enhances data integrity should not be subject
>>to removal in order to enhance performance.
>  
> Agreed.  My position is that the current implementation provides such a
> step forward, in terms of data integrity, over existing file system
> implementations, that enhancing it with additional ECC should wait until
> some time in the future.  IOW - I''d prefer to see the current
> implementation optimized, before adding additional features like ECC that
> would only increase the CPU cycles per block of I/O.
Yes!  At this time I would say that we really don''t know what the real
rates of faults are along the path, since until now, we really didn''t
have
a way of detecting them in a reasonable amount of time.  Until we know
the nature of the faults we expect to see, we shouldn''t go off
half-cocked
and shoot the wrong target.  Or, to put it another way, I''m very
interested
in finding real failure rates detected by ZFS users.  Let me know if you
see any blips.
  -- richard

Jeff Bonwick

2005-Dec-07 05:23 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

> Agreed 100%.  I often draw a pseudo realistic graph on a white board that
> tries to show 3 curves over time.  a) CPU "horsepower" (which
reflects
> CPU/memory latency), b) Avg system memory size and c) system I/O
> performance.  Obviously (preaching to the converted)  the CPU & memory
size
> curves climb dramatically while (disk) I/O rates climb at a dramatically
> lower rate.  In fact, the main challenge in building high performance
> computer systems, is dealing with the disparity between the available CPU
> cycles (with attendant large system memory size) and I/O rates, both in
> terms of I/O Operations per Second and overall (disk) I/O bandwidth.
Exactly!  Few people fully appreciate this.  Over the past 15 years,
disks have gotten about 10x faster, memory about 100x faster, and CPUs
1000x faster.  As you point out, it is the change in *relative*
performance, not absolute performance, that moves the design center.
 
Jeff

Nathan Kroenert

2005-Dec-07 05:27 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

Silly question time:

What sort of statistics do we keep on these flips / detected corruptions?
  - Per zpool
  - per device
  - per lun
etc.

Is it something we''ll be able to collect from an explorer at some time,
for example?

Thanks! :)

Nathan.


Richard Elling wrote:
> Al Hopper wrote:
> 
>> On Tue, 6 Dec 2005, Richard Elling wrote:
>>
>>> Any code which enhances data integrity should not be subject
>>> to removal in order to enhance performance.
>>
>>  
>> Agreed.  My position is that the current implementation provides such a
>> step forward, in terms of data integrity, over existing file system
>> implementations, that enhancing it with additional ECC should wait
until
>> some time in the future.  IOW - I''d prefer to see the current
>> implementation optimized, before adding additional features like ECC
that
>> would only increase the CPU cycles per block of I/O.
> 
> 
> Yes!  At this time I would say that we really don''t know what the
real
> rates of faults are along the path, since until now, we really
didn''t have
> a way of detecting them in a reasonable amount of time.  Until we know
> the nature of the faults we expect to see, we shouldn''t go off
half-cocked
> and shoot the wrong target.  Or, to put it another way, I''m very
interested
> in finding real failure rates detected by ZFS users.  Let me know if you
> see any blips.
>  -- richard
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jeff Bonwick

2005-Dec-07 05:29 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

> Yes!  At this time I would say that we really don''t know what the
real
> rates of faults are along the path, since until now, we really
didn''t have
> a way of detecting them in a reasonable amount of time.  Until we know
> the nature of the faults we expect to see, we shouldn''t go off
half-cocked
> and shoot the wrong target.  Or, to put it another way, I''m very
interested
> in finding real failure rates detected by ZFS users.  Let me know if you
> see any blips.
Well said.  If we''re not careful, we may learn something.

Jeff

Jeff Bonwick

2005-Dec-07 05:44 UTC

head link

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

> Silly question time:
> 
> What sort of statistics do we keep on these flips / detected corruptions?
>   - Per zpool
>   - per device
>   - per lun
> etc.
> 
> Is it something we''ll be able to collect from an explorer at some
time,
> for example?
On a live system, running ''zpool status'' will tell you what
errors
we''ve seen.  If you''re using mdb on a crash dump, "::spa
-ve" will
give you the same information.

At present these are just bean counters; over the next few months
we''re going to add detailed FMA telemetry, and also maintain both
an error log and a zpool/zfs command log on persistent storage.

Jeff

Darren Dunham

2005-Dec-09 01:51 UTC

head link

[zfs-discuss] Re: were 2-D block addresses ever considered to

> > ISTR reading that Network Appliance filers allow
> this sort of expansion. 
> 
> actually, Solaris volume manager''s raid-5 allows for
> this as well; you
> can metattach new columns to a raid metadevice.
With some significant differences.  Since the filesystem can''t
recognize the underlying storage, the original blocks remain fixed and data on
the new column is not really striped anywhere.  It''s just like tacking
a concat device onto the end of the raid5 (unless I''ve completely
misunderstood how it works).

As opposed to something like a VxVM relayout.  It can add a column, but it takes
the time penalty (and requires the temporary storage) to rewrite virtually every
block on the volume so the stripes (and parity blocks) span all of the new
columns.
> that said, it would seem that adding a single extra
> column to a raid-z
> group which was mostly full wouldn''t immediately add
> significant usable
> space -- most of the new blocks would be opposite
> allocated blocks and
> wouldn''t be useable until churn had freed up the rest
> of the row.
Certainly that''s one of the well-known ways of causing performance
issues on a netapp.   Obviously they have some way of accomplishing the writes,
even if it can''t use the space on the other disks.  I''m not
sure how they do that.

-- 
Darren
This message posted from opensolaris.org

Ben Lazarus

2005-Dec-12 03:25 UTC

head link

[zfs-discuss] Re: were 2-D block addresses ever considered to allow RAID-Z expansion?

I think RAID-Z column expansion would be a pretty valuable feature, even if a
lengthy rebuild (even an offline rebuild) were necessary.  As someone else
pointed out, this would probably be most useful for home users, but
it''d also be useful in the ''enterprise''.

As that poster also pointed out, home users don''t represent much if any
direct revenue for Sun, but I think it''s been a big mistake in the past
to underestimate the indirect influence of the home / hobbyist crowd on the
corporate landscape, as kids grow up, get jobs, and start recommending that
their employers adopt the stuff that they''ve grown up using at home
(e.g. Linux), regardless of whether or not there''s something better out
there (e.g. Solaris), that they may just have never had a reason to get familiar
with.

This is partly why I''m so pleased to see the beginnings of a
revitalization effort on x86 and consumer hardware for Solaris.  Make it easy
and attractive for advanced home users and hobbyists to try Solaris, and I
really believe this will have an eventual impact on corporate adoption and
revenue.

I think ZFS has the potential to be a huge ''hook'' to make
people who might have otherwise installed Linux consider Solaris.  I think
dtrace probably does the same for the education market, and I think it will have
a similar result on Sun''s bottom line - get people used to Solaris,
whether it''s hobbyists at home or college students taking CS courses,
and this can only help Sun in the long run.

So, a long-winded way of saying - don''t discount the home users. 
Obviously, you don''t want to make design decisions that cater to the
home users at the expense of your corporate installed-base, but I think you know
what I''m saying.

Anyway, even if adding a column had some downside (e.g. lengthy rebuild), I
think that it''d still be an excellent feature to have the option of
either exercising or ignoring.  It''d probably be the top or only
feature request on my personal list - you guys seem to have already done
everything else I could have imagined, and plenty of things I couldn''t
have.  Kudos.
This message posted from opensolaris.org

zfs discuss - Dec 2005 - were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?

[zfs-discuss] Re: were 2-D block addresses ever considered to

[zfs-discuss] Re: were 2-D block addresses ever considered to allow RAID-Z expansion?