Jason Ozolins
2005-Dec-06 05:39 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
To the Sun ZFS folks: I was just thinking: During the ZFS design phase, was a 2-D block address format ever considered to allow expansion of RAID-Z vdevs after their initial creation? I know that RAID-Z vdev components don''t need to be zero''d before use, thanks to ZFS'' nifty full-stripe writes, but if you allocated a few bits from that hefty 128 bit device virtual address for a component number, then after the initial creation of a RAID-Z vdev, you could add a new component to the vdev by zeroing it. As it stands though, it sounds like the DVA geometry is set in stone by the number of components at the time the RAID-Z vdev is created, which doesn''t allow this expansion. ISTR reading that Network Appliance filers allow this sort of expansion. Probably means it''s patented... :-( -Jason
Jeff Bonwick
2005-Dec-06 06:28 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
> I was just thinking: During the ZFS design phase, was a 2-D block > address format ever considered to allow expansion of RAID-Z vdevs after > their initial creation?Yes! You''re correct that each RAID-Z group is a fixed number of devices. If you have a 5-disk RAID-Z group, you can add *another* 5-disk RAID-Z group (or 4-disk, or whatever size you like); but you can''t add a single disk and convert an existing 5-disk RAID-Z into a 6-disk RAID-Z. The solution you suggested is one option. Another option we''ve considered is time-dependent geometry. That is, suppose we allowed a RAID-Z config change from 5 to 6 disks as part of transaction group 37. Then all blocks born in txg 37 or before would remain striped across the original 5 disks, and all blocks born in txg 38 or later would be striped across all 6 disks. We could determine how each block was allocated based solely on its birth time and the (presumably tiny) history of RAID-Z config changes. That said, as tempting as this is on the basis of technical coolness, I question whether the added complexity is really worth it. As the old saying goes, what you leave out is as important as what you put in. Jeff
Al Hopper
2005-Dec-06 14:36 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
On Mon, 5 Dec 2005, Jeff Bonwick wrote:> > I was just thinking: During the ZFS design phase, was a 2-D block > > address format ever considered to allow expansion of RAID-Z vdevs after > > their initial creation? > > Yes! > > You''re correct that each RAID-Z group is a fixed number of devices. > If you have a 5-disk RAID-Z group, you can add *another* 5-disk RAID-Z > group (or 4-disk, or whatever size you like); but you can''t add a > single disk and convert an existing 5-disk RAID-Z into a 6-disk RAID-Z. > > The solution you suggested is one option. Another option we''ve > considered is time-dependent geometry. That is, suppose we allowed a > RAID-Z config change from 5 to 6 disks as part of transaction group 37. > Then all blocks born in txg 37 or before would remain striped across > the original 5 disks, and all blocks born in txg 38 or later would be > striped across all 6 disks. We could determine how each block was > allocated based solely on its birth time and the (presumably tiny) > history of RAID-Z config changes. > > That said, as tempting as this is on the basis of technical coolness, > I question whether the added complexity is really worth it. As the old > saying goes, what you leave out is as important as what you put in.Having seen a number of posts suggesting/requesting feature adds and feature enhancement, I''d rather see the current implementation/feature-set be solidified and the code performance/efficiency increased. The danger of adding more features is, IMHO, that it''ll take more host CPU cycles per block of data read/written, to implement them. So when I say code efficiency, I''m really saying to minimize the number of CPU cycles it takes to read/write data and do real (world) disk I/O. Just my $0.02 - I know others will have different priorities. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
Bill Sommerfeld
2005-Dec-06 14:50 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
On Tue, 2005-12-06 at 16:39 +1100, Jason Ozolins wrote:> ISTR reading that Network Appliance filers allow this sort of expansion.actually, Solaris volume manager''s raid-5 allows for this as well; you can metattach new columns to a raid metadevice. that said, it would seem that adding a single extra column to a raid-z group which was mostly full wouldn''t immediately add significant usable space -- most of the new blocks would be opposite allocated blocks and wouldn''t be useable until churn had freed up the rest of the row. - Bill
Richard Elling
2005-Dec-06 18:01 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
Al Hopper wrote:> Having seen a number of posts suggesting/requesting feature adds and > feature enhancement, I''d rather see the current implementation/feature-set > be solidified and the code performance/efficiency increased.Agree.> The danger of adding more features is, IMHO, that it''ll take more host CPU > cycles per block of data read/written, to implement them. So when I say > code efficiency, I''m really saying to minimize the number of CPU cycles it > takes to read/write data and do real (world) disk I/O.Not sure I agree. More likely I disagree. Data integrity is the paramount requirement. Any code which enhances data integrity should not be subject to removal in order to enhance performance. If you just want performance, then there are dozens and dozens of other technologies you can use which prefer performance over data integrity. I''ve been looking at a lot of the source lately, and am pleasantly surprised at the clarity and attention to data integrity. Comparatively, UFS is a flaming PODS. Further, I''ll predict that improving performance will require more code, and CPU cycles, rather than less. The good news is that CPU cycles are approaching free at a rapid rate. http://www.sun.com/processors/UltraSPARC-T1/index.xml> Just my $0.02 - I know others will have different priorities.... and requirements :-) -- richard
Jason Ozolins
2005-Dec-06 22:52 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
Jeff Bonwick wrote:>>I was just thinking: During the ZFS design phase, was a 2-D block >>address format ever considered to allow expansion of RAID-Z vdevs after >>their initial creation?[...]> The solution you suggested is one option. Another option we''ve > considered is time-dependent geometry. That is, suppose we allowed a > RAID-Z config change from 5 to 6 disks as part of transaction group 37. > Then all blocks born in txg 37 or before would remain striped across > the original 5 disks, and all blocks born in txg 38 or later would be > striped across all 6 disks. We could determine how each block was > allocated based solely on its birth time and the (presumably tiny) > history of RAID-Z config changes. > > That said, as tempting as this is on the basis of technical coolness, > I question whether the added complexity is really worth it. As the old > saying goes, what you leave out is as important as what you put in.It wasn''t a feature request (relax Al), rather a question about the design history. As for whether it''s worth the hassle: I''d say that being able to migrate data off a vdev without unmounting the datasets on that pool would remove most of the need to reconfigure RAID-Z vdevs on the fly. I think this was already mentioned by Sun folks as a desired enhancement to ZFS. Home users would probably like the ability to slap another drive on the side of a RAID-Z vdev, because it means they only have to buy one drive at a time to expand... but they''re not a big revenue source. ;-) As for cross-fs mv and special case file concatenation APIs and other weird stuff of marginal appeal, my favourite feature that looks like it will never get implemented on UNIX would be a decent API for mapping out allocated data extents in sparse files. Even Windows NT has one that doesn''t suck. If you ever try to use user-level backup tools like rsync to backup 64-bit sparse files, you''ll spend an awfully long time reading zeros to work out where the holes are. "zfs backup" won''t help you here if the destination file system isn''t ZFS... -Jason =:^)
Neil Perrin
2005-Dec-07 00:04 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
> As for cross-fs mv and special case file concatenation APIs and other > weird stuff of marginal appeal, my favourite feature that looks like it > will never get implemented on UNIX would be a decent API for mapping out > allocated data extents in sparse files. Even Windows NT has one that > doesn''t suck. If you ever try to use user-level backup tools like rsync > to backup 64-bit sparse files, you''ll spend an awfully long time reading > zeros to work out where the holes are. "zfs backup" won''t help you here > if the destination file system isn''t ZFS...Take a look at lseek(2). It was enhanced (8 months ago) to seek to the next hole (SEEK_HOLE) or data (SEEK_DATA). This allows a simple program to be written to find/copy/backup/etc a sparse file. Both UFS and ZFS support this, although ZFS is more efficient. We ought to make cp use it, though. Neil
Jeff Bonwick
2005-Dec-07 00:19 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
> As for cross-fs mv and special case file concatenation APIs and other > weird stuff of marginal appeal, my favourite feature that looks like it > will never get implemented on UNIX would be a decent API for mapping out > allocated data extents in sparse files.Hah! You''re in luck! As part of the ZFS project, we introduced two general extensions to lseek(): SEEK_HOLE and SEEK_DATA. These allow you to quickly discover the non-zero regions of holey files. Quoting from the updated man page: o If whence is SEEK_HOLE, the offset of the start of the next hole greater than or equal to the supplied offset is returned. The definition of a hole is provided near the end of the DESCRIPTION. o If whence is SEEK_DATA, the file pointer is set to the start of the next non-hole file region greater than or equal to the supplied offset. [...] A "hole" is defined as a contiguous range of bytes in a file, all having the value of zero, but not all zeros in a file are guaranteed to be represented as holes returned with SEEK_HOLE. Filesystems are allowed to expose ranges of zeros with SEEK_HOLE, but not required to. Applications can use SEEK_HOLE to optimise their behavior for ranges of zeros, but must not depend on it to find all such ranges in a file. The existence of a hole at the end of every data region allows for easy programming and implies that a virtual hole exists at the end of the file. Applications should use fpathconf(_PC_MIN_HOLE_SIZE) or pathconf(_PC_MIN_HOLE_SIZE) to determine if a filesystem supports SEEK_HOLE. See fpath- conf(2). For filesystems that do not supply information about holes, the file will be represented as one entire data region. Any filesystem can support SEEK_HOLE / SEEK_DATA. Even a filesystem like UFS, which has no special support for sparseness, can walk its block pointers a lot faster than it can copy out a bunch of zeroes. For ZFS, we made sparse file navigation a first-class operation -- no linear searches. Each block pointer contains a ''fill count'' describing the number of blocks beneath it in the tree. This allows ZFS to find holes and non-holes (real data) in logarithmic time. I''ll say more about the implementation in an upcoming blog entry. Jeff
Al Hopper
2005-Dec-07 03:58 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
On Tue, 6 Dec 2005, Richard Elling wrote:> Al Hopper wrote: > > Having seen a number of posts suggesting/requesting feature adds and > > feature enhancement, I''d rather see the current implementation/feature-set > > be solidified and the code performance/efficiency increased. > > Agree. > > > The danger of adding more features is, IMHO, that it''ll take more host CPU > > cycles per block of data read/written, to implement them. So when I say > > code efficiency, I''m really saying to minimize the number of CPU cycles it > > takes to read/write data and do real (world) disk I/O..... reformatted/hacked up/but not modified ...> Not sure I agree. More likely I disagree. > Data integrity is the paramount requirement.Agreed.> Any code which enhances data integrity should not be subject > to removal in order to enhance performance.Agreed. My position is that the current implementation provides such a step forward, in terms of data integrity, over existing file system implementations, that enhancing it with additional ECC should wait until some time in the future. IOW - I''d prefer to see the current implementation optimized, before adding additional features like ECC that would only increase the CPU cycles per block of I/O.> If you just want performance, then there are dozens and dozens of other > technologies you can use which prefer performance over data integrity.Agreed. In this respect, ZFS is revolutionary, not evolutionary and a testament to the ZFS teams ability to think outside the box.> I''ve been looking at a lot of the source lately, and am pleasantly > surprised at the clarity and attention to data integrity. > Comparatively, UFS is a flaming PODS.Agreed.> Further, I''ll predict that improving performance will require more code, > and CPU cycles, rather than less. The good news is that CPU cycles are > approaching free at a rapid rate. > http://www.sun.com/processors/UltraSPARC-T1/index.xmlAgreed 100%. I often draw a pseudo realistic graph on a white board that tries to show 3 curves over time. a) CPU "horsepower" (which reflects CPU/memory latency), b) Avg system memory size and c) system I/O performance. Obviously (preaching to the converted) the CPU & memory size curves climb dramatically while (disk) I/O rates climb at a dramatically lower rate. In fact, the main challenge in building high performance computer systems, is dealing with the disparity between the available CPU cycles (with attendant large system memory size) and I/O rates, both in terms of I/O Operations per Second and overall (disk) I/O bandwidth. Having recognized that trend years ago, I shifted most of our development towards Java - to "ride" the CPU & memory size curves!> > Just my $0.02 - I know others will have different priorities. > > ... and requirements :-):) Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005
Richard Elling
2005-Dec-07 05:05 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
Al Hopper wrote:> On Tue, 6 Dec 2005, Richard Elling wrote: >>Any code which enhances data integrity should not be subject >>to removal in order to enhance performance. > > Agreed. My position is that the current implementation provides such a > step forward, in terms of data integrity, over existing file system > implementations, that enhancing it with additional ECC should wait until > some time in the future. IOW - I''d prefer to see the current > implementation optimized, before adding additional features like ECC that > would only increase the CPU cycles per block of I/O.Yes! At this time I would say that we really don''t know what the real rates of faults are along the path, since until now, we really didn''t have a way of detecting them in a reasonable amount of time. Until we know the nature of the faults we expect to see, we shouldn''t go off half-cocked and shoot the wrong target. Or, to put it another way, I''m very interested in finding real failure rates detected by ZFS users. Let me know if you see any blips. -- richard
Jeff Bonwick
2005-Dec-07 05:23 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
> Agreed 100%. I often draw a pseudo realistic graph on a white board that > tries to show 3 curves over time. a) CPU "horsepower" (which reflects > CPU/memory latency), b) Avg system memory size and c) system I/O > performance. Obviously (preaching to the converted) the CPU & memory size > curves climb dramatically while (disk) I/O rates climb at a dramatically > lower rate. In fact, the main challenge in building high performance > computer systems, is dealing with the disparity between the available CPU > cycles (with attendant large system memory size) and I/O rates, both in > terms of I/O Operations per Second and overall (disk) I/O bandwidth.Exactly! Few people fully appreciate this. Over the past 15 years, disks have gotten about 10x faster, memory about 100x faster, and CPUs 1000x faster. As you point out, it is the change in *relative* performance, not absolute performance, that moves the design center. Jeff
Nathan Kroenert
2005-Dec-07 05:27 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
Silly question time: What sort of statistics do we keep on these flips / detected corruptions? - Per zpool - per device - per lun etc. Is it something we''ll be able to collect from an explorer at some time, for example? Thanks! :) Nathan. Richard Elling wrote:> Al Hopper wrote: > >> On Tue, 6 Dec 2005, Richard Elling wrote: >> >>> Any code which enhances data integrity should not be subject >>> to removal in order to enhance performance. >> >> >> Agreed. My position is that the current implementation provides such a >> step forward, in terms of data integrity, over existing file system >> implementations, that enhancing it with additional ECC should wait until >> some time in the future. IOW - I''d prefer to see the current >> implementation optimized, before adding additional features like ECC that >> would only increase the CPU cycles per block of I/O. > > > Yes! At this time I would say that we really don''t know what the real > rates of faults are along the path, since until now, we really didn''t have > a way of detecting them in a reasonable amount of time. Until we know > the nature of the faults we expect to see, we shouldn''t go off half-cocked > and shoot the wrong target. Or, to put it another way, I''m very interested > in finding real failure rates detected by ZFS users. Let me know if you > see any blips. > -- richard > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jeff Bonwick
2005-Dec-07 05:29 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
> Yes! At this time I would say that we really don''t know what the real > rates of faults are along the path, since until now, we really didn''t have > a way of detecting them in a reasonable amount of time. Until we know > the nature of the faults we expect to see, we shouldn''t go off half-cocked > and shoot the wrong target. Or, to put it another way, I''m very interested > in finding real failure rates detected by ZFS users. Let me know if you > see any blips.Well said. If we''re not careful, we may learn something. Jeff
Jeff Bonwick
2005-Dec-07 05:44 UTC
[zfs-discuss] were 2-D block addresses ever considered to allow RAID-Z expansion?
> Silly question time: > > What sort of statistics do we keep on these flips / detected corruptions? > - Per zpool > - per device > - per lun > etc. > > Is it something we''ll be able to collect from an explorer at some time, > for example?On a live system, running ''zpool status'' will tell you what errors we''ve seen. If you''re using mdb on a crash dump, "::spa -ve" will give you the same information. At present these are just bean counters; over the next few months we''re going to add detailed FMA telemetry, and also maintain both an error log and a zpool/zfs command log on persistent storage. Jeff
Darren Dunham
2005-Dec-09 01:51 UTC
[zfs-discuss] Re: were 2-D block addresses ever considered to
> > ISTR reading that Network Appliance filers allow > this sort of expansion. > > actually, Solaris volume manager''s raid-5 allows for > this as well; you > can metattach new columns to a raid metadevice.With some significant differences. Since the filesystem can''t recognize the underlying storage, the original blocks remain fixed and data on the new column is not really striped anywhere. It''s just like tacking a concat device onto the end of the raid5 (unless I''ve completely misunderstood how it works). As opposed to something like a VxVM relayout. It can add a column, but it takes the time penalty (and requires the temporary storage) to rewrite virtually every block on the volume so the stripes (and parity blocks) span all of the new columns.> that said, it would seem that adding a single extra > column to a raid-z > group which was mostly full wouldn''t immediately add > significant usable > space -- most of the new blocks would be opposite > allocated blocks and > wouldn''t be useable until churn had freed up the rest > of the row.Certainly that''s one of the well-known ways of causing performance issues on a netapp. Obviously they have some way of accomplishing the writes, even if it can''t use the space on the other disks. I''m not sure how they do that. -- Darren This message posted from opensolaris.org
Ben Lazarus
2005-Dec-12 03:25 UTC
[zfs-discuss] Re: were 2-D block addresses ever considered to allow RAID-Z expansion?
I think RAID-Z column expansion would be a pretty valuable feature, even if a lengthy rebuild (even an offline rebuild) were necessary. As someone else pointed out, this would probably be most useful for home users, but it''d also be useful in the ''enterprise''. As that poster also pointed out, home users don''t represent much if any direct revenue for Sun, but I think it''s been a big mistake in the past to underestimate the indirect influence of the home / hobbyist crowd on the corporate landscape, as kids grow up, get jobs, and start recommending that their employers adopt the stuff that they''ve grown up using at home (e.g. Linux), regardless of whether or not there''s something better out there (e.g. Solaris), that they may just have never had a reason to get familiar with. This is partly why I''m so pleased to see the beginnings of a revitalization effort on x86 and consumer hardware for Solaris. Make it easy and attractive for advanced home users and hobbyists to try Solaris, and I really believe this will have an eventual impact on corporate adoption and revenue. I think ZFS has the potential to be a huge ''hook'' to make people who might have otherwise installed Linux consider Solaris. I think dtrace probably does the same for the education market, and I think it will have a similar result on Sun''s bottom line - get people used to Solaris, whether it''s hobbyists at home or college students taking CS courses, and this can only help Sun in the long run. So, a long-winded way of saying - don''t discount the home users. Obviously, you don''t want to make design decisions that cater to the home users at the expense of your corporate installed-base, but I think you know what I''m saying. Anyway, even if adding a column had some downside (e.g. lengthy rebuild), I think that it''d still be an excellent feature to have the option of either exercising or ignoring. It''d probably be the top or only feature request on my personal list - you guys seem to have already done everything else I could have imagined, and plenty of things I couldn''t have. Kudos. This message posted from opensolaris.org