When using an array''s RAID schemes, what higher level ZFS features are not in play when ZFS Stripes/Concats are used without using any ZFS RaidZ or mirrors? I understand from http://www.opensolaris.org/os/community/zfs/faq/#hardwareraid that ZFS can only report errors but not correct them. I think ZFS still does copy-on-wite, and rollback on error - these are separate from RAID. Does ZFS round robin the writes across the LUNs when there is no ZFS RaidZ or mirrors in play? Or do all the writes go to the first LUN until it is full? What other ZFS features depend on ZFS RAID ? Thanks, John _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 03 August, 2007 - John.Karwoski at Sun.COM sent me these 1,6K bytes:> When using an array''s RAID schemes, what higher level ZFS > features are not in play when ZFS Stripes/Concats are used without using > any ZFS RaidZ or mirrors? > > I understand from > http://www.opensolaris.org/os/community/zfs/faq/#hardwareraid > > that ZFS can only report errors but not correct them. I think ZFS still > does copy-on-wite, and rollback on error - these are separate from > RAID. > > Does ZFS round robin the writes across the LUNs when there is no ZFS > RaidZ or mirrors in play? Or do all the writes go to the first > LUN until it is full?Round-robin across all vdevs (single disk, mirror, raidz, raidz2) in a pool.> What other ZFS features depend on ZFS RAID ?Mostly the self-healing stuff.. But if it''s not zfs-redundant and a device experiences write errors, the machine will currently panic. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se
John Martinez
2007-Aug-03 17:37 UTC
[zfs-discuss] ZFS Features when Using Enterprise Arrays
On Aug 3, 2007, at 8:29 AM, Tomas ?gren wrote:> On 03 August, 2007 - John.Karwoski at Sun.COM sent me these 1,6K bytes: > >> When using an array''s RAID schemes, what higher level ZFS >> features are not in play when ZFS Stripes/Concats are used without >> using >> any ZFS RaidZ or mirrors? >> >> I understand from >> http://www.opensolaris.org/os/community/zfs/faq/#hardwareraid >> >> that ZFS can only report errors but not correct them. I think ZFS >> still >> does copy-on-wite, and rollback on error - these are separate from >> RAID. >> >> Does ZFS round robin the writes across the LUNs when there is no ZFS >> RaidZ or mirrors in play? Or do all the writes go to the first >> LUN until it is full? > > Round-robin across all vdevs (single disk, mirror, raidz, raidz2) in a > pool.That''s good to hear.>> What other ZFS features depend on ZFS RAID ? > > Mostly the self-healing stuff.. But if it''s not zfs-redundant and a > device experiences write errors, the machine will currently panic.Wow, this is certainly worse than the current VxVM/VxFS implementation. At least there I get I/O errors and disk groups get failed or disabled. -john
Alderman, Sean
2007-Aug-03 18:35 UTC
[zfs-discuss] ZFS Features when Using Enterprise Arrays
The OP here is posting the "Z"illion dollar question ... And apologies in advance for the verbal diarrhea. Most of the Enterprise Level systems people here (my company) look at ZFS and say, "Wow that''s really cool...but..." What comes after the "but..." is a host of questions that ultimately come down to how much does ZFS cost? What is the cost of running ZFS RAIDZ on top of a Enterprise Storage System that''s already RAID 1+0 or RAID 5? As a "manager" of systems, how do I justify switching from tried and true SVM/UFS or VxFS/VxVM on high speed redundant storage storage? The nice features are the never needing to go offline to manage storage the server sees, snapshots, clones, etc. don''t seem to make up for the loss of the ability to repair in the event of a failure. We monitor heavily, we can schedule a maintenance window when necessary, and we can cope with an outage (however painful) that requires an FSCK or even a tape restore...for as often as it happens (once in the last 5 years I believe). Does this mean that my environment is too low on the totem pole for ZFS? I''m pretty sure we subscribe to a five 9''s uptime SLA. A gentleman yesterday posted the zpool status below that used SAN Devices. Suppose each device is 100GB of RAID 1+0 storage. pool: ms2 state: ONLINE scrub: scrub completed with 0 errors on Sun Jul 22 00:47:51 2007 config: NAME STATE READ WRITE CKSUM ms2 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600C0FF0000000000A7E0A0E6F8A1000d0 ONLINE 0 0 0 c4t600C0FF0000000000A7E8D1EA7178800d0 ONLINE 0 0 0 mirror ONLINE 0 0 0 c4t600C0FF0000000000A7E0A7219D78100d0 ONLINE 0 0 0 c4t600C0FF0000000000A7E8D7B3709D800d0 ONLINE 0 0 0 errors: No known data errors This configuration shows 400GB (800GB actual behind the SAN) and my usable space is 200GB. That''s 25% in storage capacity alone, and I''m sure there are other costs in RAID X over RAID Y that are less tangible. So, is that worth it? Am I supposed to suggest that we go double the capacity in our RAID 1+0 CLARiioN so that I can implement ZFS 1+0 and not sacrifice any storage capacity? Am I supposed to suggest that the storage crew abandon RAID 1+0 on their devices in order for ZFS to provide fault tolerance? Either way, this makes ZFS a very tough sell. How would I historically show that the investment was worth it when ZFS probably never sees a checksum error because the Storage System hides failures so well? On the other hand, if I were to configure my pool like this... pool: ms2 state: ONLINE scrub: scrub completed with 0 errors on Sun Jul 22 00:47:51 2007 config: NAME STATE READ WRITE CKSUM ms2 ONLINE 0 0 0 c4t600C0FF0000000000A7E0A0E6F8A1000d0 ONLINE 0 0 0 c4t600C0FF0000000000A7E8D1EA7178800d0 ONLINE 0 0 0 c4t600C0FF0000000000A7E0A7219D78100d0 ONLINE 0 0 0 c4t600C0FF0000000000A7E8D7B3709D800d0 ONLINE 0 0 0 errors: No known data errors I''d have a nice 400GB (800GB actual) pool. I''d still have my hard RAID 1+0, but now a single checksum error on any one LUN would render the entire file system unusable. There is NO way to replace or repair with out destroying the entire pool. What is the likelihood of that happening? And what would cause such a thing? I have run the Self Healing Demo against a both of the above pool configurations, the latter is not pretty. With Storage Systems providing their own snap/clone facilities (like BCVs with EMC) it only gets more difficult as Storage and Server teams work largely independent of each other. I''d really like to push ZFS for data storage on all of our new hardware going forward, but unless I can justify over ruling the Storage System''s RAID 1+0 or dropping my capacity utilization from 50% to 25%, I haven''t got much ground to stand on. Is anyone else paddling in my canoe? -- Sean -----Original Message----- From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Tomas ?gren Sent: Friday, August 03, 2007 11:30 AM To: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] ZFS Features when Using Enterprise Arrays On 03 August, 2007 - John.Karwoski at Sun.COM sent me these 1,6K bytes:> When using an array''s RAID schemes, what higher level ZFS features are > not in play when ZFS Stripes/Concats are used without using any ZFS > RaidZ or mirrors? > > I understand from > http://www.opensolaris.org/os/community/zfs/faq/#hardwareraid > > that ZFS can only report errors but not correct them. I think ZFS > still does copy-on-wite, and rollback on error - these are separate > from RAID. > > Does ZFS round robin the writes across the LUNs when there is no ZFS > RaidZ or mirrors in play? Or do all the writes go to the first LUN > until it is full?Round-robin across all vdevs (single disk, mirror, raidz, raidz2) in a pool.> What other ZFS features depend on ZFS RAID ?Mostly the self-healing stuff.. But if it''s not zfs-redundant and a device experiences write errors, the machine will currently panic. /Tomas -- Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/ |- Student at Computing Science, University of Ume? `- Sysadmin at {cs,acc}.umu.se _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Mario Goebbels
2007-Aug-03 18:36 UTC
[zfs-discuss] ZFS Features when Using Enterprise Arrays
>>> What other ZFS features depend on ZFS RAID ? >> Mostly the self-healing stuff.. But if it''s not zfs-redundant and a >> device experiences write errors, the machine will currently panic. > > Wow, this is certainly worse than the current VxVM/VxFS > implementation. At least there I get I/O errors and disk groups get > failed or disabled.Yeah, this is strange behavior. Depending on when/how the error shows up, a server could be sent on a reboot party. I haven''t run into I/O error issues yet, but by the time that happens, I hope ZFS will generate an event that can be easily trapped by an application that sends warning emails or automagically IMs. Same for the desktop scenario, where Gnome would be notified and pops up a system modal error dialog or something. -mg -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 648 bytes Desc: OpenPGP digital signature URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070803/ad140d87/attachment.bin>
Wade.Stuart at fallon.com
2007-Aug-03 20:46 UTC
[zfs-discuss] ZFS Features when Using Enterprise Arrays
zfs-discuss-bounces at opensolaris.org wrote on 08/03/2007 01:35:04 PM:> The OP here is posting the "Z"illion dollar question ... And > apologies in advance for the verbal diarrhea. > > Most of the Enterprise Level systems people here (my company) look > at ZFS and say, "Wow that''s really cool...but..." What comes after > the "but..." is a host of questions that ultimately come down to how > much does ZFS cost? > > What is the cost of running ZFS RAIDZ on top of a Enterprise Storage > System that''s already RAID 1+0 or RAID 5? >I really don''t get what you are asking? VS vxfs/vxvm and svm? then the additional cost is none (or negative cost vs licensing vxfs/vxvm/snap).> As a "manager" of systems, how do I justify switching from tried and > true SVM/UFS or VxFS/VxVM on high speed redundant storage storage? > The nice features are the never needing to go offline to manage > storage the server sees, snapshots, clones, etc. don''t seem to make > up for the loss of the ability to repair in the event of a failure. > We monitor heavily, we can schedule a maintenance window when > necessary, and we can cope with an outage (however painful) that > requires an FSCK or even a tape restore...for as often as it happens > (once in the last 5 years I believe). Does this mean that my > environment is too low on the totem pole for ZFS? I''m pretty sure > we subscribe to a five 9''s uptime SLA. >Yet you have no way to know if your uptime includes spewing out invalid data.> A gentleman yesterday posted the zpool status below that used SAN > Devices. Suppose each device is 100GB of RAID 1+0 storage. > > pool: ms2 > state: ONLINE > scrub: scrub completed with 0 errors on Sun Jul 22 00:47:51 2007 > config: > NAME STATE READ WRITECKSUM> ms2 ONLINE 0 00> mirror ONLINE 0 00> c4t600C0FF0000000000A7E0A0E6F8A1000d0 ONLINE 0 00> c4t600C0FF0000000000A7E8D1EA7178800d0 ONLINE 0 00> mirror ONLINE 0 00> c4t600C0FF0000000000A7E0A7219D78100d0 ONLINE 0 00> c4t600C0FF0000000000A7E8D7B3709D800d0 ONLINE 0 00> errors: No known data errors > > This configuration shows 400GB (800GB actual behind the SAN) and my > usable space is 200GB. That''s 25% in storage capacity alone, and > I''m sure there are other costs in RAID X over RAID Y that are less > tangible. So, is that worth it?25% vs 25% for vxfs/vxvm and svm in similar configurations. Striped config (which is what you are using in vxvm and svm now right?) has no additional penalty.> > Am I supposed to suggest that we go double the capacity in our RAID > 1+0 CLARiioN so that I can implement ZFS 1+0 and not sacrifice any > storage capacity? Am I supposed to suggest that the storage crew > abandon RAID 1+0 on their devices in order for ZFS to provide fault > tolerance? Either way, this makes ZFS a very tough sell.Well the easy sell is to use ZFS as you use vxfs/vxvm and svm now (stripe) -- you still gain checksumming data (but not self heal data -- only metadata), snaps (free vs license), compression, pooling, etc...> How would > I historically show that the investment was worth it when ZFS > probably never sees a checksum error because the Storage System > hides failures so well? >You can''t. Maybe show the last time your emc had a failed disk on a RAID lun group -- emc went to scrub before replace and showed yet another disk in the same lun group that was bad and you had to restore from tape or emc fibbed and replaced the disk with suspect parity data.> On the other hand, if I were to configure my pool like this... > > pool: ms2 > state: ONLINE > scrub: scrub completed with 0 errors on Sun Jul 22 00:47:51 2007 > config: > NAME STATE READ WRITECKSUM> ms2 ONLINE 0 00> c4t600C0FF0000000000A7E0A0E6F8A1000d0 ONLINE 0 00> c4t600C0FF0000000000A7E8D1EA7178800d0 ONLINE 0 00> c4t600C0FF0000000000A7E0A7219D78100d0 ONLINE 0 00> c4t600C0FF0000000000A7E8D7B3709D800d0 ONLINE 0 00> errors: No known data errors > > I''d have a nice 400GB (800GB actual) pool. I''d still have my hard > RAID 1+0, but now a single checksum error on any one LUN would > render the entire file system unusable. >No, copied from another thread: To clarify - ditto blocks are used: 3 copies for pool metadata, each copy on different lun if possible, 2 copies for each file system metadata with each copy on different lun. This means that file system meta data corruptions should self-heal in a non-redundand config (symlinks being an exception now, but there''s RFE to fix it).> There is NO way to replace > or repair with out destroying the entire pool.Delete the file and restore -- also you may want to call EMC and ask why your host is being fed corrupted data without any failures showing on the EMC. If the checksum error overlaps the zfs metadata ditto blocks and makes metadata self heal fail then you restore from tape. If you lose access to a lun in the stripe you go down -- just like with vxvm and svm. How is this not better then vxvm and svm?> What is the > likelihood of that happening? And what would cause such a thing? I > have run the Self Healing Demo against a both of the above pool > configurations, the latter is not pretty.Depends what opensolaris/solaris bits you are on. Newer bits handle this better and should keep you up and heal the metadata if metadata dittos are available. Either way how the heck does svm and vxvm handle this for you currently? =)> > With Storage Systems providing their own snap/clone facilities (like > BCVs with EMC) it only gets more difficult as Storage and Server > teams work largely independent of each other.Hmm, in most environments I have seen, BCVs have been used on the os/app side after the admin states the machine -- what good are random snaps of a unknown state? Sure the storage guys grant them to you (BCV space), but do you really not own the snap side too?> I''d really like to > push ZFS for data storage on all of our new hardware going forward, > but unless I can justify over ruling the Storage System''s RAID 1+0 > or dropping my capacity utilization from 50% to 25%, I haven''t got > much ground to stand on. Is anyone else paddling in my canoe?It may help if you don''t sabotage your own arguments for ZFS. Bottom line is ZFS (even in stripe mode) buys you more than vxvm or svm for less cost ($$ and time). Try to compare ZFS stripe to vxvm stripe. ZFS raidz to vxvm raid. ZFS should come out ahead, except for a few places such as user quotas and evacuating luns. Those are coming sometime.
Richard Elling
2007-Aug-03 22:46 UTC
[zfs-discuss] ZFS Features when Using Enterprise Arrays
Alderman, Sean wrote:> I''d have a nice 400GB (800GB actual) pool. I''d still have my hard RAID 1+0, but now a single checksum error on any one LUN would render the entire file system unusable.No, this is not a correct assumption. The "panic when ZFS sees an error" case is for *writes* where ZFS has no other option to write the data correctly (copies=1 *and* zpool has no redundancy. Some other file systems will patiently wait, perhaps forever. The real solution to this requires changes beyond ZFS, which is perhaps why it is not already finished (I don''t work directly on this code, so I can''t say for sure.) For errors on reads, only the affected file is affected. -- richard