Hi Folks, I am currently in the midst of setting up a completely new file server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994 product (I work for LSI Logic so Engenio is a no brainer). I have configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. I then created sub zfs systems below that and set quotas and sharenfs''d them so that it appears that these "file systems" are dynamically shrinkable and growable. It looks very good... I can see the correct file system sizes on all types of machines (Linux 32/64bit and of course Solaris boxes) and if I resize the quota it''s picked up in NFS right away. But I would be the first in our organization to use this in an enterprise system so I definitely have some concerns that I''m hoping someone here can address. 1. How stable is ZFS? The Engenio box is completely configured for RAID5 with hot spares and write cache (8GB) has battery backup so I''m not too concerned from a hardware side. I''m looking for an idea of how stable ZFS itself is in terms of corruptability, uptime and OS stability. 2. Recommended config. Above, I have a fairly simple setup. In many of the examples the granularity is home directory level and when you have many many users that could get to be a bit of a nightmare administratively. I am really only looking for high level dynamic size adjustability and am not interested in its built in RAID features. But given that, any real world recommendations? 3. Caveats? Anything I''m missing that isn''t in the docs that could turn into a BIG gotchya? 4. Since all data access is via NFS we are concerned that 32 bit systems (Mainly Linux and Windows via Samba) will not be able to access all the data areas of a 2TB+ zpool even if the zfs quota on a particular share is less then that. Can anyone comment? The bottom line is that with anything new there is cause for concern. Especially if it hasn''t been tested within our organization. But the convenience/functionality factors are way too hard to ignore. Thanks, Jeff This message posted from opensolaris.org
Hello Jeffery, Friday, January 26, 2007, 3:16:44 PM, you wrote: JM> Hi Folks, JM> I am currently in the midst of setting up a completely new file JM> server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) JM> connected to an Engenio 6994 product (I work for LSI Logic so JM> Engenio is a no brainer). I have configured a couple of zpools JM> from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. I JM> then created sub zfs systems below that and set quotas and JM> sharenfs''d them so that it appears that these "file systems" are JM> dynamically shrinkable and growable. It looks very good... I can JM> see the correct file system sizes on all types of machines (Linux JM> 32/64bit and of course Solaris boxes) and if I resize the quota JM> it''s picked up in NFS right away. But I would be the first in our JM> organization to use this in an enterprise system so I definitely JM> have some concerns that I''m hoping someone here can address. JM> 1. How stable is ZFS? The Engenio box is completely configured JM> for RAID5 with hot spares and write cache (8GB) has battery backup JM> so I''m not too concerned from a hardware side. I''m looking for an JM> idea of how stable ZFS itself is in terms of corruptability, uptime and OS stability. When it comes to uptime, os stability or corruptability - no problems here. However if you give ZFS entire LUN''s on Enginio devices IIRC with that arrays when zfs issues flush wrtie cache to the array it actually does and this can possibly hurt performance. There''s a way to setup array to ignore flush commands or you can put zfs on SMI. You have to check if this problem was actually with Enginio - I''m not sure. However, depending on workload, consider doing RAID in ZFS instead of in on the array. Especially ''coz you get self-healing from ZFS then. At least doing stripe between several RAID5 LUNs would be good idea. JM> 2. Recommended config. Above, I have a fairly simple setup. In JM> many of the examples the granularity is home directory level and JM> when you have many many users that could get to be a bit of a JM> nightmare administratively. I am really only looking for high JM> level dynamic size adjustability and am not interested in its JM> built in RAID features. But given that, any real world recommendations? Depending on how much users you have consider creating a file system for each user or at least for a group of users if you can group them. JM> 3. Caveats? Anything I''m missing that isn''t in the docs that could turn into a BIG gotchya? WRITE CACHE problem I mentioned above - but check if it was really Enginio - anyway there''re simple workarounds. There''re some performance issues in corner cases hope you won''t hit one. Use at least S10U3 or Nevada (there''re some people using nevada in production :)). JM> 4. Since all data access is via NFS we are concerned that 32 bit JM> systems (Mainly Linux and Windows via Samba) will not be able to JM> access all the data areas of a 2TB+ zpool even if the zfs quota on JM> a particular share is less then that. Can anyone comment? If there''s quota on a file system then nfs client will see that quota as a file system size IIRC so it shouldn''t be a problem. But that means a file system for each users. JM> The bottom line is that with anything new there is cause for JM> concern. Especially if it hasn''t been tested within our JM> organization. But the convenience/functionality factors are way too hard to ignore. ZFS is new, that''s right. There''re some problems, mostly related to performance and hot spare support (when doing raid in ZFS). Other that that you should be ok. Quite a lot of people are using ZFS in a production. I myself have ZFS in a production for years and right now with well over 100TB of data on it using different storage arrays and I''m still migrating more and more data. Never lost any data on ZFS, at least I don''t know about it :))))) -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
On Fri, 2007-01-26 at 06:16 -0800, Jeffery Malloch wrote:> Hi Folks, > > I am currently in the midst of setting up a completely new file server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994 product (I work for LSI Logic so Engenio is a no brainer). I have configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. I then created sub zfs systems below that and set quotas and sharenfs''d them so that it appears that these "file systems" are dynamically shrinkable and growable. It looks very good... I can see the correct file system sizes on all types of machines (Linux 32/64bit and of course Solaris boxes) and if I resize the quota it''s picked up in NFS right away. But I would be the first in our organization to use this in an enterprise system so I definitely have some concerns that I''m hoping someone here can address. > > 1. How stable is ZFS? The Engenio box is completely configured for RAID5 with hot sparesThat partly defeats the purpose of ZFS. ZFS offers raid-z and raid-z2 (double parity) with all the advantages of raid-5 or raid-6 but without several of the raid-5 issues. It also has features that a raid-5 controller could never do: ensure data integrity from the kernel to the disk, and self correction.> and write cache (8GB) has battery backup so I''m not too concerned from a hardware side.Whereas the cache/battery backup is a requirement if you run raid-5, it is not for zfs.> I''m looking for an idea of how stable ZFS itself is in terms of corruptability, uptime and OS stability.Since Solaris 10 U3, it is rock solid. No issue here. 1.3TB or so currently assigned in FC drives, in production without any issues. We switched after losing some data from hardware mirroring. Our sysadmin is ecstatic with zfs. Some of the filesystems have compression enabled and that increases even the throughput, if you have the cpu/ram available.> 2. Recommended config.The most reliable setup is a JBOD + zfs. But if you have cache, on your box, there might be some magic setup you have to do for that box, and I''m sure somebody on the list will help you with that. I dont have an Engenio. Francois
On Fri, 2007-01-26 at 06:16 -0800, Jeffery Malloch wrote:> 2. Recommended config.1) Since this is a system that many users will depend on, use zfs-managed redundancy, either mirroring or raid-z, between the LUNs exported by the storage system. You may think your storage system is perfect, but are you sure? with a non-redundant zfs, over time, you''ll know for sure, but you might find this out at a very inconvenient time. With zfs-managed redundancy, if bit rot happens, you have an excellent chance of slogging through without any application-visible impact. 2) Enable compression. For the software development workloads I''m seeing, this generally recovers the space lost to redundancy. - Bill
I''ve used ZFS since July/August 2006 when Sol 10 Update 2 came out (first release to integrate ZFS.) I''ve used it on three servers (E25K domain, and 2 E2900s) extensivesely; two them are production. I''ve over 3TB of storage from an EMC SAN under ZFS management for no less than 6 months. Like your configuration we''ve defered data redundancy to SAN. My observations are: 1. ZFS is stable to a very large extent. There are two known issues that I''m aware of: a. You can end up in an endless ''reboot'' cycle when you''ve a corrupt zpool. I came across this when I had data corruption due to a HBA mismatch with EMC SAN. This mismatch injected data corruption in transit and the EMC faithfully wrote bad data, upon reading this bad data ZFS threw up all over the floor for that pool. There is a documented workaround to snap out of the ''reboot'' cycle, I''ve not checked if this is fixed in 11/06 update 3. b. Your server will hang when one of the underlying disks disappear. In our case we had a T2000 running 11/06 and had a mirrored zpool against two internal drives. When we pulled one of the drives abruptly the server simply hung. I believe this is a known bug, workaround? 2. When you''ve I/O operations that either request fsync or open files with O_DSYNC option coupled with high I/O ZFS will choke. It won''t crash but the filesystem I/O runs like molases on a cold morning. All my feedback is based on Solaris 10 Update 2 (aka 06/06) and I''ve no comments on NFS. I strongly recommend that you use ZFS data redundancy (z1, z2, or mirror) and simply delegate the Engenio to stripe the data for performance. This message posted from opensolaris.org
On Fri, Jan 26, 2007 at 08:06:46AM -0800, Anantha N. Srirama wrote:> > b. Your server will hang when one of the underlying disks disappear. In our case we had a T2000 running 11/06 and had a mirrored zpool against two internal drives. When we pulled one of the drives abruptly the server simply hung. I believe this is a known bug, workaround?This was just covered here and looks like the fix will make it into u4 (i think it''s in svn_48?) The workaround is to do a ''zpool offline'' whenever possible before removing a disk. Yes, this is not always possible (in the case of disk death), but will help in some situations. I can''t wait for U4. :) -brian -- "The reason I don''t use Gnome: every single other window manager I know of is very powerfully extensible, where you can switch actions to different mouse buttons. Guess which one is not, because it might confuse the poor users? Here''s a hint: it''s not the small and fast one." --Linus
ZFS Rule #0: You gotta have redundancy ZFS Rule #1: Redundancy shall be managed by zfs, and zfs alone. Whatever you have, junk it. Let ZFS manage mirroring and redundancy. ZFS doesn''t forgive even single bit errors! This message posted from opensolaris.org
Oh yep, I know that "churning" feeling in stomach that there''s got to be a GOTCHA somewhere... it can''t be *that* simple! This message posted from opensolaris.org
On Fri, Jan 26, 2007 at 09:33:40AM -0800, Akhilesh Mritunjai wrote:> ZFS Rule #0: You gotta have redundancy > ZFS Rule #1: Redundancy shall be managed by zfs, and zfs alone. > > Whatever you have, junk it. Let ZFS manage mirroring and redundancy. ZFS doesn''t forgive even single bit errors!How does this work in an environment with storage that''s centrally- managed and shared between many servers? I''m putting together a new IMAP server that will eventually use 3TB of space from our Netapp via an iSCSI SAN. The Netapp provides all of the disk management and redundancy that I''ll ever need. The server will only see a virtual disk (a LUN). I want to use ZFS on that LUN because it''s superior to UFS in this application, even without the redundancy. There''s no way to get the Netapp to behave like a JBOD. Are you saying that this configuration isn''t going to work? -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
On Fri, 26 Jan 2007, Gary Mills wrote:> no way to get the Netapp to behave like a JBOD. Are you saying that > this configuration isn''t going to work?It''ll work, but it may not be optimal. -- Rich Teer, SCSA, SCNA, SCSECA, OpenSolaris CAB member President, Rite Online Inc. Voice: +1 (250) 979-1638 URL: http://www.rite-group.com/rich
On Jan 26, 2007, at 9:42, Gary Mills wrote:> How does this work in an environment with storage that''s centrally- > managed and shared between many servers? I''m putting together a new > IMAP server that will eventually use 3TB of space from our Netapp via > an iSCSI SAN. The Netapp provides all of the disk management and > redundancy that I''ll ever need. The server will only see a virtual > disk (a LUN). I want to use ZFS on that LUN because it''s superior > to UFS in this application, even without the redundancy. There''s > no way to get the Netapp to behave like a JBOD. Are you saying that > this configuration isn''t going to work?It will work, but if the storage system corrupts the data, ZFS will be unable to correct it. It will detect the error. A number that I''ve been quoting, albeit without a good reference, comes from Jim Gray, who has been around the data-management industry for longer than I have (and I''ve been in this business since 1970); he''s currently at Microsoft. Jim says that the controller/drive subsystem writes data to the wrong sector of the drive without notice about once per drive per year. In a 400-drive array, that''s once a day. ZFS will detect this error when the file is read (one of the blocks'' checksum will not match). But it can only correct the error if it manages the redundancy. I would suggest exporting two LUNs from your central storage and let ZFS mirror them. You can get a wider range of space/performance tradeoffs if you give ZFS a JBOD, but that doesn''t sound like an option. --Ed
On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:> On Jan 26, 2007, at 9:42, Gary Mills wrote: > >How does this work in an environment with storage that''s centrally- > >managed and shared between many servers? > > It will work, but if the storage system corrupts the data, ZFS will be > unable to correct it. It will detect the error. > > A number that I''ve been quoting, albeit without a good reference, comes > from Jim Gray, who has been around the data-management industry for > longer than I have (and I''ve been in this business since 1970); he''s > currently at Microsoft. Jim says that the controller/drive subsystem > writes data to the wrong sector of the drive without notice about once > per drive per year. In a 400-drive array, that''s once a day. ZFS will > detect this error when the file is read (one of the blocks'' checksum > will not match). But it can only correct the error if it manages the > redundancy.Our Netapp does double-parity RAID. In fact, the filesystem design is remarkably similar to that of ZFS. Wouldn''t that also detect the error? I suppose it depends if the `wrong sector without notice'' error is repeated each time. Or is it random?> I would suggest exporting two LUNs from your central storage and let > ZFS mirror them. You can get a wider range of space/performance > tradeoffs if you give ZFS a JBOD, but that doesn''t sound like an > option.That would double the amount of disk that we''d require. I am actually planning on using two iSCSI LUNs and letting ZFS stripe across them. When we need to expand the ZFS pool, I''d like to just expand the two LUNs on the Netapp. If ZFS won''t accomodate that, I can just add a couple more LUNs. This is all convenient and easily managable. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Gary Mills wrote:> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: >> On Jan 26, 2007, at 9:42, Gary Mills wrote: >>> How does this work in an environment with storage that''s centrally- >>> managed and shared between many servers? >> It will work, but if the storage system corrupts the data, ZFS will be >> unable to correct it. It will detect the error. >> >> A number that I''ve been quoting, albeit without a good reference, comes >> from Jim Gray, who has been around the data-management industry for >> longer than I have (and I''ve been in this business since 1970); he''s >> currently at Microsoft. Jim says that the controller/drive subsystem >> writes data to the wrong sector of the drive without notice about once >> per drive per year. In a 400-drive array, that''s once a day. ZFS will >> detect this error when the file is read (one of the blocks'' checksum >> will not match). But it can only correct the error if it manages the >> redundancy.The quote from Jim seems to be related to the leaves of the tree (disks). Anecdotally, now that we have ZFS at the trunk, we''re seeing that the branches are also corrupting data. We''ve speculated that it would occur, but now we can measure it, and it is non-zero. See Anantha''s post for one such anecdote.> Our Netapp does double-parity RAID. In fact, the filesystem design is > remarkably similar to that of ZFS. Wouldn''t that also detect the > error? I suppose it depends if the `wrong sector without notice'' > error is repeated each time. Or is it random?We''re having a debate related to this, data would be appreciated :-) Do you get small, random read performance equivalent to N-2 spindles for an N-way double-parity volume?>> I would suggest exporting two LUNs from your central storage and let >> ZFS mirror them. You can get a wider range of space/performance >> tradeoffs if you give ZFS a JBOD, but that doesn''t sound like an >> option. > > That would double the amount of disk that we''d require. I am actually > planning on using two iSCSI LUNs and letting ZFS stripe across them. > When we need to expand the ZFS pool, I''d like to just expand the two > LUNs on the Netapp. If ZFS won''t accomodate that, I can just add a > couple more LUNs. This is all convenient and easily managable.Sounds reasonable to me :-) -- richard
Gary Mills wrote:> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: > >> On Jan 26, 2007, at 9:42, Gary Mills wrote: >> >>> How does this work in an environment with storage that''s centrally- >>> managed and shared between many servers? >>> >> It will work, but if the storage system corrupts the data, ZFS will be >> unable to correct it. It will detect the error. >> >> A number that I''ve been quoting, albeit without a good reference, comes >> from Jim Gray, who has been around the data-management industry for >> longer than I have (and I''ve been in this business since 1970); he''s >> currently at Microsoft. Jim says that the controller/drive subsystem >> writes data to the wrong sector of the drive without notice about once >> per drive per year. In a 400-drive array, that''s once a day. ZFS will >> detect this error when the file is read (one of the blocks'' checksum >> will not match). But it can only correct the error if it manages the >> redundancy. >> > > Our Netapp does double-parity RAID. In fact, the filesystem design is > remarkably similar to that of ZFS. Wouldn''t that also detect the > error? I suppose it depends if the `wrong sector without notice'' > error is repeated each time.If the wrong block is written by the controller then you''re out of luck. The filesystem would read the incorrect block and ... who knows. Thats why the ZFS checksums are important.
Wade.Stuart at fallon.com
2007-Jan-26 20:20 UTC
[zfs-discuss] Re: ZFS or UFS - what to do?
zfs-discuss-bounces at opensolaris.org wrote on 01/26/2007 01:43:35 PM:> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: > > On Jan 26, 2007, at 9:42, Gary Mills wrote: > > >How does this work in an environment with storage that''s centrally- > > >managed and shared between many servers? > > > > It will work, but if the storage system corrupts the data, ZFS will be > > unable to correct it. It will detect the error. > > > > A number that I''ve been quoting, albeit without a good reference, comes> > from Jim Gray, who has been around the data-management industry for > > longer than I have (and I''ve been in this business since 1970); he''s > > currently at Microsoft. Jim says that the controller/drive subsystem > > writes data to the wrong sector of the drive without notice about once > > per drive per year. In a 400-drive array, that''s once a day. ZFS will> > detect this error when the file is read (one of the blocks'' checksum > > will not match). But it can only correct the error if it manages the > > redundancy. > > Our Netapp does double-parity RAID. In fact, the filesystem design is > remarkably similar to that of ZFS. Wouldn''t that also detect the > error? I suppose it depends if the `wrong sector without notice'' > error is repeated each time. Or is it random?I do not know, WAFL and other portions of NetApp backends are never really described in very technical details -- even getting real IOPS numbers from them seems to be a hassle, much magic -- little meat. To me, zfs is very well defined behavior and methodology (you can even see the source to verify specifics) and this allows you to _know_ what weak points are. NetApp, EMC and other disk vendors may have financial benefits for allowing edge cases such as the write hole or bit rot (x errors per disk are acceptable losses, after x errors then consider replacing disk cost/benefit analysis -- will customers actually know a bit is flipped?). In EMC''s case it is very common for a disk to have multiple read/write errors before EMC will swap out the disk, they even use a substantial portion of the disk as replacement and parity bits (outside of raid) so they offset or postpone the replacement volume/costs on the customer. The most detailed description of WAFL I was able to find last time I looked was: http://www.netapp.com/library/tr/3002.pdf> > > I would suggest exporting two LUNs from your central storage and let > > ZFS mirror them. You can get a wider range of space/performance > > tradeoffs if you give ZFS a JBOD, but that doesn''t sound like an > > option. > > That would double the amount of disk that we''d require. I am actually > planning on using two iSCSI LUNs and letting ZFS stripe across them. > When we need to expand the ZFS pool, I''d like to just expand the two > LUNs on the Netapp. If ZFS won''t accomodate that, I can just add a > couple more LUNs. This is all convenient and easily managable.If you do have bit errors coming from the netapp zfs will find them and will not be able to correct in this case.> > -- > -Gary Mills- -Unix Support- -U of M Academic Computing andNetworking-> _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Jan 26, 2007, at 12:13, Richard Elling wrote:> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: >> A number that I''ve been quoting, albeit without a good reference, >> comes from Jim Gray, who has been around the data-management industry >> for longer than I have (and I''ve been in this business since 1970); >> he''s currently at Microsoft. Jim says that the controller/drive >> subsystem writes data to the wrong sector of the drive without notice >> about once per drive per year. In a 400-drive array, that''s once a >> day. ZFS will detect this error when the file is read (one of the >> blocks'' checksum will not match). But it can only correct the error >> if it manages the redundancy. > > The quote from Jim seems to be related to the leaves of the tree > (disks). > Anecdotally, now that we have ZFS at the trunk, we''re seeing that the > branches are also corrupting data. We''ve speculated that it would > occur, > but now we can measure it, and it is non-zero. See Anantha''s post for > one such anecdote.Actually, Jim was referring to everything but the trunk. He didn''t specify where from the HBA to the drive the error actually occurs. I don''t think it really matters. I saw him give a talk a few years ago at the Usenix FAST conference; that''s where I got this information. --Ed
Chad Leigh -- Shire.Net LLC
2007-Jan-26 20:48 UTC
[zfs-discuss] Re: ZFS or UFS - what to do?
On Jan 26, 2007, at 12:05 PM, Ed Gould wrote:> I would suggest exporting two LUNs from your central storage and > let ZFS mirror them. You can get a wider range of space/ > performance tradeoffs if you give ZFS a JBOD, but that doesn''t > sound like an option.I am doing something similar on a lower end scale. I am using 2 Areca RAID-6 controllers, each with an 8 disk raid plus 1 hot spare equal to 1.7TB. ZFS is being used to mirror them. Battery backed with ECC on controller cache of at least 1GB. I am in the process of building this now. Chad --- Chad Leigh -- Shire.Net LLC Your Web App and Email hosting provider chad at shire.net -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2411 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070126/9fbcb96b/attachment.bin>
Ed Gould wrote:> On Jan 26, 2007, at 12:13, Richard Elling wrote: >> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: >>> A number that I''ve been quoting, albeit without a good reference, >>> comes from Jim Gray, who has been around the data-management industry >>> for longer than I have (and I''ve been in this business since 1970); >>> he''s currently at Microsoft. Jim says that the controller/drive >>> subsystem writes data to the wrong sector of the drive without notice >>> about once per drive per year. In a 400-drive array, that''s once a >>> day. ZFS will detect this error when the file is read (one of the >>> blocks'' checksum will not match). But it can only correct the error >>> if it manages the redundancy.> Actually, Jim was referring to everything but the trunk. He didn''t > specify where from the HBA to the drive the error actually occurs. I > don''t think it really matters. I saw him give a talk a few years ago at > the Usenix FAST conference; that''s where I got this information.So this leaves me wondering how often the controller/drive subsystem reads data from the wrong sector of the drive without notice; is it symmetrical with respect to writing, and thus about once a drive/year, or are there factors which change this? Dana
On Jan 26, 2007, at 12:52, Dana H. Myers wrote:> So this leaves me wondering how often the controller/drive subsystem > reads data from the wrong sector of the drive without notice; is it > symmetrical with respect to writing, and thus about once a drive/year, > or are there factors which change this?My guess is that it would be symmetric, but I don''t really know. --Ed
Dana H. Myers wrote:> Ed Gould wrote: > >> On Jan 26, 2007, at 12:13, Richard Elling wrote: >> >>> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: >>> >>>> A number that I''ve been quoting, albeit without a good reference, >>>> comes from Jim Gray, who has been around the data-management industry >>>> for longer than I have (and I''ve been in this business since 1970); >>>> he''s currently at Microsoft. Jim says that the controller/drive >>>> subsystem writes data to the wrong sector of the drive without notice >>>> about once per drive per year. In a 400-drive array, that''s once a >>>> day. ZFS will detect this error when the file is read (one of the >>>> blocks'' checksum will not match). But it can only correct the error >>>> if it manages the redundancy. >>>> > > >> Actually, Jim was referring to everything but the trunk. He didn''t >> specify where from the HBA to the drive the error actually occurs. I >> don''t think it really matters. I saw him give a talk a few years ago at >> the Usenix FAST conference; that''s where I got this information. >> > > So this leaves me wondering how often the controller/drive subsystem > reads data from the wrong sector of the drive without notice; is it > symmetrical with respect to writing, and thus about once a drive/year, > or are there factors which change this? >It''s not symmetrical. Often times its a fw bug. Others a spurious event causes one block to be read/written instead of an other one. (Alpha particles anyone?)
Torrey McMahon wrote:> Dana H. Myers wrote: >> Ed Gould wrote: >> >>> On Jan 26, 2007, at 12:13, Richard Elling wrote: >>> >>>> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: >>>> >>>>> A number that I''ve been quoting, albeit without a good reference, >>>>> comes from Jim Gray, who has been around the data-management industry >>>>> for longer than I have (and I''ve been in this business since 1970); >>>>> he''s currently at Microsoft. Jim says that the controller/drive >>>>> subsystem writes data to the wrong sector of the drive without notice >>>>> about once per drive per year. In a 400-drive array, that''s once a >>>>> day. ZFS will detect this error when the file is read (one of the >>>>> blocks'' checksum will not match). But it can only correct the error >>>>> if it manages the redundancy. >>>>> >> >> >>> Actually, Jim was referring to everything but the trunk. He didn''t >>> specify where from the HBA to the drive the error actually occurs. I >>> don''t think it really matters. I saw him give a talk a few years ago at >>> the Usenix FAST conference; that''s where I got this information. >>> >> >> So this leaves me wondering how often the controller/drive subsystem >> reads data from the wrong sector of the drive without notice; is it >> symmetrical with respect to writing, and thus about once a drive/year, >> or are there factors which change this? >> > > It''s not symmetrical. Often times its a fw bug. Others a spurious event > causes one block to be read/written instead of an other one. (Alpha > particles anyone?)I would tend to expect these spurious events to impact read and write equally; more specifically, the chance of any one read or write being mis-addressed is about the same. Since, AFAIK, there are many more reads from a disk typically than writes, this would seem to suggest that there would be more mis-addressed reads in a drive/year than mis-addressed writes. Is this the reason for the asymmetry? (I''m sure waving my hands here) Dana
On Jan 26, 2007, at 13:16, Dana H. Myers wrote:> I would tend to expect these spurious events to impact read and write > equally; more specifically, the chance of any one read or write being > mis-addressed is about the same. Since, AFAIK, there are many more > reads > from a disk typically than writes, this would seem to suggest that > there > would be more mis-addressed reads in a drive/year than mis-addressed > writes. Is this the reason for the asymmetry?Jim''s "once per drive per year" number was not very precise. I took it to be just one significant digit. I don''t recall if he distinguished reads from writes. --Ed
it would be good to have real data and not only guess ot anecdots this story about wrong blocks being written by RAID controllers sounds like the anti-terrorism propaganda we are leaving in: exagerate the facts to catch everyone''s attention .It''s going to take more than that to prove RAID ctrls have been doing a bad jobs for the last 30 years Let''s make up real stories with hard fact first s. On 1/26/07, Ed Gould <Ed.Gould at sun.com> wrote:> On Jan 26, 2007, at 12:52, Dana H. Myers wrote: > > So this leaves me wondering how often the controller/drive subsystem > > reads data from the wrong sector of the drive without notice; is it > > symmetrical with respect to writing, and thus about once a drive/year, > > or are there factors which change this? > > My guess is that it would be symmetric, but I don''t really know. > > --Ed > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Jan 26, 2007, at 13:29, Selim Daoud wrote:> it would be good to have real data and not only guess ot anecdotsYes, I agree. I''m sorry I don''t have the data that Jim presented at FAST, but he did present actual data. Richard Elling (I believe it was Richard) has also posted some related data from ZFS experience to this list. There is more than just anecdotal evidence for this. --Ed
> From: zfs-discuss-bounces at opensolaris.org > [mailto:zfs-discuss-bounces at opensolaris.org] On Behalf Of Ed Gould > Sent: Friday, January 26, 2007 3:38 PM > > Yes, I agree. I''m sorry I don''t have the data that Jim presented at > FAST, but he did present actual data. Richard Elling (I believe it > was > Richard) has also posted some related data from ZFS experience to this > list.This seems to be from Jim and on point: http://www.usenix.org/event/fast05/tech/gray.pdf paul
On Jan 26, 2007, at 13:53, Paul Fisher wrote:> This seems to be from Jim and on point: > > http://www.usenix.org/event/fast05/tech/gray.pdfYes, thanks. That''s the talk I was referring to. There''s a reference in it to a Microsoft tech report with measurement data. --Ed
Dana H. Myers wrote:> Torrey McMahon wrote: > >> Dana H. Myers wrote: >> >>> Ed Gould wrote: >>> >>> >>>> On Jan 26, 2007, at 12:13, Richard Elling wrote: >>>> >>>> >>>>> On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote: >>>>> >>>>> >>>>>> A number that I''ve been quoting, albeit without a good reference, >>>>>> comes from Jim Gray, who has been around the data-management industry >>>>>> for longer than I have (and I''ve been in this business since 1970); >>>>>> he''s currently at Microsoft. Jim says that the controller/drive >>>>>> subsystem writes data to the wrong sector of the drive without notice >>>>>> about once per drive per year. In a 400-drive array, that''s once a >>>>>> day. ZFS will detect this error when the file is read (one of the >>>>>> blocks'' checksum will not match). But it can only correct the error >>>>>> if it manages the redundancy. >>>>>> >>>>>> >>> >>> >>>> Actually, Jim was referring to everything but the trunk. He didn''t >>>> specify where from the HBA to the drive the error actually occurs. I >>>> don''t think it really matters. I saw him give a talk a few years ago at >>>> the Usenix FAST conference; that''s where I got this information. >>>> >>>> >>> So this leaves me wondering how often the controller/drive subsystem >>> reads data from the wrong sector of the drive without notice; is it >>> symmetrical with respect to writing, and thus about once a drive/year, >>> or are there factors which change this? >>> >>> >> It''s not symmetrical. Often times its a fw bug. Others a spurious event >> causes one block to be read/written instead of an other one. (Alpha >> particles anyone?) >> > > I would tend to expect these spurious events to impact read and write > equally; more specifically, the chance of any one read or write being > mis-addressed is about the same. Since, AFAIK, there are many more reads > from a disk typically than writes, this would seem to suggest that there > would be more mis-addressed reads in a drive/year than mis-addressed > writes. Is this the reason for the asymmetry? > > (I''m sure waving my hands here)For the spurious events, yes, I would expect things to be impacted symmetrically depending when it comes to errors during reads and errors during writes. That is if you could figure out what spurious event occurred. In most cases the spurious errors are caught only at read time and you''re left wondering. Was it an incorrect read? Was the data written incorrectly? You end up throwing your hands up and saying, "Lets hope that doesn''t happen again." It''s much easier to unearth a fw bug in a particular disk drive operating in certain conditions and fix it. Now that we''re checksumming things I''d expect to find more errors, and hopefully be in a condition to fix them, then we have in the past. We will also start getting customer complaints like, "We moved to ZFS and now we are seeing media errors more often. Why is ZFS broken?" This is similar to the StorADE issues we had in NWS - Ahhh, the good old days - when we started doing a much better job discovering issues and reporting them when in the past we were blissfully silent. We used to have some data on that with nice graphs but I can''t find them lying about.
Hi Jeff, We''re running a FLX210 which I believe is an Engenio 2884. In our case it also is attached to a T2000. ZFS has run VERY stably for us with data integrity issues at all. We did have a significant latency problem caused by ZFS flushing the write cache on the array after every write, but that can be fixed by configuring your array to ignore cache flushes. The instructions for Engenio products are here: http://blogs.digitar.com/jjww/?itemid=44 We use the config for a production database, so I can''t speak to the NFS issues. All I would mention is to watch the RAM consumption by ZFS. Does anyone on the list have a recommendation for ARC sizing with NFS? Best Regards, Jason On 1/26/07, Jeffery Malloch <jeffery.malloch at lsi.com> wrote:> Hi Folks, > > I am currently in the midst of setting up a completely new file server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994 product (I work for LSI Logic so Engenio is a no brainer). I have configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. I then created sub zfs systems below that and set quotas and sharenfs''d them so that it appears that these "file systems" are dynamically shrinkable and growable. It looks very good... I can see the correct file system sizes on all types of machines (Linux 32/64bit and of course Solaris boxes) and if I resize the quota it''s picked up in NFS right away. But I would be the first in our organization to use this in an enterprise system so I definitely have some concerns that I''m hoping someone here can address. > > 1. How stable is ZFS? The Engenio box is completely configured for RAID5 with hot spares and write cache (8GB) has battery backup so I''m not too concerned from a hardware side. I''m looking for an idea of how stable ZFS itself is in terms of corruptability, uptime and OS stability. > > 2. Recommended config. Above, I have a fairly simple setup. In many of the examples the granularity is home directory level and when you have many many users that could get to be a bit of a nightmare administratively. I am really only looking for high level dynamic size adjustability and am not interested in its built in RAID features. But given that, any real world recommendations? > > 3. Caveats? Anything I''m missing that isn''t in the docs that could turn into a BIG gotchya? > > 4. Since all data access is via NFS we are concerned that 32 bit systems (Mainly Linux and Windows via Samba) will not be able to access all the data areas of a 2TB+ zpool even if the zfs quota on a particular share is less then that. Can anyone comment? > > The bottom line is that with anything new there is cause for concern. Especially if it hasn''t been tested within our organization. But the convenience/functionality factors are way too hard to ignore. > > Thanks, > > Jeff > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Correction: "ZFS has run VERY stably for us with data integrity issues at all." should read "ZFS has run VERY stably for us with NO data integrity issues at all." On 1/26/07, Jason J. W. Williams <jasonjwwilliams at gmail.com> wrote:> Hi Jeff, > > We''re running a FLX210 which I believe is an Engenio 2884. In our case > it also is attached to a T2000. ZFS has run VERY stably for us with > data integrity issues at all. > > We did have a significant latency problem caused by ZFS flushing the > write cache on the array after every write, but that can be fixed by > configuring your array to ignore cache flushes. The instructions for > Engenio products are here: http://blogs.digitar.com/jjww/?itemid=44 > > We use the config for a production database, so I can''t speak to the > NFS issues. All I would mention is to watch the RAM consumption by > ZFS. > > Does anyone on the list have a recommendation for ARC sizing with NFS? > > Best Regards, > Jason > > > On 1/26/07, Jeffery Malloch <jeffery.malloch at lsi.com> wrote: > > Hi Folks, > > > > I am currently in the midst of setting up a completely new file server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994 product (I work for LSI Logic so Engenio is a no brainer). I have configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. I then created sub zfs systems below that and set quotas and sharenfs''d them so that it appears that these "file systems" are dynamically shrinkable and growable. It looks very good... I can see the correct file system sizes on all types of machines (Linux 32/64bit and of course Solaris boxes) and if I resize the quota it''s picked up in NFS right away. But I would be the first in our organization to use this in an enterprise system so I definitely have some concerns that I''m hoping someone here can address. > > > > 1. How stable is ZFS? The Engenio box is completely configured for RAID5 with hot spares and write cache (8GB) has battery backup so I''m not too concerned from a hardware side. I''m looking for an idea of how stable ZFS itself is in terms of corruptability, uptime and OS stability. > > > > 2. Recommended config. Above, I have a fairly simple setup. In many of the examples the granularity is home directory level and when you have many many users that could get to be a bit of a nightmare administratively. I am really only looking for high level dynamic size adjustability and am not interested in its built in RAID features. But given that, any real world recommendations? > > > > 3. Caveats? Anything I''m missing that isn''t in the docs that could turn into a BIG gotchya? > > > > 4. Since all data access is via NFS we are concerned that 32 bit systems (Mainly Linux and Windows via Samba) will not be able to access all the data areas of a 2TB+ zpool even if the zfs quota on a particular share is less then that. Can anyone comment? > > > > The bottom line is that with anything new there is cause for concern. Especially if it hasn''t been tested within our organization. But the convenience/functionality factors are way too hard to ignore. > > > > Thanks, > > > > Jeff > > > > > > This message posted from opensolaris.org > > _______________________________________________ > > zfs-discuss mailing list > > zfs-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > >
On Fri, Jan 26, 2007 at 11:05:17AM -0800, Ed Gould wrote:> > A number that I''ve been quoting, albeit without a good reference, comes > from Jim Gray, who has been around the data-management industry for > longer than I have (and I''ve been in this business since 1970); he''s > currently at Microsoft. Jim says that the controller/drive subsystem > writes data to the wrong sector of the drive without notice about once > per drive per year. In a 400-drive array, that''s once a day. ZFS will > detect this error when the file is read (one of the blocks'' checksum > will not match). But it can only correct the error if it manages the > redundancy.My only qualification to enter this discussion is that I once wrote a floppy disk format program for minix. I recollect, however, that each sector on the disk is accompanied by a block that contains the sector address and a CRC. In order to write to the wrong sector, both of these items would have to be read incorrectly. Otherwise, the controller would never find the wrong sector. Are we just talking about a CRC failure here? That would be random, but the frequency of CRC errors would depend on the signal quality. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
On 26-Jan-07, at 7:29 PM, Selim Daoud wrote:> it would be good to have real data and not only guess ot anecdots > > this story about wrong blocks being written by RAID controllers > sounds like the anti-terrorism propaganda we are leaving in: exagerate > the facts to catch everyone''s attention > .It''s going to take more than that to prove RAID ctrls have been doing > a bad jobs for the last 30 yearsIt does happen. Hard numbers are available if you look. This sounds a bit like the "RAID expert" I bumped into who just couldn''t see the paradigm had shifted under him -- the implications of "end to end".> Let''s make up real stories with hard fact first > s. >Related links: https://www.gelato.unsw.edu.au/archives/comp-arch/2006-September/ 003008.html http://www.lockss.org/locksswiki/files/3/30/Eurosys2006.pdf [A Fresh Look at the Reliability of Long-term Digital Storage, 2006] http://www.ecsl.cs.sunysb.edu/tr/rpe19.pdf [Challenges of Long-Term Digital Archiving: A Survey, 2006] http://www.cs.wisc.edu/~vijayan/vijayan-thesis.pdf [IRON File Systems, 2006] http://www.tcs.hut.fi/~hhk/phd/phd_Hannu_H_Kari.pdf [Latent Sector Faults and Reliability of Disk Arrays, 1997] --T> On 1/26/07, Ed Gould <Ed.Gould at sun.com> wrote: >> On Jan 26, 2007, at 12:52, Dana H. Myers wrote: >> > So this leaves me wondering how often the controller/drive >> subsystem >> > reads data from the wrong sector of the drive without notice; is it >> > symmetrical with respect to writing, and thus about once a drive/ >> year, >> > or are there factors which change this? >> >> My guess is that it would be symmetric, but I don''t really know. >> >> --Ed >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> My only qualification to enter this discussion is that I once wrote a > floppy disk format program for minix. I recollect, however, that each > sector on the disk is accompanied by a block that contains the sector > address and a CRC.You''d have to define the layer you''re talking about. I presume something like this occurs between a dumb disk and an intelligent controller, or even within the encoding parameters of a disk, but I don''t think it does between say a SCSI/FC controller and a disk. So if the drive itself put the head in the wrong sector, maybe it could figure that out. But perhaps the scsi controller had a bug and sent the wrong address to the drive. I don''t think there''s anything at that layer that would notice (unless the application/file system is encoding intent into the data). Corrections about my assumption with SCSI/FC/ATA appreciated. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
Toby Thain wrote:> > On 26-Jan-07, at 7:29 PM, Selim Daoud wrote: > >> it would be good to have real data and not only guess ot anecdots >> >> this story about wrong blocks being written by RAID controllers >> sounds like the anti-terrorism propaganda we are leaving in: exagerate >> the facts to catch everyone''s attention >> .It''s going to take more than that to prove RAID ctrls have been doing >> a bad jobs for the last 30 years > > It does happen. Hard numbers are available if you look. This sounds a > bit like the "RAID expert" I bumped into who just couldn''t see the > paradigm had shifted under him -- the implications of "end to end".It happens. As long we look at the numbers in context and don''t run around going, "Hey...have you seen these numbers? What have been doing for the last 35 years!?!?" we''re ok.
> A number that I''ve been quoting, albeit without a good reference, comes > from Jim Gray, who has been around the data-management industry for > longer than I have (and I''ve been in this business since 1970); he''s > currently at Microsoft. Jim says that the controller/drive subsystem > writes data to the wrong sector of the drive without notice about once > per drive per year. In a 400-drive array, that''s once a day. ZFS will > detect this error when the file is read (one of the blocks'' checksum > will not match). But it can only correct the error if it manages the > redundancy.So now with ZFS, can anyone with a 400 drive array confirm that a "scrub" has to fix roughly one problem a day? (Or modify appropriately for whatever amount of drives.) -- / Peter Schuller PGP userID: 0xE9758B7D or ''Peter Schuller <peter.schuller at infidyne.com>'' Key retrieval: Send an E-Mail to getpgpkey at scode.org E-Mail: peter.schuller at infidyne.com Web: http://www.scode.org
> 1. How stable is ZFS?It''s a new file system; there will be bugs. It appears to be well-tested, though. There are a few known issues; for instance, a write failure can panic the system under some circumstances. UFS has known issues too....> 2. Recommended config. Above, I have a fairly > simple setup. In many of the examples the > granularity is home directory level and when you have > many many users that could get to be a bit of a > nightmare administratively.Do you need user quotas? If so, you need a file system per user with ZFS. That may be an argument against it in some environments, but in my experience tends to be more important in academic settings than corporations.> 4. Since all data access is via NFS we are concerned > that 32 bit systems (Mainly Linux and Windows via > Samba) will not be able to access all the data areas > of a 2TB+ zpool even if the zfs quota on a particular > share is less then that. Can anyone comment?Not a problem. NFS doesn''t really deal with volumes, just files, so the offsets are always file-relative and the volume can be as large as desired. Anton This message posted from opensolaris.org
Selim Daoud wrote:> it would be good to have real data and not only guess ot anecdots > > this story about wrong blocks being written by RAID controllers > sounds like the anti-terrorism propaganda we are leaving in: exagerate > the facts to catch everyone''s attention > .It''s going to take more than that to prove RAID ctrls have been doing > a bad jobs for the last 30 years > Let''s make up real stories with hard fact firstI have actual hard data and bitter experience (from support calls) to backup the allegations that raid controllers can and do write bad blocks. No, I cannot and will not provide specifics - I signed an NDA which expressly deals with confidentiality of customer information. What I can say is that if we''d had ZFS to manage the filesystems in question, not only would we have detected the problem much earlier, but the flow-on effect to the end-users would have been much more easily managed. James C. McPherson -- Solaris kernel software engineer, system admin and troubleshooter http://www.jmcp.homeunix.com/blog Find me on LinkedIn @ http://www.linkedin.com/in/jamescmcpherson
On Jan 26, 2007, at 14:05, Ed Gould wrote:> It will work, but if the storage system corrupts the data, ZFS will > be unable to correct it. It will detect the error.Unless you turn checksuming off. From zfs(1M): checksum=on | off | fletcher2, | fletcher4 | sha256 Controls the checksum used to verify data integrity. The default value is ?on?, which automatically selects an appropriate algorithm (currently, fletcher2, but this may change in future releases). The value ?off? disables integrity checking on user data. Disabling checksums is NOT a recommended practice.
On Jan 26, 2007, at 14:43, Gary Mills wrote:> Our Netapp does double-parity RAID. In fact, the filesystem design is > remarkably similar to that of ZFS. Wouldn''t that also detect the > error? I suppose it depends if the `wrong sector without notice'' > error is repeated each time. Or is it random?On most (all?) other systems the parity only comes into effect when a drive fails. When all the drives are reporting "OK" most (all?) RAID systems don''t use the parity data at all. ZFS is the first (only?) system that actively checks the data returned from disk, regardless of whether the drives are reporting they''re okay or not. I''m sure I''ll be corrected if I''m wrong. :)
I''m not sure what benefit you forsee by running a COW filesystem (ZFS) on a COW array (NetApp). Back to regularly scheduled programming: I still say you should let ZFS manage JBoD type storage. I can personally recount the horror of relying upon an intelligent storage array (EMC DMX3500 in our case.) We had in flight data corruption that EMC faithfully wrote just like NetApp would in your case. Everybody is assuming that corruption or data loss occurs only on disks, it can happen everywhere. In a datacenter SAN you''ve so many more paths that can introduce data corruption. Hence the need for ensuring data integrity closest to the use of data, namely ZFS. ZFS will not stop alpha particle induced memory corruption after data has been received by server and verified to be correct. Sadly I''ve been hit with that as well. This message posted from opensolaris.org
On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote:> We had in flight data corruption that EMC faithfully wrote just > like NetApp would in your case. Everybody is assuming that > corruption or data loss occurs only on disks, it can happen > everywhere. In a datacenter SAN you''ve so many more paths that can > introduce data corruption. Hence the need for ensuring data > integrity closest to the use of data, namely ZFS.Now how do we get this message out there and understood, fellow evangelicals? :) --Toby> ZFS will not stop alpha particle induced memory corruption after > data has been received by server and verified to be correct. Sadly > I''ve been hit with that as well.
On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote:> ... ZFS will not stop alpha particle induced memory corruption > after data has been received by server and verified to be correct. > Sadly I''ve been hit with that as well.My brother points out that you can use a rad hardened CPU. ECC should take care of the RAM. :-) I wonder when the former will become data centre best practice? --Toby
On Sat, Jan 27, 2007 at 04:15:30PM -0800, Anantha N. Srirama wrote:> > I''m not sure what benefit you forsee by running a COW filesystem > (ZFS) on a COW array (NetApp).Assuming that that question was addressed to me, the primary feature that I need from ZFS is snapshots. The Netapp has snapshots too, but they are done by disk blocks since, for an iSCSI LUN, the Netapp has no concept of files. ZFS snapshots allow restore of individual files when users accidentally delete them. As well, I do need a filesystem of some sort on the iSCSI LUN. If ZFS is superior to UFS in this application, I''d like to use it. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Casper.Dik at Sun.COM
2007-Jan-28 09:59 UTC
[zfs-discuss] Re: Re: ZFS or UFS - what to do?
> >On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote: > >> ... ZFS will not stop alpha particle induced memory corruption >> after data has been received by server and verified to be correct. >> Sadly I''ve been hit with that as well. > > >My brother points out that you can use a rad hardened CPU. ECC should >take care of the RAM. :-) > >I wonder when the former will become data centre best practice?Alpha particles which "hit" CPUs must have their origin inside said CPU. (Alpha particles do not penentrate skin, paper, let alone system cases or CPU packagaging) Casper
Casper.Dik at Sun.COM wrote:>> On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote: >> >> >>> ... ZFS will not stop alpha particle induced memory corruption >>> after data has been received by server and verified to be correct. >>> Sadly I''ve been hit with that as well. >>> >> My brother points out that you can use a rad hardened CPU. ECC should >> take care of the RAM. :-) >> >> I wonder when the former will become data centre best practice? >> > > Alpha particles which "hit" CPUs must have their origin inside said CPU. > > (Alpha particles do not penentrate skin, paper, let alone system cases > or CPU packagaging) > > Casper > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >But, but, but, they''ll get my brain without this nice shiny aluminum cap I made! Cosmic (aka Gamma) Radiation, folks. And, I think we''ve jumped the shark. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Casper.Dik at Sun.COM wrote:> Alpha particles which "hit" CPUs must have their origin inside said CPU. > > (Alpha particles do not penentrate skin, paper, let alone system cases > or CPU packagaging)Gamma rays cannot be shielded in a senseful way. J?rg -- EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin js at cs.tu-berlin.de (uni) schilling at fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
On 28-Jan-07, at 7:59 AM, Casper.Dik at Sun.COM wrote:> >> >> On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote: >> >>> ... ZFS will not stop alpha particle induced memory corruption >>> after data has been received by server and verified to be correct. >>> Sadly I''ve been hit with that as well. >> >> >> My brother points out that you can use a rad hardened CPU. ECC should >> take care of the RAM. :-) >> >> I wonder when the former will become data centre best practice? > > Alpha particles which "hit" CPUs must have their origin inside said > CPU. > > (Alpha particles do not penentrate skin, paper, let alone system cases > or CPU packagaging)Thanks. But what about cosmic rays? --T> > Casper
Casper.Dik at Sun.COM
2007-Jan-28 13:50 UTC
[zfs-discuss] Re: Re: ZFS or UFS - what to do?
> >On 28-Jan-07, at 7:59 AM, Casper.Dik at Sun.COM wrote: > >> >>> >>> On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote: >>> >>>> ... ZFS will not stop alpha particle induced memory corruption >>>> after data has been received by server and verified to be correct. >>>> Sadly I''ve been hit with that as well. >>> >>> >>> My brother points out that you can use a rad hardened CPU. ECC should >>> take care of the RAM. :-) >>> >>> I wonder when the former will become data centre best practice? >> >> Alpha particles which "hit" CPUs must have their origin inside said >> CPU. >> >> (Alpha particles do not penentrate skin, paper, let alone system cases >> or CPU packagaging) > >Thanks. But what about cosmic rays?I was just in pedantic mode; "cosmic rays" is the term covering all different particles, including alpha, beta and gamma rays. Alpha rays don''t reach us from the "cosmos"; they are caught long before they can do any harm. Ditto beta rays. Both have an electrical charge that makes passing magnetic fields or passing through materials difficult. Both do exist "in the free" but are commonly caused by slow radioactive decay of our natural environment. Gamma rays are photons with high energy; they are not capture by magnetic fields (such as those existing in atoms: electons, protons). They need to take a direct hit before they''re stopped; they can only be stopped by dense materials, such as lead. Unfortunately, natural occuring lead is polluted by pollonium and uranium and is an alpha/beta source in its own right. That''s why 100 year old lead from roofs is worth more money than new lead: it''s radioisotopes have been depleted. Casper
Anantha N. Srirama
2007-Jan-28 14:19 UTC
[zfs-discuss] Re: Re: Re: ZFS or UFS - what to do?
You''re right that storage level snapshots are filesystem agnostic. I''m not sure why you believe you won''t be able to restore individual files by using a NetApp snapshot? In the case of ZFS you''d take a periodic snapshot and use it to restore files, in the case of NetApp you can do the same (of course you''ve to have the additional step to mount the new snapshot volume.) Is this convenience tipping the scales for you to pursue ZFS? This message posted from opensolaris.org
On Sat, Jan 27, 2007 at 04:15:30PM -0800, Anantha N. Srirama wrote:> > I''m not sure what benefit you forsee by running a COW filesystem > (ZFS) on a COW array (NetApp).The application requires a filesystem with POSIX semantics. My first choice would be NFS from the Netapp, but this won''t work in this case. My next choice is an iSCSI LUN with a local filesystem on it. I''m assuming that since ZFS is more modern than UFS, that ZFS would be the best of the two, even though the JBOD-oriented features of ZFS will not be used. ZFS does seem to be more manageable than UFS. Filesystems that draw their space from a common pool is ideal for our application. The ability to expand a pool by adding another device, or by extending a existing device, is also ideal. Another feature is snapshots, which I''ve mentioned earlier. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
On Sun, Jan 28, 2007 at 06:19:25AM -0800, Anantha N. Srirama wrote:> > You''re right that storage level snapshots are filesystem agnostic. I''m > not sure why you believe you won''t be able to restore individual files > by using a NetApp snapshot? In the case of ZFS you''d take a periodic > snapshot and use it to restore files, in the case of NetApp you can do > the same (of course you''ve to have the additional step to mount the > new snapshot volume.) Is this convenience tipping the scales for you > to pursue ZFS?Yes, we''d run out of LUNs. We''re talking about two weeks of daily snapshots on six filesystems. Each snapshot on the Netapp would become a separate iSCSI LUN. They need to be mounted on the server so that our admins can locate and restore missing files when necessary. -- -Gary Mills- -Unix Support- -U of M Academic Computing and Networking-
Hello Anantha, Friday, January 26, 2007, 5:06:46 PM, you wrote: ANS> All my feedback is based on Solaris 10 Update 2 (aka 06/06) and ANS> I''ve no comments on NFS. I strongly recommend that you use ZFS ANS> data redundancy (z1, z2, or mirror) and simply delegate the ANS> Engenio to stripe the data for performance. Striping on an array and then doing redundancy with ZFS has at least one drawback - what if one of disks fails? You''ve got to replace bad disk, re-create stripe on an array and resilver on ZFS (or stay with hotspare). Lot of hassle. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Hello Francois, Friday, January 26, 2007, 4:09:43 PM, you wrote: FD> On Fri, 2007-01-26 at 06:16 -0800, Jeffery Malloch wrote:>> Hi Folks, >> >> I am currently in the midst of setting up a completely new file server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) connected to an Engenio 6994 product (I work for LSI Logic so Engenio is a no brainer). I have configured a couple of zpools from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. I then created sub zfs systems below that and set quotas and sharenfs''d them so that it appears that these "file systems" are dynamically shrinkable and growable. It looks very good... I can see the correct file system sizes on all types of machines (Linux 32/64bit and of course Solaris boxes) and if I resize the quota it''s picked up in NFS right away. But I would be the first in our organization to use this in an enterprise system so I definitely have some concerns that I''m hoping someone here can address. >> >> 1. How stable is ZFS? The Engenio box is completely configured for RAID5 with hot sparesFD> That partly defeats the purpose of ZFS. ZFS offers raid-z and raid-z2 FD> (double parity) with all the advantages of raid-5 or raid-6 but without FD> several of the raid-5 issues. It also has features that a raid-5 FD> controller could never do: ensure data integrity from the kernel to the FD> disk, and self correction. Not always true. Actually you can get much more performance for some workloads doing raid-5 in HW than raid-z. Also with some other entry level arrays there''re limits on how much LUNs can be presented and you actually can''t expose all disks each as a LUN due to the limit (yes, Sun''s 3510).>> and write cache (8GB) has battery backup so I''m not too concerned from a hardware side.FD> Whereas the cache/battery backup is a requirement if you run raid-5, it FD> is not for zfs. Still it doesn''t mean it won''t help for some workloads.>> 2. Recommended config.FD> The most reliable setup is a JBOD + zfs. But if you have cache, on your I would argue this. No matter what you still get less reliable setup when using ZFS on top of simple JBOD than Symmetrix box. It''s just that in many cases that simple JBOD can be good enough. FD> box, there might be some magic setup you have to do for that box, and FD> I''m sure somebody on the list will help you with that. I dont have an FD> Engenio. There''s a workaround for Enginie devices. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Agreed, I guess I didn''t articulate my point/thought very well. The best config is to present JBoDs and let ZFS provide the data protection. This has been a very stimulating conversation thread; it is shedding new light into how to best use ZFS. This message posted from opensolaris.org
On January 28, 2007 7:57:31 PM -0800 "Anantha N. Srirama" <anantha.srirama at cdc.hhs.gov> wrote:> Agreed, I guess I didn''t articulate my point/thought very well. The best > config is to present JBoDs and let ZFS provide the data protection. This > has been a very stimulating conversation thread; it is shedding new light > into how to best use ZFS.Actually it depends on the workload. Best is a very loaded word. -frank
Anantha N. Srirama writes: > Agreed, I guess I didn''t articulate my point/thought very well. The > best config is to present JBoDs and let ZFS provide the data > protection. This has been a very stimulating conversation thread; it > is shedding new light into how to best use ZFS. > > I would say: To enable the unique ZFS feature of self-healing ZFS must be allowed to manage a level of redundancy: mirroring or Raid-z. The type of LUNs (JBOD/Raid-*/iscsi) used is not relevant in this statement. Now, if one also relies on ZFS to reconstruct data in the face of disk failures (as opposed to storage based reconstruction), better make sure that single/double disk failures do not bring down multiple LUNS at once. So better protection is achieved by configuring LUNS that maps to seggregated sets of physical things (disks & controllers). -r > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Hi All, In my test set up, I have one zpool of size 1000M bytes and it has only 30 M free space (970 M is used for some other purpose). On this zpool I created one file (using open () call) and i attempted to write 2MB data on it ( with write() call) but it is failed. It written only 1.3 MB (the written value of write() call) data, it is because of "No space left on the device". After that I tried to truncate this file to 1.3 Mb data but it is failing. Any clues on this? -Masthan --------------------------------- Food fight? Enjoy some healthy debate in the Yahoo! Answers Food & Drink Q&A. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070129/1de9b75f/attachment.html>
> > Our Netapp does double-parity RAID. In fact, the filesystem design is > > remarkably similar to that of ZFS. Wouldn''t that also detect the > > error? I suppose it depends if the `wrong sector without notice'' > > error is repeated each time. Or is it random? > > On most (all?) other systems the parity only comes into effect when a > drive fails. When all the drives are reporting "OK" most (all?) RAID > systems don''t use the parity data at all. ZFS is the first (only?) > system that actively checks the data returned from disk, regardless > of whether the drives are reporting they''re okay or not. > > I''m sure I''ll be corrected if I''m wrong. :)Netapp/OnTAP does do read verification, but it does it outside the raid-4/raid-dp protection (just like ZFS does it outside the raidz protction). So it''s correct that the parity data is not read at all in either OnTAP or ZFS, but both attempt to do verification of the data on all reads. See also: http://blogs.sun.com/bonwick/entry/zfs_end_to_end_data for a few more specifics on it and the differences from the ZFS data check. -- Darren Dunham ddunham at taos.com Senior Technical Consultant TAOS http://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area < This line left intentionally blank to confuse you. >
On Jan 26, 2007, at 09:16, Jeffery Malloch wrote:> Hi Folks, > > I am currently in the midst of setting up a completely new file > server using a pretty well loaded Sun T2000 (8x1GHz, 16GB RAM) > connected to an Engenio 6994 product (I work for LSI Logic so > Engenio is a no brainer). I have configured a couple of zpools > from Volume groups on the Engenio box - 1x2.5TB and 1x3.75TB. I > then created sub zfs systems below that and set quotas and > sharenfs''d them so that it appears that these "file systems" are > dynamically shrinkable and growable.ah - the 6994 is the controller we use in the 6140/6540 if i''m not mistaken .. i guess this thread will go down in a flaming JBOD vs RAID controller religious war again .. oops, too late :P yes - the dynamic LUN expansion bits in ZFS is quite nice and handy for managing dynamic growth of a pool or file system. so going back to Jeffery''s original questions:> > 1. How stable is ZFS? The Engenio box is completely configured > for RAID5 with hot spares and write cache (8GB) has battery backup > so I''m not too concerned from a hardware side. I''m looking for an > idea of how stable ZFS itself is in terms of corruptability, uptime > and OS stability.I think the stability issue has already been answered pretty well .. 8GB battery backed cache is nice .. performance wise you might find some odd interactions with the ZFS adaptive cache integration and the way in which the intent log operates (O_DSYNC writes can potentially impose a lot of in flight commands for relatively little work) - there''s a max blocksize of 128KB (also maxphys), so you might want to experiment with tuning back the stripe width .. i seem to recall the the 6994 controller seemed to perform best with 256KB or 512KB stripe width .. so there may be additional tuning on the read-ahead or write- behind algorithms.> 2. Recommended config. Above, I have a fairly simple setup. In > many of the examples the granularity is home directory level and > when you have many many users that could get to be a bit of a > nightmare administratively. I am really only looking for high > level dynamic size adjustability and am not interested in its built > in RAID features. But given that, any real world recommendations?Not being interested in the RAID functionality as Roch points out eliminates the self-healing functionality and reconstruction bits in ZFS .. but you still get other nice benefits like dynamic LUN expansion As i see it, since we seem to have excess CPU and bus capacity on newer systems (most applications haven''t quite caught up to impose enough of a load yet) .. we''re back to the mid ''90s where host based volume management and caching makes sense and is being proposed again. Being proactive, we might want to consider putting an embedded Solaris/ZFS on a RAID controller to see if we''ve really got something novel in the caching and RAID algorithms for when the application load really does catch up and impose more of a load on the host. Additionally - we''re seeing that there''s a big benefit in moving the filesystem closer to the storage array since most users care more about their consistency of their data (upper level) than the reliability of the disk subsystem or RAID controller. Implementing a RAID controller that''s more intimately aware of the upper data levels seems like the next logical evolutionary step.> 3. Caveats? Anything I''m missing that isn''t in the docs that > could turn into a BIG gotchya?I would say be careful of the ease at which you can destroy file systems and pools .. while convenient - there''s typically no warning if you or an administrator does a zfs or zpool destroy .. so i could see that turning into an issue. Also if a LUN goes offline, you may not see this right away and you would have the potential to corrupt your pool or panic your system. Hence the self-healing and scrub options to detect and repair failure a little bit faster. People on this forum have been finding RAID controller inconsistencies .. hence the religious JBOD vs RAID ctlr "disruptive paradigm shift"> 4. Since all data access is via NFS we are concerned that 32 bit > systems (Mainly Linux and Windows via Samba) will not be able to > access all the data areas of a 2TB+ zpool even if the zfs quota on > a particular share is less then that. Can anyone comment?Doing 2TB+ shouldn''t be a problem for the NFS or Samba mounted filesystem regardless if the host is 32bit or not. The only place where you can run into a problem is if the size of an individual file crosses 2 or 4TB on a 32bit system. I know we''ve implemented file systems (QFS in this case) that were samba shared to 32bit windows hosts in excess of 40-100TB without any major issues. I''m sure there''s similar cases with ZFS and thumper .. i just don''t have that data. a little late to the discussion, but hth --- .je
Hi Guys, SO...>From what I can tell from this thread ZFS if VERY fussy about managing writes,reads and failures. It wants to be bit perfect. So if you use the hardware that comes with a given solution (in my case an Engenio 6994) to manage failures you risk a) bad writes that don''t get picked up due to corruption from write cache to disk b) failures due to data changes that ZFS is unaware of that the hardware imposes when it tries to fix itself.So now I have a $70K+ lump that''s useless for what it was designed for. I should have spent $20K on a JBOD. But since I didn''t do that, it sounds like a traditional model works best (ie. UFS et al) for the type of hardware I have. No sense paying for something and not using it. And by using ZFS just as a method for ease of file system growth and management I risk much more corruption. The other thing I haven''t heard is why NOT to use ZFS. Or people who don''t like it for some reason or another. Comments? Thanks, Jeff PS - the responses so far have been great and are much appreciated! Keep ''em coming... This message posted from opensolaris.org
Hi Jeff, Maybe I mis-read this thread, but I don''t think anyone was saying that using ZFS on-top of an intelligent array risks more corruption. Given my experience, I wouldn''t run ZFS without some level of redundancy, since it will panic your kernel in a RAID-0 scenario where it detects a LUN is missing and can''t fix it. That being said, I wouldn''t run anything but ZFS anymore. When we had some database corruption issues awhile back, ZFS made it very simple to prove it was the DB. Just did a scrub and boom, verification that the data was laid down correctly. RAID-5 will have better random read performance the RAID-Z for reasons Robert had to beat into my head. ;-) But if you really need that performance, perhaps RAID-10 is what you should be looking at? Someone smarter than I can probably give a better idea. Regarding the failure detection, is anyone on the list have the ZFS/FMA traps fed into a network management app yet? I''m curious what the experience with it is? Best Regards, Jason On 1/29/07, Jeffery Malloch <jeffery.malloch at lsi.com> wrote:> Hi Guys, > > SO... > > >From what I can tell from this thread ZFS if VERY fussy about managing writes,reads and failures. It wants to be bit perfect. So if you use the hardware that comes with a given solution (in my case an Engenio 6994) to manage failures you risk a) bad writes that don''t get picked up due to corruption from write cache to disk b) failures due to data changes that ZFS is unaware of that the hardware imposes when it tries to fix itself. > > So now I have a $70K+ lump that''s useless for what it was designed for. I should have spent $20K on a JBOD. But since I didn''t do that, it sounds like a traditional model works best (ie. UFS et al) for the type of hardware I have. No sense paying for something and not using it. And by using ZFS just as a method for ease of file system growth and management I risk much more corruption. > > The other thing I haven''t heard is why NOT to use ZFS. Or people who don''t like it for some reason or another. > > Comments? > > Thanks, > > Jeff > > PS - the responses so far have been great and are much appreciated! Keep ''em coming... > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Jan 29, 2007, at 14:17, Jeffery Malloch wrote:> Hi Guys, > > SO... > >> From what I can tell from this thread ZFS if VERY fussy about >> managing writes,reads and failures. It wants to be bit perfect. >> So if you use the hardware that comes with a given solution (in my >> case an Engenio 6994) to manage failures you risk a) bad writes >> that don''t get picked up due to corruption from write cache to >> disk b) failures due to data changes that ZFS is unaware of that >> the hardware imposes when it tries to fix itself. > > So now I have a $70K+ lump that''s useless for what it was designed > for. I should have spent $20K on a JBOD. But since I didn''t do > that, it sounds like a traditional model works best (ie. UFS et al) > for the type of hardware I have. No sense paying for something and > not using it. And by using ZFS just as a method for ease of file > system growth and management I risk much more corruption. > > The other thing I haven''t heard is why NOT to use ZFS. Or people > who don''t like it for some reason or another. > > Comments?I put together this chart a while back .. i should probably update it for RAID6 and RAIDZ2 # ZFS ARRAY HW CAPACITY COMMENTS -- --- -------- -------- -------- 1 R0 R1 N/2 hw mirror - no zfs healing 2 R0 R5 N-1 hw R5 - no zfs healing 3 R1 2 x R0 N/2 flexible, redundant, good perf 4 R1 2 x R5 (N/2)-1 flexible, more redundant, decent perf 5 R1 1 x R5 (N-1)/2 parity and mirror on same drives (XXX) 6 RZ R0 N-1 standard RAID-Z no mirroring 7 RZ R1 (tray) (N/2)-1 RAIDZ+1 8 RZ R1 (drives) (N/2)-1 RAID1+Z (highest redundancy) 9 RZ 3 x R5 N-4 triple parity calculations (XXX) 10 RZ 1 x R5 N-2 double parity calculations (XXX) (note: I included the cases where you have multiple arrays with a single lun per vdisk (say) and where you only have a single array split into multiple LUNs.) The way I see it, you''re better off picking either controller parity or zfs parity .. there''s no sense in computing parity multiple times unless you have cycles to spare and don''t mind the performance hit .. so the questions you should really answer before you choose the hardware is what level of redundancy to capacity balance do you want? and whether or not you want to compute RAID in ZFS host memory or out on a dedicated blackbox controller? I would say something about double caching too, but I think that''s moot since you''ll always cache in the ARC if you use ZFS the way it''s currently written. Other feasible filesystem options for Solaris - UFS, QFS, or vxfs with SVM or VxVM for volume mgmt if you''re so inclined .. all depends on your budget and application. There''s currently tradeoffs in each one, and contrary to some opinions, the death of any of these has been grossly exaggerated. --- .je -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070129/2b9c0903/attachment.html>
On Mon, Jan 29, 2007 at 11:17:05AM -0800, Jeffery Malloch wrote:> From what I can tell from this thread ZFS if VERY fussy about > managing writes,reads and failures. It wants to be bit perfect. So > if you use the hardware that comes with a given solution (in my case > an Engenio 6994) to manage failures you risk a) bad writes that > don''t get picked up due to corruption from write cache to disk b) > failures due to data changes that ZFS is unaware of that the > hardware imposes when it tries to fix itself. > > So now I have a $70K+ lump that''s useless for what it was designed > for. I should have spent $20K on a JBOD. But since I didn''t do > that, it sounds like a traditional model works best (ie. UFS et al) > for the type of hardware I have. No sense paying for something and > not using it. And by using ZFS just as a method for ease of file > system growth and management I risk much more corruption.Well, ZFS with HW RAID makes sense in some cases. However, it seems that if you are unwilling to lose 50% disk space to RAID 10 or two mirrored HW RAID arrays, you either use RAID 0 on the array with ZFS RAIDZ/RAIDZ2 on top of that or a JBOD with ZFS RAIDZ/RAIDZ2 on top of that. -- albert chin (china at thewrittenword.com)
On January 29, 2007 11:17:05 AM -0800 Jeffery Malloch <jeffery.malloch at lsi.com> wrote:> Hi Guys, > > SO... > >> From what I can tell from this thread ZFS if VERY fussy about managing >> writes,reads and failures. It wants to be bit perfect.It''s funny to call that "fussy". All filesystems WANT to be bit perfect, zfs actually does something to ensure it.>> So if you use >> the hardware that comes with a given solution (in my case an Engenio >> 6994) to manage failures you risk a) bad writes that don''t get picked up >> due to corruption from write cache to diskYou would always have that problem, JBOD or RAID. There are many places data can get corrupted, not just in the RAID write cache. zfs will correct it, or at least detect it depending on your configuration.>> b) failures due to data >> changes that ZFS is unaware of that the hardware imposes when it tries >> to fix itself.If that happens, you will be lucky to have ZFS to fix it. If the array changes data, it is broken. This is not the same thing as correcting data.> The other thing I haven''t heard is why NOT to use ZFS. Or people who > don''t like it for some reason or another.If you need per-user quotas, zfs might not be a good fit. (In many cases per-filesystem quotas can be used effectively though.) If you need NFS clients to traverse mount points on the server (eg /home/foo), then this won''t work yet. Then again, does this work with UFS either? Seems to me it wouldn''t. The difference is that zfs encourages you to create more filesystems. But you don''t have to. If you have an application that is very highly tuned for a specific filesystem (e.g. UFS with directio), you might not want to replace it with zfs. If you need incremental restore, you might need to stick with UFS. (snapshots might be enough for you though) -frank
Albert Chin said:> Well, ZFS with HW RAID makes sense in some cases. However, it seems that if > you are unwilling to lose 50% disk space to RAID 10 or two mirrored HW RAID > arrays, you either use RAID 0 on the array with ZFS RAIDZ/RAIDZ2 on top of > that or a JBOD with ZFS RAIDZ/RAIDZ2 on top of that.I''ve been re-evaluating our local decision on this question (how to layout ZFS on pre-existing RAID hardware). In our case, the array does not allow RAID-0 of any type, and we''re unwilling to give up the expensive disk space to a mirrored configuration. In fact, in our last decision, we came to the conclusion that we didn''t want to layer RAID-Z on top of HW RAID-5, thinking that the added loss of space is too high, given any of the "XXX" layouts in Jonathan Edwards'' chart:> # ZFS ARRAY HW CAPACITY COMMENTS > -- --- -------- -------- -------- > . . . > 5 R1 1 x R5 (N-1)/2 parity and mirror on same drives (XXX) > 9 RZ 3 x R5 N-4 triple parity calculations (XXX) > . . . > 10 RZ 1 x R5 N-2 double parity calculations (XXX)So, we ended up (some months ago) deciding to go with only HW RAID-5, using ZFS to stripe together large-ish LUN''s made up of independent HW RAID-5 groups. We''d have no ZFS redundancy, but at least ZFS would catch any corruption that may come along. We can restore individual corrupted files from tape backups (which we''re already doing anyway), if necessary. However, given the default behavior of ZFS (as of Solaris-10U3) is to panic/halt when it encounters a corrupted block that it can''t repair, I''m re-thinking our options, weighing against the possibility of a significant downtime caused by a single-block corruption. Today I''ve been pondering a variant of #10 above, the variation being to slice a RAID-5 volume across than N LUN''s, i.e. LUN''s smaller than the size of the individual disks that make up the HW R5 volume. A larger number of small LUN''s results in less space given up to ZFS parity, which is nice when overall disk space is important to us. We''re not expecting RAID-Z across these LUN''s to make it possible to survive failure of a whole disk, rather we only "need" RAID-Z to repair the occasional block corruption, in the hopes that this might head off the need to restore a whole multi-TB pool. We''ll rely on the HW RAID-5 to protect against whole-disk failure. Just thinking out loud here. Now I''m off to see what kind of performance cost there is, comparing (with 400GB disks): Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume 8+1 RAID-Z on 9 244.2GB LUN''s from a 6+1 HW RAID5 volume Regards, Marion
On 29/01/2007, at 12:50 AM, Casper.Dik at Sun.COM wrote:> >> >> On 28-Jan-07, at 7:59 AM, Casper.Dik at Sun.COM wrote: >> >>> >>>> >>>> On 27-Jan-07, at 10:15 PM, Anantha N. Srirama wrote: >>>> >>>>> ... ZFS will not stop alpha particle induced memory corruption >>>>> after data has been received by server and verified to be correct. >>>>> Sadly I''ve been hit with that as well. >>>> >>>> >>>> My brother points out that you can use a rad hardened CPU. ECC >>>> should >>>> take care of the RAM. :-) >>>> >>>> I wonder when the former will become data centre best practice? >>> >>> Alpha particles which "hit" CPUs must have their origin inside said >>> CPU. >>> >>> (Alpha particles do not penentrate skin, paper, let alone system >>> cases >>> or CPU packagaging) >> >> Thanks. But what about cosmic rays? > > > I was just in pedantic mode; "cosmic rays" is the term covering > all different particles, including alpha, beta and gamma rays. > > Alpha rays don''t reach us from the "cosmos"; they are caught > long before they can do any harm. Ditto beta rays. Both have > an electrical charge that makes passing magnetic fields or passing > through materials difficult. Both do exist "in the free" but are > commonly caused by slow radioactive decay of our natural environment. > > Gamma rays are photons with high energy; they are not capture by > magnetic fields (such as those existing in atoms: electons, protons). > They need to take a direct hit before they''re stopped; they can only > be stopped by dense materials, such as lead. Unfortunately, natural > occuring lead is polluted by pollonium and uranium and is an alpha/ > beta > source in its own right. That''s why 100 year old lead from roofs is > worth more money than new lead: it''s radioisotopes have been depleted.<ludicrous_topic_drift> Ok, I''ll bite. It''s been a long day, so that may be why I can''t see why the radioisotopes in lead that was dug up 100 years ago would be any more depleted than the lead that sat in the ground for the intervening 100 years. Half-life is half-life, no? Now if it were something about the modern extraction process that added contaminants, then I can see it. </ludicrous_topic_drift>
Casper.Dik at Sun.COM
2007-Jan-30 08:22 UTC
[zfs-discuss] Re: Re: ZFS or UFS - what to do?
>Ok, I''ll bite. It''s been a long day, so that may be why I can''t see >why the radioisotopes in lead that was dug up 100 years ago would be >any more depleted than the lead that sat in the ground for the >intervening 100 years. Half-life is half-life, no?>Now if it were something about the modern extraction process that >added contaminants, then I can see it.In nature, lead is found in deposits with trace elements of other heavy radio nucleotides. (U235/238/Th232). These are removed in processing, but one of their decay products is Pb-210. Pb-210 cannot be chemically removed from lead. (lead contains mostly stable Pb 207/208/209) New lead may also contain trace amounts of Polonium-210. So lead, when mined has trace amounts of radioactive Pb-210; as the half-life of Pb210 is only 22 years, it''s fairly radioactive but also decays rapidly (1/32 of radiation left after 100 years, 1/1000th after 200) Casper
I wrote:> Just thinking out loud here. Now I''m off to see what kind of performance > cost there is, comparing (with 400GB disks): > Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume > 8+1 RAID-Z on 9 244.2GB LUN''s from a 6+1 HW RAID5 volumeRichard.Elling at Sun.COM said:> Interesting idea. Please post back to let us know how the performance looks.The short story is, performance is not bad with the raidz arrangement, until you get to doing reads, at which point it looks much worse than the 1-LUN setup. Please bear in mind that I''m not a storage nor benchmarking expert, though I''d say I''m not a neophyte either. Some specifics: The array is a low-end Hitachi, 9520V. My two test subjects are a pair of RAID-5 groups in the same shelf, each consisting of 6D+1P 400GB SATA drives. The test host is a Sun T2000, 16GB RAM, connected via 2Gb FC links through a pair of switches (the array/mpxio combination do not support load-balancing, so only one 2Gb channel is in use at a time). It is running Solaris-10U3, patches current as of 12-Jan-2007. The array was mostly idle except for my tests, although some light I/O to other shelves may have come from another host on occasion. The test host wasn''t doing anything else during these tests. One RAID-5 group was configured as a single 2048GB LUN (with about 150GB left unallocated, the array has a max LUN size); The second RAID-5 group was setup as nine 244.3GB LUN''s. Here are the zpool configurations I used for these tests: # zpool status -v pool: bulk_sp1 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bulk_sp1 ONLINE 0 0 0 c6t4849544143484920443630303133323230303230d0 ONLINE 0 0 0 errors: No known data errors pool: bulk_zp2 state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM bulk_zp2 ONLINE 0 0 0 raidz1 ONLINE 0 0 0 c6t4849544143484920443630303133323230303330d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303331d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303332d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303333d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303334d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303335d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303336d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303337d0 ONLINE 0 0 0 c6t4849544143484920443630303133323230303338d0 ONLINE 0 0 0 errors: No known data errors # zfs list NAME USED AVAIL REFER MOUNTPOINT bulk_sp1 83K 1.95T 24.5K /sp1 bulk_zp2 73.8K 1.87T 2.67K /zp2 I used two benchmarks: One was a "bunzip2 | tar" extract of the Sun Studio-11 SPARC distribution tarball, extracting from the T2000''s internal drives onto the test zpools. For this benchmark, both zpools gave similar results: pool sp1 (single-LUN stripe): du -s -k: 1155141 time -p: real 713.67 user 614.42 sys 7.56 1.6MB/sec overall pool zp2 (8+1-LUN raidz1): du -s -k: 1169020 time -p: real 714.96 user 614.78 sys 7.56 1.6MB/sec overall The 2nd benchmark was bonnie++ v1.03, run single-threaded with default arguments, which means a 32GB dataset made of up 1GB files. Observations of "vmstat" and "mpstat" during the tests showed that bonnie++ is CPU-limited on the T2000, especially for the getc()/putc() tests, so I later ran 3x bonnie++''s simultaneously (13GB dataset each), and got the same results in total throughput for the block read/write tests on the single-LUN zpool (I was not patient enough to sit through the getc/putc tests again :-). pool sp1 (single-LUN stripe): Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP filer1 32G 15497 99 66245 84 16652 30 15210 90 106600 59 322.3 3 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 5204 100 +++++ +++ 8076 100 4551 100 +++++ +++ 7509 100 filer1,32G,15497,99,66245,84,16652,30,15210,90,106600,59,322.3,3,16,5204,100,+++++,+++,8076,100,4551,100,+++++,+++,7509,100 pool zp2 (8+1-LUN raidz1): Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP filer1 32G 16118 100 29702 40 7416 13 15828 94 30204 20 25.0 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 5215 100 +++++ +++ 8527 100 4453 100 +++++ +++ 8918 100 filer1,32G,16118,100,29702,40,7416,13,15828,94,30204,20,25.0,0,16,5215,100,+++++,+++,8527,100,4453,100,+++++,+++,8918,100 I''m not sure what to add in the way of comments. It seems clear from the results, and from watching "iostat -xn", "vmstat", "mpstat", etc. during the tests, that the raidz pool apparently suffers from not being able to make as good use of the array''s 1GB cache (the sequential block read test seems to match well with Hitachi''s read prefetch algorithms, I guess). There''s also the potential of too much seeking going on for the raidz pool, since there are 9 LUN''s on top of 7 physical disk drives (though how Hitachi divides/stripes those LUN''s is not clear to me). One thing I noticed which puzzles me is that in both configurations, though more so in the divided-up raidz pool, there were long periods of time where the LUN''s showed in "iostat -xn" output at 100% busy but with no I/O''s happening at all. No paging, CPU 100% idle, no less than 2GB of free RAM, for as long as 20-30 seconds. Sure puts a dent in the throughput. I''m doing some more testing of NFS throughput over these two zpool''s, since the test machine will eventually become an NFS and samba server. I''ve got some questions about the performance issues in the NFS scenario, but will address those in a separate message. Questions, observations, and/or suggestions are welcome. Regards, Marion
fishy smell way below... Marion Hakanson wrote:> I wrote: >> Just thinking out loud here. Now I''m off to see what kind of performance >> cost there is, comparing (with 400GB disks): >> Simple ZFS stripe on one 2198GB LUN from a 6+1 HW RAID5 volume >> 8+1 RAID-Z on 9 244.2GB LUN''s from a 6+1 HW RAID5 volume > > > Richard.Elling at Sun.COM said: >> Interesting idea. Please post back to let us know how the performance looks. > > > The short story is, performance is not bad with the raidz arrangement, until > you get to doing reads, at which point it looks much worse than the 1-LUN setup. > > Please bear in mind that I''m not a storage nor benchmarking expert, though > I''d say I''m not a neophyte either. > > Some specifics: > > The array is a low-end Hitachi, 9520V. My two test subjects are a pair > of RAID-5 groups in the same shelf, each consisting of 6D+1P 400GB SATA > drives. The test host is a Sun T2000, 16GB RAM, connected via 2Gb FC > links through a pair of switches (the array/mpxio combination do not > support load-balancing, so only one 2Gb channel is in use at a time). > It is running Solaris-10U3, patches current as of 12-Jan-2007. > > The array was mostly idle except for my tests, although some light > I/O to other shelves may have come from another host on occasion. > The test host wasn''t doing anything else during these tests. > > One RAID-5 group was configured as a single 2048GB LUN (with about 150GB > left unallocated, the array has a max LUN size); The second RAID-5 group > was setup as nine 244.3GB LUN''s. > > Here are the zpool configurations I used for these tests: > # zpool status -v > pool: bulk_sp1 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > bulk_sp1 ONLINE 0 0 0 > c6t4849544143484920443630303133323230303230d0 ONLINE 0 0 0 > > errors: No known data errors > > pool: bulk_zp2 > state: ONLINE > scrub: none requested > config: > > NAME STATE READ WRITE CKSUM > bulk_zp2 ONLINE 0 0 0 > raidz1 ONLINE 0 0 0 > c6t4849544143484920443630303133323230303330d0 ONLINE 0 0 0 > c6t4849544143484920443630303133323230303331d0 ONLINE 0 0 0 > c6t4849544143484920443630303133323230303332d0 ONLINE 0 0 0 > c6t4849544143484920443630303133323230303333d0 ONLINE 0 0 0 > c6t4849544143484920443630303133323230303334d0 ONLINE 0 0 0 > c6t4849544143484920443630303133323230303335d0 ONLINE 0 0 0 > c6t4849544143484920443630303133323230303336d0 ONLINE 0 0 0 > c6t4849544143484920443630303133323230303337d0 ONLINE 0 0 0 > c6t4849544143484920443630303133323230303338d0 ONLINE 0 0 0 > > errors: No known data errors > # zfs list > NAME USED AVAIL REFER MOUNTPOINT > bulk_sp1 83K 1.95T 24.5K /sp1 > bulk_zp2 73.8K 1.87T 2.67K /zp2 > > > I used two benchmarks: One was a "bunzip2 | tar" extract of the Sun > Studio-11 SPARC distribution tarball, extracting from the T2000''s > internal drives onto the test zpools. For this benchmark, both zpools > gave similar results: > > pool sp1 (single-LUN stripe): > du -s -k: > 1155141 > time -p: > real 713.67 > user 614.42 > sys 7.56 > 1.6MB/sec overall > > pool zp2 (8+1-LUN raidz1): > du -s -k: > 1169020 > time -p: > real 714.96 > user 614.78 > sys 7.56 > 1.6MB/sec overall > > > > The 2nd benchmark was bonnie++ v1.03, run single-threaded with default > arguments, which means a 32GB dataset made of up 1GB files. Observations of > "vmstat" and "mpstat" during the tests showed that bonnie++ is CPU-limited > on the T2000, especially for the getc()/putc() tests, so I later ran 3x > bonnie++''s simultaneously (13GB dataset each), and got the same results > in total throughput for the block read/write tests on the single-LUN zpool > (I was not patient enough to sit through the getc/putc tests again :-). > > pool sp1 (single-LUN stripe): > Version 1.03 ------Sequential Output------ --Sequential Input- --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > filer1 32G 15497 99 66245 84 16652 30 15210 90 106600 59 322.3 3 > ------Sequential Create------ --------Random Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 5204 100 +++++ +++ 8076 100 4551 100 +++++ +++ 7509 100 > filer1,32G,15497,99,66245,84,16652,30,15210,90,106600,59,322.3,3,16,5204,100,+++++,+++,8076,100,4551,100,+++++,+++,7509,100 > > pool zp2 (8+1-LUN raidz1): > Version 1.03 ------Sequential Output------ --Sequential Input- --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > filer1 32G 16118 100 29702 40 7416 13 15828 94 30204 20 25.0 0 > ------Sequential Create------ --------Random Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 5215 100 +++++ +++ 8527 100 4453 100 +++++ +++ 8918 100 > filer1,32G,16118,100,29702,40,7416,13,15828,94,30204,20,25.0,0,16,5215,100,+++++,+++,8527,100,4453,100,+++++,+++,8918,100 >[is it just me, or can anybody understand Bonnie++ output? Maybe they just lose too much info trying to cram the results onto a VT-52 screen...]> I''m not sure what to add in the way of comments. It seems clear from the > results, and from watching "iostat -xn", "vmstat", "mpstat", etc. during > the tests, that the raidz pool apparently suffers from not being able to > make as good use of the array''s 1GB cache (the sequential block read test > seems to match well with Hitachi''s read prefetch algorithms, I guess). > There''s also the potential of too much seeking going on for the raidz pool, > since there are 9 LUN''s on top of 7 physical disk drives (though how Hitachi > divides/stripes those LUN''s is not clear to me). > > One thing I noticed which puzzles me is that in both configurations, though > more so in the divided-up raidz pool, there were long periods of time where > the LUN''s showed in "iostat -xn" output at 100% busy but with no I/O''s > happening at all. No paging, CPU 100% idle, no less than 2GB of free RAM, > for as long as 20-30 seconds. Sure puts a dent in the throughput.IIRC, the calculation for %busy is the amount of time that an I/O is on the device. These symptoms would occur if an I/O is dropped somewhere along the way or at the array. Eventually, we''ll timeout and retry, though by default that should be after 60 seconds. I think we need to figure out what is going on here before accepting the results. It could be that we''re overrunning the queue on the Hitachi. By default, ZFS will send 35 concurrent commands per vdev and the ssd driver will send up to 256 to a target. IIRC, Hitachi has a formula for calculating sdd_max_throttle to avoid such overruns, but I''m not sure if that applies to this specific array. -- richard> I''m doing some more testing of NFS throughput over these two zpool''s, > since the test machine will eventually become an NFS and samba server. > I''ve got some questions about the performance issues in the NFS scenario, > but will address those in a separate message. > > Questions, observations, and/or suggestions are welcome. > > Regards, > > Marion > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On 2/1/07, Marion Hakanson <hakansom at ohsu.edu> wrote:> There''s also the potential of too much seeking going on for the raidz pool, > since there are 9 LUN''s on top of 7 physical disk drives (though how Hitachi > divides/stripes those LUN''s is not clear to me).Marion, That is the part of your setup that puzzled me. You took the same 7 disk raid5 set and split them into 9 LUNS. The Hitachi likely splits the "virtual disk" into 9 continuous partitions so each LUN maps back to different parts of the 7 disks. I speculate that ZFS thinks it is talking to 9 different disks so spreads out the writes accordingly. What ZFS thinks is sequential writes becomes well spaced writes across the entire disk & blows your seek time off the roof. I''m interested how it looks like from the Hitachi end. If you can, repeat the test with the Hitachi presenting all 7 disks directly to ZFS as LUNs?> One thing I noticed which puzzles me is that in both configurations, though > more so in the divided-up raidz pool, there were long periods of time where > the LUN''s showed in "iostat -xn" output at 100% busy but with no I/O''s > happening at all. No paging, CPU 100% idle, no less than 2GB of free RAM, > for as long as 20-30 seconds. Sure puts a dent in the throughput.Interesting... what you are suggesting is that %b is 100% when w/s and r/s is 0? -- Just me, Wire ...
weeyeh at gmail.com said:> That is the part of your setup that puzzled me. You took the same 7 disk > raid5 set and split them into 9 LUNS. The Hitachi likely splits the "virtual > disk" into 9 continuous partitions so each LUN maps back to different parts > of the 7 disks. I speculate that ZFS thinks it is talking to 9 different > disks so spreads out the writes accordingly. What ZFS thinks is sequential > writes becomes well spaced writes across the entire disk & blows your seek > time off the roof.That''s what I thought might happen before I even tried this, although it''s also possible the Hitachi "stripes" each LUN across all 7 disks. Either way, one could be getting too many seeks. Note that I''m just trying to see if it was so bad that the self-healing capability wasn''t worth the cost. I do realize these are 7200rpm SATA disks, so seeking isn''t what they do best.> I''m interested how it looks like from the Hitachi end. If you can, > repeat the test with the Hitachi presenting all 7 disks directly to > ZFS as LUNs?The array doesn''t give us that capability.> Interesting... what you are suggesting is that %b is 100% when w/s and r/s is > 0?Correct. Sometimes all "iostat -xn" columns are 0 except %b; Sometimes the asvc_t column stays at "4.0" for the duration of the quiet period. I''ve also observed times where all columns were 0, including %b. Sure is puzzling. Richard.Elling at Sun.COM said:> IIRC, the calculation for %busy is the amount of time that an I/O is on the > device. These symptoms would occur if an I/O is dropped somewhere along the > way or at the array. Eventually, we''ll timeout and retry, though by default > that should be after 60 seconds. I think we need to figure out what is going > on here before accepting the results. It could be that we''re overrunning the > queue on the Hitachi. By default, ZFS will send 35 concurrent commands per > vdev and the ssd driver will send up to 256 to a target. IIRC, Hitachi has a > formula for calculating sdd_max_throttle to avoid such overruns, but I''m not > sure if that applies to this specific array.Hmm, it''s true that I have made no tuning changes on the T2000 side. It would make sense if the array just stopped responding. I''ll have to poke at the array and see if it has any diagnostics logged somewhere. I recall that the Hitachi docs do have some recommendations on max-throttle settings, so I''ll go dig those up and see what I can find out. Thanks for the comments, Marion
Marion Hakanson wrote:> However, given the default behavior of ZFS (as of Solaris-10U3) is to > panic/halt when it encounters a corrupted block that it can''t repair, > I''m re-thinking our options, weighing against the possibility of a > significant downtime caused by a single-block corruption.Guess what happens when UFS finds an inconsistency it can''t fix either? The issue is that ZFS has the chance to fix the inconsistency if the zpool is a mirror or raidZ. Not that it finds the inconsistency in the first place. ZFS will just find more of them given a set of errors vs other filesystems.