I find it baffling that RaidZ(2,3) was designed to split a record-size block into N (N=# of member devices) pieces and send the uselessly tiny requests to spinning rust when we know the massive delays entailed in head seeks and rotational delay. The ZFS-mirror and load-balanced configuration do the obviously correct thing and don''t split records and gain more by utilizing parallel access. I can''t imagine the code-path for RAIDZ would be so hard to fix. I''ve read posts back to 06 and all I see are lamenting about the horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to claw back performance by combining multiple such vDEVs. I understand RAIDZ will never equal Mirroring, but it could get damn close if it didn''t break requests down and better yet utilized copies=N and properly placed the copies on disparate spindles. This is somewhat analogous to what the likes of 3PAR do and it''s not rocket science. An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount of storage but the latter is a hell of a lot more resilient and max IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still be 1/2 the IOPS of the 8 disk mirror but I''d at least save a bundle of coin in either reduced spindle count or using slower drives. With all the great things ZFS is capable of, why hasn''t this been redesigned long ago? what glaringly obvious truth am I missing?
On Mon, Jan 4, 2010 at 2:27 AM, matthew patton <pattonme at yahoo.com> wrote:> I find it baffling that RaidZ(2,3) was designed to split a record-size block into N (N=# of member devices) pieces and send the uselessly tiny requests to spinning rust when we know the massive delays entailed in head seeks and rotational delay. The ZFS-mirror and load-balanced configuration do the obviously correct thing and don''t split records and gain more by utilizing parallel access. I can''t imagine the code-path for RAIDZ would be so hard to fix. > > I''ve read posts back to 06 and all I see are lamenting about the horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to claw back performance by combining multiple such vDEVs. I understand RAIDZ will never equal Mirroring, but it could get damn close if it didn''t break requests down and better yet utilized copies=N and properly placed the copies on disparate spindles. This is somewhat analogous to what the likes of 3PAR do and it''s not rocket science. > > An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount of storage but the latter is a hell of a lot more resilient and max IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still be 1/2 the IOPS of the 8 disk mirror but I''d at least save a bundle of coin in either reduced spindle count or using slower drives. > > With all the great things ZFS is capable of, why hasn''t this been redesigned long ago? what glaringly obvious truth am I missing?It is the sacrifice that was made to remove the write hole vulnerability that existed in RAID5/6. Personnally I am thinking now that that write hole isn''t so bad, and with COW writes and a write log, that vulnerability really could be marginalized. If you are running copies=2 you could utilize hardware RAID5/6 with battery backed write cache for the RAID and present it as a couple LUNs to ZFS which should provide higher performance with data resiliency in place. Say with 14 drives, 2x 7 drive RAID6s, make a 2 vdev zpool out of them and copies=2, should provide more then enough data resiliency and performance, at a cost. If the drives are large enough though it could overcome that. -Ross
Chris Siebenmann <cks at cs.toronto.edu> wrote:> People have already mentioned the RAID-[56] write hole, > but it''s more > than that; in a never-overwrite system with multiple blocks > in one RAID > stripe, how do you handle updates to some of the blocks? > > See: > ??? http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSSensibleRAIDOh that''s easy. Netapp''s been doing this since forever. A little extra meta-data is nothing to worry about. Snapshots create by comparison massive metadata footprints. since it''s all COW why not do full stripe writes all the time? Assume a 4 disk raidZ(3+p) A1 A2 A3 AP B1 B2 B3 BP C1 C2 C3 CP ... Then the transaction group timer fires and dumps a bunch of records needing syncing: A2'', B2'', C3'', D2'', A1'', B3''. I write these out in a totally new/empty stripes as A1'' A2'' C3'' XP B2'' B3'' D2'' XP I don''t have to read any of the original blocks and parity is calculated from in-memory. Then I just modify the metadata to mark the original blocks as invalid/superceded. But for XOR and stripe recovery purposes we can leave the original stripe perfectly alone. When a full stripe is no longer valid (all blocks superceded) and isn''t part of a snapshot it gets put on the "clean/ready for reuse" list. After a while one could potentially end up with all of the "A" blocks sitting on just one spindle. They are still fully protected but a sequential read of A1-A3 will obviously be much slower than if they were properly spread across 3 spindles. This is where array scrubbing would step in and rebalance the A-series. On the other hand the elevator algorithm applied to the transaction group could order things such that all ''x1'' blocks go on spindle 1, ''x2'' go on spindle 2 etc. If there aren''t enough of a particular spindle then just use an empty block to fill in the hole or if that is too wasteful, resort to the less optimal ordering for the left-overs and scrubbing will eventually take care of it. Note that my representation mimics RAID4 in layout. You can of course move the parity block around, indeed parity block spindle is a simple function of stripe index and array width. Eg. for stripe N on width W -> parity is on spindle W - (N mod W). Is distributed parity worth doing? No, I don''t think so.
On Jan 3, 2010, at 11:27 PM, matthew patton wrote:> I find it baffling that RaidZ(2,3) was designed to split a record- > size block into N (N=# of member devices) pieces and send the > uselessly tiny requests to spinning rust when we know the massive > delays entailed in head seeks and rotational delay. The ZFS-mirror > and load-balanced configuration do the obviously correct thing and > don''t split records and gain more by utilizing parallel access. I > can''t imagine the code-path for RAIDZ would be so hard to fix.Knock yourself out :-) http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c> I''ve read posts back to 06 and all I see are lamenting about the > horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to > claw back performance by combining multiple such vDEVs. I understand > RAIDZ will never equal Mirroring, but it could get damn close if it > didn''t break requests down and better yet utilized copies=N and > properly placed the copies on disparate spindles. This is somewhat > analogous to what the likes of 3PAR do and it''s not rocket science.That is not the issue for small, random reads. For all reads, the checksum is verified. When you spread the record across multiple disks, then you need to read the record back from those disks. In general, this means that as long as the recordsize is larger than the requested small read, then your performance will approach the N/(N-P) * IOPS limit. At the pathological edge, you can set recordsize to 512 bytes and you end up with mirroring (!) The small, random read performance model I developed only calculates the above IOPS limit, and does not consider recordsize. The physical I/O is much more difficult to correlate to the logical I/ O because of all of the coalescing and caching that occurs at all of the lower levels in the stack.> An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount > of storage but the latter is a hell of a lot more resilient and max > IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still > be 1/2 the IOPS of the 8 disk mirror but I''d at least save a bundle > of coin in either reduced spindle count or using slower drives. > > With all the great things ZFS is capable of, why hasn''t this been > redesigned long ago? what glaringly obvious truth am I missing?Performance, dependability, space: pick two. -- richard
Richard Elling writes: > On Jan 3, 2010, at 11:27 PM, matthew patton wrote: > > > I find it baffling that RaidZ(2,3) was designed to split a record- > > size block into N (N=# of member devices) pieces and send the > > uselessly tiny requests to spinning rust when we know the massive > > delays entailed in head seeks and rotational delay. The ZFS-mirror > > and load-balanced configuration do the obviously correct thing and > > don''t split records and gain more by utilizing parallel access. I > > can''t imagine the code-path for RAIDZ would be so hard to fix. > > Knock yourself out :-) > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c > > > I''ve read posts back to 06 and all I see are lamenting about the > > horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to > > claw back performance by combining multiple such vDEVs. I understand > > RAIDZ will never equal Mirroring, but it could get damn close if it > > didn''t break requests down and better yet utilized copies=N and > > properly placed the copies on disparate spindles. This is somewhat > > analogous to what the likes of 3PAR do and it''s not rocket science. > > That is not the issue for small, random reads. For all reads, the > checksum is > verified. When you spread the record across multiple disks, then you > need > to read the record back from those disks. In general, this means that as > long as the recordsize is larger than the requested small read, then > your > performance will approach the N/(N-P) * IOPS limit. At the > pathological edge, > you can set recordsize to 512 bytes and you end up with mirroring (!) > The small, random read performance model I developed only calculates > the above IOPS limit, and does not consider recordsize. > > The physical I/O is much more difficult to correlate to the logical I/ > O because > of all of the coalescing and caching that occurs at all of the lower > levels in > the stack. > > > An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount > > of storage but the latter is a hell of a lot more resilient and max > > IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still > > be 1/2 the IOPS of the 8 disk mirror but I''d at least save a bundle > > of coin in either reduced spindle count or using slower drives. > > > > With all the great things ZFS is capable of, why hasn''t this been > > redesigned long ago? what glaringly obvious truth am I missing? > > Performance, dependability, space: pick two. > -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss If store record X in one column like raid-5 or 6 does; then you need to generate parity for that record X by grouping with other unrelated records Y, Z, T etc. When X if freed in the filesystem, it still holds parity information protecting Y, Z, T so you can''t get rid of what was stored @ X. If you try to store new data in X and in associated parity by fail in mid-stream you hit the raid-5 write hole. Moreover now that X is not referenced in the filesystem, no more checksum is associated with it and if bit rot occurs in X and disk holding Y dies, resilvering would generate garbage for Y. This seems to force use to chunk up disks with every unit checksummed even if freed. Secure deletion becomes a problem as well. And you need can end up madly searching for free stripes, repositioning old blocks in partial striped even if the pool is just 10% filled up. Can one do this with raid-dp ? http://blogs.sun.com/roch/entry/need_inodes That said, I truly am for a evolution for random read workloads. Raid-Z on 4K sectors is quite appealing. It means that small objects become nearly mirrored with good random read performance while large objects are stored efficiently. -r
On 05/01/2010 16:00, Roch wrote:> That said, I truly am for a evolution for random read > workloads. Raid-Z on 4K sectors is quite appealing. It means > that small objects become nearly mirrored with good random read > performance while large objects are stored efficiently. > >Have you got any benchmarks available (comparing 512B to 4K to classical RAID-5)? The problem is that while RAID-Z is really good for some workloads it is really bad for others. Sometimes having L2ARC might effectively mitigate the problem but for some workloads it won''t (due to the huge size of a working set). In such environments RAID-Z2 offers much worse performance then similarly configured NetApp (RAID-DP, same number of disk drives). If ZFS would provide another RAID-5/RAID-6 like protection but with different characteristics so writing to a pool would be slower but reading from it would be much faster (comparable to RAID-DP) some customers would be very happy. Then maybe a new kind of cache device would be needed to buffer writes to NV storage to make writes faster (like "HW" arrays have been doing for years). A possible *workaround* is to use SVM to set-up RAID-5 and create a zfs pool on top of it. How does SVM handle R5 write hole? IIRC SVM doesn''t offer RAID-6. -- Robert Milkowski http://milek.blogspot.com
On Tue, Jan 05, 2010 at 04:49:00PM +0000, Robert Milkowski wrote:> A possible *workaround* is to use SVM to set-up RAID-5 and create a > zfs pool on top of it. > How does SVM handle R5 write hole? IIRC SVM doesn''t offer RAID-6.As far as I know, it does not address it. It''s possible that adding a transaction volume would help by replaying anything that affected the volume, but I don''t know that sufficient information is present. Symantec Volume Manager offers an explicit Raid5 log device. There doesn''t appear to be any corresponding object in SVM. -- Darren
Le 5 janv. 10 ? 17:49, Robert Milkowski a ?crit :> On 05/01/2010 16:00, Roch wrote: >> That said, I truly am for a evolution for random read >> workloads. Raid-Z on 4K sectors is quite appealing. It means >> that small objects become nearly mirrored with good random read >> performance while large objects are stored efficiently. >> >> > > Have you got any benchmarks available (comparing 512B to 4K to > classical RAID-5)?Using 8K ''soft'' sector prototype on an otherwise plain raid-z layout, we got 8X more random reads than with 512B sectors; as would be expected.> > The problem is that while RAID-Z is really good for some workloads > it is really bad for others.The bigger sector makes raid-z like mirroring for small records. And so performance of raid-z will be very good and it''s also space efficient for large objects.> Sometimes having L2ARC might effectively mitigate the problem but > for some workloads it won''t (due to the huge size of a working set). > In such environments RAID-Z2 offers much worse performance then > similarly configured NetApp (RAID-DP, same number of disk drives). > If ZFS would provide another RAID-5/RAID-6 like protection but with > different characteristics so writing to a pool would be slower but > reading from it would be much faster (comparable to RAID-DP) some > customers would be very happy.Agreed.> Then maybe a new kind of cache device would be needed to buffer > writes to NV storage to make writes faster (like "HW" arrays have > been doing for years). >Writes are not the problem and we have log device to offload them. It''s really about maintaining integrity of raid-5 type layout in the presence of bit-rot even if such bit-rot occur within free space.> > A possible *workaround* is to use SVM to set-up RAID-5 and create a > zfs pool on top of it. > How does SVM handle R5 write hole? IIRC SVM doesn''t offer RAID-6. >It doesn''t.> -- > Robert Milkowski > http://milek.blogspot.com > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2431 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100105/8ecf1e58/attachment.bin>
On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:> On 05/01/2010 16:00, Roch wrote: >> That said, I truly am for a evolution for random read >> workloads. Raid-Z on 4K sectors is quite appealing. It means >> that small objects become nearly mirrored with good random read >> performance while large objects are stored efficiently. >> >> > > Have you got any benchmarks available (comparing 512B to 4K to > classical RAID-5)?Not fair! A 512 byte random write workload will absolutely clobber a RAID-5 implementation. It is the RAID-5 pathological worst case. For many arrays, even a 4 KB random write workload will suck most heinously. The raidz pathological worst case is a random read from many-column raidz where files have records 128 KB in size. The inflated read problem is why it makes sense to match recordsize for fixed record workloads. This includes CIFS workloads which use 4 KB records. It is also why having many columns in the raidz for large records does not improve performance. Hence the 3 to 9 raidz disk limit recommendation in the zpool man page. http://www.baarf.com> The problem is that while RAID-Z is really good for some workloads > it is really bad for others. > Sometimes having L2ARC might effectively mitigate the problem but > for some workloads it won''t (due to the huge size of a working set). > In such environments RAID-Z2 offers much worse performance then > similarly configured NetApp (RAID-DP, same number of disk drives). > If ZFS would provide another RAID-5/RAID-6 like protection but with > different characteristics so writing to a pool would be slower but > reading from it would be much faster (comparable to RAID-DP) some > customers would be very happy. Then maybe a new kind of cache device > would be needed to buffer writes to NV storage to make writes faster > (like "HW" arrays have been doing for years).This still does not address the record checksum. This is only a problem for small, random read workloads, which means L2ARC is a good solution. If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO HDD and good performance do not belong in the same sentence anymore. Game over -- SSDs win.> A possible *workaround* is to use SVM to set-up RAID-5 and create a > zfs pool on top of it. > How does SVM handle R5 write hole? IIRC SVM doesn''t offer RAID-6.IIRC, SVM does a prewrite. Dog slow. Also, SVM is, AFAICT, on life support. The source is out there if anyone wants to carry it forward. Actually, many of us would be quite happy for SVM to fade from our memory :-) -- richard
On 05/01/2010 18:37, Roch Bourbonnais wrote:> > Writes are not the problem and we have log device to offload them. > It''s really about maintaining integrity of raid-5 type layout in the > presence of bit-rot even if such > bit-rot occur within free space. >How is it addressed in RAID-DP? -- Robert Milkowski http://milek.blogspot.com
On 05/01/2010 18:49, Richard Elling wrote:> On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote:> The problem is that while RAID-Z is really good for some workloads it > is really bad for others. >> Sometimes having L2ARC might effectively mitigate the problem but for >> some workloads it won''t (due to the huge size of a working set). In >> such environments RAID-Z2 offers much worse performance then >> similarly configured NetApp (RAID-DP, same number of disk drives). If >> ZFS would provide another RAID-5/RAID-6 like protection but with >> different characteristics so writing to a pool would be slower but >> reading from it would be much faster (comparable to RAID-DP) some >> customers would be very happy. Then maybe a new kind of cache device >> would be needed to buffer writes to NV storage to make writes faster >> (like "HW" arrays have been doing for years). > > This still does not address the record checksum. This is only a problem > for small, random read workloads, which means L2ARC is a good solution. > If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO > HDD and good performance do not belong in the same sentence anymore. > Game over -- SSDs win. >as I wrote - sometimes the working set is so big that L2ARC or not there is virtually no difference and it is not practical to deploy L2ARC several TBs in size or bigger. For such workload RAID-DP behaves much better (many small random reads, not that much writes).
On 6/01/2010 3:00 AM, Roch wrote:> Richard Elling writes: > > On Jan 3, 2010, at 11:27 PM, matthew patton wrote: > > > > > I find it baffling that RaidZ(2,3) was designed to split a record- > > > size block into N (N=# of member devices) pieces and send the > > > uselessly tiny requests to spinning rust when we know the massive > > > delays entailed in head seeks and rotational delay. The ZFS-mirror > > > and load-balanced configuration do the obviously correct thing and > > > don''t split records and gain more by utilizing parallel access. I > > > can''t imagine the code-path for RAIDZ would be so hard to fix. > > > > Knock yourself out :-) > > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c > > > > > I''ve read posts back to 06 and all I see are lamenting about the > > > horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to > > > claw back performance by combining multiple such vDEVs. I understand > > > RAIDZ will never equal Mirroring, but it could get damn close if it > > > didn''t break requests down and better yet utilized copies=N and > > > properly placed the copies on disparate spindles. This is somewhat > > > analogous to what the likes of 3PAR do and it''s not rocket science. > > > >[snipped for space ]> > That said, I truly am for a evolution for random read > workloads. Raid-Z on 4K sectors is quite appealing. It means > that small objects become nearly mirrored with good random read > performance while large objects are stored efficiently. > > -r >Sold! Let''s do that then! :-) Seriously - are there design or architectural reasons why this isn''t done by default, or at least an option? Or is it just a "no one''s had time to implement yet" thing? I understand that 4K sectors might be less space efficient for lots of small files, but I suspect lots of us would happilly make that trade off! Thanks, Tristan
On Jan 5, 2010, at 11:30 AM, Robert Milkowski wrote:> On 05/01/2010 18:49, Richard Elling wrote: >> On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote: > >> The problem is that while RAID-Z is really good for some workloads >> it is really bad for others. >>> Sometimes having L2ARC might effectively mitigate the problem but >>> for some workloads it won''t (due to the huge size of a working >>> set). In such environments RAID-Z2 offers much worse performance >>> then similarly configured NetApp (RAID-DP, same number of disk >>> drives). If ZFS would provide another RAID-5/RAID-6 like >>> protection but with different characteristics so writing to a pool >>> would be slower but reading from it would be much faster >>> (comparable to RAID-DP) some customers would be very happy. Then >>> maybe a new kind of cache device would be needed to buffer writes >>> to NV storage to make writes faster (like "HW" arrays have been >>> doing for years). >> >> This still does not address the record checksum. This is only a >> problem >> for small, random read workloads, which means L2ARC is a good >> solution. >> If L2ARC is a set of HDDs, then you could gain some advantage, but >> IMHO >> HDD and good performance do not belong in the same sentence anymore. >> Game over -- SSDs win. >> > > as I wrote - sometimes the working set is so big that L2ARC or not > there is virtually no difference and it is not practical to deploy > L2ARC several TBs in size or bigger. For such workload RAID-DP > behaves much better (many small random reads, not that much writes).If you are doing small, random reads on dozens of TB of data, then you''ve got a much bigger problem on your hands... kinda like counting grains of sand on the beach during low tide :-). Hopefully, you do not have to randomly update that data because your file system isn''t COW :-). Fortunately, most workloads are not of that size and scope. Since there are already 1 TB SSDs on the market, the only thing keeping the HDD market alive is the low $/TB. Moore''s Law predicts that cost advantage will pass. SSDs are already the low $/IOPS winners. -- richard
On Jan 5, 2010, at 11:56 AM, Tristan Ball wrote:> On 6/01/2010 3:00 AM, Roch wrote: >> That said, I truly am for a evolution for random read >> workloads. Raid-Z on 4K sectors is quite appealing. It means >> that small objects become nearly mirrored with good random read >> performance while large objects are stored efficiently. >> >> -r >> > Sold! Let''s do that then! :-) > > Seriously - are there design or architectural reasons why this isn''t > done by default, or at least an option? Or is it just a "no one''s > had time to implement yet" thing?Waiting on hardware to be become widely available might be a long wait. See also PSARC 2008/769 http://arc.opensolaris.org/caselog/PSARC/2008/769/inception.materials/design_doc> I understand that 4K sectors might be less space efficient for lots > of small files, but I suspect lots of us would happilly make that > trade off!+1 (for better reliability, too!) -- richard
On Tue, 5 Jan 2010, Richard Elling wrote:> > Since there are already 1 TB SSDs on the market, the only thing keeping the > HDD market alive is the low $/TB. Moore''s Law predicts that cost advantage > will pass. SSDs are already the low $/IOPS winners.SSD vendors are still working to stabilize their designs. Most of them seem to be unworthy for use in more than a laptop computer. A number of computer vendors (e.g. Apple & Dell) who offered SSDs in their computers encountered an expectedly high rate of product failure. According to Sun''s own engineers, Moore''s Law is very bad for enterprise SSDs. FLASH devices built to very small geometries are more likely to wear out and forget. Current design trends are moving in a direction which is contrary to the requirements of enterprise SSDs. See http://www.eetimes.com/showArticle.jhtml?articleID=219200284 Perhaps inovative designers like Suncast will figure out how to build reliable SSDs based on parts which are more likely to wear out and forget. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
On 6/01/2010 7:19 AM, Richard Elling wrote:> > If you are doing small, random reads on dozens of TB of data, then you''ve > got a much bigger problem on your hands... kinda like counting grains of > sand on the beach during low tide :-). Hopefully, you do not have to > randomly > update that data because your file system isn''t COW :-). Fortunately, > most > workloads are not of that size and scope. > > Since there are already 1 TB SSDs on the market, the only thing > keeping the > HDD market alive is the low $/TB. Moore''s Law predicts that cost > advantage > will pass. SSDs are already the low $/IOPS winners. > -- richard >These workloads (small random reads over huge datasets) might be getting more common in some environments - because it seems to be what you get when you consolidate virtual machine storage. We''ve got a moderately large number of Virtual Machines (a mix of Debian, Win2K Win2K3) running a very large set of applications, and our reads are all over the place! :-( I have to say I remain impressed at how well the ARC behaves, but even then our hit rate is often not wonderful. I _dream_ about being able to afford to build out my entire storage from cheap/large SSD''s. My guess would be that in about 2 years I''ll be able to. One of the reasons we''ve essentially put a hold on buying "enterprise storage" or fast FC/SCSI disks. A large part of the justification for FC/SCSI disks is their performance, and they''re going to be completely eclipsed within the lifetime of any serious mid-range to high end storage array. Until that day we''re make do large sata drives, mirrored, with relatively high spindle counts to avoid long per-disk queues. :-) T PS: OK, I know other tier-1 storage vendors have started integrating SSD''s as well, but they hadn''t when we started out current round of storage upgrades, and I stilll think opensolaris+sata hdds+ssd''s gives us a cleaner,cheaper and easier upgrade path than most tier-1 vendors can provide.
On 05/01/2010 20:19, Richard Elling wrote:> On Jan 5, 2010, at 11:30 AM, Robert Milkowski wrote: >> On 05/01/2010 18:49, Richard Elling wrote: >>> On Jan 5, 2010, at 8:49 AM, Robert Milkowski wrote: >> >>> The problem is that while RAID-Z is really good for some workloads >>> it is really bad for others. >>>> Sometimes having L2ARC might effectively mitigate the problem but >>>> for some workloads it won''t (due to the huge size of a working >>>> set). In such environments RAID-Z2 offers much worse performance >>>> then similarly configured NetApp (RAID-DP, same number of disk >>>> drives). If ZFS would provide another RAID-5/RAID-6 like protection >>>> but with different characteristics so writing to a pool would be >>>> slower but reading from it would be much faster (comparable to >>>> RAID-DP) some customers would be very happy. Then maybe a new kind >>>> of cache device would be needed to buffer writes to NV storage to >>>> make writes faster (like "HW" arrays have been doing for years). >>> >>> This still does not address the record checksum. This is only a >>> problem >>> for small, random read workloads, which means L2ARC is a good solution. >>> If L2ARC is a set of HDDs, then you could gain some advantage, but IMHO >>> HDD and good performance do not belong in the same sentence anymore. >>> Game over -- SSDs win. >>> >> >> as I wrote - sometimes the working set is so big that L2ARC or not >> there is virtually no difference and it is not practical to deploy >> L2ARC several TBs in size or bigger. For such workload RAID-DP >> behaves much better (many small random reads, not that much writes). > > If you are doing small, random reads on dozens of TB of data, then you''ve > got a much bigger problem on your hands... kinda like counting grains of > sand on the beach during low tide :-). Hopefully, you do not have to > randomly > update that data because your file system isn''t COW :-). Fortunately, > most > workloads are not of that size and scope. >Well, nevertheless some environments are like that (and no, I''m not speculating) and the truth is that NetApp with RAID-DP with the same amount of disk drives proven to be faster than RAID-Z2 even with a help of SSDs as L2ARC. The point is that NetApp allowed to provide the capacity of RAID-6 and protection of dual parity while providing better performance to RAID-Z2 in the environment. In other workloads RAIDZ-2 will be better, but not in this particular environment. All I''m saying is that having yet another RAID type in ZFS which offers capacity similar to RAID-5/RAID-6 but with different performance characteristics so small random reads are on par with RAID-DP while sacrificing write performance would be beneficial for some environments. RAID-Z with bigger sector size could improve performance but provided capacity could be much less than RAID-5/6 so it not necessary might be an apple-to-apple comparison (but still useful for some environments). -- Robert Milkowski http://milek.blogspot.com
On 05/01/2010 20:19, Richard Elling wrote:> [...] Fortunately, most > workloads are not of that size and scope. >Forgot to mention it in my last email - yes, I agree. The environment I''m talking about is rather unusual and in most other cases where RAID-5/6 was considered the performance of RAID-Z1/2 was good enough or even better. -- Robert Milkowski http://milek.blogspot.com
Many large-scale photo hosts start with netapp as the default "good enough" way to handle multiple-TB storage. With a 1-5% cache on top, the workload is truly random-read over many TBs. But these workloads almost assume a frontend cache to take care of hot traffic, so L2ARC is just a nice implementation of that, not a silver bullet. I agree that RAID-DP is much more scalable for reads than RAIDZx, and this basically turns into a cost concern at scale. The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be used instead of netapp. But this certainly reduces the cost advantage significantly. mike p.s. I managed the team that built blogger.com''s photo hosting, and picasaweb.google.com, so I''ve seen some of this stuff at scale (neither of these use netapp). For large photos, it''s pretty simple: the more independent spindles, the better.
On 05/01/2010 23:31, Michael Herf wrote:> The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be > used instead of netapp. But this certainly reduces the cost advantage > significantly. >This is true to some extent. I didn''t want to bring it up as I wanted to focus only on technical aspect. -- Robert Milkowski http://milek.blogspot.com
On Jan 5, 2010, at 16:06, Bob Friesenhahn wrote:> Perhaps inovative designers like Suncast will figure out how to > build reliable SSDs based on parts which are more likely to wear out > and forget.At which point we''ll probably start seeing the memristor start making an appearance in various devices. :)
Michael Herf wrote:> I agree that RAID-DP is much more scalable for reads than RAIDZx, and > this basically turns into a cost concern at scale. > > The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be > used instead of netapp. But this certainly reduces the cost advantage > significantly.Has anyone compared RAID-Z2 against something like LSI MegaRAID RAID-6? If a sub-$1,000 RAID controller can save thousands of dollars worth of disks it would somewhat put the lie to the idea that ZFS kills hardware RAID. Wes Felter
On Jan 6, 2010, at 1:30 PM, Wes Felter wrote:> Michael Herf wrote: > >> I agree that RAID-DP is much more scalable for reads than RAIDZx, and >> this basically turns into a cost concern at scale. >> The raw cost/GB for ZFS is much lower, so even a 3-way mirror could >> be >> used instead of netapp. But this certainly reduces the cost advantage >> significantly. > > Has anyone compared RAID-Z2 against something like LSI MegaRAID > RAID-6? If a sub-$1,000 RAID controller can save thousands of > dollars worth of disks it would somewhat put the lie to the idea > that ZFS kills hardware RAID.ZFS doesn''t kill "hardware RAID." First, there is no such thing as "hardware RAID" there is only software RAID. Second, "hardware RAID" systems are pretty much useless without a file system, database, or some other application which can translate from a set of blocks to something useful. Rather, ZFS works very nicely with "hardware RAID" systems or JBODs, iSCSI, et.al. You can happily add the advantage of ZFS features using a LSI MegaRAID RAID-6 controller at little or no additional cost :-) -- richard
On Wed, Jan 6, 2010 at 4:30 PM, Wes Felter <wesley at felter.org> wrote:> Michael Herf wrote: > >> I agree that RAID-DP is much more scalable for reads than RAIDZx, and >> this basically turns into a cost concern at scale. >> >> The raw cost/GB for ZFS is much lower, so even a 3-way mirror could be >> used instead of netapp. But this certainly reduces the cost advantage >> significantly. > > Has anyone compared RAID-Z2 against something like LSI MegaRAID RAID-6? If a > sub-$1,000 RAID controller can save thousands of dollars worth of disks it > would somewhat put the lie to the idea that ZFS kills hardware RAID.A hardware RAID6 controller with a big battery backed write cache will beat RAIDZ2 hands down. It avoids the write-hole problem with the battery cache, but you still have the possibility of silent data corruption to deal with. You could put two LSI MegaRAID controllers into a 2U box each going to a storage enclosure setup with a RAID6 array, then setup a zpool made out of a mirrored vdev of each. That takes RAIDZ2 out of the picture while providing integrity and performance. Extra cost is always assumed if you want both. If you want to add redundancy it will cost you double. -Ross
Wilkinson, Alex
2010-Jan-07 06:39 UTC
[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]
0n Wed, Jan 06, 2010 at 02:22:19PM -0800, Richard Elling wrote: >Rather, ZFS works very nicely with "hardware RAID" systems or JBODs >iSCSI, et.al. You can happily add the Im not sure how ZFS works very nicely with say for example an EMC Cx310 array ? -Alex IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email.
Richard Elling
2010-Jan-07 07:00 UTC
[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]
On Jan 6, 2010, at 10:39 PM, Wilkinson, Alex wrote:> > 0n Wed, Jan 06, 2010 at 02:22:19PM -0800, Richard Elling wrote: > >> Rather, ZFS works very nicely with "hardware RAID" systems or JBODs >> iSCSI, et.al. You can happily add the > > Im not sure how ZFS works very nicely with say for example an EMC > Cx310 array ?Why would ZFS be any different than other file systems on a Cx310? -- richard
Wilkinson, Alex
2010-Jan-07 07:09 UTC
[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]
0n Wed, Jan 06, 2010 at 11:00:49PM -0800, Richard Elling wrote: >On Jan 6, 2010, at 10:39 PM, Wilkinson, Alex wrote: >> >> 0n Wed, Jan 06, 2010 at 02:22:19PM -0800, Richard Elling wrote: >> >>> Rather, ZFS works very nicely with "hardware RAID" systems or JBODs >>> iSCSI, et.al. You can happily add the >> >> Im not sure how ZFS works very nicely with say for example an EMC >> Cx310 array ? > >Why would ZFS be any different than other file systems on a Cx310? Well, not specifically the filesystem but using ZFS as a volume manager. Please see: http://mail.opensolaris.org/pipermail/zfs-discuss/2009-April/028089.html -Alex IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email.
Richard Elling
2010-Jan-07 17:57 UTC
[zfs-discuss] rethinking RaidZ and Record size [SEC=UNCLASSIFIED]
On Jan 6, 2010, at 11:09 PM, Wilkinson, Alex wrote:> 0n Wed, Jan 06, 2010 at 11:00:49PM -0800, Richard Elling wrote: > >> On Jan 6, 2010, at 10:39 PM, Wilkinson, Alex wrote: >>> >>> 0n Wed, Jan 06, 2010 at 02:22:19PM -0800, Richard Elling wrote: >>> >>>> Rather, ZFS works very nicely with "hardware RAID" systems or JBODs >>>> iSCSI, et.al. You can happily add the >>> >>> Im not sure how ZFS works very nicely with say for example an EMC >>> Cx310 array ? >> >> Why would ZFS be any different than other file systems on a Cx310? > > Well, not specifically the filesystem but using ZFS as a volume > manager. > Please see: http://mail.opensolaris.org/pipermail/zfs-discuss/2009-April/028089.htmlChoice is good, no? :-) If you choose not to use ZFS for RAID, then that is a perfectly reasonable choice. Many people have a mix of protected and unprotected storage of all sorts. But it is good to know that if you have very, very important data, you can protect it in many complementary ways. -- richard