> My question is how efficient will ZFS be, given that > it will be layered on top of the hardware RAID and > write cache?ZFS delivers best performance when used standalone, directly on entire disks. By using ZFS on top of a HW RAID, you make your data susceptible to HW errors caused by the storage subsystem''s RAID algorithm, and slow down the I/O. You should see much better performance by not creating a HW RAID, then adding all the disks in the 3320'' enclosures to a ZFS RAIDZ pool. Additionally, given enough disks, it might be possible to squeeze even better performance by creating several RAIDZ vdevs and striping them. For a discussion on this aspect, please see "WHEN TO (AND NOT TO) USE RAID-Z" treatise at http://blogs.sun.com/roch/entry/when_to_and_not_to. This message posted from opensolaris.org
przemolicc at poczta.fm
2006-Sep-04 13:20 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
On Mon, Sep 04, 2006 at 01:59:53AM -0700, UNIX admin wrote:> > My question is how efficient will ZFS be, given that > > it will be layered on top of the hardware RAID and > > write cache? > > ZFS delivers best performance when used standalone, directly on entire disks. By using ZFS on top of a HW RAID, you make your data susceptible to HW errors caused by the storage subsystem''s RAID algorithm, and slow down the I/O. > > You should see much better performance by not creating a HW RAID, then adding all the disks in the 3320'' enclosures to a ZFS RAIDZ pool.This is the case where I don''t understand Sun''s politics at all: Sun doesn''t offer really cheap JBOD which can be bought just for ZFS. And don''t even tell me about 3310/3320 JBODs - they are horrible expansive :-( If Sun wants ZFS to be absorbed quicker it should have such _really_ cheap JBOD. przemol
Torrey McMahon
2006-Sep-04 19:19 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
UNIX admin wrote:>> My question is how efficient will ZFS be, given that >> it will be layered on top of the hardware RAID and >> write cache? >> > > ZFS delivers best performance when used standalone, directly on entire disks. By using ZFS on top of a HW RAID, you make your data susceptible to HW errors caused by the storage subsystem''s RAID algorithm, and slow down the I/O. >This is simply not true. ZFS would protect against the same type of errors seen on an individual drive as it would on a pool made of HW raid LUN(s). It might be overkill to layer ZFS on top of a LUN that is already protected in some way by the devices internal RAID code but it does not "make your data susceptible to HW errors caused by the storage subsystem''s RAID algorithm, and slow down the I/O". True, ZFS can''t manage past the LUN into the array. Guess what? ZFS can''t get past the disk drive firmware either....and thats a good thing for all parties involved.
Peter Sundstrom
2006-Sep-04 20:49 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
Hmm. Appears to be differing opinions. Another way of putting my question is can anyone guarantee that ZFS will not perform worse that UFS on the array? High speed performance is not really an issue, hence the reason the disks are mirrored rather than striped. The client is more concerned with redundancy (hence the cautious approach of having 3 hot spares). This message posted from opensolaris.org
Torrey McMahon
2006-Sep-04 21:18 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
Depends on the workload. (Did I miss that email?) Peter Sundstrom wrote:> Hmm. Appears to be differing opinions. > > Another way of putting my question is can anyone guarantee that ZFS will not perform worse that UFS on the array? > > High speed performance is not really an issue, hence the reason the disks are mirrored rather than striped. The client is more concerned with redundancy (hence the cautious approach of having 3 hot spares). > >
On 9/5/06, Torrey McMahon <Torrey.McMahon at sun.com> wrote:> This is simply not true. ZFS would protect against the same type of > errors seen on an individual drive as it would on a pool made of HW raid > LUN(s). It might be overkill to layer ZFS on top of a LUN that is > already protected in some way by the devices internal RAID code but it > does not "make your data susceptible to HW errors caused by the storage > subsystem''s RAID algorithm, and slow down the I/O".& Roch''s recommendation to leave at least 1 layer of redundancy to ZFS allows the extension of ZFS''s own redundancy features for some truely remarkable data reliability. Perhaps, the question should be how one could mix them to get the best of both worlds instead of going to either extreme.> True, ZFS can''t manage past the LUN into the array. Guess what? ZFS > can''t get past the disk drive firmware either....and thats a good thing > for all parties involved.-- Just me, Wire ...
Robert Milkowski
2006-Sep-05 10:45 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Hello Wee, Tuesday, September 5, 2006, 10:58:32 AM, you wrote: WYT> On 9/5/06, Torrey McMahon <Torrey.McMahon at sun.com> wrote:>> This is simply not true. ZFS would protect against the same type of >> errors seen on an individual drive as it would on a pool made of HW raid >> LUN(s). It might be overkill to layer ZFS on top of a LUN that is >> already protected in some way by the devices internal RAID code but it >> does not "make your data susceptible to HW errors caused by the storage >> subsystem''s RAID algorithm, and slow down the I/O".WYT> & Roch''s recommendation to leave at least 1 layer of redundancy to ZFS WYT> allows the extension of ZFS''s own redundancy features for some truely WYT> remarkable data reliability. WYT> Perhaps, the question should be how one could mix them to get the best WYT> of both worlds instead of going to either extreme. Depends on your data but sometime it could be useful to create HW RAID and then do just striping on ZFS side between at least two LUNs. That way you do not get data protection but fs/pool protection with ditto block. Of course each LUN is HW RAID made of different physical disks. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Jonathan Edwards
2006-Sep-05 16:42 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
On Sep 5, 2006, at 06:45, Robert Milkowski wrote:> Hello Wee, > > Tuesday, September 5, 2006, 10:58:32 AM, you wrote: > > WYT> On 9/5/06, Torrey McMahon <Torrey.McMahon at sun.com> wrote: >>> This is simply not true. ZFS would protect against the same type of >>> errors seen on an individual drive as it would on a pool made of >>> HW raid >>> LUN(s). It might be overkill to layer ZFS on top of a LUN that is >>> already protected in some way by the devices internal RAID code >>> but it >>> does not "make your data susceptible to HW errors caused by the >>> storage >>> subsystem''s RAID algorithm, and slow down the I/O". > > WYT> & Roch''s recommendation to leave at least 1 layer of > redundancy to ZFS > WYT> allows the extension of ZFS''s own redundancy features for some > truely > WYT> remarkable data reliability. > > WYT> Perhaps, the question should be how one could mix them to get > the best > WYT> of both worlds instead of going to either extreme. > > Depends on your data but sometime it could be useful to create HW RAID > and then do just striping on ZFS side between at least two LUNs. That > way you do not get data protection but fs/pool protection with ditto > block. Of course each LUN is HW RAID made of different physical disks.i remember working up a chart on this list about 2 months ago: Here''s 10 options I can think of to summarize combinations of zfs with hw redundancy: # ZFS ARRAY HW CAPACITY COMMENTS -- --- -------- -------- -------- 1 R0 R1 N/2 hw mirror - no zfs healing (XXX) 2 R0 R5 N-1 hw R5 - no zfs healing (XXX) 3 R1 2 x R0 N/2 flexible, redundant, good perf 4 R1 2 x R5 (N/2)-1 flexible, more redundant, decent perf 5 R1 1 x R5 (N-1)/2 parity and mirror on same drives (XXX) 6 RZ R0 N-1 standard RAIDZ - no array RAID (XXX) 7 RZ R1 (tray) (N/2)-1 RAIDZ+1 8 RZ R1 (drives) (N/2)-1 RAID1+Z (highest redundancy) 9 RZ 2 x R5 N-3 triple parity calculations (XXX) 10 RZ 1 x R5 N-2 double parity calculations (XXX) If you''ve invested in a RAID controller on an array, you might as well take advantage of it, otherwise you could probably get an old D1000 chassis somewhere and just run RAIDZ on JBOD. If you''re more concerned about redundancy than space, with the SUN/ STK 3000 series dual controller arrays I would either create at least 2 x RAID5 luns balanced across controllers and zfs mirror, or create at least 4 x RAID1 luns balanced across controllers and use RAIDZ. RAID0 isn''t going to make that much sense since you''ve got a 128KB txg commit on zfs which isn''t going to be enough to do a full stripe in most cases. .je -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20060905/10f15023/attachment.html>
Torrey McMahon
2006-Sep-05 17:04 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Wee Yeh Tan wrote:> > > Perhaps, the question should be how one could mix them to get the best > of both worlds instead of going to either extreme.In the specific case of a 3320 I think Jonathan''s chart has a lot of good info that can be put to use. In the general case, well, I hate to say this but it depends. From what I''ve seen the general discussions on this list tend toward the "Make my small direct connected desktop/server go as fast as possible". Once you leave that space and move to the opposite end of the spectrum, a large heterogeneous datacenter, you have to start looking at the overall data management strategy and how different pieces of technology get implemented. (Site to site array replication being a good example.) Thats where I think you''ll find more interesting cases where raid setups will be used with ZFS on top more then not. There are also the speed enhancement provided by a HW raid array, and usually RAS too, compared to a native disk drive but the numbers on that are still coming in and being analyzed. (See previous threads.) -- Torrey McMahon Sun Microsystems Inc.
Richard Elling - PAE
2006-Sep-05 17:06 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Jonathan Edwards wrote:> Here''s 10 options I can think of to summarize combinations of zfs with > hw redundancy: > > # ZFS ARRAY HW CAPACITY COMMENTS > -- --- -------- -------- -------- > 1 R0 R1 N/2 hw mirror - no zfs healing (XXX) > 2 R0 R5 N-1 hw R5 - no zfs healing (XXX) > 3 R1 2 x R0 N/2 flexible, redundant, good perf > 4 R1 2 x R5 (N/2)-1 flexible, more redundant, decent perf > 5 R1 1 x R5 (N-1)/2 parity and mirror on same drives (XXX) > 6 RZ R0 N-1 standard RAIDZ - no array RAID (XXX) > 7 RZ R1 (tray) (N/2)-1 RAIDZ+1 > 8 RZ R1 (drives) (N/2)-1 RAID1+Z (highest redundancy) > 9 RZ 2 x R5 N-3 triple parity calculations (XXX) > 10 RZ 1 x R5 N-2 double parity calculations (XXX) > > If you''ve invested in a RAID controller on an array, you might as well > take advantage of it, otherwise you could probably get an old D1000 > chassis somewhere and just run RAIDZ on JBOD.I think it would be good if RAIDoptimizer could be expanded to show these cases, too. Right now, the availability and performance models are simple. To go to this level, the models get more complex and there are many more tunables. However, for a few representative cases, it might make sense to do deep analysis, even if that analysis does not get translated into a tool directly. We have the tools to do the deep analysis, but the models will need to be written and verified. That said, does anyone want to see this sort of analysis? If so, what configurations should we do first (keep in mind that each config may take a few hours, maybe more depending on the performance model) -- richard
Wee Yeh Tan writes: > On 9/5/06, Torrey McMahon <Torrey.McMahon at sun.com> wrote: > > This is simply not true. ZFS would protect against the same type of > > errors seen on an individual drive as it would on a pool made of HW raid > > LUN(s). It might be overkill to layer ZFS on top of a LUN that is > > already protected in some way by the devices internal RAID code but it > > does not "make your data susceptible to HW errors caused by the storage > > subsystem''s RAID algorithm, and slow down the I/O". > > & Roch''s recommendation to leave at least 1 layer of redundancy to ZFS > allows the extension of ZFS''s own redundancy features for some truely > remarkable data reliability. > > Perhaps, the question should be how one could mix them to get the best > of both worlds instead of going to either extreme. > > > True, ZFS can''t manage past the LUN into the array. Guess what? ZFS > > can''t get past the disk drive firmware either....and thats a good thing > > for all parties involved. > > > -- > Just me, > Wire ... > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss Thinking some more about this. If your requirements does mandate some form of mirroring, then it truly seems that ZFS should take that in charge if only because of the self-healing characteristics. So I feel the storage array''s job is to export low latency Luns to ZFS. I''d be happy to live with those simple Luns but I guess some storage will just refuse to export non-protected luns. Now we can definitively take advantage of the Array''s capability of exporting highly resilient Luns; RAID-5 seems to fit the bill rather well here. Even an 9+1 luns will be quite resilient and have a low block overhead. So we benefit from the arrays resiliency as well as it''s low latency characteristics. And we mirror data at the ZFS level which means great performance and great data integrity and great availability. Note that ZFS write characteristics (all sequential) means that we will commonly be filling full stripes on the luns thus avoiding the partial stripe performance pitfall. If you must shy away from any form of mirroring, then it''s either stripe your raid-5 luns (performance edge for those who live dangerously) or raid-z around those raid-5 luns (lower cost, survives lun failures). -r
Torrey McMahon
2006-Sep-06 21:58 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Roch - PAE wrote:> Thinking some more about this. If your requirements does > mandate some form of mirroring, then it truly seems that ZFS > should take that in charge if only because of the > self-healing characteristics. So I feel the storage array''s > job is to export low latency Luns to ZFS. >The hard part is getting a set of simple requirements. As you go into more complex data center environments you get hit with older Solaris revs, other OSs, SOX compliance issues, etc. etc. etc. The world where most of us seem to be playing with ZFS is on the lower end of the complexity scale. Sure, throw your desktop some fast SATA drives. No problem. Oh wait, you''ve got ten Oracle DBs on three E25Ks that need to be backed up every other blue moon ... I agree with the general idea that an array, be it one disk or some raid combination, should simply export low latency LUNs. However, its the features offered by the array - Like site to site replication - used to meet more complex requirements that literally slow things down. In many cases you''ll see years old operational procedures causing those low latency LUNs to slow down even more. Something really hard to get a customer to undo because a new fangled file system is out. ;)> I''d be happy to live with those simple Luns but I guess some > storage will just refuse to export non-protected luns. Now > we can definitively take advantage of the Array''s capability > of exporting highly resilient Luns; RAID-5 seems to fit the > bill rather well here. Even an 9+1 luns will be quite > resilient and have a low block overhead. >I think 99x0 used to do 3+1 only. Now it''s 7+1 if I recall. Close enough I suppose.> So we benefit from the arrays resiliency as well as it''s low > latency characteristics. And we mirror data at the ZFS level > which means great performance and great data integrity and > great availability. > > Note that ZFS write characteristics (all sequential) means > that we will commonly be filling full stripes on the luns > thus avoiding the partial stripe performance pitfall.One thing comes to mind in that case. Many arrays do sequential detect on the blocks that come in to the front end ports. If things get split up to much or out of order or <insert some strange array characteristic here> then you could induce more latency as the array does cartwheels trying to figure out whats going on.
Nicolas Dorfsman
2006-Sep-07 08:15 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
> The hard part is getting a set of simple > requirements. As you go into > more complex data center environments you get hit > with older Solaris > revs, other OSs, SOX compliance issues, etc. etc. > etc. The world where > most of us seem to be playing with ZFS is on the > lower end of the > complexity scale. Sure, throw your desktop some fast > SATA drives. No > problem. Oh wait, you''ve got ten Oracle DBs on three > E25Ks that need to > be backed up every other blue moon ...Another fact is CPU use. Does anybody really know what will be effects of intensive CPU workload on ZFS perfs, and effects of ZFS RAID CPU compute on intensive CPU workload ? I heard a story about a customer complaining about his higend server performances; when a guy came on site...and discover beautiful SVM RAID-5 volumes, the solution was almost found. Nicolas This message posted from opensolaris.org
Torrey McMahon
2006-Sep-07 17:22 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Nicolas Dorfsman wrote:>> The hard part is getting a set of simple >> requirements. As you go into >> more complex data center environments you get hit >> with older Solaris >> revs, other OSs, SOX compliance issues, etc. etc. >> etc. The world where >> most of us seem to be playing with ZFS is on the >> lower end of the >> complexity scale. Sure, throw your desktop some fast >> SATA drives. No >> problem. Oh wait, you''ve got ten Oracle DBs on three >> E25Ks that need to >> be backed up every other blue moon ... >> > > Another fact is CPU use. > > Does anybody really know what will be effects of intensive CPU workload on ZFS perfs, and effects of ZFS RAID CPU compute on intensive CPU workload ? > > I heard a story about a customer complaining about his higend server performances; when a guy came on site...and discover beautiful SVM RAID-5 volumes, the solution was almost found. >Raid calculations take CPU time but I haven''t seen numbers on ZFS usage. SVM is known for using a fair bit of CPU when performing R5 calculations and I''m sure other OS have the same issue. EMC used to go around saying that offloading raid calculations to their storage arrays would increase application performance because you would free up CPU time to do other stuff. The "EMC effect" is how they used to market it.
Richard Elling - PAE
2006-Sep-07 18:07 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Torrey McMahon wrote:> Raid calculations take CPU time but I haven''t seen numbers on ZFS usage. > SVM is known for using a fair bit of CPU when performing R5 calculations > and I''m sure other OS have the same issue. EMC used to go around saying > that offloading raid calculations to their storage arrays would increase > application performance because you would free up CPU time to do other > stuff. The "EMC effect" is how they used to market it.In all modern processors, and most ancient processors, XOR takes 1 CPU cycle and is easily pipelined. Getting the data from the disk to the registers takes thousands or hundreds of thousands of CPU cycles. You will more likely feel the latency of the read-modify-write for RAID-5 than the CPU time needed for XOR. ZFS avoids the read-modify-write, but does compression, so it is possible that a few more CPU cycles will be used. But it should still be a big win because CPU cycles are less expensive than disk I/O. Meanwhile, I think we''re all looking for good data on this. -- richard
Richard Elling - PAE wrote:> Torrey McMahon wrote: >> Raid calculations take CPU time but I haven''t seen numbers on ZFS >> usage. SVM is known for using a fair bit of CPU when performing R5 >> calculations and I''m sure other OS have the same issue. EMC used to go >> around saying that offloading raid calculations to their storage >> arrays would increase application performance because you would free >> up CPU time to do other stuff. The "EMC effect" is how they used to >> market it. > > In all modern processors, and most ancient processors, XOR takes 1 CPU > cycle and is easily pipelined. Getting the data from the disk to the > registers > takes thousands or hundreds of thousands of CPU cycles. You will more > likely > feel the latency of the read-modify-write for RAID-5 than the CPU time > needed > for XOR. ZFS avoids the read-modify-write, but does compression, so it is > possible that a few more CPU cycles will be used. But it should still be a > big win because CPU cycles are less expensive than disk I/O. Meanwhile, I > think we''re all looking for good data on this. > -- richardI believe the true answer is (wait for it...) It Depends(TM) on what you''re limited on. If your system under your load is CPU constrained, ZFS calculating the RAIDZ parity (and checksum) is going to hurt; if you are IO constrained then having the otherwise idle CPU do (which is, of course, more than just an XOR instruction, but we all know that) the work may help. The ZFS design center of mostly-idle CPUs is not always accurate, although most customers don''t dare push the system to 100% utilization. It''s when you _do_ hit that point, or when the extra overhead unexpectedly makes you hit or go beyond that point that things can get interesting quickly. - Pete
On 9/7/06, Torrey McMahon <Torrey.McMahon at sun.com> wrote:> Nicolas Dorfsman wrote: > >> The hard part is getting a set of simple > >> requirements. As you go into > >> more complex data center environments you get hit > >> with older Solaris > >> revs, other OSs, SOX compliance issues, etc. etc. > >> etc. The world where > >> most of us seem to be playing with ZFS is on the > >> lower end of the > >> complexity scale. Sure, throw your desktop some fast > >> SATA drives. No > >> problem. Oh wait, you''ve got ten Oracle DBs on three > >> E25Ks that need to > >> be backed up every other blue moon ... > >> > > > > Another fact is CPU use. > > > > Does anybody really know what will be effects of intensive CPU workload on ZFS perfs, and effects of ZFS RAID CPU compute on intensive CPU workload ? > >with ZFS I have found that memory is a much greater limitation, even my dual 300mhz u2 has no problem filling 2x 20MB/s scsi channels, even with compression enabled, using raidz and 10k rpm 9GB drives, thanks to its 2GB of ram it does great at everything I throw at it. On the other hand my blade 1500 ram 512MB with 3x 18GB 10k rpm drives using 2x 40MB/s scsi channels , os is on a 80GB ide drive, has problems interactively because as soon as you push zfs hard it hogs all the ram and may take 5 or 10 seconds to get response on xterms while the machine clears out ram and loads its applications/data back into ram. James Dickens uadmin.blogspot.com> > I heard a story about a customer complaining about his higend server performances; when a guy came on site...and discover beautiful SVM RAID-5 volumes, the solution was almost found. > > > > Raid calculations take CPU time but I haven''t seen numbers on ZFS usage. > SVM is known for using a fair bit of CPU when performing R5 calculations > and I''m sure other OS have the same issue. EMC used to go around saying > that offloading raid calculations to their storage arrays would increase > application performance because you would free up CPU time to do other > stuff. The "EMC effect" is how they used to market it. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Richard Elling - PAE
2006-Sep-07 19:14 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
przemolicc at poczta.fm wrote:> This is the case where I don''t understand Sun''s politics at all: Sun > doesn''t offer really cheap JBOD which can be bought just for ZFS. And > don''t even tell me about 3310/3320 JBODs - they are horrible expansive :-(Yep, multipacks are EOL for some time now -- killed by big disks. Back when disks were small, people would buy multipacks to attach to their workstations. There was a time when none of the workstations had internal disks, but I''d be dating myself :-) For datacenter-class storage, multipacks were not appropriate. They only had single-ended SCSI interfaces which have a limited cable budget which limited their use in racks. Also, they weren''t designed to be used in a rack environment, so they weren''t mechanically appropriate either. I suppose you can still find them on eBay.> If Sun wants ZFS to be absorbed quicker it should have such _really_ cheap > JBOD.I don''t quite see this in my crystal ball. Rather, I see all of the SAS/SATA chipset vendors putting RAID in the chipset. Basically, you can''t get a "dumb" interface anymore, except for fibre channel :-). In other words, if we were to design a system in a chassis with perhaps 8 disks, then we would also use a controller which does RAID. So, we''re right back to square 1. -- richard
The bigger problem with system utilization for software RAID is the cache, not the CPU cycles proper. Simply preparing to write 1 MB of data will flush half of a 2 MB L2 cache. This hurts overall system performance far more than the few microseconds that XORing the data takes. (A similar effect occurs with file system buffering, and this is one reason why direct I/O is attractive for databases ? there?s no pollution of the system cache.) This message posted from opensolaris.org
przemolicc at poczta.fm
2006-Sep-08 08:09 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
On Thu, Sep 07, 2006 at 12:14:20PM -0700, Richard Elling - PAE wrote:> przemolicc at poczta.fm wrote: > >This is the case where I don''t understand Sun''s politics at all: Sun > >doesn''t offer really cheap JBOD which can be bought just for ZFS. And > >don''t even tell me about 3310/3320 JBODs - they are horrible expansive :-( > > Yep, multipacks are EOL for some time now -- killed by big disks. Back when > disks were small, people would buy multipacks to attach to their > workstations. > There was a time when none of the workstations had internal disks, but I''d > be dating myself :-) > > For datacenter-class storage, multipacks were not appropriate. They only > had single-ended SCSI interfaces which have a limited cable budget which > limited their use in racks. Also, they weren''t designed to be used in a > rack environment, so they weren''t mechanically appropriate either. I > suppose > you can still find them on eBay. > >If Sun wants ZFS to be absorbed quicker it should have such _really_ cheap > >JBOD. > > I don''t quite see this in my crystal ball. Rather, I see all of the > SAS/SATA > chipset vendors putting RAID in the chipset. Basically, you can''t get a > "dumb" interface anymore, except for fibre channel :-). In other words, if > we were to design a system in a chassis with perhaps 8 disks, then we would > also use a controller which does RAID. So, we''re right back to square 1.Richard, when I talk about cheap JBOD I think about home users/small servers/small companies. I guess you can sell 100 X4500 and at the same time 1000 (or even more) cheap JBODs to the small companies which for sure will not buy the big boxes. Yes, I know, you earn more selling X4500. But what do you think, how Linux found its way to data centers and become important player in OS space ? Through home users/enthusiasts who become familiar with it and then started using the familiar things in their job. Proven way to achieve "world domination". ;-)) przemol
> Roch - PAE wrote: > The hard part is getting a set of simple requirements. As you go into > more complex data center environments you get hit with older Solaris > revs, other OSs, SOX compliance issues, etc. etc. etc. The world where > most of us seem to be playing with ZFS is on the lower end of the > complexity scale.I''ve been watching this thread and unfortunately fit this model. I''d hoped that ZFS might scale enough to solve my problem but you seem to be saying that it''s mostly untested in large scale environments. About 7 years ago we ran out of inodes on our UFS file systems. We used bFile as middleware for a while to distribute the files across multiple disks and then switched to VFS on SAN about 5 years ago. Distribution across file systems and inode depletion continued to be a problem so we switched middleware to another vendor that essentially compresses about 200 files into a single 10Mb archive and uses a DB to find the file within the archive on the correct disk. Expensive, complex and slow but effective solution until the latest license renewal when we got hit with a huge bill. I''d love to go back to a pure file system model and looked at Reiser4, JFS, NTFS and now ZFS for a way to support over 100 million small documents and 16Tb. We average 2 file reads and 1 file write per second 24/7 with expected growth to 24Tb. I''d be willing to scrap everything we have to find a non-proprietary long term solution. ZFS looked like it might provide an answer. Are you saying it''s not really suitable for this type of application? This message posted from opensolaris.org
Torrey McMahon writes: > Nicolas Dorfsman wrote: > >> The hard part is getting a set of simple > >> requirements. As you go into > >> more complex data center environments you get hit > >> with older Solaris > >> revs, other OSs, SOX compliance issues, etc. etc. > >> etc. The world where > >> most of us seem to be playing with ZFS is on the > >> lower end of the > >> complexity scale. Sure, throw your desktop some fast > >> SATA drives. No > >> problem. Oh wait, you''ve got ten Oracle DBs on three > >> E25Ks that need to > >> be backed up every other blue moon ... > >> > > > > Another fact is CPU use. > > > > Does anybody really know what will be effects of intensive CPU workload on ZFS perfs, and effects of ZFS RAID CPU compute on intensive CPU workload ? > > > > I heard a story about a customer complaining about his higend server performances; when a guy came on site...and discover beautiful SVM RAID-5 volumes, the solution was almost found. > > > > Raid calculations take CPU time but I haven''t seen numbers on ZFS usage. > SVM is known for using a fair bit of CPU when performing R5 calculations > and I''m sure other OS have the same issue. EMC used to go around saying > that offloading raid calculations to their storage arrays would increase > application performance because you would free up CPU time to do other > stuff. The "EMC effect" is how they used to market it. > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss I just measured quickly that a 1.2Ghz sparc can do [400-500]MB/sec of encoding (time spent in misnamed function vdev_raidz_reconstruct) for a 3 disk raid-z group. Bigger groups, should cost more but I''d also expect the cost to decrease with increase CPU frequency. Note that, the raidz cost is impacted by this: 6460622 zio_nowait() doesn''t live up to its name -r
Darren J Moffat
2006-Sep-08 08:41 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
przemolicc at poczta.fm wrote:> Richard, when I talk about cheap JBOD I think about home users/small > servers/small companies. I guess you can sell 100 X4500 and at the same > time 1000 (or even more) cheap JBODs to the small companies which for sure > will not buy the big boxes. Yes, I know, you earn more selling > X4500. But what do you think, how Linux found its way to data centers > and become important player in OS space ? Through home users/enthusiasts who > become familiar with it and then started using the familiar things in > their job.But Linux isn''t a hardware vendor and doesn''t make cheap JBOD or multipack for the home user. So I don''t see how we get from "Sun should make cheap home user JBOD" (which BTW we don''t really have the channel to sell for anyway) to "but Linux dominated this way". -- Darren J Moffat
przemolicc at poczta.fm
2006-Sep-08 09:23 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
On Fri, Sep 08, 2006 at 09:41:58AM +0100, Darren J Moffat wrote:> przemolicc at poczta.fm wrote: > >Richard, when I talk about cheap JBOD I think about home users/small > >servers/small companies. I guess you can sell 100 X4500 and at the same > >time 1000 (or even more) cheap JBODs to the small companies which for sure > >will not buy the big boxes. Yes, I know, you earn more selling > >X4500. But what do you think, how Linux found its way to data centers > >and become important player in OS space ? Through home users/enthusiasts > >who > >become familiar with it and then started using the familiar things in > >their job. > > But Linux isn''t a hardware vendor and doesn''t make cheap JBOD or > multipack for the home user.Linux is used as a symbol.> So I don''t see how we get from "Sun should make cheap home user JBOD" > (which BTW we don''t really have the channel to sell for anyway) to "but > Linux dominated this way"."Home user" = tech/geek/enthusiasts who is an admin in job [ Linux ] "Home user" is using linux at home and is satisfied with it. He/she then goes to job and says "Let''s install/use it on less important servers". He/she (and management) is again satisfied with it. So lets use it at more important servers ... etc. [ ZFS ] "Home user" is using ZFS (Solaris) at home (remember easiness and even WEB interface to ZFS operations !,) to keep photos, musics, etc. and is satisfied with it. He/she the goes to his/her job and says "I use for a while a fantastic filesystem". Lets use it on less important servers". Ok. Later on "Works ok. Let''s use on more important ....". Etc... Yes, I know, a bit naive. But remember that not only Linux spreads this way but also Solaris as well. I guess most of downloaded Solaris CD/DVD are for x86. You as a company "attack" at high end/midrange level. Let users/admins/fans "attack" at lower end level. przemol
Robert Milkowski
2006-Sep-08 09:41 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Hello James, Thursday, September 7, 2006, 8:58:10 PM, you wrote: JD> with ZFS I have found that memory is a much greater limitation, even JD> my dual 300mhz u2 has no problem filling 2x 20MB/s scsi channels, even JD> with compression enabled, using raidz and 10k rpm 9GB drives, thanks JD> to its 2GB of ram it does great at everything I throw at it. On the JD> other hand my blade 1500 ram 512MB with 3x 18GB 10k rpm drives using JD> 2x 40MB/s scsi channels , os is on a 80GB ide drive, has problems JD> interactively because as soon as you push zfs hard it hogs all the ram JD> and may take 5 or 10 seconds to get response on xterms while the JD> machine clears out ram and loads its applications/data back into ram. IIRC correctly there''s is a bug in SPARC ata driver which when combined with ZFS expresses itself. Unless you use only ZFS on those SCSI drives...? -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
zfs "hogs all the ram" under a sustained heavy write load. This is being tracked by: 6429205 each zpool needs to monitor it''s throughput and throttle heavy writers -r
Roch - PAE
2006-Sep-08 10:11 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
Jim Sloey writes: > > Roch - PAE wrote: > > The hard part is getting a set of simple requirements. As you go into > > more complex data center environments you get hit with older Solaris > > revs, other OSs, SOX compliance issues, etc. etc. etc. The world where > > most of us seem to be playing with ZFS is on the lower end of the > > complexity scale. > > I''ve been watching this thread and unfortunately fit this model. I''d > hoped that ZFS might scale enough to solve my problem but you seem to > be saying that it''s mostly untested in large scale environments. > About 7 years ago we ran out of inodes on our UFS file systems. We > used bFile as middleware for a while to distribute the files across > multiple disks and then switched to VFS on SAN about 5 years > ago. Distribution across file systems and inode depletion continued to > be a problem so we switched middleware to another vendor that > essentially compresses about 200 files into a single 10Mb archive and > uses a DB to find the file within the archive on the correct > disk. Expensive, complex and slow but effective solution until the > latest license renewal when we got hit with a huge bill. > > I''d love to go back to a pure file system model and looked at Reiser4, > JFS, NTFS and now ZFS for a way to support over 100 million small > documents and 16Tb. We average 2 file reads and 1 file write per > second 24/7 with expected growth to 24Tb. I''d be willing to scrap > everything we have to find a non-proprietary long term solution. > ZFS looked like it might provide an answer. Are you saying it''s not > really suitable for this type of application? > I don''t think it was the point of the post. I''ve read it to mean that some customers because of outside consideration from ZFS have some need to use storage array in ways that may not allow ZFS to develop it''s full potential. If you don''t replicate within ZFS, then ZFS will not be able to heal corrupted blocks. But if you''re storage model allow for ZFS replication, then the quote is not aimed at your case. Are you going to grow to 24TB using a few writes per second ? -r > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
On Fri, 8 Sep 2006, Jim Sloey wrote:> > Roch - PAE wrote: > > The hard part is getting a set of simple requirements. As you go into > > more complex data center environments you get hit with older Solaris > > revs, other OSs, SOX compliance issues, etc. etc. etc. The world where > > most of us seem to be playing with ZFS is on the lower end of the > > complexity scale.... reformatted ..> I''ve been watching this thread and unfortunately fit this model. I''d > hoped that ZFS might scale enough to solve my problem but you seem to be > saying that it''s mostly untested in large scale environments. About 7 > years ago we ran out of inodes on our UFS file systems. We used bFile as > middleware for a while to distribute the files across multiple disks and > then switched to VFS on SAN about 5 years ago. Distribution across file > systems and inode depletion continued to be a problem so we switched > middleware to another vendor that essentially compresses about 200 files > into a single 10Mb archive and uses a DB to find the file within the > archive on the correct disk. Expensive, complex and slow but effective > solution until the latest license renewal when we got hit with a huge > bill. I''d love to go back to a pure file system model and looked at > Reiser4, JFS, NTFS and now ZFS for a way to support over 100 million > small documents and 16Tb. We average 2 file reads and 1 file write per > second 24/7 with expected growth to 24Tb. I''d be willing to scrap > everything we have to find a non-proprietary long term solution. ZFS > looked like it might provide an answer. Are you saying it''s not really > suitable for this type of application?No - that''s not what he is saying. Personally I think (from the info presented) is that ZFS would be a viable long term solution to this storage headache. But the neat thing about ZFS, is that, with a spare AMD based box and, as few as 5 low-cost SATA drives, you can actually try it[1]. Think about this for a Second. You can put together a test ZFS box for less money than you would spend, in man-hours, talking about it as a _possible_ solution. [1] 5 to 10 SATA drives won''t get you 16Tb - but it''ll get you close enough to model the system with a substantial portion of your dataset. Regards, Al Hopper Logical Approach Inc, Plano, TX. al at logical-approach.com Voice: 972.379.2133 Fax: 972.379.2134 Timezone: US CDT OpenSolaris.Org Community Advisory Board (CAB) Member - Apr 2005 OpenSolaris Governing Board (OGB) Member - Feb 2006
Jim Sloey
2006-Sep-08 13:33 UTC
[zfs-discuss] Re: Re: Re: Recommendation ZFS on StorEdge 3320
rbourbon writes:> I don''t think it was the point of the post. I''ve read > it to mean that some customers because of outside > consideration from ZFS have some need to use storage array in ways > that may not allow ZFS to develop it''s full potential.I''ve been following this thread because we have redundant load balanced servers, SAN and replication to a disaster recovery site 800 miles away. We will probably not be able to use ZFS to it''s full potential (especially for replication) however it does solve our iNode depletion problem and eliminates middleware. Not trying to hijack the thread, just trying to learn from others experience before I commit.> Are you going to grow to 24TB using a few writes per > second ?Actually 24Tb (8Tb growth from current capacity) is the low end projection. 60 sec * 60 min * 24 hours * 365 days = 31,536,000 new files/year * 3 yrs till the next technology refresh/upgrade. This message posted from opensolaris.org
Richard Elling - PAE
2006-Sep-08 16:33 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
przemolicc at poczta.fm wrote:>> I don''t quite see this in my crystal ball. Rather, I see all of the SAS/SATA >> chipset vendors putting RAID in the chipset. Basically, you can''t get a >> "dumb" interface anymore, except for fibre channel :-). In other words, if >> we were to design a system in a chassis with perhaps 8 disks, then we would >> also use a controller which does RAID. So, we''re right back to square 1. > > Richard, when I talk about cheap JBOD I think about home users/small > servers/small companies. I guess you can sell 100 X4500 and at the same > time 1000 (or even more) cheap JBODs to the small companies which for sure > will not buy the big boxes. Yes, I know, you earn more selling > X4500. But what do you think, how Linux found its way to data centers > and become important player in OS space ? Through home users/enthusiasts who > become familiar with it and then started using the familiar things in > their job.I was looking for a new AM2 socket motherboard a few weeks ago. All of the ones I looked at had 2xIDE and 4xSATA with onboard (SATA) RAID. All were less than $150. In other words, the days of having a JBOD-only solution are over except for single disk systems. 4x750 GBytes is a *lot* of data (and video). There has been some recent discussion about eSATA JBODs in the press. I''m not sure they will gain much market share. iPods and flash drives have a much larger market share.> Proven way to achieve "world domination". ;-))Dang! I was planning to steal a cobalt bomb and hold the world hostage while I relax in my space station... zero-G whee! :-) -- richard
Bill Sommerfeld
2006-Sep-08 18:05 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
On Fri, 2006-09-08 at 09:33 -0700, Richard Elling - PAE wrote:> There has been some recent discussion about eSATA JBODs in the press. I''m not > sure they will gain much market share. iPods and flash drives have a much larger > market share.Dunno about eSATA jbods, but eSATA host ports have appeared on at least two HDTV-capable DVRs for storage expansion (looks like one model of the Scientific Atlanta cable box DVR''s as well as on the shipping-any-day-now Tivo Series 3). It''s strange that they didn''t go with firewire since it''s already widely used for digital video. - Bill
Ed Gould
2006-Sep-08 18:22 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
On Sep 8, 2006, at 9:33, Richard Elling - PAE wrote:> I was looking for a new AM2 socket motherboard a few weeks ago. All > of the ones > I looked at had 2xIDE and 4xSATA with onboard (SATA) RAID. All were > less than $150. > In other words, the days of having a JBOD-only solution are over > except for single > disk systems. 4x750 GBytes is a *lot* of data (and video).It''s not clear to me that JBOD is dead. The (S)ATA RAID cards I''ve seen are really software RAID solutions that know just enough in the controller to let the BIOS boot off a RAID volume. None of the expensive RAID stuff is in the controller. --Ed
Torrey McMahon
2006-Sep-08 18:35 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
Ed Gould wrote:> On Sep 8, 2006, at 9:33, Richard Elling - PAE wrote: >> I was looking for a new AM2 socket motherboard a few weeks ago. All >> of the ones >> I looked at had 2xIDE and 4xSATA with onboard (SATA) RAID. All were >> less than $150. >> In other words, the days of having a JBOD-only solution are over >> except for single >> disk systems. 4x750 GBytes is a *lot* of data (and video). > > It''s not clear to me that JBOD is dead. The (S)ATA RAID cards I''ve > seen are really software RAID solutions that know just enough in the > controller to let the BIOS boot off a RAID volume. None of the > expensive RAID stuff is in the controller.If I read between the lines here I think you''re saying that the raid functionality is in the chipset but the management can only be done by software running on the outside. (Right?) A1000 anyone? :)
Ed Gould
2006-Sep-08 18:40 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
On Sep 8, 2006, at 11:35, Torrey McMahon wrote:> If I read between the lines here I think you''re saying that the raid > functionality is in the chipset but the management can only be done by > software running on the outside. (Right?)No. All that''s in the chipset is enough to read a RAID volume for boot. Block layout, RAID-5 parity calculations, and the rest are all done in the software. I wouldn''t be surprised if RAID-5 parity checking was absent on read for boot, but I don''t actually know. --Ed
Jonathan Edwards
2006-Sep-08 18:41 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
On Sep 8, 2006, at 14:22, Ed Gould wrote:> On Sep 8, 2006, at 9:33, Richard Elling - PAE wrote: >> I was looking for a new AM2 socket motherboard a few weeks ago. >> All of the ones >> I looked at had 2xIDE and 4xSATA with onboard (SATA) RAID. All >> were less than $150. >> In other words, the days of having a JBOD-only solution are over >> except for single >> disk systems. 4x750 GBytes is a *lot* of data (and video). > > It''s not clear to me that JBOD is dead. The (S)ATA RAID cards I''ve > seen are really software RAID solutions that know just enough in > the controller to let the BIOS boot off a RAID volume. None of the > expensive RAID stuff is in the controller.additionally the only RAID many support favor just mirroring and striping (RAID 0, 1, 10, etc) not as many do parity.
Bennett, Steve
2006-Sep-08 22:14 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
> Dunno about eSATA jbods, but eSATA host ports have > appeared on at least two HDTV-capable DVRs for storage > expansion (looks like one model of the Scientific Atlanta > cable box DVR''s as well as on the shipping-any-day-now > Tivo Series 3). > > It''s strange that they didn''t go with firewire since it''s > already widely used for digital video.Cost? If you use eSata it''s pretty much just a physical connector onto the board, whereas I guess firewire needs a 1394 interface (couple of dollars?) plus a royalty to all the patent holders. It''s probably not much, but I can''t see how there can be *any* margin in consumer electronics these days... Steve.
Richard Elling - PAE
2006-Sep-09 00:59 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
Ed Gould wrote:> On Sep 8, 2006, at 11:35, Torrey McMahon wrote: >> If I read between the lines here I think you''re saying that the raid >> functionality is in the chipset but the management can only be done by >> software running on the outside. (Right?) > > No. All that''s in the chipset is enough to read a RAID volume for > boot. Block layout, RAID-5 parity calculations, and the rest are all > done in the software. I wouldn''t be surprised if RAID-5 parity checking > was absent on read for boot, but I don''t actually know.At Sun, we often use the LSI Logic LSISAS1064 series of SAS RAID controllers on motherboards for many products. [LSI claims support for Solaris 2.6!] These controllers have a builtin microcontroller(ARM 926, IIRC), firmware, and nonvolatile memory (NVSRAM) for implementing the RAID features. We manage them through BIOS, OBP, or raidctl(1m). As Torrey says, very much like the A1000. Some of the fancier LSI products offer RAID 5, too. -- richard
Anton B. Rang
2006-Sep-09 03:34 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320 - offtopic
The better SATA RAID cards have hardware support. One site comparing controllers is: http://tweakers.net/reviews/557 Five of the eight controllers they looked at implemented RAID in hardware; one of the others implemented only the XOR in hardware. Chips like the Adaptec AIC-8210 implement multiple SATA ports as well as RAID-5 and RAID-6 and a microcontroller in a single chip. JBOD probably isn''t dead, simply because motherboard manufacturers are unlikely to pay the extra $10 it might cost to use a RAID-enabled chip rather than a plain chip (and the cost is more if you add cache RAM); but basic RAID is at least cheap. Of course, having RAID in the HBA is a single point of failure! This message posted from opensolaris.org
Richard Elling - PAE
2006-Sep-09 05:28 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320 - offtopic
Anton B. Rang wrote:> JBOD probably isn''t dead, simply because motherboard manufacturers are unlikely to pay > the extra $10 it might cost to use a RAID-enabled chip rather than a plain chip (and > the cost is more if you add cache RAM); but basic RAID is at least cheap.NVidia MCPs (later NForce chipsets) also do RAID. The NForce 5x0 systems even do RAID-5 and sparing (with 6 SATA ports). Using special-purpose RAID chips won''t be necessary for desktops or low-end systems. Moore''s law says that we can continue to integrate more and more functions onto fewer parts.> Of course, having RAID in the HBA is a single point of failure!At this level, and price point, there are many SPOFs. Indeed there is always at least one SPOF. -- richard
Frank Cusack
2006-Sep-09 06:11 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320 - offtopic
On September 8, 2006 5:59:47 PM -0700 Richard Elling - PAE <Richard.Elling at Sun.COM> wrote:> Ed Gould wrote: >> On Sep 8, 2006, at 11:35, Torrey McMahon wrote: >>> If I read between the lines here I think you''re saying that the raid >>> functionality is in the chipset but the management can only be done by >>> software running on the outside. (Right?) >> >> No. All that''s in the chipset is enough to read a RAID volume for >> boot. Block layout, RAID-5 parity calculations, and the rest are all >> done in the software. I wouldn''t be surprised if RAID-5 parity checking >> was absent on read for boot, but I don''t actually know. > > At Sun, we often use the LSI Logic LSISAS1064 series of SAS RAID > controllers > on motherboards for many products. [LSI claims support for Solaris 2.6!] > These controllers have a builtin microcontroller(ARM 926, IIRC), > firmware, > and nonvolatile memory (NVSRAM) for implementing the RAID features. We > manage > them through BIOS, OBP, or raidctl(1m). As Torrey says, very much like > the A1000. > Some of the fancier LSI products offer RAID 5, too.Yes, some (many) of the RAID controllers do all the RAID in the hardware. I don''t see where Ed was disputing that. But there will always be a [large] market for cheaper but less capable products and so at least for awhile to come there will be these not-quite- RAID cards. Probably for a very long while. winmodem, anyone? -frank
On September 7, 2006 12:25:47 PM -0700 "Anton B. Rang" <Anton.Rang at Sun.COM> wrote:> The bigger problem with system utilization for software RAID is the > cache, not the CPU cycles proper. Simply preparing to write 1 MB of data > will flush half of a 2 MB L2 cache. This hurts overall system performance > far more than the few microseconds that XORing the data takes.Interesting. So does this in any way invalidate benchmarks recently posted here which showed raidz on jbod to outperform a zfs stripe on HW raid5? (That''s my recollection, perhaps it''s a mischaracterization or just plain wrong.) I mean, even if raid-z on jbod in a filesystem benchmark is a winner, when you have an actual application with a working set that is more than filesystem data, the benchmark results would be misleading. Ultimately, you do want to use your actual application as the benchmark, but certainly generic benchmarks should at least be helpful. -frank
On Sep 9, 2006, at 1:32 AM, Frank Cusack wrote:> On September 7, 2006 12:25:47 PM -0700 "Anton B. Rang" > <Anton.Rang at Sun.COM> wrote: >> The bigger problem with system utilization for software RAID is the >> cache, not the CPU cycles proper. Simply preparing to write 1 MB >> of data >> will flush half of a 2 MB L2 cache. This hurts overall system >> performance >> far more than the few microseconds that XORing the data takes. > > Interesting. So does this in any way invalidate benchmarks > recently posted > here which showed raidz on jbod to outperform a zfs stripe on HW > raid5?No. There are, in fact, two reasons why RAID-Z is likely to outperform hardware RAID 5, at least in certain types of I/O benchmarks. First, RAID-5 requires read-modify-write cycles when full stripes aren''t being written; and ZFS tends to issue small and pretty much random I/O (in my experience), which is the worst case for RAID-5. Second, performing RAID on the main CPU is faster, or at least just as fast, as in hardware. There are also cases where hardware RAID 5 will likely outperform ZFS. One is when there is a large RAM cache (which is not being flushed by ZFS -- one issue to be addressed is that the commands ZFS uses to control the write cache on plain disks tend to effectively disable the NVRAM cache on hardware RAID controllers). Another is when the I/O bandwidth being used is near the maximum capacity of the host channel, because doing software RAID requires moving more data over this channel. (If you have sufficient channels to dedicate one per disk, as is the case with SATA, this doesn''t come into play.) This is particularly noticeable during reconstruction, since the channels are being used both to transfer data & reconstruct it, where in a hardware RAID-5 box (of moderate cost, at least) they are typically overprovisioned. A third is if the system CPU or memory bandwidth is heavily used by your application; for instance, a database running under heavy load. In this case, the added CPU, cache, and memory bandwidth of software RAID will stress the application.> Ultimately, you do want to use your actual application as the > benchmark, > but certainly generic benchmarks should at least be helpful.They''re helpful in measuring what the benchmark measures. ;-) If the benchmark measures how quickly you can get data from host RAM to disk, which is typically the case, it won''t tell you anything about how much CPU was used in the process. Real applications, however, often care. There''s a reason why we use interrupt-driven controllers, even though you get better performance of the I/O itself with polling. :-) Anton
Anton B. Rang writes: > The bigger problem with system utilization for software RAID is the cache, not the CPU cycles proper. Simply preparing to write 1 MB of data will flush half of a 2 MB L2 cache. This hurts overall system performance far more than the few microseconds that XORing the data takes. > With ZFS, on most deployments we''ll bring the data into cache for the checksums; so I guess that the raid-z cost will be just incremental. Now would we gain anything at generating ZFS functions for ''checksum+parity'', ''checksum+parity+compression'' ? -r
UNIX admin
2006-Sep-12 18:12 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
> This is simply not true. ZFS would protect against > the same type of > errors seen on an individual drive as it would on a > pool made of HW raid > LUN(s). It might be overkill to layer ZFS on top of a > LUN that is > already protected in some way by the devices internal > RAID code but it > does not "make your data susceptible to HW errors > caused by the storage > subsystem''s RAID algorithm, and slow down the I/O".I disagree, and vehemently at that. I maintain that if the HW RAID is used, the chance of data corruption is much higher, and ZFS would have a lot more repairing to do than it would if it were used directly on disks. Problems with HW RAID algorithms have been plaguing us for at least 15 years or more. The venerable Sun StorEdge T3 comes to mind! Further, while it is perfectly logical to me that doing RAID calculations twice is slower than doing it once, you maintain that is not the case, perhaps because one calculation is implemented in FW/HW? Well, why don''t you simply try it out? Once with both RAID HW and ZFS, and once with just ZFS directly on the disks? RAID HW is very likely to have a slower CPU or CPUs than any modern system that ZFS will be running on. Even if we assume that the HW RAID''s CPU is the same speed or faster than the CPU in the server, you still have TWICE the amount of work that has to be performed for every write. Once by the hardware and once by the software (ZFS). Caches might help some, but I fail to see how double the amount of work (and hidden, abstracted complexity) would be as fast or faster than just using ZFS directly on the disks. This message posted from opensolaris.org
UNIX admin
2006-Sep-12 18:35 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
> There are also the speed enhancement provided by a HW > raid array, and > usually RAS too, compared to a native disk drive but > the numbers on > that are still coming in and being analyzed. (See > previous threads.)Speed enhancements? What is the baseline of comparison? Hardware RAIDs can be banalized to two features: cache which does data reordering for optimal disk writes and parity calculation which is being offloaded off of the server''s CPU. But HW calculations still take time, and the in-between, battery backed cache serves to replace the individual disk caches, because of the traditional file system approach which had to have some assurance that the data made it to disk in one way or another. With ZFS however the in-between cache is obsolete, as individual disk caches can be used directly. I also openly question whether even the dedicated RAID HW is faster than the newest CPUs in modern servers. Unless there is something that I''m missing, I fail to see the benefit of a HW RAID in tandem with ZFS. In my view, this holds especially true when one gets into SAN storage like SE6920, EMC and Hitachi products. Furthermore, need I remind of the buggy SE6920 firmware? I don''t trust it as far as I can throw it. Or, lets put it this way: I trust Mr. Bonwick a whole lot more than some firmware writers. This message posted from opensolaris.org
On September 12, 2006 11:35:54 AM -0700 UNIX admin <tripivceta at hotmail.com> wrote:>> There are also the speed enhancement provided by a HW >> raid array, and >> usually RAS too, compared to a native disk drive but >> the numbers on >> that are still coming in and being analyzed. (See >> previous threads.)It would be nice if you would attribute your quotes. Maybe this is a limitation of the web interface?> Speed enhancements? What is the baseline of comparison? > > Hardware RAIDs can be banalized to two features: cache which does data > reordering for optimal disk writes and parity calculation which is being > offloaded off of the server''s CPU. > > But HW calculations still take time, and the in-between, battery backed > cache serves to replace the individual disk caches, because of the > traditional file system approach which had to have some assurance that > the data made it to disk in one way or another. > > With ZFS however the in-between cache is obsolete, as individual disk > caches can be used directly. I also openly question whether even the > dedicated RAID HW is faster than the newest CPUs in modern servers. > > Unless there is something that I''m missing, I fail to see the benefit of > a HW RAID in tandem with ZFS. In my view, this holds especially true when > one gets into SAN storage like SE6920, EMC and Hitachi products.I agree with your basic point, that the HW RAID cache is obsoleted by zfs (which seems to be substantiated here by benchmark results), but I think you slightly mischaracterize its use. The speed of the HW RAID CPU is irrelevant; the parity is XOR which is extremely fast with any CPU when compared to disk write speed. What is relevant is, as Anton points out, the CPU cache on the host system. Parity calculations kill the cache and will hurt memory-intensive apps. So in this case, offloading it may help in the ufs case. (Not for zfs, as I understand from reading here, since checksums still have to be done. I would argue that this is *absolutely essential* [and zfs obsoletes all other filesystems] and therefore the gain in the ufs on HW RAID-5 case is worthless due to the correctness tradeoff.) It would be interesting to have a zfs enabled HBA to offload the checksum and parity calculations. How much of zfs would such an HBA have to understand? -frank
Robert Milkowski
2006-Sep-12 20:25 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Hello Frank, Tuesday, September 12, 2006, 9:41:05 PM, you wrote: FC> It would be interesting to have a zfs enabled HBA to offload the checksum FC> and parity calculations. How much of zfs would such an HBA have to FC> understand? That won''t be end-to-end checksuming anymore, right? That way you can disable ZFS checksuming at all and base only on HW RAID. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Torrey McMahon
2006-Sep-12 20:56 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
UNIX admin wrote:>> This is simply not true. ZFS would protect against >> the same type of >> errors seen on an individual drive as it would on a >> pool made of HW raid >> LUN(s). It might be overkill to layer ZFS on top of a >> LUN that is >> already protected in some way by the devices internal >> RAID code but it >> does not "make your data susceptible to HW errors >> caused by the storage >> subsystem''s RAID algorithm, and slow down the I/O". >> > > I disagree, and vehemently at that. I maintain that if the HW RAID is used, the chance of data corruption is much higher, and ZFS would have a lot more repairing to do than it would if it were used directly on disks. Problems with HW RAID algorithms have been plaguing us for at least 15 years or more. The venerable Sun StorEdge T3 comes to mind! >Please expand on your logic. Remember that ZFS works on top of LUNs. A disk drive by itself is a LUN when added to a ZFS pool. A LUN can also be comprised of multiple disk drives striped together and presented to a host as one logical unit. Or a LUN can be offered by a virtualization gateway that in turn imports raid array LUNs that are really made up of individual disk drives. Or ... insert a million different ways to get a host something called a LUN that allows the host to read and write blocks. They could be really slow LUNs because they''re two hamsters shuffling zeros and ones back and forth on little wheels. (OK, that might be too slow.) Outside of the cache enabling when entire disk drives are presented to the pool ZFS doesn''t care what the LUN is made of. ZFS reliability features are available and work on top of the LUNs you give it and the configuration you use. The type of LUN is inconsequential at the ZFS level. If I had 12 LUNS that were single disk drives and created a RAIDZ pool it would have the same reliability at the ZFS level as if I presented it 12 LUNs that were really quad-mirrors from 12 independent hw raid array. You can make argument that the 12 disk drive config is easier to use or that the overall reliability of the 12 quad-mirror LUNs system has a higher reliability but at ZFSs point of view it''s the same. Its happily writing blocks, checking checksums, reading things from the LUNs, etc. etc. etc. On top of that disk drives are not some simple beast that just coughs up i/o when you want it to. A modern disk drive does all sorts of stuff under the covers to speed up i/o and - surprise - increase the reliability of the drive as much as possible. If you think you''re really writing "straight to disk" you''re not. Cache, ZBR, bad block re-allocation, all come into play. As for problems with specific raid arrays, including the T3, you are preaching to the choir but I''m definitely not going to get into a pissing contest over specific components having more or less bugs then an other.> Further, while it is perfectly logical to me that doing RAID calculations twice is slower than doing it once, you maintain that is not the case, perhaps because one calculation is implemented in FW/HW? >As the man says, "It depends". A really fast raid array might be responding to i/o requests faster then a single disk drive. It might not given the nature of the i/o coming in. Don''t think of it in terms of RAID calculations taking a certain amount of time. Think of it in terms of having to meet a specific amount of requirements to manage your data. I''ll be the first to say that if you''re going to be putting ZFS on a desktop then a simple JBOD is a box to look at. If you''re going to look at an enterprise data center the answer is going to be different. That is something a lot of people on this alias seem to be missing out on. Stating ZFS on JBODs is the answer to everything is the punchline of the "When all you have is a hammer..." routine.
James C. McPherson
2006-Sep-13 05:14 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Richard Elling wrote:> Frank Cusack wrote: >> It would be interesting to have a zfs enabled HBA to offload the checksum >> and parity calculations. How much of zfs would such an HBA have to >> understand? > [warning: chum] > Disagree. HBAs are pretty wimpy. It is much less expensive and more > efficient to move that (flexible!) function into the main CPUs.I think Richard is in the groove here. All the hba chip implementation documentation that I''ve seen (publicly available of course) indicates that these chips are already highly optimized engines, and I don''t think that adding extra functionality like checksum and parity calculations would be an efficient use of silicon/SoI. cheers, James
Richard Elling
2006-Sep-13 06:45 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
Frank Cusack wrote:> It would be interesting to have a zfs enabled HBA to offload the checksum > and parity calculations. How much of zfs would such an HBA have to > understand?[warning: chum] Disagree. HBAs are pretty wimpy. It is much less expensive and more efficient to move that (flexible!) function into the main CPUs. -- richard
James C. McPherson wrote:> Richard Elling wrote: >> Frank Cusack wrote: >>> It would be interesting to have a zfs enabled HBA to offload the >>> checksum >>> and parity calculations. How much of zfs would such an HBA have to >>> understand? >> [warning: chum] >> Disagree. HBAs are pretty wimpy. It is much less expensive and more >> efficient to move that (flexible!) function into the main CPUs. > > I think Richard is in the groove here. All the hba chip > implementation documentation that I''ve seen (publicly > available of course) indicates that these chips are > already highly optimized engines, and I don''t think that > adding extra functionality like checksum and parity > calculations would be an efficient use of silicon/SoI. > > cheers, > JamesHBAs work on an entirely different layer than what checksumming data would be efficient at. If we''re using the OSI-style model for this type, HBAs work at layer 1. And, as James mentioned, they are highly specialized ASICs for doing just bus-level communications. It''s not like there is extra general-purposes compute power available (or, even can possibly be built-in). Checksumming for ZFS requires filesystem-level knowledge, which is effectively up at OSI layer 6 or 7, and well beyond the understanding of a lowly HBA (it''s just passing bits back and forth, and has no conception of what they mean). Essentially, moving block checksumming into the HBA would at best be similar to what we see with super-low-cost RAID controllers and the XOR function. Remember how well that works? Now, building ZFS-style checksum capability (or, just hardware checksum capability for ZFS to call) is indeed proper and possible for _real_ hardware RAID controllers, as they are much more akin to standard general-purpose CPUs (indeed, most now use a GP processor anyway). We''re back into the old argument of "put it on a co-processor, then move it onto the CPU, then move it back onto a co-processor" cycle. Personally, with modern CPUs being so under-utilized these days, and all ZFS-bound data having to move through main memory in any case (whether hardware checksum-assisted or not), use the CPU. Hardware-assist for checksum sounds nice, but I can''t think of it actually being more efficient that doing it on the CPU (it won''t actually help performance), so why bother with extra hardware? -Erik
Casper.Dik at Sun.COM
2006-Sep-13 09:25 UTC
[zfs-discuss] Re: Recommendation ZFS on StorEdge 3320
>We''re back into the old argument of "put it on a co-processor, then move >it onto the CPU, then move it back onto a co-processor" cycle. >Personally, with modern CPUs being so under-utilized these days, and all >ZFS-bound data having to move through main memory in any case (whether >hardware checksum-assisted or not), use the CPU. Hardware-assist for >checksum sounds nice, but I can''t think of it actually being more >efficient that doing it on the CPU (it won''t actually help performance), >so why bother with extra hardware?Plus it moves part of the resiliency away from where we knew the data was good (the CPU/computer) across a bus/fabric/whatnot possibly causing checksums to be computed over incorrect data. We already see that with IP checksuming off-loading and broken hardware and broken VLAN switches recomputing the ethernet CRC. Casper
Anton B. Rang
2006-Sep-13 15:57 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
> It would be interesting to have a zfs enabled HBA to offload the checksum > and parity calculations. How much of zfs would such an HBA have to > understand?That''s an interesting question. For parity, it''s actually pretty easy. One can envision an HBA which took a group of related write commands and computed the parity on the fly, using it for a final write command. This would, however, probably limit the size of a block that could be written to whatever amount of memory was available for buffering on the HBA. (Of course, memory is relatively cheap these days, but it''s still not free, so the HBA might have only a few megabytes.) The checksum is more difficult. If you''re willing to delay writing an indirect block until all of its children have been written [*], then we can just compute the checksum for each block as it goes out, and that''s easy [**] -- easier than the parity, in fact, since there''s no buffering required beyond the checksum itself. ZFS in fact does delay this write at present. However, I''ve argued in the past that ZFS shouldn''t delay it, but should write indirect blocks in parallel with the data blocks. It would be interesting to determine whether the performance improvement of doing checksums on the HBA would outweigh the potential benefit of writing indirect blocks in parallel. Maybe it would for larger writes. Anyone got an FPGA programmer and an open-source SATA implementation? :-) (Unfortunately storage protocols have a complex analog side, and except for 1394, I''m not aware of any implementations that separate the digital/analog, which makes prototyping a lot harder, at least without much more detailed documentation on the controllers than you''re likely to find.) -- Anton [*] Actually, you don''t need to delay until the writes have made it to disk, but since you want to compute the checksum as the data goes out to the disk rather than making a second pass over it, you''d need to wait until the data has at least been sent to the drive cache. [**] For SCSI and FC, there''s added complexity in that the drives can request data out-of-order. You can disable this but at the cost of some performance on high-end drives. This message posted from opensolaris.org
Anton B. Rang
2006-Sep-13 16:21 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
> just measured quickly that a 1.2Ghz sparc can do [400-500]MB/sec > of encoding (time spent in misnamed function > vdev_raidz_reconstruct) for a 3 disk raid-z group.Strange, that seems very low. Ah, I see. The current code loops through each buffer, either copying or XORing it into the parity. This likely would perform quite a bit better if it were reworked to go through more than one buffer at a time, doing the XOR. (Reading the partial parity is expensive.) Actually, this would be an instance where using assembly language or even processor-dependent code would be useful. Since the prefetch buffers on UltraSPARC are only applicable to floating-point loads, we should probably use prefetch & the VIS xor instructions. (Even calling bcopy instead of using the existing copy loop would help.) FWIW, on large systems we ought to be aiming to sustain 8 GB/s or so of writes, and using 16 CPUs for just parity computation seems inordinately painful. :-) This message posted from opensolaris.org
Anton B. Rang
2006-Sep-13 16:25 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
>With ZFS however the in-between cache is obsolete, as individual disk caches can be used >directly. I also openly question whether even the dedicated RAID HW is faster than the newest >CPUs in modern servers.Individual disk caches are typically in the 8-16 MB range; for 15 disks, that gives you about 256 MB. A RAID with 15 drives behind it might have 2-4 GB of cache. That''s a big improvement. The dedicated RAID hardware may not be faster than the newest CPUs, but as a friend of mine has pointed out, even though delegating a job to somebody else often means it''s done more slowly, it frees him up to do his other work. (It''s also pondering the difference between latency and bandwidth. When parity is computed inline with the data path, as is often the case for hardware controllers, the bandwidth is relatively low since it''s happening at the speed of data transfer to an individual disk, but the latency is effectively zero, since it''s not adding any time to the transfer.) This message posted from opensolaris.org
Roch - PAE
2006-Sep-14 09:53 UTC
[zfs-discuss] Re: Re: Recommendation ZFS on StorEdge 3320
"With ZFS however the in-between cache is obsolete, as individual disk caches can be used directly." The statement needs to be qualified. Storage cache, if protected, works great to reduce critical op latency. ZFS when it writes to disk cache, will flush data out before return to say an O_DSYNC write. The application level latency is not improved by the disk write cache. But a battery protected mirrored storage cache should act as a latency reductor, thus improving some workloads. ____________________________________________________________________________________ Performance, Availability & Architecture Engineering Roch Bourbonnais Sun Microsystems, Icnc-Grenoble Senior Performance Analyst 180, Avenue De L''Europe, 38330, Montbonnot Saint Martin, France Roch.Bourbonnais at Sun.Com http://blogs.sun.com/roller/page/roch