David J. Orman
2006-Jun-01 21:35 UTC
[zfs-discuss] question about ZFS performance for webserving/java
Just as a hypothetical (not looking for exact science here folks..), how would ZFS fare (in your educated opinion) in this sitation: 1 - Machine with 8 10k rpm SATA drives. High performance machine of sorts (ie dual proc, etc..let''s weed out cpu/memory/bus bandwidth as much as possible from the equation). 2 - Workload is webserving, well - application serving. Java app server 9, various java applications requiring database access (mostly small tables/data elements, but millions and millions of rows). 3 - App server would be running in one zone, with a (NFS) mounted ZFS filesystem as storage. 4 - DB server (PgSQL) would be running in another zone, with a (NFS) mounted ZFS filesystem as storage. 5 - Multiple disk redundancy is needed. So, I''m assuming two raid-z pools of 3 drives each, mirrored is the solution. If people have a better suggestion, tell me! :P 6 - OS will be Sol10U2, OS/Root FS will be installed on mirrored drives, using UFS (my only choice..) Now, please eliminate CPU/RAM from this equation, assume the server has 4 cores of goodness powering it, and 32 gigs of ram. No, running on a ram-disk isn''t what I''m asking for. :P * NFS being optional, just curious what the difference would be, as getting a T1000 + building an external storage box is an option. I just can''t justify Sun''s crazy storage pricing at the moment. How would ZFS perform (educated opinions, I realize I won''t be getting exact answers) in this situation. I can''t be more specific because I don''t have the HW in front of me, I''m trying to get a feel for the "correct" solution before I make huge purchases. If anything else is needed, please feel free to ask! Thanks, David
Robert Milkowski
2006-Jun-01 22:16 UTC
[zfs-discuss] question about ZFS performance for webserving/java
Hello David, Thursday, June 1, 2006, 11:35:41 PM, you wrote: DJO> Just as a hypothetical (not looking for exact science here DJO> folks..), how would ZFS fare (in your educated opinion) in this sitation: DJO> 1 - Machine with 8 10k rpm SATA drives. High performance machine DJO> of sorts (ie dual proc, etc..let''s weed out cpu/memory/bus DJO> bandwidth as much as possible from the equation). DJO> 2 - Workload is webserving, well - application serving. Java app DJO> server 9, various java applications requiring database access DJO> (mostly small tables/data elements, but millions and millions of rows). DJO> 3 - App server would be running in one zone, with a (NFS) DJO> mounted ZFS filesystem as storage. DJO> 4 - DB server (PgSQL) would be running in another zone, with a DJO> (NFS) mounted ZFS filesystem as storage. DJO> 5 - Multiple disk redundancy is needed. So, I''m assuming two DJO> raid-z pools of 3 drives each, mirrored is the solution. If DJO> people have a better suggestion, tell me! :P DJO> 6 - OS will be Sol10U2, OS/Root FS will be installed on mirrored DJO> drives, using UFS (my only choice..) DJO> Now, please eliminate CPU/RAM from this equation, assume the DJO> server has 4 cores of goodness powering it, and 32 gigs of ram. DJO> No, running on a ram-disk isn''t what I''m asking for. :P DJO> * NFS being optional, just curious what the difference would be, DJO> as getting a T1000 + building an external storage box is an DJO> option. I just can''t justify Sun''s crazy storage pricing at the moment. DJO> How would ZFS perform (educated opinions, I realize I won''t be DJO> getting exact answers) in this situation. I can''t be more DJO> specific because I don''t have the HW in front of me, I''m trying DJO> to get a feel for the "correct" solution before I make huge purchases. DJO> If anything else is needed, please feel free to ask! I guess there''ll be lot of small random reads. It means that raid-z could be bad choice if you want highest read IO/s. Consider raid-10 - this should give you good redundancy and highes read IO/s. However writing can be slower than in raid-z. Actual redundancy will be better than what you proposed. Now database - this would be interesting. Maybe setting recordsize will be required. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Matthew Ahrens
2006-Jun-01 22:30 UTC
[zfs-discuss] question about ZFS performance for webserving/java
On Thu, Jun 01, 2006 at 11:35:41AM -1000, David J. Orman wrote:> 3 - App server would be running in one zone, with a (NFS) mounted ZFS > filesystem as storage. > > 4 - DB server (PgSQL) would be running in another zone, with a (NFS) > mounted ZFS filesystem as storage.Why would you use NFS? These zones are on the same machine as the storage, right? You can simply export filesystems in your pool to the various zones (see zfs(1m) and zonecfg(1m) manpages). This will result in better performance.> 5 - Multiple disk redundancy is needed. So, I''m assuming two raid-z > pools of 3 drives each, mirrored is the solution. If people have a > better suggestion, tell me! :PThere is no need for multiple pools. Perhaps you meant two raid-z groups (aka "vdevs") in a single pool? Also, wouldn''t you want to use all 8 disks, therefore use two 4-disk raid-z groups? This way you would get 3 disks worth of usable space. Depending on how much space you need, you should consider using a single double-parity RAID-Z group with your 8 disks. This would give you 6 disks worth of usable space. Given that you want to be able to tolerate two failures, that is probably your best solution. Other solutions would include three 3-way mirrors (if you can fit another drive in your machine), giving you 3 disks worth of usable space. --matt
David J. Orman
2006-Jun-01 22:52 UTC
[zfs-discuss] question about ZFS performance for webserving/java
----- Original Message ----- From: Matthew Ahrens <ahrens at eng.sun.com> Date: Thursday, June 1, 2006 12:30 pm Subject: Re: [zfs-discuss] question about ZFS performance for webserving/java> Why would you use NFS? These zones are on the same machine as the > storage, right? You can simply export filesystems in your pool to the > various zones (see zfs(1m) and zonecfg(1m) manpages). This will > resultin better performance.This is why I noted it as optional, and gave my reasoning (a T1000 with a seperate box handling storage, exporting via NFS to the T1000). I''m not investing in the blackhole that is FC, no way, and I don''t know how to cram 8+ SATA ports into a T1000. I can''t justify the price of the T2000 at this point. But again, NFS was *optional*. Using a home-built box, I would be using directly attached storage.> > > There is no need for multiple pools. Perhaps you meant two raid-z > groups (aka "vdevs") in a single pool? Also, wouldn''t you want to use > all 8 disks, therefore use two 4-disk raid-z groups? This way you > wouldget 3 disks worth of usable space.Yes, I meant what you specified, sorry for my lack of knowledge. :) I need two of the disks for the root FS, because U2 won''t allow me to make the root FS on ZFS fs. Otherwise, I''d love to use all 8.> Depending on how much space you need, you should consider using a > singledouble-parity RAID-Z group with your 8 disks. This would > give you 6 > disks worth of usable space. Given that you want to be able to > toleratetwo failures, that is probably your best solution. Other > solutionswould include three 3-way mirrors (if you can fit another > drive in your > machine), giving you 3 disks worth of usable space.That would be ideal, unfortunately that won''t be in U2 for various reasons (I won''t argue this point, although I really think the "process" is hurting Solaris in this regard, this should have been included - lots of people need two disk redundancy at least.) Thanks, David
Robert Milkowski
2006-Jun-01 23:17 UTC
[zfs-discuss] question about ZFS performance for webserving/java
Hello David, Friday, June 2, 2006, 12:52:05 AM, you wrote: DJO> ----- Original Message ----- DJO> From: Matthew Ahrens <ahrens at eng.sun.com> DJO> Date: Thursday, June 1, 2006 12:30 pm DJO> Subject: Re: [zfs-discuss] question about ZFS performance for webserving/java>> >> >> There is no need for multiple pools. Perhaps you meant two raid-z >> groups (aka "vdevs") in a single pool? Also, wouldn''t you want to use >> all 8 disks, therefore use two 4-disk raid-z groups? This way you >> wouldget 3 disks worth of usable space.DJO> Yes, I meant what you specified, sorry for my lack of knowledge. :) DJO> I need two of the disks for the root FS, because U2 won''t allow DJO> me to make the root FS on ZFS fs. Otherwise, I''d love to use all 8. The system itself won''t take too much space. You can create one large slice form the rest of the disks and the same slices on the rest of the disks. Then you can create one large pool from 8 such slices. Remaining space on the rest of the disks could be use for swap for example, or other smaller pool. Now I would consider creating raid-10, and not raid-z, something like: zpool create local mirror s1 s2 mirror s3 s4 mirror s5 s6 mirror s7 s8 Than I would probably create local/zones/db and put zone here, then create additional needed filesystem in that one pool. btw. in such a config write cache would be off by default - I''m not sure it will be a problem or not. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Erik Trimble
2006-Jun-01 23:37 UTC
[zfs-discuss] ZFS performance metric/cookbook/whitepaper
Maybe the best thing here is to have us (i.e. the people on this list) come up with a set of standard and expected use cases, and have the ZFS team tell us what the relative performance/tradeoffs are. I mean, rather than us just asking a bunch of specific cases, a good whitepaper Best Practices / Cookbook for ZFS would be nice. For instance: compare UFS/Solaris Volume Manager against ZFS in: [random|sequential][small|large][read|write] on UFS/SVM: Raid-1, Raid-5, Raid 0+1 ZFS: RaidZ, Mirrors Relative Performance of HWRaid vs JBOD e.g. 3510FC w/ RAID using ZFS vs 3510FC JBOD using ZFS I know a bunch of this has been discussed before (and I''ve read most of it :-), but collecting it in one place and filling out the actual analysis would be Really Nice. -- Erik Trimble Java System Support Mailstop: usca14-102 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
David J. Orman
2006-Jun-02 02:03 UTC
[zfs-discuss] question about ZFS performance for webserving/java
----- Original Message ----- From: Robert Milkowski <rmilkowski at task.gda.pl> Date: Thursday, June 1, 2006 1:17 pm Subject: Re[2]: [zfs-discuss] question about ZFS performance for webserving/java> Hello David, > > The system itself won''t take too much space. > You can create one large slice form the rest of the disks and the same > slices on the rest of the disks. Then you can create one large pool > from 8 such slices. Remaining space on the rest of the disks could be > use for swap for example, or other smaller pool.Ok, sorry I''m not up to speed on Solaris/software raid types. So you''re saying create a couple slices on each disk. One set of slices I''ll use to make a raid of some sort (maybe 10) and use UFS on that (for the initial install - can this be done on installation??), and then use the rest of the slices on the disks to do the zfs/raid for everything else?> > Now I would consider creating raid-10, and not raid-z, something like: > > zpool create local mirror s1 s2 mirror s3 s4 mirror s5 s6 mirror s7 s8 > > Than I would probably create local/zones/db and put zone here, then > create additional needed filesystem in that one pool. > > btw. in such a config write cache would be off by default - I''m not > sure it will be a problem or not.Ok. I''ll keep that in mind. I''m just making sure this is feasible. The technical details I can work out later. David
Please add to the list the differences on locally or remotely attach vdevs: FC, SCSI/SATA, or iSCSI. This is the part that is troubling me most, as there are wildly different performance characteristics when you use NFS with any of these backends with the various configs of ZFS. Another thing is when where cache should be or not be used on backend RAID devices (RAID vs JBOD point made already). The wild difference is between small and large file writes, and how the backend can go from 10''s of MB/sec to 10''s of KB/sec. Really. On 6/1/06, Erik Trimble <Erik.Trimble at sun.com> wrote:> > Maybe the best thing here is to have us (i.e. the people on this list) > come up with a set of standard and expected use cases, and have the ZFS > team tell us what the relative performance/tradeoffs are. I mean, > rather than us just asking a bunch of specific cases, a good whitepaper > Best Practices / Cookbook for ZFS would be nice. > > > For instance: > > > compare UFS/Solaris Volume Manager against ZFS in: > > [random|sequential][small|large][read|write] > on > UFS/SVM: Raid-1, Raid-5, Raid 0+1 > ZFS: RaidZ, Mirrors > > > > > Relative Performance of HWRaid vs JBOD > e.g. > > 3510FC w/ RAID using ZFS > vs > 3510FC JBOD using ZFS > > > > I know a bunch of this has been discussed before (and I''ve read most of > it :-), but collecting it in one place and filling out the actual > analysis would be Really Nice. > > > -- > Erik Trimble > Java System Support > Mailstop: usca14-102 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Robert Milkowski
2006-Jun-02 06:56 UTC
[zfs-discuss] question about ZFS performance for webserving/java
Hello David, Friday, June 2, 2006, 4:03:45 AM, you wrote: DJO> ----- Original Message ----- DJO> From: Robert Milkowski <rmilkowski at task.gda.pl> DJO> Date: Thursday, June 1, 2006 1:17 pm DJO> Subject: Re[2]: [zfs-discuss] question about ZFS performance for webserving/java>> Hello David, >> >> The system itself won''t take too much space. >> You can create one large slice form the rest of the disks and the same >> slices on the rest of the disks. Then you can create one large pool >> from 8 such slices. Remaining space on the rest of the disks could be >> use for swap for example, or other smaller pool.DJO> Ok, sorry I''m not up to speed on Solaris/software raid types. So DJO> you''re saying create a couple slices on each disk. One set of DJO> slices I''ll use to make a raid of some sort (maybe 10) and use DJO> UFS on that (for the initial install - can this be done on DJO> installation??), and then use the rest of the slices on the disks DJO> to do the zfs/raid for everything else?>>Exactly. And you can do it during installation. I do it with jumpstart - server with 6 disks - on first two disks system is installed on a mirror (SVM+UFS, configured automatically by jumpstart profile) and one additional slice (rest of the disk). Then I create zpool on these slices. So it can look like: c0t0d0s1 c0t1d0s1 SVM mirror, UFS / c0t0d0s3 c0t1d0s3 SVM mirror, UFS /var c0t0d0s4 c0t1d0s4 SVM mirror, UFS /opt c0t0d0s6 c0t1d0s6 SVM metadb c0t2d0s1 c0t2d0s1 SVM mirror, SWAP SWAP /s1 size sizeof(/ + /var + /opt) zpool create local mirror c0t0d0s0 c0t1d0s0 mirror c0t2d0s0 c0t3d0s0 mirror c0t4d0s0 c0t5d0s0 or any other raid supported by ZFS. I put SWAP on other disks than / intentionally so disk space isn''t wasted and we have as much space for s0 as possible. I use s0 slice here to emphasis that it would be good idea to have s0 slice starts from cylinder 0 (or 1) - which means from the beginning of the disks as we can expect that ZFS pool will be mostly used in terms of IOs and not / /var or /opt. In that config I would eventually resign of creating separate /opt. Then on such created pool I would put zones (each in its own zfs filesystem under common hierarchy like local/zones) and other filesystems. I would also consider setting atime=off for most of them. -- Best regards, Robert mailto:rmilkowski at task.gda.pl http://milek.blogspot.com
Roch Bourbonnais - Performance Engineering
2006-Jun-02 09:07 UTC
[zfs-discuss] question about ZFS performance for webserving/java
You propose ((2-way mirrored) x RAID-Z (3+1)) . That gives you 3 data disks worth and you''d have to loose 2 disk in each mirror (4 total) to loose data. For random read load you describe, I could expect that the per device cache to work nicely; That is file blocks stored at some given time in the past may be restituted also closely in time; Basically I update page Foo and page Bar at some time in the past because they contain shared information or reference one another and clients pulling one page, hits the other soon after. But if every file records are updated (written) fully independantly to the read (input) pattern, then you''d be in the low range of response time. Best case would give you up to 6 disks worth of IOPS serving capacity (maybe even more). If the device cache fails miserably then you''d have 2 disk worth on input IOPS. Now if you buy one more disk, you could envision (3-way mirror) x (3-disk dynamic stripe). Same amount of data as before but 9 disks worth of IOPS; But some 3-disks failure may put data as risk. Client NFS for inputs traffic seems quite ok to me. It mostly for output that NFS can be an issue in general. NFS causes individual client threads doing updates to operate very much in synchronization with the storage subsystem. This contrast with a local FS that can work much more asynchoneously. With direct attached FS, We can much better decouple application updates to memory, and FS updates to storage. -r David J. Orman writes: > Just as a hypothetical (not looking for exact science here folks..), > how would ZFS fare (in your educated opinion) in this sitation: > > 1 - Machine with 8 10k rpm SATA drives. High performance machine of > sorts (ie dual proc, etc..let''s weed out cpu/memory/bus bandwidth as > much as possible from the equation). > > 2 - Workload is webserving, well - application serving. Java app > server 9, various java applications requiring database access (mostly > small tables/data elements, but millions and millions of rows). > > 3 - App server would be running in one zone, with a (NFS) mounted ZFS > filesystem as storage. > > 4 - DB server (PgSQL) would be running in another zone, with a (NFS) > mounted ZFS filesystem as storage. > > 5 - Multiple disk redundancy is needed. So, I''m assuming two raid-z > pools of 3 drives each, mirrored is the solution. If people have a > better suggestion, tell me! :P > > 6 - OS will be Sol10U2, OS/Root FS will be installed on mirrored > drives, using UFS (my only choice..) > > Now, please eliminate CPU/RAM from this equation, assume the server > has 4 cores of goodness powering it, and 32 gigs of ram. No, running > on a ram-disk isn''t what I''m asking for. :P > > * NFS being optional, just curious what the difference would be, as > getting a T1000 + building an external storage box is an option. I > just can''t justify Sun''s crazy storage pricing at the moment. > > How would ZFS perform (educated opinions, I realize I won''t be getting > exact answers) in this situation. I can''t be more specific because I > don''t have the HW in front of me, I''m trying to get a feel for the > "correct" solution before I make huge purchases. > > If anything else is needed, please feel free to ask! > > Thanks, > David > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Rainer Orth
2006-Jun-02 09:09 UTC
[zfs-discuss] question about ZFS performance for webserving/java
Robert Milkowski <rmilkowski at task.gda.pl> writes:> So it can look like:[...]> c0t2d0s1 c0t2d0s1 SVM mirror, SWAP SWAP /s1 size > sizeof(/ + /var + /opt)You can avoid this by swapping to a zvol, though at the moment this requires a fix for CR 6405330. Unfortunately, since one cannot yet dump to a zvol, one needs a dedicated dump device in this case ;-( Rainer -- ----------------------------------------------------------------------------- Rainer Orth, Faculty of Technology, Bielefeld University
Gavin Maltby
2006-Jun-02 10:14 UTC
[zfs-discuss] question about ZFS performance for webserving/java
On 06/02/06 10:09, Rainer Orth wrote:> Robert Milkowski <rmilkowski at task.gda.pl> writes: > >> So it can look like: > [...] >> c0t2d0s1 c0t2d0s1 SVM mirror, SWAP SWAP /s1 size >> sizeof(/ + /var + /opt) > > You can avoid this by swapping to a zvol, though at the moment this > requires a fix for CR 6405330. Unfortunately, since one cannot yet dump to > a zvol, one needs a dedicated dump device in this case ;-(Dedicated dump devices are *always* best, so this is no loss. Dumping through filesystem code when it may be that code itself which caused the panic is badness. Gavin
Rainer Orth
2006-Jun-02 10:20 UTC
[zfs-discuss] question about ZFS performance for webserving/java
Gavin Maltby writes:> > You can avoid this by swapping to a zvol, though at the moment this > > requires a fix for CR 6405330. Unfortunately, since one cannot yet dump to > > a zvol, one needs a dedicated dump device in this case ;-( > > Dedicated dump devices are *always* best, so this is no loss. Dumping > through filesystem code when it may be that code itself which caused the > panic is badness.True, but it adds another complication to the boot disk creation, although it would nonetheless be nice to have this at least as an option. In addition, if the single dump slice isn''t mirrored and the disk wedged, you loose the dump completely (though this should be rare). Adding in an SVM mirror if one doesn''t need SVM otherwise (as will happen once ZFS boot integrates) certainly isn''t atttrative here (and adds another software layer between the dump code and the devices). Rainer ----------------------------------------------------------------------------- Rainer Orth, Faculty of Technology, Bielefeld University
Darren J Moffat
2006-Jun-02 10:33 UTC
[zfs-discuss] question about ZFS performance for webserving/java
Rainer Orth wrote:> Gavin Maltby writes: > >>> You can avoid this by swapping to a zvol, though at the moment this >>> requires a fix for CR 6405330. Unfortunately, since one cannot yet dump to >>> a zvol, one needs a dedicated dump device in this case ;-( >> Dedicated dump devices are *always* best, so this is no loss. Dumping >> through filesystem code when it may be that code itself which caused the >> panic is badness. > > True, but it adds another complication to the boot disk creation, although > it would nonetheless be nice to have this at least as an option. In > addition, if the single dump slice isn''t mirrored and the disk wedged, you > loose the dump completely (though this should be rare). Adding in an SVM > mirror if one doesn''t need SVM otherwise (as will happen once ZFS boot > integrates) certainly isn''t atttrative here (and adds another software > layer between the dump code and the devices).The ZFS boot project is fixing that CR so that you can dump to a zvol. -- Darren J Moffat
Rainer Orth
2006-Jun-02 10:41 UTC
[zfs-discuss] question about ZFS performance for webserving/java
Darren J Moffat writes:> >>> You can avoid this by swapping to a zvol, though at the moment this > >>> requires a fix for CR 6405330. Unfortunately, since one cannot yet dump to > >>> a zvol, one needs a dedicated dump device in this case ;-( > >> Dedicated dump devices are *always* best, so this is no loss. Dumping > >> through filesystem code when it may be that code itself which caused the > >> panic is badness. > > > > True, but it adds another complication to the boot disk creation, although > > it would nonetheless be nice to have this at least as an option. In > > addition, if the single dump slice isn''t mirrored and the disk wedged, you > > loose the dump completely (though this should be rare). Adding in an SVM > > mirror if one doesn''t need SVM otherwise (as will happen once ZFS boot > > integrates) certainly isn''t atttrative here (and adds another software > > layer between the dump code and the devices). > > The ZFS boot project is fixing that CR so that you can dump to a zvol.Cool, though I hope you don''t talk about CR 6405330 I mentioned above: this one is just about not adding a zvol used as a swap device during a regular boot since zvols are created later than swapadd is run. I''ve contributed a fix already and Eric Lowe is working to integrate it. Rainer