I thought I''d share some lessons learned testing Oracle APS on Solaris 10 using ZFS as backend storage. I just got done running 2 months worth of performance tests on a v490 (32GB/4x1.8Ghz dual core proc system with 2xSun 2G HBAs on separate fabrics) and varying how I managed storage. Storage used included EMC CX-600 disk (both presented as a LUN and as exported disks) and Pillar Axiom disk, using ZFS for all filesystems, some filesystems and no filesystems vs VxVM and UFS combinations. The specifics of my timing data won''t be generally useful (I simply needed to drive down the timing of one large job) but this layout has been generally useful in keeping latency down among my ordinary ERP and other Oracle loads. These suggestions have been taken from the performance wiki as well as other mailing lists and posts made here, as well as my own guesstimates. -Everything VXFS was pretty fast out of the box, but I expected that. -Having everything vanilla UFS was a bit slower on filebench tests, but dragged out my plan during real loads. -8k blocksize for datafiles is essential. Both filebench and live testing prove this out. -Separating the redo log from the data pool. Redo logs blew chunks on every ZFS installation, driving up my total process time in every case (the job is redo intensive at important junctures). -redo logs on a RAID 10 LUN on EMC using forcedirectio,noatime beat the same LUN using vxfs multiple times (didn''t test quickio, which we don''t normally use anyway). Slicing and presenting LUNS from the same RAID Group was faster than slicing a single LUN from the OS (for synchronized redo logs, primary sync on 1 group, secondary on the other), but didn''t get any faster or seriously drop my latency overhead when I used entirely separate RAID10 Groups. Using EMC LUNs was consistantly faster than exporting the disks and making veritas or disksuite luns. -separating /backup onto a separate pool. huge differences during backups. I use low priority axiom disk here. -exporting disks from EMC and using those to build RAID10 mirrors. This is annoying as I''d prefer to create the mirrors on EMC so I can take advantage of the hot spares and the backend processing, but the kernel still takes a crap every time a single non-redundant (to ZFS) device backs up and causes a bus reset. -for my particular test 7xRAID 10 (14 73GB 15k drives) ended up being as fast or faster than the same number of drives split into EMC luns and presented with vxfs on them. With 11x (22 drives) and /backup and redo logs on the main pool, the drives always stay at a high latency and performance craps during backups. -I tried futzing with sd:sd_max_throttle with values from 20(low water mark) to 64 (high water mark without errors) and my particular process didn''t seem to benefit. Left this value at 20, since EMC still recommends it. -No particular value for powerpath vs mpxio other than price. -set_arc.sh script (when it worked the first couple of times). The program grabs many GB of memory, so fighting the ARC cache for the right to mmap was a huge impediment. -Pillar was pretty quick when I created multiple luns and strung them together as one big stripe, but wasn''t as consistant in IO usage or overall time as EMC. A couple of things that don''t relate to the IO layout, but were important for APS were: -Sun Systems need faster processors to process the jobs faster. Had Oracle beat our data processing time by 4x+ on an 8x1Ghz system (we had 1.5Ghz USIV+)and this bugged the hell out of me until I found out that they were running an HP9000 rp4440 which uses a combined memory bandwidth of 12.9GB/s, no matter how many processors were running. USIV+ maxes at 2.4Ghz/proc, but scales the more procs you have working. This is all swell for RDBMS loads talking to shared memory, but most of the time in APS is spent running a single threaded job that loads many gigabytes of memory then processes the data. For that case, big, unscalable memory bandwidth beat the hell out of scalable procs at higher Mhz. Going from 1.3Ghz procs to 1.8s cut total running time (even with other improvements) by about 60%. -MPSS using 4M pages vs normal 8k pages made no real difference. While trapstat -T wasn''t really showing a high percentage of misses, there was an assumption that anything that allowed the process to read data into memory faster would help performance. Maybe if we could actually recompile the binary, but by setting the environment to use the library we got nothing more than more 4M cache misses. This message posted from opensolaris.org
I''m sorry dude, I can''t make head or tail from your post. What is your point? This message posted from opensolaris.org
General Oracle zpool/zfs tuning, from my tests with Oracle 9i and the APS Memory Based Planner and filebench. All tests completed using Solaris 10 update 2 and update 3.: -use zpools with 8k blocksize for data -don''t use zfs for redo logs - use ufs with directio and noatime. Building redo logs on EMC RAID 10 pools presented as separate devices seemed to produce the most %busy headroom for the log volumes during high activity. -when using highly available SAN storage, export the disks as LUNS and use zfs to do your redundancy - using array rundandancy (say 5 mirrors that you will zpool together as a stripe) will cause the machine to crap out and die if any of those mirrored devices, say, gets too much io and causes the machine to do a bus reset. At this point it''s better to export 10 disks and let zpool make your mirrors and your hot spares. When using Pillar storage where you don''t have direct access to the disk devices, I just made multiple luns and wasted a few extra blocks to give zpool local redundancy. -I found no big performance difference using powerpath or mpxio, though device names are easier to use in powerpath and mpxio is cheaper. -using the set_arc.sh script (mdb -k''ing a ceiling for the arc cache) to keep the arc cache low and a lot of memory wide open is essential for oracle performance. It''s effectiveness is a little inconsistant, but I believe that''s being looked into now. It''ll be great when I can set a ceiling in Update 4 in /etc/system. -sd:sd_max_throttle testing didn''t seem to present any great gain to values higher than 20, then EMC setting recommendation. -my best rule of thumb for creating zpools is to determine the number of disks I''d normally use if I were creating the same Oracle setup using EMC LUNS, then apply about the same amount of devices to the zpool. Still working on a better way to use only the storage I need, but be able to get top performance. Specific performance notes for Oracle''s APS and memory based planner: -after a point, IO and the filesystem tuning doesn''t seem to gain performance benefits. -other than having memory overhead for the Memory Based Planner normally wants, excess memory doesn''t seem to help. Memory size and Oracle caching seems to do less than having wider memory bandwidth. -faster processors were the best way to ensure direct performance gain in the Memory Based Planner tests. This message posted from opensolaris.org
Jeff, This is great information. Thanks for sharing. Quickio is almost required if you want vxfs with Oracle. We ran a benchmark a few years back and found that vxfs is fairly cache hungry and ufs with directio beats vxfs without quickio hands down. Take a look at what mpstat says on xcalls. See if you can limit that factor by binding the query process to either the processor or lgroup. I suspect this should give you better times. -- Just me, Wire ... On 3/17/07, JS <jeff.sutch at acm.org> wrote:> I thought I''d share some lessons learned testing Oracle APS on Solaris 10 using ZFS as backend storage. I just got done running 2 months worth of performance tests on a v490 (32GB/4x1.8Ghz dual core proc system with 2xSun 2G HBAs on separate fabrics) and varying how I managed storage. Storage used included EMC CX-600 disk (both presented as a LUN and as exported disks) and Pillar Axiom disk, using ZFS for all filesystems, some filesystems and no filesystems vs VxVM and UFS combinations. The specifics of my timing data won''t be generally useful (I simply needed to drive down the timing of one large job) but this layout has been generally useful in keeping latency down among my ordinary ERP and other Oracle loads. These suggestions have been taken from the performance wiki as well as other mailing lists and posts made here, as well as my own guesstimates. > > -Everything VXFS was pretty fast out of the box, but I expected that. > -Having everything vanilla UFS was a bit slower on filebench tests, but dragged out my plan during real loads. > -8k blocksize for datafiles is essential. Both filebench and live testing prove this out. > -Separating the redo log from the data pool. Redo logs blew chunks on every ZFS installation, driving up my total process time in every case (the job is redo intensive at important junctures). > -redo logs on a RAID 10 LUN on EMC using forcedirectio,noatime beat the same LUN using vxfs multiple times (didn''t test quickio, which we don''t normally use anyway). Slicing and presenting LUNS from the same RAID Group was faster than slicing a single LUN from the OS (for synchronized redo logs, primary sync on 1 group, secondary on the other), but didn''t get any faster or seriously drop my latency overhead when I used entirely separate RAID10 Groups. Using EMC LUNs was consistantly faster than exporting the disks and making veritas or disksuite luns. > -separating /backup onto a separate pool. huge differences during backups. I use low priority axiom disk here. > -exporting disks from EMC and using those to build RAID10 mirrors. This is annoying as I''d prefer to create the mirrors on EMC so I can take advantage of the hot spares and the backend processing, but the kernel still takes a crap every time a single non-redundant (to ZFS) device backs up and causes a bus reset. > -for my particular test 7xRAID 10 (14 73GB 15k drives) ended up being as fast or faster than the same number of drives split into EMC luns and presented with vxfs on them. With 11x (22 drives) and /backup and redo logs on the main pool, the drives always stay at a high latency and performance craps during backups. > -I tried futzing with sd:sd_max_throttle with values from 20(low water mark) to 64 (high water mark without errors) and my particular process didn''t seem to benefit. Left this value at 20, since EMC still recommends it. > -No particular value for powerpath vs mpxio other than price. > -set_arc.sh script (when it worked the first couple of times). The program grabs many GB of memory, so fighting the ARC cache for the right to mmap was a huge impediment. > -Pillar was pretty quick when I created multiple luns and strung them together as one big stripe, but wasn''t as consistant in IO usage or overall time as EMC. > > A couple of things that don''t relate to the IO layout, but were important for APS were: > -Sun Systems need faster processors to process the jobs faster. Had Oracle beat our data processing time by 4x+ on an 8x1Ghz system (we had 1.5Ghz USIV+)and this bugged the hell out of me until I found out that they were running an HP9000 rp4440 which uses a combined memory bandwidth of 12.9GB/s, no matter how many processors were running. USIV+ maxes at 2.4Ghz/proc, but scales the more procs you have working. This is all swell for RDBMS loads talking to shared memory, but most of the time in APS is spent running a single threaded job that loads many gigabytes of memory then processes the data. For that case, big, unscalable memory bandwidth beat the hell out of scalable procs at higher Mhz. Going from 1.3Ghz procs to 1.8s cut total running time (even with other improvements) by about 60%. > -MPSS using 4M pages vs normal 8k pages made no real difference. While trapstat -T wasn''t really showing a high percentage of misses, there was an assumption that anything that allowed the process to read data into memory faster would help performance. Maybe if we could actually recompile the binary, but by setting the environment to use the library we got nothing more than more 4M cache misses. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
JS wrote:> General Oracle zpool/zfs tuning, from my tests with Oracle 9i and the APS Memory Based Planner and filebench. All tests completed using Solaris 10 update 2 and update 3.: > > -use zpools with 8k blocksize for datadefinitely!> -don''t use zfs for redo logs - use ufs with directio and noatime. Building redo logs on EMC RAID 10 pools presented as separate devices seemed to produce the most %busy headroom for the log volumes during high activity.We are currently recommending separate (ZFS) file systems for redo logs. Did you try that? Or did you go straight to a separate UFS file system for redo logs?> -when using highly available SAN storage, export the disks as LUNS and use zfs to do your redundancy - using array rundandancy (say 5 mirrors that you will zpool together as a stripe) will cause the machine to crap out and die if any of those mirrored devices, say, gets too much io and causes the machine to do a bus reset. At this point it''s better to export 10 disks and let zpool make your mirrors and your hot spares. When using Pillar storage where you don''t have direct access to the disk devices, I just made multiple luns and wasted a few extra blocks to give zpool local redundancy. > -I found no big performance difference using powerpath or mpxio, though device names are easier to use in powerpath and mpxio is cheaper. > -using the set_arc.sh script (mdb -k''ing a ceiling for the arc cache) to keep the arc cache low and a lot of memory wide open is essential for oracle performance. It''s effectiveness is a little inconsistant, but I believe that''s being looked into now. It''ll be great when I can set a ceiling in Update 4 in /etc/system. > -sd:sd_max_throttle testing didn''t seem to present any great gain to values higher than 20, then EMC setting recommendation.This is not surprising. We see more issues with this when there is mixed storage because other devices can be penalized by EMC''s requirements. Some day, perhaps our grandchildren will have a protocol that does proper flow control and we won''t need this :-) -- richard> -my best rule of thumb for creating zpools is to determine the number of disks I''d normally use if I were creating the same Oracle setup using EMC LUNS, then apply about the same amount of devices to the zpool. Still working on a better way to use only the storage I need, but be able to get top performance. > > Specific performance notes for Oracle''s APS and memory based planner: > -after a point, IO and the filesystem tuning doesn''t seem to gain performance benefits. > -other than having memory overhead for the Memory Based Planner normally wants, excess memory doesn''t seem to help. Memory size and Oracle caching seems to do less than having wider memory bandwidth. > -faster processors were the best way to ensure direct performance gain in the Memory Based Planner tests. > > > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> -when using highly available SAN storage, export the > disks as LUNS and use zfs to do your redundancy - > using array rundandancy (say 5 mirrors that you will > zpool together as a stripe) will cause the machine > to crap out and die if any of those mirrored > devices, say, gets too much io and causes the > machine to do a bus reset.This sound interesting to me! Did you find that a scsi bus reset bring to a kernel panic? What do you get on the logs? Thanks, Gino This message posted from opensolaris.org
The big problem is that if you don''t do your redundancy in the zpool, then the loss of a single device flatlines the system. This occurs in single device pools or stripes or concats. Sun support has said in support calls and Sunsolve docs that this is by design, but I''ve never seen the loss of any other filesystem cause a machine to halt and dump core. Multiple bus resets can create a condition that makes the kernel believe that the device is no longer available. This was a persistant problem, especially on Pillar, until I started using setting sd_max_throttle down. "Why on earth would I not want to make redundant devices in zfs, when it''s reliability is so much better than other RAIDs?" This is the problem that says "I want the management ease of ZFS but I don''t want to have to jump through hoops in my SAN to present LUNS when the reliability is basically good enough.". While I can knit multiple luns together in Pillar (wasting space on already redundant storage), it''s easier to manage for, say, backup devices or small storage for a zone, to simply create a LUN and import it as a single zpool, adding space when necessary. Another thing that would be a great use of this would be to create mirrors on EMC and then knit those together as a stripe, taking advantage of my existing failover devices and zfs speed and management all at the same time. Unfortunately this bug puts the kibosh on that. This message posted from opensolaris.org
JS writes: > The big problem is that if you don''t do your redundancy in the zpool, > then the loss of a single device flatlines the system. This occurs in > single device pools or stripes or concats. Sun support has said in > support calls and Sunsolve docs that this is by design, but I''ve never > seen the loss of any other filesystem cause a machine to halt and dump > core. Multiple bus resets can create a condition that makes the kernel > believe that the device is no longer available. This was a persistant > problem, especially on Pillar, until I started using setting > sd_max_throttle down. Such failures are certainly not "by design" and my understanding is that it''s being very actively worked on. This said, redundancy in the zpool is a great idea. At the least it protects the path between the filesystem and the storage. -r
I''d definitely prefer owning a sort of SAN solution that would basically just be trays of JBODs exported through redundant controllers, with enterprise level service. The world is still playing catch up to integrate with all the possibilities of zfs. This message posted from opensolaris.org
JS wrote:> I''d definitely prefer owning a sort of SAN solution that would basically just be trays of JBODs exported through redundant controllers, with enterprise level service. The world is still playing catch up to integrate with all the possibilities of zfs.It was called the A5000, later A5100 and A5200. I''ve still got the scars and Torrey looks like one of the X-men. If you think that a disk drive vendor can write better code than an OS/systems vendor, then you''re due for a sad realization. -- richard
Did you try using ZFS compression on Oracle zsystems? (filesystems) This message posted from opensolaris.org
I didn''t see an advantage in this scenario, though I use zfs/compression happily on my NFS user directory. This message posted from opensolaris.org
> We are currently recommending separate (ZFS) file systems for redo logs.Did you try that? Or did you go straight to a separate UFS file system for redo logs? I''d answered this directly in email originally. The answer was that yes, I tested using zfs for logpools among a number of disk layouts and performance times were terrible on every one - no better than using a main zpool and carving off /log slices. Performance times went down (good) and disk %busy stayed low on all the ufs/directio setups. This message posted from opensolaris.org
> > We are currently recommending separate (ZFS) file systems for redo logs. > Did you try that? Or did you go straight to a separate UFS file system for > redo logs? > > I''d answered this directly in email originally. > > The answer was that yes, I tested using zfs for logpools among a number of disk layouts and performance times were terrible on every one - no better than using a main zpool and carving off /log slices. Performance times went down (good) and disk %busy stayed low on all the ufs/directio setups. >This is surprising. ZFS should do good with redo logs on a different pool. What are your iorates (iops/MB/s) for the log devices? Do you have a iostat for when you tried that? thanks, -neel
So, if your array is something big like an HP XP12000, you wouldn''t just make a zpool of one big LUN (LUSE volume), you''d split it in two and make a mirror when creating the zpool? If the array has redundancy built in, you''re suggesting to add another layer of redundancy using ZFS on top of that? We''re looking to use this in our environment. Just wanted some clarification. This message posted from opensolaris.org
basically you would add ZFS redundancy level, if you want to be protected from silent data corruption (data corruption that could occur somewhere along the IO path) - XP12000 has all the features to protect from hardware failure (no-SPOF) - ZFS has all the feature to protect from silent data corruption (no-SPOC C=corruption) this seems of over protection, but it''s the price to pay when dealing with large amount of data nowadays selim -- ------------------------------------------------------ Blog: http://fakoli.blogspot.com/ On Dec 4, 2007 2:54 PM, Sean Parkinson <sean.parkinson at fda.hhs.gov> wrote:> So, if your array is something big like an HP XP12000, you wouldn''t just make a zpool of one big LUN (LUSE volume), you''d split it in two and make a mirror when creating the zpool? > > If the array has redundancy built in, you''re suggesting to add another layer of redundancy using ZFS on top of that? > > We''re looking to use this in our environment. Just wanted some clarification. > > > This message posted from opensolaris.org > _______________________________________________ > > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- ------------------------------------------------------ Blog: http://fakoli.blogspot.com/
Seconded. Redundant controllers means you get one controller that locks them both up, as much as it means you''ve got backup. Best Regards, Jason On Mar 21, 2007 4:03 PM, Richard Elling <Richard.Elling at sun.com> wrote:> JS wrote: > > I''d definitely prefer owning a sort of SAN solution that would basically just be trays of JBODs exported through redundant controllers, with enterprise level service. The world is still playing catch up to integrate with all the possibilities of zfs. > > It was called the A5000, later A5100 and A5200. I''ve still > got the scars and Torrey looks like one of the X-men. If you think > that a disk drive vendor can write better code than an OS/systems > vendor, then you''re due for a sad realization. > -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >