thr3ads.net - zfs discuss - [zfs-discuss] ZFS performance with Oracle [Mar 2007]

If this information is useful, please help other people find it:
Share via:

2007-Mar-16 21:11 UTC

[zfs-discuss] ZFS performance with Oracle

I thought I''d share some lessons learned testing Oracle APS on Solaris
10 using ZFS as backend storage. I just got done running 2 months worth of
performance tests on a v490 (32GB/4x1.8Ghz dual core proc system with 2xSun 2G
HBAs on separate fabrics) and varying how I managed storage. Storage used
included EMC CX-600 disk (both presented as a LUN and as exported disks) and
Pillar Axiom disk, using ZFS for all filesystems, some filesystems and no
filesystems vs VxVM and UFS combinations. The specifics of my timing data
won''t be generally useful (I simply needed to drive down the timing of
one large job) but this layout has been generally useful in keeping latency down
among my ordinary ERP and other Oracle loads. These suggestions have been taken
from the performance wiki as well as other mailing lists and posts made here, as
well as my own guesstimates.
 
-Everything VXFS was pretty fast out of the box, but I expected that. 
-Having everything vanilla UFS was a bit slower on filebench tests, but dragged
out my plan during real loads.
-8k blocksize for datafiles is essential. Both filebench and live testing prove
this out.
-Separating the redo log from the data pool. Redo logs blew chunks on every ZFS
installation, driving up my total process time in every case (the job is redo
intensive at important junctures).
-redo logs on a RAID 10 LUN on EMC using forcedirectio,noatime beat the same LUN
using vxfs multiple times (didn''t test quickio, which we don''t
normally use anyway). Slicing and presenting LUNS from the same RAID Group was
faster than slicing a single LUN from the OS (for synchronized redo logs,
primary sync on 1 group, secondary on the other), but didn''t get any
faster or seriously drop my latency overhead when I used entirely separate
RAID10 Groups. Using EMC LUNs was consistantly faster than exporting the disks
and making veritas or disksuite luns.
-separating /backup onto a separate pool. huge differences during backups. I use
low priority axiom disk here.
-exporting disks from EMC and using those to build RAID10 mirrors. This is
annoying as I''d prefer to create the mirrors on EMC so I can take
advantage of the hot spares and the backend processing, but the kernel still
takes a crap every time a single non-redundant (to ZFS) device backs up and
causes a bus reset.
-for my particular test 7xRAID 10 (14 73GB 15k drives) ended up being as fast or
faster than the same number of drives split into EMC luns and presented with
vxfs on them. With 11x (22 drives) and /backup and redo logs on the main pool,
the drives always stay at a high latency and performance craps during backups.
-I tried futzing with sd:sd_max_throttle with values from 20(low water mark) to
64 (high water mark without errors) and my particular process didn''t
seem to benefit.  Left this value at 20, since EMC still recommends it.
-No particular value for powerpath vs mpxio other than price.
-set_arc.sh script (when it worked the first couple of times). The program grabs
many GB of memory, so fighting the ARC cache for the right to mmap was a huge
impediment.
-Pillar was pretty quick when I created multiple luns and strung them together
as one big stripe, but wasn''t as consistant in IO usage or overall time
as EMC.

A couple of things that don''t relate to the IO layout, but were
important for APS were:
-Sun Systems need faster processors to process the jobs faster. Had Oracle beat
our data processing time by 4x+ on an 8x1Ghz system (we had 1.5Ghz USIV+)and
this bugged the hell out of me until I found out that they were running an
HP9000 rp4440 which uses a combined memory bandwidth of 12.9GB/s, no matter how
many processors were running. USIV+ maxes at 2.4Ghz/proc, but scales the more
procs you have working. This is all swell for RDBMS loads talking to shared
memory, but most of the time in APS is spent running a single threaded job that
loads many gigabytes of memory then processes the data. For that case, big,
unscalable memory bandwidth beat the hell out of scalable procs at higher Mhz.
Going from 1.3Ghz procs to 1.8s cut total running time (even with other
improvements) by about 60%.
-MPSS using 4M pages vs normal 8k pages made no real difference. While trapstat
-T wasn''t really showing a high percentage of misses, there was an
assumption that anything that allowed the process to read data into memory
faster would help performance. Maybe if we could actually recompile the binary,
but by setting the environment to use the library we got nothing more than more
4M cache misses.
 
 
This message posted from opensolaris.org

Anantha N. Srirama

2007-Mar-18 21:52 UTC

head link

[zfs-discuss] Re: ZFS performance with Oracle

I''m sorry dude, I can''t make head or tail from your post. What
is your point?
 
 
This message posted from opensolaris.org

2007-Mar-19 03:59 UTC

head link

[zfs-discuss] Re: ZFS performance with Oracle

General Oracle zpool/zfs tuning, from my tests with Oracle 9i and the APS Memory
Based Planner and filebench. All tests completed using Solaris 10 update 2 and
update 3.:

 -use zpools with 8k blocksize for data
 -don''t use zfs for redo logs - use ufs with directio and noatime.
Building redo logs on EMC RAID 10 pools presented as separate devices seemed to
produce the most %busy headroom for the log volumes during high activity.
 -when using highly available SAN storage, export the disks as LUNS and use zfs
to do your redundancy - using array rundandancy (say 5 mirrors that you will
zpool together as a stripe) will cause the machine to crap out and die if any of
those mirrored devices, say, gets too much io and causes the machine to do a bus
reset. At this point it''s better to export 10 disks and let zpool make
your mirrors and your hot spares. When using Pillar storage where you
don''t have direct access to the disk devices, I just made multiple luns
and wasted a few extra blocks to give zpool local redundancy.
-I found no big performance difference using powerpath or mpxio, though device
names are easier to use in powerpath and mpxio is cheaper.
-using the set_arc.sh script (mdb -k''ing a ceiling for the arc cache)
to keep the arc cache low and a lot of memory wide open is essential for oracle
performance. It''s effectiveness is a little inconsistant, but I believe
that''s being looked into now. It''ll be great when I can set a
ceiling in Update 4 in /etc/system.
-sd:sd_max_throttle testing didn''t seem to present any great gain to
values higher than 20, then EMC setting recommendation.
-my best rule of thumb for creating zpools is to determine the number of disks
I''d normally use if I were creating the same Oracle setup using EMC
LUNS, then apply about the same amount of devices to the zpool. Still working on
a better way to use only the storage I need, but be able to get top performance.

Specific performance notes for Oracle''s APS and memory based planner:
-after a point, IO and the filesystem tuning doesn''t seem to gain
performance benefits.
-other than having memory overhead for the Memory Based Planner normally wants,
excess memory doesn''t seem to help. Memory size and Oracle caching
seems to do less than having wider memory bandwidth.
-faster processors were the best way to ensure direct performance gain in the
Memory Based Planner tests.
 
 
This message posted from opensolaris.org

Wee Yeh Tan

2007-Mar-19 06:14 UTC

head link

[zfs-discuss] ZFS performance with Oracle

Jeff,

This is great information.  Thanks for sharing.

Quickio is almost required if you want vxfs with Oracle.  We ran a
benchmark a few years back and found that vxfs is fairly cache hungry
and ufs with directio beats vxfs without quickio hands down.

Take a look at what mpstat says on xcalls.  See if you can limit that
factor by binding the query process to either the processor or lgroup.
 I suspect this should give you better times.


-- 
Just me,
Wire ...


On 3/17/07, JS <jeff.sutch at acm.org> wrote:> I thought I''d share some lessons learned testing Oracle APS on
Solaris 10 using ZFS as backend storage. I just got done running 2 months worth
of performance tests on a v490 (32GB/4x1.8Ghz dual core proc system with 2xSun
2G HBAs on separate fabrics) and varying how I managed storage. Storage used
included EMC CX-600 disk (both presented as a LUN and as exported disks) and
Pillar Axiom disk, using ZFS for all filesystems, some filesystems and no
filesystems vs VxVM and UFS combinations. The specifics of my timing data
won''t be generally useful (I simply needed to drive down the timing of
one large job) but this layout has been generally useful in keeping latency down
among my ordinary ERP and other Oracle loads. These suggestions have been taken
from the performance wiki as well as other mailing lists and posts made here, as
well as my own guesstimates.
>
> -Everything VXFS was pretty fast out of the box, but I expected that.
> -Having everything vanilla UFS was a bit slower on filebench tests, but
dragged out my plan during real loads.
> -8k blocksize for datafiles is essential. Both filebench and live testing
prove this out.
> -Separating the redo log from the data pool. Redo logs blew chunks on every
ZFS installation, driving up my total process time in every case (the job is
redo intensive at important junctures).
> -redo logs on a RAID 10 LUN on EMC using forcedirectio,noatime beat the
same LUN using vxfs multiple times (didn''t test quickio, which we
don''t normally use anyway). Slicing and presenting LUNS from the same
RAID Group was faster than slicing a single LUN from the OS (for synchronized
redo logs, primary sync on 1 group, secondary on the other), but didn''t
get any faster or seriously drop my latency overhead when I used entirely
separate RAID10 Groups. Using EMC LUNs was consistantly faster than exporting
the disks and making veritas or disksuite luns.
> -separating /backup onto a separate pool. huge differences during backups.
I use low priority axiom disk here.
> -exporting disks from EMC and using those to build RAID10 mirrors. This is
annoying as I''d prefer to create the mirrors on EMC so I can take
advantage of the hot spares and the backend processing, but the kernel still
takes a crap every time a single non-redundant (to ZFS) device backs up and
causes a bus reset.
> -for my particular test 7xRAID 10 (14 73GB 15k drives) ended up being as
fast or faster than the same number of drives split into EMC luns and presented
with vxfs on them. With 11x (22 drives) and /backup and redo logs on the main
pool, the drives always stay at a high latency and performance craps during
backups.
> -I tried futzing with sd:sd_max_throttle with values from 20(low water
mark) to 64 (high water mark without errors) and my particular process
didn''t seem to benefit.  Left this value at 20, since EMC still
recommends it.
> -No particular value for powerpath vs mpxio other than price.
> -set_arc.sh script (when it worked the first couple of times). The program
grabs many GB of memory, so fighting the ARC cache for the right to mmap was a
huge impediment.
> -Pillar was pretty quick when I created multiple luns and strung them
together as one big stripe, but wasn''t as consistant in IO usage or
overall time as EMC.
>
> A couple of things that don''t relate to the IO layout, but were
important for APS were:
> -Sun Systems need faster processors to process the jobs faster. Had Oracle
beat our data processing time by 4x+ on an 8x1Ghz system (we had 1.5Ghz
USIV+)and this bugged the hell out of me until I found out that they were
running an HP9000 rp4440 which uses a combined memory bandwidth of 12.9GB/s, no
matter how many processors were running. USIV+ maxes at 2.4Ghz/proc, but scales
the more procs you have working. This is all swell for RDBMS loads talking to
shared memory, but most of the time in APS is spent running a single threaded
job that loads many gigabytes of memory then processes the data. For that case,
big, unscalable memory bandwidth beat the hell out of scalable procs at higher
Mhz. Going from 1.3Ghz procs to 1.8s cut total running time (even with other
improvements) by about 60%.
> -MPSS using 4M pages vs normal 8k pages made no real difference. While
trapstat -T wasn''t really showing a high percentage of misses, there
was an assumption that anything that allowed the process to read data into
memory faster would help performance. Maybe if we could actually recompile the
binary, but by setting the environment to use the library we got nothing more
than more 4M cache misses.
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Richard Elling

2007-Mar-19 17:27 UTC

head link

[zfs-discuss] Re: ZFS performance with Oracle

JS wrote:> General Oracle zpool/zfs tuning, from my tests with Oracle 9i and the APS
Memory Based Planner and filebench. All tests completed using Solaris 10 update
2 and update 3.:
> 
>  -use zpools with 8k blocksize for data
definitely!
>  -don''t use zfs for redo logs - use ufs with directio and noatime.
Building redo logs on EMC RAID 10 pools presented as separate devices seemed to
produce the most %busy headroom for the log volumes during high activity.
We are currently recommending separate (ZFS) file systems for redo logs.
Did you try that?  Or did you go straight to a separate UFS file system for
redo logs?
>  -when using highly available SAN storage, export the disks as LUNS and use
zfs to do your redundancy - using array rundandancy (say 5 mirrors that you will
zpool together as a stripe) will cause the machine to crap out and die if any of
those mirrored devices, say, gets too much io and causes the machine to do a bus
reset. At this point it''s better to export 10 disks and let zpool make
your mirrors and your hot spares. When using Pillar storage where you
don''t have direct access to the disk devices, I just made multiple luns
and wasted a few extra blocks to give zpool local redundancy.
> -I found no big performance difference using powerpath or mpxio, though
device names are easier to use in powerpath and mpxio is cheaper.
> -using the set_arc.sh script (mdb -k''ing a ceiling for the arc
cache) to keep the arc cache low and a lot of memory wide open is essential for
oracle performance. It''s effectiveness is a little inconsistant, but I
believe that''s being looked into now. It''ll be great when I
can set a ceiling in Update 4 in /etc/system.
> -sd:sd_max_throttle testing didn''t seem to present any great gain
to values higher than 20, then EMC setting recommendation.
This is not surprising. We see more issues with this when there is mixed
storage because other devices can be penalized by EMC''s requirements. 
Some
day, perhaps our grandchildren will have a protocol that does proper flow
control and we won''t need this :-)
  -- richard
> -my best rule of thumb for creating zpools is to determine the number of
disks I''d normally use if I were creating the same Oracle setup using
EMC LUNS, then apply about the same amount of devices to the zpool. Still
working on a better way to use only the storage I need, but be able to get top
performance.
> 
> Specific performance notes for Oracle''s APS and memory based
planner:
> -after a point, IO and the filesystem tuning doesn''t seem to gain
performance benefits.
> -other than having memory overhead for the Memory Based Planner normally
wants, excess memory doesn''t seem to help. Memory size and Oracle
caching seems to do less than having wider memory bandwidth.
> -faster processors were the best way to ensure direct performance gain in
the Memory Based Planner tests.
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Gino Ruopolo

2007-Mar-20 09:03 UTC

head link

[zfs-discuss] Re: ZFS performance with Oracle

> -when using highly available SAN storage, export the
> disks as LUNS and use zfs to do your redundancy -
> using array rundandancy (say 5 mirrors that you will
> zpool together as a stripe) will cause the machine
> to crap out and die if any of those mirrored
> devices, say, gets too much io and causes the
> machine to do a bus reset. 
This sound interesting to me!
Did you find that a scsi bus reset bring to a kernel panic?
What do you get on the logs?

Thanks,
Gino
 
 
This message posted from opensolaris.org

2007-Mar-20 18:59 UTC

head link

[zfs-discuss] Re: ZFS performance with Oracle

The big problem is that if you don''t do your redundancy in the zpool,
then the loss of a single device flatlines the system. This occurs in single
device pools or stripes or concats. Sun support has said in support calls and
Sunsolve docs that this is by design, but I''ve never seen the loss of
any other filesystem cause a machine to halt and dump core. Multiple bus resets
can create a condition that makes the kernel believe that the device is no
longer available. This was a persistant problem, especially on Pillar, until I
started using setting sd_max_throttle down.

 "Why on earth would I not want to make redundant devices in zfs, when
it''s reliability is so much better than other RAIDs?"
 This is the problem that says "I want the management ease of ZFS but I
don''t want to have to jump through hoops in my SAN to present LUNS when
the reliability is basically good enough.".
 While I can knit multiple luns together in Pillar (wasting space on already
redundant storage), it''s easier to manage for, say, backup devices or
small storage for a zone, to simply create a LUN and import it as a single
zpool, adding space when necessary. Another thing that would be a great use of
this would be to create mirrors on EMC and then knit those together as a stripe,
taking advantage of my existing failover devices and zfs speed and management
all at the same time. Unfortunately this bug puts the kibosh on that.
 
 
This message posted from opensolaris.org

Roch - PAE

2007-Mar-21 10:00 UTC

head link

[zfs-discuss] Re: ZFS performance with Oracle

JS writes:
 > The big problem is that if you don''t do your redundancy in the
zpool,
 > then the loss of a single device flatlines the system. This occurs in
 > single device pools or stripes or concats. Sun support has said in
 > support calls and Sunsolve docs that this is by design, but I''ve
never
 > seen the loss of any other filesystem cause a machine to halt and dump
 > core. Multiple bus resets can create a condition that makes the kernel
 > believe that the device is no longer available. This was a persistant
 > problem, especially on Pillar, until I started using setting
 > sd_max_throttle down. 

Such failures are certainly not "by design" and my
understanding is that it''s being very actively worked on.

This said, redundancy in the zpool is a great idea.
At the least it protects the path between the filesystem and 
the storage.

-r

2007-Mar-21 22:17 UTC

head link

[zfs-discuss] Re: Re: ZFS performance with Oracle

I''d definitely prefer owning a sort of SAN solution that would
basically just be trays of JBODs exported through redundant controllers, with
enterprise level service. The world is still playing catch up to integrate with
all the possibilities of zfs.
 
 
This message posted from opensolaris.org

Richard Elling

2007-Mar-21 23:03 UTC

head link

[zfs-discuss] Re: Re: ZFS performance with Oracle

JS wrote:> I''d definitely prefer owning a sort of SAN solution that would
basically just be trays of JBODs exported through redundant controllers, with
enterprise level service. The world is still playing catch up to integrate with
all the possibilities of zfs.
It was called the A5000, later A5100 and A5200.  I''ve still
got the scars and Torrey looks like one of the X-men.  If you think
that a disk drive vendor can write better code than an OS/systems
vendor, then you''re due for a sad realization.
  -- richard

Matt B

2007-Mar-22 06:59 UTC

head link

[zfs-discuss] Re: ZFS performance with Oracle

Did you try using ZFS compression on Oracle zsystems? (filesystems)
 
 
This message posted from opensolaris.org

2007-Mar-28 18:57 UTC

head link

[zfs-discuss] Re: ZFS performance with Oracle

I didn''t see an advantage in this scenario, though I use
zfs/compression happily on my NFS user directory.
 
 
This message posted from opensolaris.org

2007-Mar-28 19:01 UTC

head link

[zfs-discuss] Re: Re: ZFS performance with Oracle

> We are currently recommending separate (ZFS) file systems for redo logs.Did you try that? Or did you go straight to a separate UFS file system for
redo logs?

 I''d answered this directly in email originally. 

 The answer was that yes, I tested using zfs for logpools among a number of disk
layouts and performance times were terrible on every one - no better than using
a main zpool and carving off /log slices. Performance times went down (good) and
disk %busy stayed low on all the ufs/directio setups.
 
 
This message posted from opensolaris.org

Neelakanth Nadgir

2007-Mar-28 19:18 UTC

head link

[zfs-discuss] Re: Re: ZFS performance with Oracle

> > We are currently recommending separate (ZFS) file systems for redo
logs.
> Did you try that? Or did you go straight to a separate UFS file system for
> redo logs?
> 
>  I''d answered this directly in email originally. 
> 
>  The answer was that yes, I tested using zfs for logpools among a number of
disk layouts and performance times were terrible on every one - no better than
using a main zpool and carving off /log slices. Performance times went down
(good) and disk %busy stayed low on all the ufs/directio setups.
>  

This is surprising. ZFS should do good with redo logs on a different
pool. What are your iorates (iops/MB/s) for the log
devices? Do you have a iostat for when you tried that?
thanks,
-neel

Sean Parkinson

2007-Dec-04 13:54 UTC

head link

[zfs-discuss] ZFS performance with Oracle

So, if your array is something big like an HP XP12000, you wouldn''t
just make a zpool of one big LUN (LUSE volume), you''d split it in two
and make a mirror when creating the zpool?

If the array has redundancy built in, you''re suggesting to add another
layer of redundancy using ZFS on top of that?

We''re looking to use this in our environment. Just wanted some
clarification.
 
 
This message posted from opensolaris.org

Selim Daoud

2007-Dec-05 18:33 UTC

head link

[zfs-discuss] ZFS performance with Oracle

basically you would add ZFS redundancy level, if you want to be
protected from silent data corruption (data corruption that could
occur somewhere along the IO path)

- XP12000 has all the features to protect from hardware failure (no-SPOF)
- ZFS has all the feature to protect from silent data corruption
(no-SPOC C=corruption)
this seems of over protection, but it''s the price to pay when dealing
with large amount of data nowadays

selim

-- 
------------------------------------------------------
Blog: http://fakoli.blogspot.com/

On Dec 4, 2007 2:54 PM, Sean Parkinson <sean.parkinson at fda.hhs.gov>
wrote:> So, if your array is something big like an HP XP12000, you
wouldn''t just make a zpool of one big LUN (LUSE volume), you''d
split it in two and make a mirror when creating the zpool?
>
> If the array has redundancy built in, you''re suggesting to add
another layer of redundancy using ZFS on top of that?
>
> We''re looking to use this in our environment. Just wanted some
clarification.
>
>
> This message posted from opensolaris.org
> _______________________________________________
>
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>


-- 
------------------------------------------------------
Blog: http://fakoli.blogspot.com/

Jason J. W. Williams

2007-Dec-05 20:56 UTC

head link

[zfs-discuss] ZFS performance with Oracle

Seconded. Redundant controllers means you get one controller that
locks them both up, as much as it means you''ve got backup.

Best Regards,
Jason

On Mar 21, 2007 4:03 PM, Richard Elling <Richard.Elling at sun.com>
wrote:> JS wrote:
> > I''d definitely prefer owning a sort of SAN solution that
would basically just be trays of JBODs exported through redundant controllers,
with enterprise level service. The world is still playing catch up to integrate
with all the possibilities of zfs.
>
> It was called the A5000, later A5100 and A5200.  I''ve still
> got the scars and Torrey looks like one of the X-men.  If you think
> that a disk drive vendor can write better code than an OS/systems
> vendor, then you''re due for a sad realization.
>   -- richard
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

zfs discuss - Mar 2007 - ZFS performance with Oracle

[zfs-discuss] ZFS performance with Oracle

[zfs-discuss] Re: ZFS performance with Oracle

[zfs-discuss] Re: ZFS performance with Oracle

[zfs-discuss] ZFS performance with Oracle

[zfs-discuss] Re: ZFS performance with Oracle

[zfs-discuss] Re: ZFS performance with Oracle

[zfs-discuss] Re: ZFS performance with Oracle

[zfs-discuss] Re: ZFS performance with Oracle

[zfs-discuss] Re: Re: ZFS performance with Oracle

[zfs-discuss] Re: Re: ZFS performance with Oracle

[zfs-discuss] Re: ZFS performance with Oracle

[zfs-discuss] Re: ZFS performance with Oracle

[zfs-discuss] Re: Re: ZFS performance with Oracle

[zfs-discuss] Re: Re: ZFS performance with Oracle

[zfs-discuss] ZFS performance with Oracle

[zfs-discuss] ZFS performance with Oracle

[zfs-discuss] ZFS performance with Oracle