Working on a POC for high IO workloads, and I?m running in to a bottleneck that I?m not sure I can solve. Testbed looks like this : SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU?s, 72GB RAM, and ESXi VM ? 4GB RAM, 1vCPU Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010 Target Nexenta system : Intel barebones, Dual Xeon 5620 CPU?s, 192GB RAM, Nexenta 3.1.3 Enterprise Intel x520 dual port 10Gbit Ethernet ? LACP Active VPC to Nexus 5010 switches. 2x LSI 9201-16E HBA?s, 1x LSI 9200-8e HBA 5 DAE?s (3 in use for this test) 1 DAE ? connected (multipathed) to LSI 9200-8e. Loaded w/ 6x Stec ZeusRAM SSD?s ? striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC. 2 DAE?s connected (multipathed) to one LSI 9201-16E ? 24x 600GB 15k Seagate Cheetah drives Obviously data integrity is not guaranteed Testing using IOMeter from windows guest, 10GB test file, queue depth of 64 I have a share set up with 4k recordsizes, compression disabled, access time disabled, and am seeing performance as follows : ~50,000 IOPS 4k random read. 200MB/sec, 30% CPU utilization on Nexenta, ~90% utilization on guest OS. I?m guessing guest OS is bottlenecking. Going to try physical hardware next week ~25,000 IOPS 4k random write. 100MB/sec, ~70% CPU utilization on Nexenta, ~45% CPU utilization on guest OS. Feels like Nexenta CPU is bottleneck. Load average of 2.5 A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec performance, can?t remember CPU utilization on either side. Will retest and report those numbers. It feels like something is adding more overhead here than I would expect on the 4k recordsizes/IO workloads. Any thoughts where I should start on this? I?d really like to see closer to 10Gbit performance here, but it seems like the hardware isn?t able to cope with it?
On Tue, 24 Jul 2012, matthewb at flash.shanje.com wrote:> > ~50,000 IOPS 4k random read. 200MB/sec, 30% CPU utilization on Nexenta, ~90% > utilization on guest OS. I?m guessing guest OS is bottlenecking. Going to > try physical hardware next week > ~25,000 IOPS 4k random write. 100MB/sec, ~70% CPU utilization on Nexenta, > ~45% CPU utilization on guest OS. Feels like Nexenta CPU is bottleneck. Load > average of 2.5 > > A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec > performance, can?t remember CPU utilization on either side. Will retest and > report those numbers. > > It feels like something is adding more overhead here than I would expect on > the 4k recordsizes/IO workloads. Any thoughts where I should start on this? > I?d really like to see closer to 10Gbit performance here, but it seems like > the hardware isn?t able to cope with it?All systems have a bottleneck. You are highly unlikely to get close to 10Gbit performance with 4k random synchronous write. 25K IOPS seems pretty good to me. The 2.4GHz clock rate of the 4-core Xeon CPU you are using is not terribly high. Performance is likely better with a higher-clocked more modern design with more cores. Verify that the zfs checksum algorithm you are using is a low-cost one and that you have not enabled compression or deduplication. You did not tell us how your zfs pool is organized so it is impossible to comment more. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Pool is 6x striped Stec ZEUSRam as ZIL, 6x OCZ Talos C 230GB drives L2ARC, and 24x 15k SAS drives striped (no parity, no mirroring) - I know, terrible for reliability, but I just want to see what kind of IO I can hit. Checksum is ON - can''t recall what default is right now. Compression is off Dedupe is off Trying to figure out vdbench right now, but apparently that''s beyond my abilities at 8:30PM :( -----Original Message----- From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] Sent: Tuesday, July 24, 2012 8:13 PM To: matthewb at flash.shanje.com Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] IO load questions On Tue, 24 Jul 2012, matthewb at flash.shanje.com wrote:> > ~50,000 IOPS 4k random read. 200MB/sec, 30% CPU utilization on > Nexenta, ~90% utilization on guest OS. I?m guessing guest OS is > bottlenecking. Going to try physical hardware next week > ~25,000 IOPS 4k random write. 100MB/sec, ~70% CPU utilization on > Nexenta, ~45% CPU utilization on guest OS. Feels like Nexenta CPU is > bottleneck. Load average of 2.5 > > A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec > performance, can?t remember CPU utilization on either side. Will > retest and report those numbers. > > It feels like something is adding more overhead here than I would > expect on the 4k recordsizes/IO workloads. Any thoughts where I shouldstart on this?> I?d really like to see closer to 10Gbit performance here, but it seems > like the hardware isn?t able to cope with it?All systems have a bottleneck. You are highly unlikely to get close to 10Gbit performance with 4k random synchronous write. 25K IOPS seems pretty good to me. The 2.4GHz clock rate of the 4-core Xeon CPU you are using is not terribly high. Performance is likely better with a higher-clocked more modern design with more cores. Verify that the zfs checksum algorithm you are using is a low-cost one and that you have not enabled compression or deduplication. You did not tell us how your zfs pool is organized so it is impossible to comment more. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Important question, what is the interconnect? iSCSI? FC? NFS? -- richard On Jul 24, 2012, at 9:44 AM, matthewb at flash.shanje.com wrote:> Working on a POC for high IO workloads, and I?m running in to a bottleneck that I?m not sure I can solve. Testbed looks like this : > > SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU?s, 72GB RAM, and ESXi > VM ? 4GB RAM, 1vCPU > Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010 > > Target Nexenta system : > > Intel barebones, Dual Xeon 5620 CPU?s, 192GB RAM, Nexenta 3.1.3 Enterprise > Intel x520 dual port 10Gbit Ethernet ? LACP Active VPC to Nexus 5010 switches. > 2x LSI 9201-16E HBA?s, 1x LSI 9200-8e HBA > 5 DAE?s (3 in use for this test) > 1 DAE ? connected (multipathed) to LSI 9200-8e. Loaded w/ 6x Stec ZeusRAM SSD?s ? striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC. > 2 DAE?s connected (multipathed) to one LSI 9201-16E ? 24x 600GB 15k Seagate Cheetah drives > Obviously data integrity is not guaranteed > > Testing using IOMeter from windows guest, 10GB test file, queue depth of 64 > I have a share set up with 4k recordsizes, compression disabled, access time disabled, and am seeing performance as follows : > > ~50,000 IOPS 4k random read. 200MB/sec, 30% CPU utilization on Nexenta, ~90% utilization on guest OS. I?m guessing guest OS is bottlenecking. Going to try physical hardware next week > ~25,000 IOPS 4k random write. 100MB/sec, ~70% CPU utilization on Nexenta, ~45% CPU utilization on guest OS. Feels like Nexenta CPU is bottleneck. Load average of 2.5 > > A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec performance, can?t remember CPU utilization on either side. Will retest and report those numbers. > > It feels like something is adding more overhead here than I would expect on the 4k recordsizes/IO workloads. Any thoughts where I should start on this? I?d really like to see closer to 10Gbit performance here, but it seems like the hardware isn?t able to cope with it?Theoretical peak performance for a single 10GbE wire is near 300k IOPS @ 4KB, unidirectional. This workload is extraordinarily difficult to achieve with a single client using any of the popular storage protocols. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120724/7a2aed1d/attachment.html>
NFS - iSCSI and FC/FCoE to come once I get it into the proper lab. From: Richard Elling [mailto:richard.elling at gmail.com] Sent: Tuesday, July 24, 2012 11:36 PM To: matthewb at flash.shanje.com Cc: zfs-discuss at opensolaris.org Subject: Re: [zfs-discuss] IO load questions Important question, what is the interconnect? iSCSI? FC? NFS? -- richard On Jul 24, 2012, at 9:44 AM, matthewb at flash.shanje.com wrote: Working on a POC for high IO workloads, and I''m running in to a bottleneck that I''m not sure I can solve. Testbed looks like this : SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU''s, 72GB RAM, and ESXi VM - 4GB RAM, 1vCPU Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010 Target Nexenta system : Intel barebones, Dual Xeon 5620 CPU''s, 192GB RAM, Nexenta 3.1.3 Enterprise Intel x520 dual port 10Gbit Ethernet - LACP Active VPC to Nexus 5010 switches. 2x LSI 9201-16E HBA''s, 1x LSI 9200-8e HBA 5 DAE''s (3 in use for this test) 1 DAE - connected (multipathed) to LSI 9200-8e. Loaded w/ 6x Stec ZeusRAM SSD''s - striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC. 2 DAE''s connected (multipathed) to one LSI 9201-16E - 24x 600GB 15k Seagate Cheetah drives Obviously data integrity is not guaranteed Testing using IOMeter from windows guest, 10GB test file, queue depth of 64 I have a share set up with 4k recordsizes, compression disabled, access time disabled, and am seeing performance as follows : ~50,000 IOPS 4k random read. 200MB/sec, 30% CPU utilization on Nexenta, ~90% utilization on guest OS. I''m guessing guest OS is bottlenecking. Going to try physical hardware next week ~25,000 IOPS 4k random write. 100MB/sec, ~70% CPU utilization on Nexenta, ~45% CPU utilization on guest OS. Feels like Nexenta CPU is bottleneck. Load average of 2.5 A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec performance, can''t remember CPU utilization on either side. Will retest and report those numbers. It feels like something is adding more overhead here than I would expect on the 4k recordsizes/IO workloads. Any thoughts where I should start on this? I''d really like to see closer to 10Gbit performance here, but it seems like the hardware isn''t able to cope with it? Theoretical peak performance for a single 10GbE wire is near 300k IOPS @ 4KB, unidirectional. This workload is extraordinarily difficult to achieve with a single client using any of the popular storage protocols. -- richard -- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120725/923e1a4f/attachment.html>
On Jul 25, 2012, at 7:34 AM, Matt Breitbach wrote:> NFS ? iSCSI and FC/FCoE to come once I get it into the proper lab.ok, so NFS for these tests. I''m not convinced a single ESXi box can drive the load to saturate 10GbE. Also, depending on how you are configuring the system, the I/O that you think is 4KB might look very different coming out of ESXi. Use nfssvrtop or one of the many dtrace one-liners for observing NFS traffic to see what is really on the wire. And I''m very interested to know if you see 16KB reads during the "write-only" workload. more below...> From: Richard Elling [mailto:richard.elling at gmail.com] > Sent: Tuesday, July 24, 2012 11:36 PM > To: matthewb at flash.shanje.com > Cc: zfs-discuss at opensolaris.org > Subject: Re: [zfs-discuss] IO load questions > > Important question, what is the interconnect? iSCSI? FC? NFS? > -- richard > > On Jul 24, 2012, at 9:44 AM, matthewb at flash.shanje.com wrote: > > > Working on a POC for high IO workloads, and I?m running in to a bottleneck that I?m not sure I can solve. Testbed looks like this : > > SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU?s, 72GB RAM, and ESXi > VM ? 4GB RAM, 1vCPU > Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010 > > Target Nexenta system : > > Intel barebones, Dual Xeon 5620 CPU?s, 192GB RAM, Nexenta 3.1.3 Enterprise > Intel x520 dual port 10Gbit Ethernet ? LACP Active VPC to Nexus 5010 switches. > 2x LSI 9201-16E HBA?s, 1x LSI 9200-8e HBA > 5 DAE?s (3 in use for this test) > 1 DAE ? connected (multipathed) to LSI 9200-8e. Loaded w/ 6x Stec ZeusRAM SSD?s ? striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC. > 2 DAE?s connected (multipathed) to one LSI 9201-16E ? 24x 600GB 15k Seagate Cheetah drives > Obviously data integrity is not guaranteed > > Testing using IOMeter from windows guest, 10GB test file, queue depth of 64 > I have a share set up with 4k recordsizes, compression disabled, access time disabled, and am seeing performance as follows : > > ~50,000 IOPS 4k random read. 200MB/sec, 30% CPU utilization on Nexenta, ~90% utilization on guest OS. I?m guessing guest OS is bottlenecking. Going to try physical hardware next week > ~25,000 IOPS 4k random write. 100MB/sec, ~70% CPU utilization on Nexenta, ~45% CPU utilization on guest OS. Feels like Nexenta CPU is bottleneck. Load average of 2.5For cases where you are not bandwidth limited, larger recordsizes can be more efficient. There is no good rule-of-thumb for this, and larger recordsizes will, at some point, hit the bandwidth bottlenecks. I''ve had good luck with 8KB and 32KB recordsize for ESXi+Windows over NFS. I''ve never bothered to test 16KB, due to lack of time.> A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec performance, can?t remember CPU utilization on either side. Will retest and report those numbers.It would not surprise me to see a CPU bottleneck on the ESXi side at these levels. -- richard> > It feels like something is adding more overhead here than I would expect on the 4k recordsizes/IO workloads. Any thoughts where I should start on this? I?d really like to see closer to 10Gbit performance here, but it seems like the hardware isn?t able to cope with it? > > Theoretical peak performance for a single 10GbE wire is near 300k IOPS @ 4KB, unidirectional. > This workload is extraordinarily difficult to achieve with a single client using any of the popular > storage protocols. > -- richard > > -- > ZFS Performance and Training > Richard.Elling at RichardElling.com > +1-760-896-4422 > > > > > > >-- ZFS Performance and Training Richard.Elling at RichardElling.com +1-760-896-4422 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20120725/eff00a1c/attachment-0001.html>
Trey,
Thanks for the enlightening info. I was really hoping that this
system could deliver more NFS IOPS out of RAM, but based on your results
I''m
guessing that''s just not possible with my hardware. Per chance have
you
tried any of the software FCoE drivers for OI with your Intel x520 and
gotten any results there? I''m currently attached to a Nexus 5010 w/ no
storage licensure, so I can''t test FCoE right now - moving to the same
(5548) switches you have next week to get some FCoE tests going. Would love
to see FCoE results, or anyone running RoCE/IB setups utilizing RDMA.
-----Original Message-----
From: Palmer, Trey [mailto:Trey.Palmer at gtri.gatech.edu]
Sent: Wednesday, July 25, 2012 8:22 PM
To: Richard Elling; Matt Breitbach
Cc: zfs-discuss at opensolaris.org
Subject: RE: [zfs-discuss] IO load questions
BTW these SSD''s are 480GB Talos 2''s.
________________________________________
From: Palmer, Trey
Sent: Wednesday, July 25, 2012 9:20 PM
To: Richard Elling; Matt Breitbach
Cc: zfs-discuss at opensolaris.org
Subject: RE: [zfs-discuss] IO load questions
Matt,
I''ve been testing an all-SSD array with Filebench. As Richard
implied, I
think your results are about what you can expect for NFS. My results on
faster hardware are not blowing yours away.
I''ve been testing 8K records, but I tried 4K a few times (with 4K
recordsize
natch) without that much improvement.
I have found that the hardware (CPU''s) makes a pretty big difference.
My test ZFS server is:
OI 151a5
HP Gen8, 2 x Xeon E5-2630, 384GB RAM
2 LSI 9205-8e''s
Supermicro SC417 JBOD with 3 24x2.5 dual-port SAS backplanes
40 OCZ SSD''s split between 2 SAS expanders, connected to separate SAS
cards
mirrored zpool, recordsize=8K, atime=off, sync=disabled,
primarycache=metadata filebench directio=1, 32-128 total threads
Server and clients are single-connected via Intel X520 to a Nexus 5548.
I tested with three different NFS clients, all running OI151a5 or Solaris
11. Here are the best results I got for read-only and ~70/30 read/write:
Dual-5530: 53K read, 36/15K read/write
Sparc T4-1: 62K read, 40/18K read/write
Dual E5-2630: 86K read, 49/23K read/write
On the local server I get these results:
168K read
76K write
115/45K read/write
85/62K read/write
Just for my own edification I set primarycache=all, directio=0 and ran read
tests on local pools all three machines. This really shows the difference
made by the hardware. Peak rates were:
T4-1 397K
Dual-E5 345K
Dual-5530 182K
Also latencies go up as you go down the chart. The T4-1 and dual-E5
reached peak results at 64/72 threads, the dual-G6 didn''t scale above
24.
The E5 ZFS server can do uncached reads from the zfs pool almost as fast as
the dual-5530 can read from memory!!! (though latencies are much higher, 0.7
vs 0.1 ms).
The T4 is pretty impressive for even moderately threaded workloads, in this
test keeping up with the E5 at 8-12 threads and passing it handily at 24.
A giant leap over Niagara 2.
iperf shows the T4''s network throughput to be slower than the
E5''s, which
likely explains it being slower for NFS but faster from memory. But we
don''t have the mezzanine cards, it''s using a likely-suboptimal
X520.
-- Trey
From: zfs-discuss-bounces at opensolaris.org
[zfs-discuss-bounces at opensolaris.org] on behalf of Richard Elling
[richard.elling at gmail.com]
Sent: Wednesday, July 25, 2012 11:05 AM
To: Matt Breitbach
Cc: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] IO load questions
On Jul 25, 2012, at 7:34 AM, Matt Breitbach wrote:
NFS - iSCSI and FC/FCoE to come once I get it into the proper lab.
ok, so NFS for these tests.
I''m not convinced a single ESXi box can drive the load to saturate
10GbE.
Also, depending on how you are configuring the system, the I/O that you
think is 4KB might look very different coming out of ESXi. Use nfssvrtop or
one of the many dtrace one-liners for observing NFS traffic to see what is
really on the wire. And I''m very interested to know if you see 16KB
reads
during the "write-only" workload.
more below...
From: Richard Elling [mailto:richard.elling at gmail.com]
Sent: Tuesday, July 24, 2012 11:36 PM
To: matthewb at flash.shanje.com
Cc: zfs-discuss at opensolaris.org
Subject: Re: [zfs-discuss] IO load questions
Important question, what is the interconnect? iSCSI? FC? NFS?
-- richard
On Jul 24, 2012, at 9:44 AM, matthewb at flash.shanje.com wrote:
Working on a POC for high IO workloads, and I''m running in to a
bottleneck
that I''m not sure I can solve. Testbed looks like this :
SuperMicro 6026-6RFT+ barebones w/ dual 5506 CPU''s, 72GB RAM, and ESXi
VM -
4GB RAM, 1vCPU Connectivity dual 10Gbit Ethernet to Cisco Nexus 5010
Target Nexenta system :
Intel barebones, Dual Xeon 5620 CPU''s, 192GB RAM, Nexenta 3.1.3
Enterprise
Intel x520 dual port 10Gbit Ethernet - LACP Active VPC to Nexus 5010
switches.
2x LSI 9201-16E HBA''s, 1x LSI 9200-8e HBA
5 DAE''s (3 in use for this test)
1 DAE - connected (multipathed) to LSI 9200-8e. Loaded w/ 6x Stec ZeusRAM
SSD''s - striped for ZIL, and 6x OCZ Talos C 230GB drives for L2ARC.
2 DAE''s connected (multipathed) to one LSI 9201-16E - 24x 600GB 15k
Seagate
Cheetah drives
Obviously data integrity is not guaranteed
Testing using IOMeter from windows guest, 10GB test file, queue depth of 64
I have a share set up with 4k recordsizes, compression disabled, access time
disabled, and am seeing performance as follows :
~50,000 IOPS 4k random read. 200MB/sec, 30% CPU utilization on Nexenta,
~90% utilization on guest OS. I''m guessing guest OS is bottlenecking.
Going to try physical hardware next week
~25,000 IOPS 4k random write. 100MB/sec, ~70% CPU utilization on Nexenta,
~45% CPU utilization on guest OS. Feels like Nexenta CPU is bottleneck.
Load average of 2.5
For cases where you are not bandwidth limited, larger recordsizes can be
more efficient. There is no good rule-of-thumb for this, and larger
recordsizes will, at some point, hit the bandwidth bottlenecks. I''ve
had
good luck with 8KB and 32KB recordsize for ESXi+Windows over NFS.
I''ve never bothered to test 16KB, due to lack of time.
A quick test with 128k recordsizes and 128k IO looked to be 400MB/sec
performance, can''t remember CPU utilization on either side. Will retest
and
report those numbers.
It would not surprise me to see a CPU bottleneck on the ESXi side at these
levels.
-- richard
It feels like something is adding more overhead here than I would expect on
the 4k recordsizes/IO workloads. Any thoughts where I should start on this?
I''d really like to see closer to 10Gbit performance here, but it seems
like
the hardware isn''t able to cope with it?
Theoretical peak performance for a single 10GbE wire is near 300k IOPS @
4KB, unidirectional.
This workload is extraordinarily difficult to achieve with a single client
using any of the popular
storage protocols.
-- richard
--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422
--
ZFS Performance and Training
Richard.Elling at RichardElling.com
+1-760-896-4422