thr3ads.net - zfs discuss - [zfs-discuss] Expected throughput [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Ian D

2010-Jul-01 18:17 UTC

[zfs-discuss] Expected throughput

Hi!  We''ve put 28x 750GB SATA drives in a RAIDZ2 pool (a single vdev)
and we get about 80MB/s in sequential read or write. We''re running
local tests on the server itself (no network involved).  Is that what we should
be expecting?  It seems slow to me.
Thanks 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100701/ce016dcb/attachment.html>

Roy Sigurd Karlsbakk

2010-Jul-01 18:25 UTC

head link

[zfs-discuss] Expected throughput

Hi! We''ve put 28x 750GB SATA drives in a RAIDZ2 pool (a single vdev)
and we get about 80MB/s in sequential read or write. We''re running
local tests on the server itself (no network involved). Is that what we should
be expecting? It seems slow to me.
Please read the ZFS best practices guide at
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

To summarise, putting 28 disks in a single vdev is nothing you would do if you
want performance. You''ll end up with as many IOPS a single drive can
do. Split it up into smaller (<10 disk) vdevs and try again. If you need high
performance, put them in a striped mirror (aka RAID1+0)

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
roy at karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100701/bc0945aa/attachment.html>

Roy Sigurd Karlsbakk

2010-Jul-01 18:43 UTC

head link

[zfs-discuss] Expected throughput

Hi! We''ve put 28x 750GB SATA drives in a RAIDZ2 pool (a single vdev)
and we get about 80MB/s in sequential read or write. We''re running
local tests on the server itself (no network involved). Is that what we should
be expecting? It seems slow to me.
Please read the ZFS best practices guide at
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

To summarise, putting 28 disks in a single vdev is nothing you would do if you
want performance. You''ll end up with as many IOPS a single drive can
do. Split it up into smaller (<10 disk) vdevs and try again. If you need high
performance, put them in a striped mirror (aka RAID1+0)
A little addition - for 28 drives, I guess I''d choose four vdevs with
seven drives each in raidz2. You''ll loose space, but it''ll be
four times faster, and four times safer. Better safe than sorry....

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
roy at karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100701/51cfc530/attachment.html>

Ian D

2010-Jul-03 13:29 UTC

head link

[zfs-discuss] Expected throughput

>To summarise, putting 28 disks in a single vdev is nothing you would do if
you want performance. You''ll end >up with as many IOPS a single
drive can do. Split it up into smaller (<10 disk) vdevs and try again. If you
need >high performance, put them in a striped mirror (aka RAID1+0)>A
little addition - for 28 drives, I guess I''d choose four vdevs with
seven drives each in raidz2. You''ll loose >space, but it''ll
be four times faster, and four times safer. Better safe than sorry....Ok... so we''ve rebuilt the pool as 14 pairs of mirrors, each pair
having one disk in each of the two JBODs.  Now we''re getting about
500-1000 IOPS (according to zpool iostat) and 20-30MB/sec in random read on a
big database.  Does that sounds right?
Thanks 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100703/e101525d/attachment.html>

Roy Sigurd Karlsbakk

2010-Jul-03 21:22 UTC

head link

[zfs-discuss] Expected throughput

>To summarise, putting 28 disks in a single vdev is nothing you would do if
you want performance. You''ll end >up with as many IOPS a single
drive can do. Split it up into smaller (<10 disk) vdevs and try again. If you
need >high performance, put them in a striped mirror (aka RAID1+0)
>A little addition - for 28 drives, I guess I''d choose four vdevs
with seven drives each in raidz2. You''ll loose >space, but
it''ll be four times faster, and four times safer. Better safe than
sorry....
Ok... so we''ve rebuilt the pool as 14 pairs of mirrors, each pair
having one disk in each of the two JBODs. Now we''re getting about
500-1000 IOPS (according to zpool iostat) and 20-30MB/sec in random read on a
big database. Does that sounds right?
At first, it seems low. Can you try to bench mark that with bonnie++ or iozone?
You should be getting tenfold of that performance even for database usage

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
roy at karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100703/0304fe57/attachment.html>

Erik Trimble

2010-Jul-04 01:48 UTC

head link

[zfs-discuss] Expected throughput

On 7/3/2010 2:22 PM, Roy Sigurd Karlsbakk wrote:>
>
> ------------------------------------------------------------------------
>
>
>         >To summarise, putting 28 disks in a single vdev is nothing you
>         would do if you want performance. You''ll end >up with
as many
>         IOPS a single drive can do. Split it up into smaller (<10
>         disk) vdevs and try again. If you need >high performance, put
>         them in a striped mirror (aka RAID1+0)
>         >A little addition - for 28 drives, I guess I''d choose
four
>         vdevs with seven drives each in raidz2. You''ll loose
>space,
>         but it''ll be four times faster, and four times safer.
Better
>         safe than sorry....
>
>
>     Ok... so we''ve rebuilt the pool as 14 pairs of mirrors, each
pair
>     having one disk in each of the two JBODs.  Now we''re getting
about
>     500-1000 IOPS (according to zpool iostat) and 20-30MB/sec in
>     random read on a big database.  Does that sounds right?
>
> At first, it seems low. Can you try to benchmark that with bonnie++ or 
> iozone? You should be getting tenfold of that performance even for 
> database usage
>
> Vennlige hilsener / Best regards
>
> roy
Actually, that sound pretty much what should be expected.   7200RPM SATA 
drives aren''t going to be able to do more than 100 IOPS or so each
under
ideal conditions.  So, the best case for your setup would be 1400 IOPS.  
(each pair of mirrors is going to have slightly more IOPS than a single 
disk).  So, you''re reporting 40-70% of maximum, which is well within 
expected range.

Similarly with the read speeds.  If you''re reading only small amounts 
each time, then your aggregate throughput isn''t going to be anywhere 
near the theoretical streaming read.  I''ve found that getting more than
5-10MB/s out of a single 7200RPM drive when doing random reads isn''t 
possible. I would possibly expect your aggregate total to be more than 
20-30MB/s, but it''s still entirely possible, especially if most of your
reads are small (which only span a couple of the mirrored pairs, rather 
than the full "stripe" width).

So, on the whole, I think that''s about the best you can expect, given 
that your workload is NOT an optimal one for the types of disk you have.

Depending on your table layout and internal data design/schema, it might 
actually be to your benefit to split your disks into multiple POOLS, and 
put different DB tables on different pools.  I''d suggest trying it out 
with 3 pools of mirrors, and see what that gets you.

And, of course, as for all database work, you need to get your DB 
indexes into RAM (or on a very fast SSD).  Having them on disk is going 
to severely penalize your performance.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100703/39b1b686/attachment.html>

Roy Sigurd Karlsbakk

2010-Jul-04 11:23 UTC

head link

[zfs-discuss] Expected throughput

>To summarise, putting 28 disks in a single vdev is nothing you would do if
you want performance. You''ll end >up with as many IOPS a single
drive can do. Split it up into smaller (<10 disk) vdevs and try again. If you
need >high performance, put them in a striped mirror (aka RAID1+0)
>A little addition - for 28 drives, I guess I''d choose four vdevs
with seven drives each in raidz2. You''ll loose >space, but
it''ll be four times faster, and four times safer. Better safe than
sorry....
Ok... so we''ve rebuilt the pool as 14 pairs of mirrors, each pair
having one disk in each of the two JBODs. Now we''re getting about
500-1000 IOPS (according to zpool iostat) and 20-30MB/sec in random read on a
big database. Does that sounds right? Seems right, as Erik said. Btw, do you use
SSDs for L2ARC/SLOG here? If not, that might help quite a bit.

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
roy at karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100704/35340098/attachment.html>

Ian D

2010-Jul-04 15:08 UTC

head link

[zfs-discuss] Expected throughput

>Ok... so we''ve rebuilt the pool as 14 pairs of mirrors, each pair
having one disk in each of the two JBODs. >Now we''re getting about
500-1000 IOPS (according to zpool iostat) and 20-30MB/sec in random read on a
big >database. Does that sounds right?>Seems right, as Erik said. Btw, do
you use SSDs for L2ARC/SLOG here? If not, that might help quite a bit.
I have 8x 100GB SLC SDDs (the Samsung ones that Dell sells) for the L2ARC and 2x
4GB DDRDrive X1s in mirror for the SLOG. The server also has 128GB of RAM, I
can see 100GB+ are used for the ARC. I''ll also double the RAM to 256GB
and add 4 more SSDs (total 1.2TB) for the L2ARC once I''m ready to go to
production. I will eventually connect total of 75 SATA drives and 84 SAS 15K
drives to it, I just want to make sure that I get the most of what I have. When
I run a dozen large SQL queries at once (they can take >10 mins) I
consistently get 300-1000 IOPs and 10-30MB/sec from the pool (according to zpool
iostat).
What I don''t understand is why, when I run a single query I get <100
IOPS and <3MB/sec. The setup can obviously do better, so where is the
bottleneck? I don''t see any CPU core on any side being maxed out so it
can''t be it...
The database is MySQL, it runs on a Linux box that connects to the Nexenta
server through 10GbE using iSCSI.
You''re very helpful btw, thanks a lot!Ian
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100704/9cea622c/attachment.html>

Bob Friesenhahn

2010-Jul-04 16:28 UTC

head link

[zfs-discuss] Expected throughput

> 
> Ok... so we''ve rebuilt the pool as 14 pairs of mirrors, each pair 
> having one disk in each of the two JBODs. ?Now we''re getting about
> 500-1000 IOPS (according to zpool iostat) and 20-30MB/sec in random 
> read on a big database. ?Does that sounds right?
I am not sure who wrote the above text since the attribution quoting 
is all botched up (Gmail?) in this thread.  Regardless, it is worth 
pointing out that ''zpool iostat'' only reports the I/O
operations which
were actually performed.  It will not report the operations which did 
not need to be performed due to already being in cache.  A quite busy 
system can still report very little via ''zpool iostat'' if it
has
enough RAM to cache the requested data.

Bob
-- 
Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mike Gerdts

2010-Jul-04 18:40 UTC

head link

[zfs-discuss] Expected throughput

On Sun, Jul 4, 2010 at 11:28 AM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:>>
>> Ok... so we''ve rebuilt the pool as 14 pairs of mirrors, each
pair having
>> one disk in each of the two JBODs. ?Now we''re getting about
500-1000 IOPS
>> (according to zpool iostat) and 20-30MB/sec in random read on a big
>> database. ?Does that sounds right?
>
> I am not sure who wrote the above text since the attribution quoting is all
> botched up (Gmail?) in this thread. ?Regardless, it is worth pointing out
> that ''zpool iostat'' only reports the I/O operations which
were actually
> performed. ?It will not report the operations which did not need to be
> performed due to already being in cache. ?A quite busy system can still
> report very little via ''zpool iostat'' if it has enough
RAM to cache the
> requested data.
>
> Bob
Very good point.  You can use a combination of "zpool iostat" and
fsstat to see the effect of reads that didn''t turn into physical I/Os.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Mike Gerdts

2010-Jul-04 18:53 UTC

head link

[zfs-discuss] Expected throughput

On Sun, Jul 4, 2010 at 10:08 AM, Ian D <reward72 at hotmail.com>
wrote:> What I don''t understand is why, when I run a single query I get
<100 IOPS
> and <3MB/sec. ?The setup can obviously do better, so where is the
> bottleneck? ?I don''t see any CPU core on any side being maxed out
so it
> can''t be it...
In what way is CPU contention being monitored?  "prstat" without
options is nearly useless for a multithreaded app on a multi-CPU (or
multi-core/multi-thread) system.  mpstat is only useful if threads
never migrate between CPU''s.  "prstat -mL" gives a nice
picture of how
busy each LWP (thread) is.

When viewed with "prstat -mL", A thread that has usr+sys at 100%
cannot go any faster, unless you can get the CPU to go faster, as I
suggest below. From my understanding (perhaps not 100% correct on the
rest of this paragraph):  The time spent in TRP may be reclaimed by
running the application in a processor set with interrupts disabled on
all of its processors.  If TFL or DFL are high, optimizing the use of
cache may be beneficial.  Examples of how you can optimize the use of
cache include using the FX scheduler with a priority that gives
relatively long time slices, using processor sets to keep other
processes off of the same caches (which are often shared by multiple
cores), or perhaps disabling CPU''s (threads) to ensure that only a
single core is using each cache.  With current generation Intel CPU''s,
this can allow the CPU clock rate to increase, thereby allowing more
work to get done.
> The database is MySQL, it runs on a Linux box that connects to the Nexenta
Oh, since the database runs on Linux I guess you need to dig up top''s
equivalent of "prstat -mL".  Unfortunately, I don''t think
that Linux
has microstate accounting and as such you may not have visibility into
time spent on traps, text faults, and data faults on a per-process
basis.
> server through 10GbE using iSCSI.
Have you done any TCP tuning?  Based on the numbers you cite above, it
looks like you are doing about 32 KB I/O''s.  I think you can perform a
test that involves mainly the network if you use netperf with options
like:

netperf -H $host -t TCP_RR -r 32768 -l 30

That is speculation based on reading
http://www.netperf.org/netperf/training/Netperf.html.  Someone else
(perhaps on networking or performance lists) may have better tests to
run.

--
Mike Gerdts
http://mgerdts.blogspot.com/

Ian D

2010-Jul-04 19:08 UTC

head link

[zfs-discuss] Expected throughput

> In what way is CPU contention being monitored?  "prstat" without
> options is nearly useless for a multithreaded app on a multi-CPU (or
> multi-core/multi-thread) system.  mpstat is only useful if threads
> never migrate between CPU''s.  "prstat -mL" gives a nice
picture of how
> busy each LWP (thread) is.

Using "prstat -mL" on the Nexenta box shows no serious activity
> Oh, since the database runs on Linux I guess you need to dig up
top''s
> equivalent of "prstat -mL".  Unfortunately, I don''t
think that Linux
> has microstate accounting and as such you may not have visibility into
> time spent on traps, text faults, and data faults on a per-process
> basis.

If CPU is the bottleneck then it''s probably on the Linux box.  Using
"top" the following is typical of what I get:

top - 15:04:11 up 24 days,  4:13,  6 users,  load average: 5.87, 5.79, 5.85
Tasks: 307 total,   1 running, 306 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.6%us,  0.3%sy,  0.0%ni, 98.4%id,  0.0%wa,  0.3%hi,  0.3%si,  0.0%st
Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 96.2%id,  3.8%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  2.2%us,  5.1%sy,  0.0%ni, 55.0%id, 36.4%wa,  0.0%hi,  1.3%si,  0.0%st
Cpu3  :  3.3%us,  1.3%sy,  0.0%ni,  0.0%id, 95.0%wa,  0.3%hi,  0.0%si,  0.0%st
Cpu4  :  0.0%us,  0.7%sy,  0.0%ni, 98.7%id,  0.3%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 99.3%id,  0.7%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  :  1.0%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu9  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 16.6%us, 10.9%sy,  0.0%ni,  0.0%id, 70.6%wa,  0.3%hi,  1.6%si,  0.0%st
Cpu11 :  0.6%us,  1.0%sy,  0.0%ni, 66.9%id, 31.5%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu13 :  0.3%us,  0.0%sy,  0.0%ni, 95.7%id,  4.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 :  1.5%us,  0.0%sy,  0.0%ni, 98.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu15 :  0.0%us,  0.7%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  74098512k total, 73910728k used,   187784k free,    96948k buffers
Swap:  2104488k total,      208k used,  2104280k free, 63210472k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
17652 mysql     20   0 3553m 3.1g 5472 S   38  4.4 247:51.80 mysqld
16301 mysql     20   0 4275m 3.3g 5980 S    4  4.7   5468:33 mysqld
16006 mysql     20   0 4434m 3.3g 5888 S    3  4.6   5034:06 mysqld
12822 root      15  -5     0    0    0 S    2  0.0  22:00.50 scsi_wq_39
> Have you done any TCP tuning? 
Some, yes, but since I''ve seen much more throughput on other tests
I''ve made, I dont think it''s the bottleneck here.
Thanks! 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100704/85639732/attachment.html>

Richard Elling

2010-Jul-04 19:09 UTC

head link

[zfs-discuss] Expected throughput

On Jul 4, 2010, at 8:08 AM, Ian D wrote:> >Ok... so we''ve rebuilt the pool as 14 pairs of mirrors, each
pair having one disk in each of the two JBODs.  >Now we''re getting
about 500-1000 IOPS (according to zpool iostat) and 20-30MB/sec in random read
on a big >database.  Does that sounds right?
> >Seems right, as Erik said. Btw, do you use SSDs for L2ARC/SLOG here? If
not, that might help quite a bit.
>  
> I have 8x 100GB SLC SDDs (the Samsung ones that Dell sells) for the L2ARC
and 2x 4GB DDRDrive X1s in mirror for the SLOG.  The server also has 128GB of
RAM, I can see 100GB+ are used for the ARC.  I''ll also double the RAM
to 256GB and add 4 more SSDs (total 1.2TB) for the L2ARC once I''m ready
to go to production.  I will eventually connect total of 75 SATA drives and 84
SAS 15K drives to it, I just want to make sure that I get the most of what I
have.
> When I run a dozen large SQL queries at once (they can take >10 mins) I
consistently get 300-1000 IOPs and 10-30MB/sec from the pool (according to zpool
iostat).
> 
> What I don''t understand is why, when I run a single query I get
<100 IOPS and <3MB/sec.  The setup can obviously do better, so where is
the bottleneck?  I don''t see any CPU core on any side being maxed out
so it can''t be it...
> 
> The database is MySQL, it runs on a Linux box that connects to the Nexenta
server through 10GbE using iSCSI.
Check your Nagle algorithm setting.  Since you didn''t mention what OS
you are
running, it is a little bit hard for us to guess and give you the command. For a
typical OpenSolaris based distribution try looking at the output of:
	ndd /dev/tcp tcp_naglim_def

If the value is not 1, then try setting it to 1.  For NexentaStor users, this is
easily
changed in the GUI under the Settings -> Preferences -> Network form.  For
OpenStorage users, it should be set by default.
 -- richard

-- 
Richard Elling
richard at nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/

Mike Gerdts

2010-Jul-04 20:10 UTC

head link

[zfs-discuss] Expected throughput

On Sun, Jul 4, 2010 at 2:08 PM, Ian D <reward72 at hotmail.com> wrote:
> Mem: ?74098512k total, 73910728k used, ? 187784k free, ? ?96948k buffers
> Swap: ?2104488k total, ? ? ?208k used, ?2104280k free, 63210472k cached
>
> ??PID USER ? ? ?PR ?NI ?VIRT ?RES ?SHR S %CPU %MEM ? ?TIME+ ?COMMAND
> 17652 mysql ? ? 20 ? 0 3553m 3.1g 5472 S ? 38 ?4.4 247:51.80 mysqld
> 16301 mysql ? ? 20 ? 0 4275m 3.3g 5980 S ? ?4 ?4.7 ? 5468:33 mysqld
> 16006 mysql ? ? 20 ? 0 4434m 3.3g 5888 S ? ?3 ?4.6 ? 5034:06 mysqld
> 12822 root ? ? ?15 ?-5 ? ? 0 ? ?0 ? ?0 S ? ?2 ?0.0 ?22:00.50 scsi_wq_39
Is that 38% of one CPU or 38% of all CPU''s?  How many CPU''s
does the
Linux box have?  I don''t mean the number of sockets, I mean number of
sockets * number of cores * number of threads per core.  My
recollection of top is that the CPU percentage is:

(pcpu_t2 - pcpu_t1) / (interval * ncpus)

Where pcpu_t* is the process CPU time at a particular time.  If you
have a two socket quad core box with hyperthreading enabled, that is 2
* 4 * 2 = 16 CPU''s.  38% of 16 CPU''s can be roughly 6
CPU''s running as
fast as they can (and 10 of them idle) or 16 CPU''s each running at
about 38%.  In the "I don''t have a CPU bottleneck" argument,
there is
a big difference.

If PID 16301 has a single thread that is doing significant work, on
the hypothetical 16 CPU box this means that it is spending about 2/3
of the time on CPU.  If the workload does:

while ( 1 ) {
    issue I/O request
    get response
    do cpu-intensive work work
}

It is only trying to do I/O 1/3 of the time.  Further, it has put a
single high latency operation between its bursts of CPU activity.

One other area of investigation that I didn''t mention before: Your
stats imply that the Linux box is getting data 32 KB at a time.  How
does 32 KB compare to the database block size?  How does 32 KB compare
to the block size on the relevant zfs filesystem or zvol?  Are blocks
aligned at the various layers?

http://blogs.sun.com/dlutz/entry/partition_alignment_guidelines_for_unified

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Ian D

2010-Jul-04 20:27 UTC

head link

[zfs-discuss] Expected throughput

> Is that 38% of one CPU or 38% of all CPU''s?  How many
CPU''s does the
> Linux box have?  I don''t mean the number of sockets, I mean number
of
> sockets * number of cores * number of threads per core.  My
The server has two Intel X5570s, they are quad core and have hyperthreading.  It
would say 800% if it was fully used.  I''ve never seen that, but
I''ve seen processes running at 350%+  > One other area of investigation that I didn''t mention before:
Your> stats imply that the Linux box is getting data 32 KB at a time.  How
> does 32 KB compare to the database block size?  How does 32 KB compare
> to the block size on the relevant zfs filesystem or zvol?  Are blocks
> aligned at the various layers?
Those are all good questions but they are going beyond my level of expertise.
That''s why I''ll be wise and soon retain the services of our
friend Richard Elling here for a few days of consulting :)
Thanks!Ian 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100704/615d331c/attachment.html>

Roy Sigurd Karlsbakk

2010-Jul-05 09:22 UTC

head link

[zfs-discuss] Expected throughput

The database is MySQL, it runs on a Linux box that connects to the Nexenta
server through 10GbE using iSCSI. Just a short question - wouldn''t it
be easier, and perhaps faster, to just have the MySQL DB on an NFS share? iSCSI
adds complexity, both on the target and the initiator.

Also, are you using jumbo frames? That can usually help a bit with either access
protocol

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
roy at karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100705/83d08515/attachment.html>

Ian D

2010-Jul-05 11:19 UTC

head link

[zfs-discuss] Expected throughput

>Just a short question - wouldn''t it be easier, and perhaps faster,
to just have the MySQL DB on an NFS share? iSCSI adds
>complexity, both on the target and the initiator.

Yes, we did tried both and we didn''t notice any difference in term of
performances.  I''ve read conflicting opinions on which is best and the
majority seems to say that iSCSI is better for databases, but I don''t
have any strong preference myself...
>Also, are you using jumbo frames? That can usually help a bit with either
access protocol

Yes.  It was off early on and we did notice a significant difference once we
switched it on.  Turning "naggle" off as suggested by Richard also
seem to have a make a little difference.  Thanks

 		 	   		  
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100705/bba3c8c8/attachment.html>

Roy Sigurd Karlsbakk

2010-Jul-05 11:57 UTC

head link

[zfs-discuss] Expected throughput

> Just a short question - wouldn''t it be easier, and perhaps faster,
to just have the MySQL DB on an NFS share? iSCSI adds
>complexity, both on the target and the initiator. 

Yes, we did tried both and we didn''t notice any difference in term of
performances. I''ve read conflicting opinions on which is best and the
majority seems to say that iSCSI is better for databases, but I don''t
have any strong preference myself... Have you tried monitoring the I/O with
vmstat or sar/sysstat? That should show the I/O speed as seen from Linux, and
should be more relevant than the "real" I/O speed to/from the drives.

Vennlige hilsener / Best regards 

roy 
-- 
Roy Sigurd Karlsbakk 
(+47) 97542685 
roy at karlsbakk.net 
http://blogg.karlsbakk.net/ 
-- 
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100705/4d07d5c6/attachment.html>

Richard Elling

2010-Jul-06 00:46 UTC

head link

[zfs-discuss] Expected throughput

On Jul 5, 2010, at 4:19 AM, Ian D wrote:> >Also, are you using jumbo frames? That can usually help a bit with
either access protocol
> 
> 
> Yes.  It was off early on and we did notice a significant difference once
we switched it on.  Turning "naggle" off as suggested by Richard also
seem to have a make a little difference.  Thanks
You need to disable Nagle on both ends: client and server.
 -- richard

-- 
Richard Elling
richard at nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/

James Van Artsdalen

2010-Jul-07 06:49 UTC

head link

[zfs-discuss] Expected throughput

Under FreeBSD I''ve seen zpool scrub sustain nearly 500 MB/s in pools
with large files (a pool with eight MIRROR vdevs on two Silicon Image 3124
controllers).

You need to carefully look for bottlenecks in the hardware.  You don''t
indicate how the disks are attached.  I would measure the total bandwidth of the
disks outside of ZFS (ie, dd if=disk of=/dev/null bs=128k, one per disk on all
disks simultaneously).
-- 
This message posted from opensolaris.org

Orvar Korvar

2010-Jul-07 10:15 UTC

head link

[zfs-discuss] Expected throughput

Something like this, maybe....
http://blogs.sun.com/constantin/entry/x4500_solaris_zfs_iscsi_perfect
-- 
This message posted from opensolaris.org

zfs discuss - Jul 2010 - Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput

[zfs-discuss] Expected throughput