thr3ads.net - zfs discuss - [zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)... [Oct 2008]

If this information is useful, please help other people find it:
Share via:

Gray Carper

2008-Oct-14 07:31 UTC

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Hey, all!

We''ve recently used six x4500 Thumpers, all publishing ~28TB iSCSI
targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an
x4200 head node. In trying to discover optimal ZFS pool construction settings,
we''ve run a number of iozone tests, so I thought I''d share
them with you and see if you have any comments, suggestions, etc.

First, on a single Thumper, we ran baseline tests on the direct-attached storage
(which is collected into a single ZFS pool comprised of four raidz2 groups)...

[1GB file size, 1KB record size]
Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest 
Write: 123919 
Rewrite: 146277 
Read: 383226 
Reread: 383567 
Random Read: 84369 
Random Write: 121617 

[8GB file size, 512KB record size]
Command:  
Write:  373345
Rewrite:  665847
Read:  2261103
Reread:  2175696
Random Read:  2239877
Random Write:  666769

[64GB file size, 1MB record size]
Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest 
Write: 517092
Rewrite: 541768 
Read: 682713
Reread: 697875
Random Read: 89362
Random Write: 488944

These results look very nice, though you''ll notice that the random read
numbers tend to be pretty low on the 1GB and 64GB tests (relative to their
sequential counterparts), but the 8GB random (and sequential) read is
unbelievably good.

Now we move to the head node''s iSCSI aggregate ZFS pool...

[1GB file size, 1KB record size]
Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f
/volumes/data-iscsi/perftest/1gbtest
Write:  127108
Rewrite:  120704
Read:  394073
Reread:  396607
Random Read:  63820
Random Write:  5907

[8GB file size, 512KB record size]
Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f
/volumes/data-iscsi/perftest/8gbtest
Write:  235348
Rewrite:  179740
Read:  577315
Reread:  662253
Random Read:  249853
Random Write:  274589

[64GB file size, 1MB record size]
Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f
/volumes/data-iscsi/perftest/64gbtest
Write:  190535
Rewrite:  194738
Read:  297605
Reread:  314829
Random Read:  93102
Random Write:  175688

Generally speaking, the results look good, but you''ll notice that
random writes are atrocious on the 1GB tests and random reads are not so great
on the 1GB and 64GB tests, but the 8GB test looks great across the board.
Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in disk,
raidz1, and raidz2 modes - there were no significant changes in the results.

So, how concerned should we be about the low scores here and there? Any
suggestions on how to improve our configuration? And how excited should we be
about the 8GB tests? ;>

Thanks so much for any input you have!
-Gray
---
University of Michigan
Medical School Information Services
--
This message posted from opensolaris.org

James C. McPherson

2008-Oct-14 12:10 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Gray Carper wrote:> Hey, all!
> 
> We''ve recently used six x4500 Thumpers, all publishing ~28TB iSCSI
> targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on
> an x4200 head node. In trying to discover optimal ZFS pool construction
> settings, we''ve run a number of iozone tests, so I thought
I''d share them
> with you and see if you have any comments, suggestions, etc.
[snip]


Which build are you running? Have you done any system
or ZFS tuning?


James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Gray Carper

2008-Oct-14 12:30 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Hey there, James!

We''re actually running NexentaStor v1.0.8, which is based on b85. We
haven''t
done any tuning ourselves, but I suppose it is possible that Nexenta did. If
there''s something specific you have in mind, I''d be happy to
look for it.

Thanks!
-Gray

On Tue, Oct 14, 2008 at 8:10 PM, James C. McPherson <James.McPherson at
sun.com> wrote:
> Gray Carper wrote:
>
>> Hey, all!
>>
>> We''ve recently used six x4500 Thumpers, all publishing ~28TB
iSCSI
>> targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool
on
>> an x4200 head node. In trying to discover optimal ZFS pool construction
>> settings, we''ve run a number of iozone tests, so I thought
I''d share them
>> with you and see if you have any comments, suggestions, etc.
>>
>
> [snip]
>
>
> Which build are you running? Have you done any system
> or ZFS tuning?
>
>
> James C. McPherson
> --
> Senior Kernel Software Engineer, Solaris
> Sun Microsystems
> http://blogs.sun.com/jmcp       http://www.jmcp.homeunix.com/blog
>


-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081014/ef6f99ff/attachment.html>

James C. McPherson

2008-Oct-14 12:33 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Gray Carper wrote:> Hey there, James!
> 
> We''re actually running NexentaStor v1.0.8, which is based on b85.
We
> haven''t done any tuning ourselves, but I suppose it is possible
that
> Nexenta did. If there''s something specific you''d like me
to look for,
> I''d be happy to.
Hi Gray,
So build 85.... that''s getting a bit long in the tooth now.

I know there have been *lots* of ZFS, Marvell SATA and iSCSI
fixes and enhancements since then which went into OpenSolaris.
I know they''re in Solaris Express and the updated binary distro
form of os2008.05 - I just don''t know whether Erast and the
Nexenta clan have included them in what they are releasing as 1.0.8.

Erast - could you chime in here please? Unfortunately I''ve got no
idea about Nexenta.

James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Gray Carper

2008-Oct-14 12:44 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Hello again! (And hellos to Erast, who has been a huge help to me many, many
times! :>)

As I understand it, Nexenta 1.1 should be released in a matter of weeks and
it''ll be based on build 101. We are waiting for that with baited
breath,
since it includes some very important Active Directory integration fixes,
but this sounds like another reason to be excited about it. Maybe this is a
discussion that should be tabled until we are able to upgrade?

-Gray

On Tue, Oct 14, 2008 at 8:33 PM, James C. McPherson <James.McPherson at
sun.com> wrote:
> Gray Carper wrote:
>
>> Hey there, James!
>>
>> We''re actually running NexentaStor v1.0.8, which is based on
b85. We
>> haven''t done any tuning ourselves, but I suppose it is
possible that Nexenta
>> did. If there''s something specific you''d like me to
look for, I''d be happy
>> to.
>>
>
> Hi Gray,
> So build 85.... that''s getting a bit long in the tooth now.
>
> I know there have been *lots* of ZFS, Marvell SATA and iSCSI
> fixes and enhancements since then which went into OpenSolaris.
> I know they''re in Solaris Express and the updated binary distro
> form of os2008.05 - I just don''t know whether Erast and the
> Nexenta clan have included them in what they are releasing as 1.0.8.
>
> Erast - could you chime in here please? Unfortunately I''ve got no
> idea about Nexenta.
>
>
>
> James C. McPherson
> --
> Senior Kernel Software Engineer, Solaris
> Sun Microsystems
> http://blogs.sun.com/jmcp       http://www.jmcp.homeunix.com/blog
>


-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081014/e97373d0/attachment.html>

James C. McPherson

2008-Oct-14 12:47 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Gray Carper wrote:> Hello again! (And hellos to Erast, who has been a huge help to me many, 
> many times! :>)
> 
> As I understand it, Nexenta 1.1 should be released in a matter of weeks 
> and it''ll be based on build 101. We are waiting for that with
baited
> breath, since it includes some very important Active Directory 
> integration fixes, but this sounds like another reason to be excited 
> about it. Maybe this is a discussion that should be tabled until we are 
> able to upgrade?
Yup, I think that''s probably the best thing. And thanks
for passing on the info about the 1.1 release, I''ll keep
that in my back pocket :)


cheers,
James
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Gray Carper

2008-Oct-14 12:59 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Howdy!

Sounds good. We''ll upgrade to 1.1 (b101) as soon as it is released,
re-run
our battery of tests, and see where we stand.

Thanks!
-Gray

On Tue, Oct 14, 2008 at 8:47 PM, James C. McPherson <James.McPherson at
sun.com> wrote:
> Gray Carper wrote:
>
>> Hello again! (And hellos to Erast, who has been a huge help to me many,
>> many times! :>)
>>
>> As I understand it, Nexenta 1.1 should be released in a matter of weeks
>> and it''ll be based on build 101. We are waiting for that with
baited breath,
>> since it includes some very important Active Directory integration
fixes,
>> but this sounds like another reason to be excited about it. Maybe this
is a
>> discussion that should be tabled until we are able to upgrade?
>>
>
> Yup, I think that''s probably the best thing. And thanks
> for passing on the info about the 1.1 release, I''ll keep
> that in my back pocket :)
>
>
> cheers,
> James
>
> --
> Senior Kernel Software Engineer, Solaris
> Sun Microsystems
> http://blogs.sun.com/jmcp       http://www.jmcp.homeunix.com/blog
>


-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
gcarper at umich.edu  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081014/604951ef/attachment.html>

Akhilesh Mritunjai

2008-Oct-14 14:05 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Just a random spectator here, but I think artifacts you''re seeing here
are not due to file size, but rather due to record size.

What is the ZFS record size ?

On a personal note, I wouldn''t do non-concurrent (?) benchmarks. They
are at best useless and at worst misleading for ZFS

- Akhilesh.
--
This message posted from opensolaris.org

Bob Friesenhahn

2008-Oct-14 14:36 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

On Tue, 14 Oct 2008, Gray Carper wrote:>
> So, how concerned should we be about the low scores here and there? 
> Any suggestions on how to improve our configuration? And how excited 
> should we be about the 8GB tests? ;>
The level of concern should depend on how you expect your storage pool 
to actually be used.  It seems that it should work great for bulk 
storage, but not to support a database, or ultra high-performance 
super-computing applications.  The good 8GB performance is due to 
successful ZFS ARC caching in RAM, and because the record size is 
reasonable given the ZFS block size and the buffering ability of the 
intermediate links.  You might see somewhat better performance using a 
256K record size.

It may take quite a while to fill 150TB up.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Erast Benson

2008-Oct-14 15:04 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

James, all serious ZFS bug fixes back-ported to b85 as well as marvell
and other sata drivers. Not everything is possible to back-port of
course, but I would say all critical things are there. This includes ZFS
ARC optimization patches, for example.

On Tue, 2008-10-14 at 22:33 +1000, James C. McPherson
wrote:> Gray Carper wrote:
> > Hey there, James!
> > 
> > We''re actually running NexentaStor v1.0.8, which is based on
b85. We
> > haven''t done any tuning ourselves, but I suppose it is
possible that
> > Nexenta did. If there''s something specific you''d
like me to look for,
> > I''d be happy to.
> 
> Hi Gray,
> So build 85.... that''s getting a bit long in the tooth now.
> 
> I know there have been *lots* of ZFS, Marvell SATA and iSCSI
> fixes and enhancements since then which went into OpenSolaris.
> I know they''re in Solaris Express and the updated binary distro
> form of os2008.05 - I just don''t know whether Erast and the
> Nexenta clan have included them in what they are releasing as 1.0.8.
> 
> Erast - could you chime in here please? Unfortunately I''ve got no
> idea about Nexenta.
> 
> 
> James C. McPherson
> --
> Senior Kernel Software Engineer, Solaris
> Sun Microsystems
> http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog
>

Brent Jones

2008-Oct-14 17:29 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper <gcarper at umich.edu>
wrote:> Hey, all!
>
> We''ve recently used six x4500 Thumpers, all publishing ~28TB iSCSI
targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an
x4200 head node. In trying to discover optimal ZFS pool construction settings,
we''ve run a number of iozone tests, so I thought I''d share
them with you and see if you have any comments, suggestions, etc.
>
> First, on a single Thumper, we ran baseline tests on the direct-attached
storage (which is collected into a single ZFS pool comprised of four raidz2
groups)...
>
> [1GB file size, 1KB record size]
> Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
> Write: 123919
> Rewrite: 146277
> Read: 383226
> Reread: 383567
> Random Read: 84369
> Random Write: 121617
>
> [8GB file size, 512KB record size]
> Command:
> Write:  373345
> Rewrite:  665847
> Read:  2261103
> Reread:  2175696
> Random Read:  2239877
> Random Write:  666769
>
> [64GB file size, 1MB record size]
> Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest
> Write: 517092
> Rewrite: 541768
> Read: 682713
> Reread: 697875
> Random Read: 89362
> Random Write: 488944
>
> These results look very nice, though you''ll notice that the random
read numbers tend to be pretty low on the 1GB and 64GB tests (relative to their
sequential counterparts), but the 8GB random (and sequential) read is
unbelievably good.
>
> Now we move to the head node''s iSCSI aggregate ZFS pool...
>
> [1GB file size, 1KB record size]
> Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f
/volumes/data-iscsi/perftest/1gbtest
> Write:  127108
> Rewrite:  120704
> Read:  394073
> Reread:  396607
> Random Read:  63820
> Random Write:  5907
>
> [8GB file size, 512KB record size]
> Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f
/volumes/data-iscsi/perftest/8gbtest
> Write:  235348
> Rewrite:  179740
> Read:  577315
> Reread:  662253
> Random Read:  249853
> Random Write:  274589
>
> [64GB file size, 1MB record size]
> Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f
/volumes/data-iscsi/perftest/64gbtest
> Write:  190535
> Rewrite:  194738
> Read:  297605
> Reread:  314829
> Random Read:  93102
> Random Write:  175688
>
> Generally speaking, the results look good, but you''ll notice that
random writes are atrocious on the 1GB tests and random reads are not so great
on the 1GB and 64GB tests, but the 8GB test looks great across the board.
Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in disk,
raidz1, and raidz2 modes - there were no significant changes in the results.
>
> So, how concerned should we be about the low scores here and there? Any
suggestions on how to improve our configuration? And how excited should we be
about the 8GB tests? ;>
>
> Thanks so much for any input you have!
> -Gray
> ---
> University of Michigan
> Medical School Information Services
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
Your setup sounds very interesting how you export iSCSI to another
head unit, can you give me some more details on your file system
layout, and how you mount it on the head unit?
Sounds like a pretty clever way to export awesomely large volumes!

Regards,

-- 
Brent Jones
brent at servuhome.net

James C. McPherson

2008-Oct-14 21:52 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Erast Benson wrote:> James, all serious ZFS bug fixes back-ported to b85 as well as marvell
> and other sata drivers. Not everything is possible to back-port of
> course, but I would say all critical things are there. This includes ZFS
> ARC optimization patches, for example.
Excellent!


James
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp	http://www.jmcp.homeunix.com/blog

Gray Carper

2008-Oct-15 04:51 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Hey there, Bob!

Looks like you and Akhilesh (thanks, Akhilesh!) are driving at a similar,
very valid point. I''m currently using the default recordsize (128K) on
all
of the ZFS pool (those of the iSCSI target nodes and the aggregate pool on
the head node).

I should''ve mentioned something about how the storage will be used in
my
original post, so I''m glad you brought it up. It will all be presented
over
NFS and CIFS as a 10GBe+Infiniband NAS which will serve a number of
organizations. Some organizations will simply use their area for end-user
file sharing, others will use it as a disk backup target, others for
databases, and still others for HPC data crunching (gene sequences). Each of
these uses will be on different filesystems, of course, so I expect it would
be good to set different recordsize paramaters for each one. Do you have any
suggestions on good starting sizes for each? I''d imagine filesharing
might
benefit from a relatively small record size (64K?), image-based backup
targets might like a pretty large record size (256K?), databases just need
recordsizes to match their block sizes, and HPC...I have no idea. Heh. I
expect I''ll need to get in contact with the HPC lab to see what kind of
profile they have (whether they deal with tiny files or big files, etc).
What do you think?

Today I''m going to try a few non-ZFS-related tweaks (disabling the
Nagle
algorithm on the iSCSI initiator and increasing MTU everywhere to 9000).
I''ll give those a shot and see if they yield performance enhancements.

-Gray

On Tue, Oct 14, 2008 at 10:36 PM, Bob Friesenhahn <
bfriesen at simple.dallas.tx.us> wrote:
> On Tue, 14 Oct 2008, Gray Carper wrote:
>
>>
>> So, how concerned should we be about the low scores here and there? Any
>> suggestions on how to improve our configuration? And how excited should
we
>> be about the 8GB tests? ;>
>>
>
> The level of concern should depend on how you expect your storage pool to
> actually be used.  It seems that it should work great for bulk storage, but
> not to support a database, or ultra high-performance super-computing
> applications.  The good 8GB performance is due to successful ZFS ARC
caching
> in RAM, and because the record size is reasonable given the ZFS block size
> and the buffering ability of the intermediate links.  You might see
somewhat
> better performance using a 256K record size.
>
> It may take quite a while to fill 150TB up.
>
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
>

-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
gcarper at umich.edu  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081015/a63dc4f3/attachment.html>

Gray Carper

2008-Oct-15 09:50 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Howdy, Brent!

Thanks for your interest! We''re pretty enthused about this project over
here
and I''d be happy to share some details with you (and anyone else who
cares
to peek). In this post I''ll try to hit the major configuration
bullet-points, but I can also throw you command-line level specifics if you
want them.

1. The six Thumper iSCSI target nodes, and the iSCSI initiator head node,
all have a high-availability network configuration by marrying link
aggregation and IP multipathing techniques. Each machine has four 1GBe
interfaces and one 10GBe interface (we could have had two 10GBe interfaces,
but we decided to save some cash ;>). We link aggregate the four 1GBe
interfaces together to create a fatter 4GBe pipe, then we use IP
multipathing to group together the 10GBe interface and the 4GBe aggregation.
Through this we create a virtual "service IP" which can float back and
forth, automatically, between the two interfaces in the event of a network
path failure. The preferred home is the 10GBe interface, but if that dies
(or any part of its network path dies, like a switch somewhere down the
line), then the service IP migrates to the 4GBe aggregate (which is on a
completely separate network path) within four seconds. Whenever the 10GBe
interface is happy again, the service IP automatically migrates back to its
home.

2 The head node also has an Infiniband interface which plugs it into our HPC
compute cluster network, giving the cluster direct access to whatever
storage it needs.

3. All six iSCSI nodes have a redundant disk configuration using four ZFS
raidz2 groups, each containing 10 drives which are spread across five
controllers. Six additional disks, from a sixth controller, also live in the
pool as spares. This results in a 28.4TB data pool for each node that can
survive disk and controller failures.

4. Each of the six iSCSI nodes are presenting the entirety of their 28TB
pools through CHAP-authenticated iSCSI targets. (See
http://docs.sun.com/app/docs/doc/819-5461/gechv?a=view  for more info on
that process.)

5. The NAS nead node has wrangled up all six of the iSCSI targets (using
"iscsiadm add discovery-address ...") and joined them to create ~150TB
ofusable storage
(using "zpool create" against the devices created with iscsiadm). With
that,
we''ve been able to carve up the storage into multiple ZFS filesystems,
each
with its own recordsize, quota, permissions, NFS/CIFS shares, etc.

I think that about covers the high-level stuff. If there''s any area you
want
to dive deeper into, fire away!

-Gray

On Wed, Oct 15, 2008 at 1:29 AM, Brent Jones <brent at servuhome.net>
wrote:
> On Tue, Oct 14, 2008 at 12:31 AM, Gray Carper <gcarper at umich.edu>
wrote:
> > Hey, all!
> >
> > We''ve recently used six x4500 Thumpers, all publishing ~28TB
iSCSI
> targets over ip-multipathed 10GB ethernet, to build a ~150TB ZFS pool on an
> x4200 head node. In trying to discover optimal ZFS pool construction
> settings, we''ve run a number of iozone tests, so I thought
I''d share them
> with you and see if you have any comments, suggestions, etc.
> >
> > First, on a single Thumper, we ran baseline tests on the
direct-attached
> storage (which is collected into a single ZFS pool comprised of four raidz2
> groups)...
> >
> > [1GB file size, 1KB record size]
> > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f
/data-das/perftest/1gbtest
> > Write: 123919
> > Rewrite: 146277
> > Read: 383226
> > Reread: 383567
> > Random Read: 84369
> > Random Write: 121617
> >
> > [8GB file size, 512KB record size]
> > Command:
> > Write:  373345
> > Rewrite:  665847
> > Read:  2261103
> > Reread:  2175696
> > Random Read:  2239877
> > Random Write:  666769
> >
> > [64GB file size, 1MB record size]
> > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f
> /data-das/perftest/64gbtest
> > Write: 517092
> > Rewrite: 541768
> > Read: 682713
> > Reread: 697875
> > Random Read: 89362
> > Random Write: 488944
> >
> > These results look very nice, though you''ll notice that the
random read
> numbers tend to be pretty low on the 1GB and 64GB tests (relative to their
> sequential counterparts), but the 8GB random (and sequential) read is
> unbelievably good.
> >
> > Now we move to the head node''s iSCSI aggregate ZFS pool...
> >
> > [1GB file size, 1KB record size]
> > Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f
> /volumes/data-iscsi/perftest/1gbtest
> > Write:  127108
> > Rewrite:  120704
> > Read:  394073
> > Reread:  396607
> > Random Read:  63820
> > Random Write:  5907
> >
> > [8GB file size, 512KB record size]
> > Command: iozone -i 0 -i 1 -i 2 -r 512 -s 8g -f
> /volumes/data-iscsi/perftest/8gbtest
> > Write:  235348
> > Rewrite:  179740
> > Read:  577315
> > Reread:  662253
> > Random Read:  249853
> > Random Write:  274589
> >
> > [64GB file size, 1MB record size]
> > Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f
> /volumes/data-iscsi/perftest/64gbtest
> > Write:  190535
> > Rewrite:  194738
> > Read:  297605
> > Reread:  314829
> > Random Read:  93102
> > Random Write:  175688
> >
> > Generally speaking, the results look good, but you''ll notice
that random
> writes are atrocious on the 1GB tests and random reads are not so great on
> the 1GB and 64GB tests, but the 8GB test looks great across the board.
> Voodoo! ;> Incidentally, I ran all these tests against the ZFS pool in
disk,
> raidz1, and raidz2 modes - there were no significant changes in the
results.
> >
> > So, how concerned should we be about the low scores here and there?
Any
> suggestions on how to improve our configuration? And how excited should we
> be about the 8GB tests? ;>
> >
> > Thanks so much for any input you have!
> > -Gray
> > ---
> > University of Michigan
> > Medical School Information Services
> > --
> > This message posted from opensolaris.org
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >
>
> Your setup sounds very interesting how you export iSCSI to another
> head unit, can you give me some more details on your file system
> layout, and how you mount it on the head unit?
> Sounds like a pretty clever way to export awesomely large volumes!
>
> Regards,
>
> --
> Brent Jones
> brent at servuhome.net
>

-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
gcarper at umich.edu  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081015/718e14d7/attachment.html>

Bob Friesenhahn

2008-Oct-15 14:33 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

On Wed, 15 Oct 2008, Gray Carper wrote:> be good to set different recordsize paramaters for each one. Do you have
any
> suggestions on good starting sizes for each? I''d imagine
filesharing might
> benefit from a relatively small record size (64K?), image-based backup
> targets might like a pretty large record size (256K?), databases just need
> recordsizes to match their block sizes, and HPC...I have no idea. Heh. I
> expect I''ll need to get in contact with the HPC lab to see what
kind of
> profile they have (whether they deal with tiny files or big files, etc).
> What do you think?
Pretty much the *only* reason to reduce the ZFS recordsize from its 
default of 128K is to support relatively unusual applications like 
databases which do random read/writes of small (often 8K) blocks. 
For sequential I/O, 128K is fine even if the application (or client) 
does reads/writes using much smaller blocks.

For small-block random I/O you will find that ZFS performance improves 
immensely when the ZFS recordsize matches the application recordsize. 
The reason for this is that ZFS does I/O using its full blocksize and 
so there is more latency and waste of I/O bandwidth and CPU if ZFS 
needs to process a 128K block for each 8K block update.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Archie Cowan

2008-Oct-15 15:01 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

I just stumbled upon this thread somehow and thought I''d share my zfs
over iscsi experience.

We recently abandoned a similar configuration with several pairs of x4500s
exporting zvols as iscsi targets and mirroring them for "high
availability" with T5220s.

Initially, our performance was also good using iozone tests, but, in testing the
resilvering processes with 10tb of data, it was abysmal. It took over a month
for a 10tb x4500 mirror that was mostly mirrored to resilver back into health
with its pair. So, not exactly a highly available configuration... if the other
x4500 went unhealthy while the other was still resilvering we''d have
been in a real bad place.

Also, "zfs send" operations on filesystems hosted by the iscsi zpool
couldn''t push out more than a few kilobytes per second. Yes, we had all
the multipathing, vlans, memory buffering and all kinds of nonsense to keep the
network from being the bottleneck but to not much benefit. This was our plan for
keeping our remote sites'' filesystems in sync so it was vital.

Maybe we did something completely wrong with our setup, but I''d suggest
you verify how long it takes to resilver new x4500s into your iscsi pools and
also see how well it does when your zpools are almost full. Our initial good
performance test results were too good to be true and it turned out that they
weren''t the whole story.

Good luck.
--
This message posted from opensolaris.org

Richard Elling

2008-Oct-15 17:39 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Archie Cowan wrote:> I just stumbled upon this thread somehow and thought I''d share my
zfs over iscsi experience.
>
> We recently abandoned a similar configuration with several pairs of x4500s
exporting zvols as iscsi targets and mirroring them for "high
availability" with T5220s.
>   
In general, such tasks would be better served by T5220 (or the new T5440 :-)
and J4500s.  This would change the data paths from:
    client --<net>-- T5220 --<net>-- X4500 --<SATA>-- disks
to
    client --<net>-- T5440 --<SAS>-- disks

With the J4500 you get the same storage density as the X4500, but
with SAS access (some would call this direct access).  You will have
much better bandwidth and lower latency between the T5440 (server)
and disks while still having the ability to multi-head the disks.  The
J4500 is a relatively new system, so this option may not have been
available at the time Archie was building his system.
 -- richard
> Initially, our performance was also good using iozone tests, but, in
testing the resilvering processes with 10tb of data, it was abysmal. It took
over a month for a 10tb x4500 mirror that was mostly mirrored to resilver back
into health with its pair. So, not exactly a highly available configuration...
if the other x4500 went unhealthy while the other was still resilvering
we''d have been in a real bad place.
>
> Also, "zfs send" operations on filesystems hosted by the iscsi
zpool couldn''t push out more than a few kilobytes per second. Yes, we
had all the multipathing, vlans, memory buffering and all kinds of nonsense to
keep the network from being the bottleneck but to not much benefit. This was our
plan for keeping our remote sites'' filesystems in sync so it was vital.
>
> Maybe we did something completely wrong with our setup, but I''d
suggest you verify how long it takes to resilver new x4500s into your iscsi
pools and also see how well it does when your zpools are almost full. Our
initial good performance test results were too good to be true and it turned out
that they weren''t the whole story.
>
> Good luck.
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Akhilesh Mritunjai

2008-Oct-15 18:20 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Hi Gray,

You''ve got a nice setup going there, few comments:

1. Do not tune ZFS without a proven test-case to show otherwise, except...
2. For databases. Tune recordsize for that particular FS to match DB recordsize.

Few questions...

* How are you divvying up the space ?
* How are you taking care of redundancy ?
* Are you aware that each layer of ZFS needs its own redundancy ?

Since you have got a mixed use case here, I would be surprized if a general
config would cover all, though it might do with some luck.
--
This message posted from opensolaris.org

Ross

2008-Oct-15 18:47 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Am I right in thinking your top level zpool is a raid-z pool consisting of six
28TB iSCSI volumes?  If so that''s a very nice setup, it''s what
we''d be doing if we had that kind of cash :-)
--
This message posted from opensolaris.org

Miles Nordin

2008-Oct-15 18:54 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

>>>>> "gc" == Gray Carper <gcarper at umich.edu>
writes:
gc> 5. The NAS nead node has wrangled up all six of the iSCSI
gc> targets

are you using raidz on the head node? It sounds like simple striping,
which is probably dangerous with the current code. This kind of sucks
because with simple striping you will get the performance of the 6
mega-spindles, while in a raidz you don''t just get less storage, you
get ~1/6th the seek bandwidth. but that''s better than losing a whole
pool. It''s not even fully effective redundancy if
resilvering/scrubbing takes 3 days per 1TB, but if it just stops the
pool from becoming corrupt and unimportable then it''s done its job.

how are you backing up that much storage? or is it all emphemeral?
It''s common to lose a whole pool, so I''d have thought
you''d want to,
for example, keep home directories on a main pool and a backup pool,
but keep only one copy of the backup dumps since in theory they have
corresponding originals somewhere else.

If you did split your x4500 * 6 into two pools, I wonder how you''d lay
out a ``main pool'''' and ``backup pool'''' such
that they''d be unlikely
to get corrupt together. make them on disjoint sets of iscsi target
nodes? keep the backup pool exported?

you could keep backup pool imported so other groups can write their
backups there, and spread it across all 6 targets, but declare a
recurring noon - 3pm maintenance window for the backup pool, in which
you: export, test-import, export, take snapshots of the zvols on the
target nodes, import. Normally you would need II to use
device-snapshots for corruption protection, but since you have two
layers of ZFS you can use this remedial trick without learning how to
use AVS. but only while the pool is exported because otherwise
there''s no way to take all six snapshots at the same instant.

or you could just hope.

I''m most interested in failure testing. What happens when you reboot
nodes or break network connections while there''s write activity to the
pool? That is nice that the ``service address'''' fails over
and fails
back, and that you''ve somehow extended the heartbeat all the way from
target to head node, but does this actually work well with iSCSI?
Does the iSCSI TCP circuit get broken and remade when the address
moves, and does this cause errors or even cause corruption if it
happens while writing to the pool? How about something more drastic
like rebooting the x4500''s---does the head node patiently wait and
then continue where it left off like clients are supposed to when
their NFS server reboots, or does it panic, or does it freeze for a
couple minutes, mark the target down and continue, and then throw a
bunch of CKSUM errors when the target comes back? The last one is
what happens for me, but I have a mirrorz vdev on the head node so my
setup''s different.

If you can get this setup to work sanely in error scenarios, I think
it can potentially have an availability advantage because some of the
driver problems causing freezes and kernel panics and hung ports we''ve
seen won''t hurt you---you can just reboot the whole target node, so
shitty drivers become merely irritating to the sysadmin instead of an
availability problem. but my expectation is, you can''t.

It sounds really scary to me, to be honest, like: 200 eggs, one basket.
and the basket is made of duct tape.

i''m less interested in performance. I can think of a bunch of silly
performance-test questions but I found most interesting Archie''s
experience about how performance can influence effective reliability.
Here are the silly questions:

have you tried any other layouts? like exporting individual disks
with iSCSI? My intuition is that this would not work well because
of TCP congestion, and I also worry the iscsi target would freeze
the whole box when one drive failed, a behavior which could be
statistically significant to the overall system''s reliability when
there are so many drives involved. but I wonder. also a simple SVM
stripe, or maybe two or three stripes per box, might be faster by
avoiding zvol COW.

(also, know that Linux has an iSCSI target, too. actually it has
three right now: IET, scst, and stgt)

any end-to-end testing yet? how is the performance of NFS or CIFS
or...what are you hoping to use over the infiniband again,
comstar/iSER or is it just IP+NFS? i don''t know much about IB.

are there fast disks in the head node that you could use to
expermient with slog or l2arc? since slogs can''t be removed without
destroying the pool, you might want testing of NFS+slog/NFS-slog
before the pool has real data on it.

can you try with and without RED on the switches? i''ve always
wondered if this makes a difference but not bothered to check it
because my targets are too slow.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081015/50501500/attachment.bin>

Miles Nordin

2008-Oct-15 19:04 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

>>>>> "r" == Ross  <myxiplx at hotmail.com>
writes:
     r> Am I right in thinking your top level zpool is a raid-z pool
     r> consisting of six 28TB iSCSI volumes?  If so that''s a very
     r> nice setup,

not if it scrubs at 400GB/day, and ''zfs send'' is uselessly
slow.  Also
I am thinking the J4500 Richard mentioned may be more robust to single
disk failures not taking down the whole box compared to a device with
a Solaris kernel in it.

s/very nice/stupidly large capacity/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081015/0e0cf81a/attachment.bin>

Marcelo Leal

2008-Oct-15 19:24 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Are you talking about what he had in the "logic of the configuration at top
level", or you are saying his top level pool is a raidz?
I would think his top level zpool is a raid0...
--
This message posted from opensolaris.org

Bob Friesenhahn

2008-Oct-15 19:35 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

On Wed, 15 Oct 2008, Marcelo Leal wrote:
> Are you talking about what he had in the "logic of the configuration
at top level", or you are saying his top level pool is a raidz?
> I would think his top level zpool is a raid0...
ZFS does not support RAID0 (simple striping).

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Tomas Ögren

2008-Oct-15 19:40 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

On 15 October, 2008 - Bob Friesenhahn sent me these 0,6K bytes:
> On Wed, 15 Oct 2008, Marcelo Leal wrote:
> 
> > Are you talking about what he had in the "logic of the
configuration at top level", or you are saying his top level pool is a
raidz?
> > I would think his top level zpool is a raid0...
> 
> ZFS does not support RAID0 (simple striping).
zpool create mypool disk1 disk2 disk3

Sure it does.

/Tomas
-- 
Tomas ?gren, stric at acc.umu.se, http://www.acc.umu.se/~stric/
|- Student at Computing Science, University of Ume?
`- Sysadmin at {cs,acc}.umu.se

Marcelo Leal

2008-Oct-15 20:08 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

So, there is no raid10 in a solaris/zfs setup?
I?m talking about "no redundancy"...
--
This message posted from opensolaris.org

Bob Friesenhahn

2008-Oct-15 20:45 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

On Wed, 15 Oct 2008, Tomas ?gren wrote:>> ZFS does not support RAID0 (simple striping).
>
> zpool create mypool disk1 disk2 disk3
>
> Sure it does.
This is load-share, not RAID0.  Also, to answer the other fellow, 
since ZFS does not support RAID0, it also does not support RAID 1+0 
(10). :-)

With RAID0 and 8 drives in a stripe, if you send a 128K block of data, 
it gets split up into eight chunks, with a chunk written to each 
drive.  With ZFS''s load share, that 128K block of data only gets 
written to one of the eight drives and no striping takes place.  The 
next write is highly likely to go to a different drive.  Load share 
seems somewhat similar to RAID0 but it is easy to see that it is not 
by looking at the drive LEDs on an drive array while writes are taking 
place.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Richard Elling

2008-Oct-15 22:57 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Bob Friesenhahn wrote:> On Wed, 15 Oct 2008, Tomas ?gren wrote:
>>> ZFS does not support RAID0 (simple striping).
>>
>> zpool create mypool disk1 disk2 disk3
>>
>> Sure it does.
>
> This is load-share, not RAID0.  Also, to answer the other fellow, 
> since ZFS does not support RAID0, it also does not support RAID 1+0 
> (10). :-)
Technically correct.  But beware of operational definitions.

 From the SNIA Dictionary, http://www.snia.org/education/dictionary

RAID Level 0
    [Storage System] Synonym for data striping.

RAID Level 1
    [Storage System] Synonym for mirroring.

RAID Level 10
    not defined at SNIA, but generally agreed to be data stripes of
    mirrors.

Data Striping
    [Storage System] A disk array data mapping technique in which
    fixed-length sequences of virtual disk data addresses are mapped to
    sequences of member disk addresses in a regular rotating pattern.

    Disk striping is commonly called RAID Level 0 or RAID 0 because
    of its similarity to common RAID data mapping techniques. It includes
    no redundancy, however, so strictly speaking, the appellation RAID
    is a misnomer.

mirroring
    [Storage System] A configuration of storage in which two or more
    identical copies of data are maintained on separate media; also known
    as RAID Level 1, disk shadowing, real-time copy, and t1 copy.

ZFS dynamic stripes are not restricted by fixed-length sequences, so
they are not, technically data stripes by SNIA definition.

ZFS mirrors do fit the SNIA definition of mirroring, though ZFS does
so by logical address, not a physical block offset.

You will often see people describe ZFS mirroring with multiple top-level
vdevs as RAID-1+0 (or 10), because that is a well-known thing.  But if
you see this in any of the official documentation, then please file a
bug.>
> With RAID0 and 8 drives in a stripe, if you send a 128K block of data, 
> it gets split up into eight chunks, with a chunk written to each 
> drive.  With ZFS''s load share, that 128K block of data only gets 
> written to one of the eight drives and no striping takes place.  The 
> next write is highly likely to go to a different drive.  Load share 
> seems somewhat similar to RAID0 but it is easy to see that it is not 
> by looking at the drive LEDs on an drive array while writes are taking 
> place.
ZFS allocates data in slabs.  By default, the slabs are 1 MByte each.
So a vdev is divided into a collection of slabs and when ZFS fills a
slab, it moves onto another.  With a dynamic stripe, the next slab may
be on a different vdev, depending on how much free space is available.
So you may see many I/Os hitting one disk, just because they happen
to be allocated on the same vdev, perhaps in the same slab.
 -- richard

Ross

2008-Oct-16 07:45 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Well obviously recovery scenarios need testing, but I still don''t see
it being that bad.  My thinking on this is:

1.  Loss of a server is very much the worst case scenario.  Disk errors are much
more likely, and with raid-z2 pools on the individual servers this should not
pose a problem.  I also would not expect to see disk failures downing an entire
x4500.  Sun have sold an awful lot of these now, enough for me to feel any such
problems should be a thing of the past.

2.  Even when a server does fail, the nature of ZFS is such that you would not
expect to loose your data, nor should you be expecting to resilver the entire
28TB.  A motherboard / backplane / PSU failure will offline that server, but
once the faulted components are replaced your pool will come back online.  Once
the pool is online, ZFS has the ability to resilver just the changed data,
meaning that your rebuild time will be simply proportional to the time the
server was down.

Of course these failure modes would need testing, as would rebuild times.  I
don''t see ''zfs send'' performance being an issue
though, not unless Grey has another 150TB of storage lying around that
he''s not telling us about.  :-)

There are always going to be some tradeoffs between risk, capacity and price,
but I expect that the benefits of this setup far outweigh the negatives.

Ross
--
This message posted from opensolaris.org

Gray Carper

2008-Oct-16 07:50 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Howdy!

Very valuable advice here (and from Bob, who made similar comments - thanks,
Bob!). I think, then, we''ll generally stick to 128K recordsizes. In the
case
of databases, we''ll stray as appropriate, and we may also stray with
the HPC
compute cluster if we can get demonstrate that it is worth it.

To answer your questions below...

Currently, we have a single pool, in a "load share" configuration (no
raidz), that collects all the storage (which answers Ross'' question
too).>From that we carve filesystems on demand. There are many more tests plannedfor that construction, though, so we are not married to it.

Redundancy abounds. ;> Since the pool doesn''t employ raidz, it
isn''t
internally redundant, but we plan to replicate the pool''s data to an
identical system (which is not yet built) at another site. Our initial
userbase don''t need the replication, however, because they uses the
system
for little more than scratch space. Huge genomic datasets are dumped on the
storage, analyzed, and the results (which are much smaller) get sent
elsewhere. Everything is wiped out soon after that and the process starts
again. Future projected uses of the storage, however, would be far less
tolerant of loss, so I expect we''ll want to reconfigure the pool in
raidz.

I see that Archie and Miles have shared some harrowing concerns which we
take very seriously. I don''t think I''ll be able to reply to
them today, but
I certainly will in the near future (particularly once we''ve completed
some
more of our induced failure scenarios).

Sidenote: Today we made eight network/iSCSI related tweaks that, in
aggregate, have resulted in dramatic performance improvements (some I just
hadn''t gotten around to yet, others suggested by Sun''s Mertol
Ozyoney)...

- disabling the Nagle algorithm on the head node
- setting each iSCSI target block size to match the ZFS record size of 128K
- disabling "thin provisioning" on the iSCSI targets
- enabling jumbo frames everywhere (each switch and NIC)
- raising ddi_msix_alloc_limit to 8
- raising ip_soft_rings_cnt to 16
- raising tcp_deferred_acks_max to 16
- raising tcp_local_dacks_max to 16

Rerunning the same tests, we now see...

[1GB file size, 1KB record size]
Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
Write: 143373
Rewrite: 183170
Read: 433205
Reread: 435503
Random Read: 90118
Random Write: 19488

[8GB file size, 512KB record size]
Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f
/volumes/data-iscsi/perftest/8gbtest
Write:  463260
Rewrite:  449280
Read:  1092291
Reread:  881044
Random Read:  442565
Random Write:  565565

[64GB file size, 1MB record size]
Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest
Write: 357199
Rewrite: 342788
Read: 609553
Reread: 645618
Random Read: 218874
Random Write: 339624

Thanks so much to everyone for all their great contributions!
-Gray

On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai <
mritun+opensolaris at gmail.com <mritun%2Bopensolaris at gmail.com>>
wrote:
> Hi Gray,
>
> You''ve got a nice setup going there, few comments:
>
> 1. Do not tune ZFS without a proven test-case to show otherwise, except...
> 2. For databases. Tune recordsize for that particular FS to match DB
> recordsize.
>
> Few questions...
>
> * How are you divvying up the space ?
> * How are you taking care of redundancy ?
> * Are you aware that each layer of ZFS needs its own redundancy ?
>
> Since you have got a mixed use case here, I would be surprized if a general
> config would cover all, though it might do with some luck.
> --
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
gcarper at umich.edu  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/66c8c224/attachment.html>

Ross

2008-Oct-16 07:58 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Miles makes a good point here, you really need to look at how this copes with
various failure modes.

Based on my experience, iSCSI is something that may cause you problems.  When I
tested this kind of setup last year I found that the entire pool hung for 3
minutes any time an iSCSI volume went offline.  It looked like a relatively
simple thing to fix if you can recompile the iSCSI driver, and there is talk
about making the timeout adjustable, but for me that was enough to put our
project on hold for now.

Ross
--
This message posted from opensolaris.org

Gray Carper

2008-Oct-16 08:00 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Oops - one thing I meant to mention: We only plan to cross-site replicate
data for those folks who require it. The HPC data crunching would have no
use for it, so that filesystem wouldn''t be replicated. In reality, we
only
expect a select few users, with relatively small filesystems, to actually
need replication. (Which begs the question: Why build an identical 150TB
system to support that? Good question. I think we''ll reevaluate. ;>)

-Gray

On Thu, Oct 16, 2008 at 3:50 PM, Gray Carper <gcarper at umich.edu> wrote:
> Howdy!
>
> Very valuable advice here (and from Bob, who made similar comments -
> thanks, Bob!). I think, then, we''ll generally stick to 128K
recordsizes. In
> the case of databases, we''ll stray as appropriate, and we may also
stray
> with the HPC compute cluster if we can get demonstrate that it is worth it.
>
> To answer your questions below...
>
> Currently, we have a single pool, in a "load share" configuration
(no
> raidz), that collects all the storage (which answers Ross''
question too).
> From that we carve filesystems on demand. There are many more tests planned
> for that construction, though, so we are not married to it.
>
> Redundancy abounds. ;> Since the pool doesn''t employ raidz, it
isn''t
> internally redundant, but we plan to replicate the pool''s data to
an
> identical system (which is not yet built) at another site. Our initial
> userbase don''t need the replication, however, because they uses
the system
> for little more than scratch space. Huge genomic datasets are dumped on the
> storage, analyzed, and the results (which are much smaller) get sent
> elsewhere. Everything is wiped out soon after that and the process starts
> again. Future projected uses of the storage, however, would be far less
> tolerant of loss, so I expect we''ll want to reconfigure the pool
in raidz.
>
> I see that Archie and Miles have shared some harrowing concerns which we
> take very seriously. I don''t think I''ll be able to reply
to them today, but
> I certainly will in the near future (particularly once we''ve
completed some
> more of our induced failure scenarios).
>
> Sidenote: Today we made eight network/iSCSI related tweaks that, in
> aggregate, have resulted in dramatic performance improvements (some I just
> hadn''t gotten around to yet, others suggested by Sun''s
Mertol Ozyoney)...
>
> - disabling the Nagle algorithm on the head node
> - setting each iSCSI target block size to match the ZFS record size of 128K
> - disabling "thin provisioning" on the iSCSI targets
> - enabling jumbo frames everywhere (each switch and NIC)
> - raising ddi_msix_alloc_limit to 8
> - raising ip_soft_rings_cnt to 16
> - raising tcp_deferred_acks_max to 16
> - raising tcp_local_dacks_max to 16
>
> Rerunning the same tests, we now see...
>
> [1GB file size, 1KB record size]
> Command: iozone -i o -i 1 -i 2 -r 1k -s 1g -f /data-das/perftest/1gbtest
> Write: 143373
> Rewrite: 183170
> Read: 433205
> Reread: 435503
> Random Read: 90118
> Random Write: 19488
>
> [8GB file size, 512KB record size]
> Command: iozone -i 0 -i 1 -i 2 -r 512k -s 8g -f
> /volumes/data-iscsi/perftest/8gbtest
> Write:  463260
> Rewrite:  449280
> Read:  1092291
> Reread:  881044
> Random Read:  442565
> Random Write:  565565
>
> [64GB file size, 1MB record size]
> Command: iozone -i o -i 1 -i 2 -r 1m -s 64g -f /data-das/perftest/64gbtest
> Write: 357199
> Rewrite: 342788
> Read: 609553
> Reread: 645618
> Random Read: 218874
> Random Write: 339624
>
> Thanks so much to everyone for all their great contributions!
> -Gray
>
>
> On Thu, Oct 16, 2008 at 2:20 AM, Akhilesh Mritunjai <
> mritun+opensolaris at gmail.com <mritun%2Bopensolaris at
gmail.com>> wrote:
>
>> Hi Gray,
>>
>> You''ve got a nice setup going there, few comments:
>>
>> 1. Do not tune ZFS without a proven test-case to show otherwise,
except...
>> 2. For databases. Tune recordsize for that particular FS to match DB
>> recordsize.
>>
>> Few questions...
>>
>> * How are you divvying up the space ?
>> * How are you taking care of redundancy ?
>> * Are you aware that each layer of ZFS needs its own redundancy ?
>>
>> Since you have got a mixed use case here, I would be surprized if a
>> general config would cover all, though it might do with some luck.
>> --
>> This message posted from opensolaris.org
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
>
>
> --
> Gray Carper
> MSIS Technical Services
> University of Michigan Medical School
> gcarper at umich.edu  |  skype:  graycarper  |  734.418.8506
> http://www.umms.med.umich.edu/msis/
>


-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
gcarper at umich.edu  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/5adc0bb5/attachment.html>

Miles Nordin

2008-Oct-16 19:01 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

>>>>> "r" == Ross <myxiplx at hotmail.com>
writes:
r> 1. Loss of a server is very much the worst case scenario.
r> Disk errors are much more likely, and with raid-z2 pools on
r> the individual servers

yeah, it kind of sucks that the slow resilvering speed enforces this
two-tier scheme.

Also if you''re going to have 1000 spinning platters you''ll
have a
drive failure every four days or so---you need to be able to do more
than one resilver at a time, and you need to do resilvers without
interrupting scrubs which could take so long to run that you run them
continuously. The ZFS-on-zvol hack lets you do both to a point, but I
think it''s an ugly workaround for lack of scalability in flat ZFS, not
the ideal way to do things.

r> A motherboard / backplane / PSU failure will offline that
r> server, but once the faulted components are replaced your pool
r> will come back online. Once the pool is online, ZFS has the
r> ability to resilver just the changed data,

except that is not what actually happens for my iSCSI setup. If I
''zpool offline'' the target before taking it down, it usually
does work
as you describe---a relatively fast resilver kicks off, and no CKSUM
errors appear later. I''ve used it gently. I haven''t offlined
a
raidz2 device for three weeks while writing gigabytes to the pool in
the mean time, but for my gentle use it does seem to work.

But if the iSCSI target goes down unexpectedly---ex., because I pull
the network cord---it does come back online and does resilver, but
latent CKSUM errors show up weeks later.

Also, if the head node reboots during a resilver, ZFS totally forgets
what it was doing, and upon reboot just blindly mounts the unclean
component as if it were clean, later calling all the differences CKSUM
errors. same thing happens if you offline a device, then reboot. The
``persistent'''' offlining doesn''t seem to work, and in
any case the
device comes online without a proper resilver.

SVM had dirty-region logging stored in the metadb so that resilvers
could continue where they left off across reboots. I believe SVM
usually did a full resilver when a component disappeared, but am not
sure this was always the case. Anyway ZFS doesn''t seem to have a
similar capability, at least not one that works.

so, in practice, whenever any iSCSI component goes away
unexpectedly---target server failure, power failure, kernel panic, L2
spanning tree reconfiguration, whatever---you have to scrub the whole
pool from the head node.

It''s interesting how the speed and optimisation of these maintenance
activities limit pool size. It''s not just full scrubs. If the
filesystem is subject to corruption, you need a backup. If the
filesystem takes two months to back up / restore, then you need really
solid incremental backup/restore features, and the backup needs to be
a cold spare, not just a backup---restoring means switching the roles
of the primary and backup system, not actually moving data.

finally, for really big pools, even O(n) might be too slow. The ZFS
best practice guide for converting UFS to ZFS says ``start multiple
rsync''s in parallel,'''' but I think we''re
finding zpool scrubs and zfs
sends are not well-parallelized.

These reliability limitations and performance characteristics of
maintenance tasks seem to make a sort of max-pool-size Wall beyond
which you end up painted into corners. If they were made better, I
think you''d later hit another wall at the maximum amount of data you
could push through one head node and would have to switch to some
QFS/GFS/OCFS-type separate-data-and-metadata filesystem, and to match
ZFS this filesystem would have to do scrubs, resilvers, and backups in
a distributed way not just distribute normal data access. A month ago
I might have ranted, ``head node speed puts a cap on how _busy_ the
filesystem can be, not how big it can be, so ZFS (modulo a lot of bug
fixes) could be fantastic for data sets of virtually unlimited size
even with its single-initiator, single-head-node limitation, so long
as the pool gets very light access.'''' Now, I don''t
think so, because
scrubbing/resilvering/backup-restore has to flow through the head
node, too.

This observation also means my preference for a ``recovery
tool'''' that
treats corrupt pools as read-only over fsck (online or offline) isn''t
very scalable. The original zfs kool-aid ``online
maintenance'''' model
of doing a cheap fsck at import time and a long O(n) fsck through
online scrubs is the only one with a future in a world where
maintenance activities can take months.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/171d8234/attachment.bin>

Marion Hakanson

2008-Oct-16 19:20 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

carton at Ivy.NET said:> It''s interesting how the speed and optimisation of these
maintenance
> activities limit pool size.  It''s not just full scrubs.  If the
filesystem is
> subject to corruption, you need a backup.  If the filesystem takes two
months
> to back up / restore, then you need really solid incremental backup/restore
> features, and the backup needs to be a cold spare, not just a
> backup---restoring means switching the roles of the primary and backup
> system, not actually moving data.   
I''ll chime in here with feeling uncomfortable with such a huge ZFS
pool,
and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach.  There
just seem to be too many moving parts depending on each other, any one of
which can make the entire pool unavailable.

For the stated usage of the original poster, I think I would aim toward
turning each of the Thumpers into an NFS server, configure the head-node
as a pNFS/NFSv4.1 metadata server, and let all the clients speak parallel-NFS
to the "cluster" of file servers.  You''ll end up with a huge
logical pool,
but a Thumper outage should result only in loss of access to the data on
that particular system.  The work of scrub/resilver/replication can be
divided among the servers rather than all living on a single head node.

Regards,

Marion

Erast Benson

2008-Oct-16 19:43 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

pNFS is NFS-centric of course and it is not yet stable, isn''t it? btw,
what is the ETA for pNFS putback?

On Thu, 2008-10-16 at 12:20 -0700, Marion Hakanson
wrote:> carton at Ivy.NET said:
> > It''s interesting how the speed and optimisation of these
maintenance
> > activities limit pool size.  It''s not just full scrubs.  If
the filesystem is
> > subject to corruption, you need a backup.  If the filesystem takes two
months
> > to back up / restore, then you need really solid incremental
backup/restore
> > features, and the backup needs to be a cold spare, not just a
> > backup---restoring means switching the roles of the primary and backup
> > system, not actually moving data.   
> 
> I''ll chime in here with feeling uncomfortable with such a huge ZFS
pool,
> and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach.  There
> just seem to be too many moving parts depending on each other, any one of
> which can make the entire pool unavailable.
> 
> For the stated usage of the original poster, I think I would aim toward
> turning each of the Thumpers into an NFS server, configure the head-node
> as a pNFS/NFSv4.1 metadata server, and let all the clients speak
parallel-NFS
> to the "cluster" of file servers.  You''ll end up with a
huge logical pool,
> but a Thumper outage should result only in loss of access to the data on
> that particular system.  The work of scrub/resilver/replication can be
> divided among the servers rather than all living on a single head node.
> 
> Regards,
> 
> Marion
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Nicolas Williams

2008-Oct-16 19:53 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

On Thu, Oct 16, 2008 at 12:20:36PM -0700, Marion Hakanson
wrote:> I''ll chime in here with feeling uncomfortable with such a huge ZFS
pool,
> and also with my discomfort of the ZFS-over-ISCSI-on-ZFS approach.  There
> just seem to be too many moving parts depending on each other, any one of
> which can make the entire pool unavailable.
But does it work well enough?  It may be faster than NFS if there''s
only
one client for each volume (unless you have fast slog devices for the
ZIL).  And it''d have better semantics too (e.g., no need for the client
and server to agree on identities/domains).

Nico
--

Marion Hakanson

2008-Oct-16 19:54 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Richard.Elling at Sun.COM said:> In general, such tasks would be better served by T5220 (or the new T5440
:-)
> and J4500s.  This would change the data paths from:
>     client --<net>-- T5220 --<net>-- X4500 --<SATA>--
disks to
>     client --<net>-- T5440 --<SAS>-- disks
> 
> With the J4500 you get the same storage density as the X4500, but with SAS
> access (some would call this direct access).  You will have much better
> bandwidth and lower latency between the T5440 (server) and disks while
still
> having the ability to multi-head the disks.  The 
There''s an odd economic factor here, if you''re in the .edu
sector:  The
Sun Education Essentials promotional price list has the X4540 priced
lower than a bare J4500 (not on the promotional list, but with a standard
EDU discount).

We have a project under development right now which might be served well
by one of these EDU X4540''s with a J4400 attached to it.  The spec
sheets
for J4400 and J4500 say you can chain together enough of them to make a
pool of 192 drives.  I''m unsure about the bandwidth of these
daisy-chained
SAS interconnects, though.  Any thoughts as to how high one might scale
an X4540-plus-J4x00 solution?  How does the X4540''s internal disk
bandwidth
compare to that of the (non-RAID) SAS HBA?

Regards,

Marion

Miles Nordin

2008-Oct-16 20:30 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

>>>>> "nw" == Nicolas Williams <Nicolas.Williams at
sun.com> writes:
    nw> But does it work well enough?  It may be faster than NFS if

You''re talking about different things.  Gray is using NFS period
between the storage cluster and the compute cluster, no iSCSI.

Gray''s (``does it work well enough''''):  iSCSI within
storage cluster
                                        NFS to storage consumers

Marion''s (less ``uncomfortable''''):      nothing(?)
within storage cluster
                                        pNFS to storage consumers

but Marion''s is not really possible at all, and won''t be for a
while
with other groups'' choice of storage-consumer platform, so
it''d have
to be GlusterFS or some other goofy fringe FUSEy thing or
not-very-general crude in-house hack.

I guess since Gray is copying data in and out all the time he doesn''t
have to worry about the glacial-restore problem and corruption
problem.  If it were my worry, I''d definitely include NFS clients in
the performance test because iSCSI is high-latency, and the NFS
clients could be more latency-sensitive than the local benchmark.  I
might test coalescing in the big data separately from running the
crunching, because maybe the big data can be copied in with
pax-over-netcat, or something other than NFS, and maybe the crunching
could treat the big data as read-only and write its small result to a
fast standalone ZFS server which would make NFS faster.  and i''d get
the small important data that needs backup off this mess (but please
let us know how the failure simulating testing goes!).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/3724d648/attachment.bin>

Nicolas Williams

2008-Oct-16 20:44 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

On Thu, Oct 16, 2008 at 04:30:28PM -0400, Miles Nordin
wrote:> >>>>> "nw" == Nicolas Williams
<Nicolas.Williams at sun.com> writes:
> 
>     nw> But does it work well enough?  It may be faster than NFS if
> 
> You''re talking about different things.  Gray is using NFS period
> between the storage cluster and the compute cluster, no iSCSI.
I was replying to Marion''s comment about
"ZFS-over-ISCSI-on-ZFS," not to
Gray.

I can see why one might worry about ZFS-over-iSCSI-on-ZFS.  Two layers
of copy-on-write might interact in odd ways that kill performance.  But
if you want ZFS-over-iSCSI in the first place then ZFS-over-iSCSI-on-ZFS
sounds like the correct approach IF it can perform well enough.

ZFS-over-iSCSI could certainly perform better than NFS, but again, it
may depend on what kind of ZIL devices you have.

Nico
--

Miles Nordin

2008-Oct-16 21:11 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

>>>>> "nw" == Nicolas Williams <Nicolas.Williams at
sun.com> writes:
>>>>> "mh" == Marion Hakanson <hakansom at
ohsu.edu> writes:
    nw> I was replying to Marion''s [...]
    nw> ZFS-over-iSCSI could certainly perform better than NFS,

better than what, ZFS-over-''mkfile''-files-on-NFS?  No one was
suggesting that.  Do you mean better than pNFS?  It sounded at first
like you meant iSCSI-over-ZFS should perform better than NFS, but no
one''s suggesting that either.

 Gray:    NFS over ZFS over iSCSI over ZFS over disk

 Marion: pNFS                     over ZFS over disk

they are both using the same amount of {,p}NFS.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081016/a99e16ad/attachment.bin>

David Magda

2008-Oct-16 22:29 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

On Oct 16, 2008, at 15:20, Marion Hakanson wrote:
> For the stated usage of the original poster, I think I would aim  
> toward
> turning each of the Thumpers into an NFS server, configure the head- 
> node
> as a pNFS/NFSv4.1
It''s a shame that Lustre isn''t available on Solaris yet
either.

Marion Hakanson

2008-Oct-16 23:46 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

carton at Ivy.NET said:> but Marion''s is not really possible at all, and won''t be
for a while with
> other groups'' choice of storage-consumer platform, so
it''d have to be
> GlusterFS or some other goofy fringe FUSEy thing or not-very-general crude
> in-house hack. 
Well, of course the magnitude of fringe factor is in the eye of the beholder.
I didn''t intend to make pNFS seem like a done deal.  I don''t
quite yet
think of OpenSolaris as a "done deal" either, still using Solaris-10
here
in production, but since this is an OpenSolaris mailing list I should be
more careful.

Anyway, from looking over the wiki/blog info, apparently the sticking
point with pNFS may be client-side availability -- there''s only Linux
and
(Open)Solaris NFSv4.1 clients just yet.  Still, pNFS claims to be backwards
compatible with NFS v3 clients:  If you point a traditional NFS client at
the pNFS metadata server, the MDS is supposed to relay the data from the
backend data servers.


dmagda at ee.ryerson.ca said:> It''s a shame that Lustre isn''t available on Solaris yet
either.
Actually, that may not be so terribly fringey, either.  Lustre and
Sun''s
Scalable Storage product can make use of Thumpers:
	http://www.sun.com/software/products/lustre/
	http://www.sun.com/servers/cr/scalablestorage/

Apparently it''s possible to have a Solaris/ZFS data-server for Lustre
backend storage:
	http://wiki.lustre.org/index.php?title=Lustre_OSS/MDS_with_ZFS_DMU

I see they do not yet have anything other than Linux clients, so that''s
a limitation.  But you can share out a Lustre filesystem over NFS, potentially
from multiple Lustre clients.  Maybe via CIFS/samba as well.

Lastly, I''ve considered the idea of using Shared-QFS to glue together
multiple Thumper-hosted ISCSI LUN''s.  You could add shared-QFS clients
(acting as NFS/CIFS servers) if the client load needed more than one.
Then SAM-FS would be a possibility for backup/replication.

Anyway, I do feel that none of this stuff is quite "there" yet.  But
my
experience with ZFS on fiberchannel SAN storage, that sinking feeling
I''ve had when a little connectivity glitch resulted in a ZFS panic,
makes me wonder if non-redundant ZFS on an ISCSI SAN is "there" yet,
either.  So far none of our lost-connection incidents resulted in pool
corruption, but we have only 4TB or so.  Restoring that much from tape
is feasible, but even if Gray''s 150TB of data can be recreated, it
would
take weeks to reload it.

If it''s decided that the clustered-filesystem solutions aren''t
feasible yet,
the suggestion I''ve seen that I liked the best was Richard''s,
with a bad-boy
server SAS-connected to multiple J4500''s.  But since Gray''s
project already
has the X4500''s, I guess they''d have to find another use for
them (:-).

Regards,

Marion

Ross

2008-Oct-17 08:31 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Some of that is very worrying Miles, do you have bug ID''s for any of
those problems?

I''m guessing the problem of the device being reported ok after the
reboot could be this one:
http://bugs.opensolaris.org/view_bug.do?bug_id=6582549

And could the errors after the reboot be one of these?
http://bugs.opensolaris.org/view_bug.do?bug_id=6558852
http://bugs.opensolaris.org/view_bug.do?bug_id=6675685

I don''t have the same concerns myself that you guys have over massive
pools since we''re working at a much smaller scale, but even so
it''s no good ZFS having one of it''s main selling features as
"only resilvers the missing data" if it can''t be relied upon
to do that every time in real world situations.

Incidentally, even with those resilver bugs, a few back of the envelope
calculations makes me think that this might not be too bad with 10Gb ethernet:

Server size:  28TB
Interconnect speed:  10Gb/s   (call it 8Gb/s of actual bandwidth)
Usage:  70%   (worst case scenario - pool dies while under heavy load)

That gives us an available resilver bandwidth of 3Gb''s, which
I''ll divide by two since that has to be used for both reads and writes.

28TB @ 1.5Gb/s gives a resilver time of around 42 hours, and changing some of
the assumptions by dropping pool usage to 20% brings that down to 16 hours. 
It''s still a long time, but for a rare disaster recovery scenario for a
large pool, I think I could live with it.
--
This message posted from opensolaris.org

Miles Nordin

2008-Oct-17 18:35 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

>>>>> "r" == Ross  <myxiplx at hotmail.com>
writes:
     r> do you have bug ID''s for any of those problems?

yeah, some of them, so maybe they will be fixed in s10u6.  Sometimes
the bug report writer has a narrower idea of the problem than I do,
but bugs.opensolaris.org is still encouraging.  Also note that there
are secret bugs, usually a bad thing but you could pervert it into
reason for even more optimism.

 6592835 6602697 6722540 6698575 6675685

     r> Server size: 28TB Interconnect speed: 10Gb/s (call it 8Gb/s of
     r> actual bandwidth) Usage: 70% 

     r> That gives us an available resilver bandwidth of 3Gb''s,
which
     r> I''ll divide by two since that has to be used for both reads
     r> and writes.

well...10Gbit/s Ethernet is full duplex but none of that matters.  I
don''t think people are reporting resilvers at anywhere near wire
speed.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081017/a52863ac/attachment.bin>

Gary Mills

2008-Oct-20 18:28 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

On Thu, Oct 16, 2008 at 03:50:19PM +0800, Gray Carper
wrote:> 
>    Sidenote: Today we made eight network/iSCSI related tweaks that, in
>    aggregate, have resulted in dramatic performance improvements (some I
>    just hadn''t gotten around to yet, others suggested by
Sun''s Mertol
>    Ozyoney)...
>    - disabling the Nagle algorithm on the head node
>    - setting each iSCSI target block size to match the ZFS record size of
>    128K
>    - disabling "thin provisioning" on the iSCSI targets
>    - enabling jumbo frames everywhere (each switch and NIC)
>    - raising ddi_msix_alloc_limit to 8
>    - raising ip_soft_rings_cnt to 16
>    - raising tcp_deferred_acks_max to 16
>    - raising tcp_local_dacks_max to 16
Can you tell us which of those changes made the most dramatic
improvement?  I have a similar situation here, with a 2-TB ZFS pool on
a T2000 using Iscsi to a Netapp file server.  Is there any way to tell
in advance if any of those changes will make a difference?  Many of
them seem to be server resources.  How can I determine their current
usage?

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

Jim Dunham

2008-Oct-20 20:21 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Gary,
>>   Sidenote: Today we made eight network/iSCSI related tweaks that, in
>>   aggregate, have resulted in dramatic performance improvements  
>> (some I
>>   just hadn''t gotten around to yet, others suggested by
Sun''s Mertol
>>   Ozyoney)...
>>   - disabling the Nagle algorithm on the head node
>>   - setting each iSCSI target block size to match the ZFS record  
>> size of
>>   128K
>>   - disabling "thin provisioning" on the iSCSI targets
>>   - enabling jumbo frames everywhere (each switch and NIC)
>>   - raising ddi_msix_alloc_limit to 8
>>   - raising ip_soft_rings_cnt to 16
>>   - raising tcp_deferred_acks_max to 16
>>   - raising tcp_local_dacks_max to 16
>
> Can you tell us which of those changes made the most dramatic
> improvement?
>>   - disabling the Nagle algorithm on the head node
This will have a dramatic effective on most I/Os, except for large  
sequential writes.
>> - setting each iSCSI target block size to match the ZFS record size  
>> of 128K
>>  - enabling jumbo frames everywhere (each switch and NIC)

These will have a positive effect for large writes, both sequential  
and random
>>   - disabling "thin provisioning" on the iSCSI targets
This only has a benefit for file-based or dsk based backing stores. If  
one use rdsk backing stores of any type, this is not an issue.

Jim
> I have a similar situation here, with a 2-TB ZFS pool on
> a T2000 using Iscsi to a Netapp file server.  Is there any way to tell
> in advance if any of those changes will make a difference?  Many of
> them seem to be server resources.  How can I determine their current
> usage?
>
> -- 
> -Gary Mills-    -Unix Support-    -U of M Academic Computing and  
> Networking-
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Jim Dunham
Storage Platform Software Group
Sun Microsystems, Inc.

Gray Carper

2008-Oct-21 00:54 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Hey, Jim! Thanks so much for the excellent assist on this - much better than
I could have ever answered it!

I thought I''d add a little bit on the other four...

 - raising ddi_msix_alloc_limit to 8

For PCI cards that use up to 8 interrupts, which our 10GBe adapters do. The
previous value of 2 could cause some CPU interrupt bottlenecks. So far, this
has been more of a preventative measure - we haven''t seen a case where
this
really made any performance impact.

 - raising ip_soft_rings_cnt to 16

This increases the number of kernel threads associated with packet
processing and is specifically meant to reduce the latency in handling
10GBe. This showed a small performance improvement.

 - raising tcp_deferred_acks_max to 16

This reduces the number of ACK packets sent, thus reducing the overall TCP
overhead. This showed a small performance improvement.

 - raising tcp_local_dacks_max to 16

This also slows down ACK packets and showed a tiny performance improvement.

Overall, we have found these four settings to not make a whole lot of
difference, but every little bit helps. ;> The four that Jim went through
were much more impactful particularly the enabling of jumbo frames and the
disabling of the Nagle algorithm.

-Gray

On Tue, Oct 21, 2008 at 4:21 AM, Jim Dunham <James.Dunham at sun.com>
wrote:
> Gary,
>
>   Sidenote: Today we made eight network/iSCSI related tweaks that, in
>>>  aggregate, have resulted in dramatic performance improvements
(some I
>>>  just hadn''t gotten around to yet, others suggested by
Sun''s Mertol
>>>  Ozyoney)...
>>>  - disabling the Nagle algorithm on the head node
>>>  - setting each iSCSI target block size to match the ZFS record
size of
>>>  128K
>>>  - disabling "thin provisioning" on the iSCSI targets
>>>  - enabling jumbo frames everywhere (each switch and NIC)
>>>  - raising ddi_msix_alloc_limit to 8
>>>  - raising ip_soft_rings_cnt to 16
>>>  - raising tcp_deferred_acks_max to 16
>>>  - raising tcp_local_dacks_max to 16
>>>
>>
>> Can you tell us which of those changes made the most dramatic
>> improvement?
>>
>
>   - disabling the Nagle algorithm on the head node
>>>
>>
> This will have a dramatic effective on most I/Os, except for large
> sequential writes.
>
>  - setting each iSCSI target block size to match the ZFS record size of
>>> 128K
>>>  - enabling jumbo frames everywhere (each switch and NIC)
>>>
>>
>
> These will have a positive effect for large writes, both sequential and
> random
>
>   - disabling "thin provisioning" on the iSCSI targets
>>>
>>
> This only has a benefit for file-based or dsk based backing stores. If one
> use rdsk backing stores of any type, this is not an issue.
>
> Jim
>
>  I have a similar situation here, with a 2-TB ZFS pool on
>> a T2000 using Iscsi to a Netapp file server.  Is there any way to tell
>> in advance if any of those changes will make a difference?  Many of
>> them seem to be server resources.  How can I determine their current
>> usage?
>>
>> --
>> -Gary Mills-    -Unix Support-    -U of M Academic Computing and
>> Networking-
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
> Jim Dunham
> Storage Platform Software Group
> Sun Microsystems, Inc.
>


-- 
Gray Carper
MSIS Technical Services
University of Michigan Medical School
gcarper at umich.edu  |  skype:  graycarper  |  734.418.8506
http://www.umms.med.umich.edu/msis/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20081021/3c4799e8/attachment.html>

Robert Milkowski

2008-Oct-22 17:18 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Hello Richard,

Wednesday, October 15, 2008, 6:39:49 PM, you wrote:

RE> Archie Cowan wrote:>> I just stumbled upon this thread somehow and thought I''d share
my zfs over iscsi experience.
>>
>> We recently abandoned a similar configuration with several pairs of
x4500s exporting zvols as iscsi targets and mirroring them for "high
availability" with T5220s.
>>   
RE> In general, such tasks would be better served by T5220 (or the new T5440
:-)
RE> and J4500s.  This would change the data paths from:
RE>     client --<net>-- T5220 --<net>-- X4500 --<SATA>--
disks
RE> to
RE>     client --<net>-- T5440 --<SAS>-- disks

RE> With the J4500 you get the same storage density as the X4500, but
RE> with SAS access (some would call this direct access).  You will have
RE> much better bandwidth and lower latency between the T5440 (server)
RE> and disks while still having the ability to multi-head the disks.  The
RE> J4500 is a relatively new system, so this option may not have been
RE> available at the time Archie was building his system.

Has MPxIO for J4500 (SAS) been backported to S10 yet?

-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Robert Milkowski

2008-Oct-22 17:26 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Hello Bob,

Wednesday, October 15, 2008, 9:45:52 PM, you wrote:

BF> On Wed, 15 Oct 2008, Tomas ?gren wrote:>>> ZFS does not support RAID0 (simple striping).
>>
>> zpool create mypool disk1 disk2 disk3
>>
>> Sure it does.
BF> This is load-share, not RAID0.  Also, to answer the other fellow, 
BF> since ZFS does not support RAID0, it also does not support RAID 1+0 
BF> (10). :-)

BF> With RAID0 and 8 drives in a stripe, if you send a 128K block of data,
BF> it gets split up into eight chunks, with a chunk written to each 
BF> drive.  With ZFS''s load share, that 128K block of data only gets

Well, it depends on your stripe width - generally it would be true
only if your strip with would be of 16KB, If you set-up 128KB stripe
width you would end-up with one or two IO/s to one or two disk drives
depending if your write was stripe aligned or not.
ZFS will make sure that every fs block is stripe aligned when doing
raid-0 like configuration (aka zfs dynamic striping). However it''s not
true for raid-z{1|2}

-- 
Best regards,
 Robert Milkowski                            mailto:milek at task.gda.pl
                                       http://milek.blogspot.com

Richard Elling

2008-Oct-22 19:44 UTC

head link

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

Robert Milkowski wrote:> Hello Richard,
>
> Wednesday, October 15, 2008, 6:39:49 PM, you wrote:
>
> RE> Archie Cowan wrote:
>   
>>> I just stumbled upon this thread somehow and thought I''d
share my zfs over iscsi experience.
>>>
>>> We recently abandoned a similar configuration with several pairs of
x4500s exporting zvols as iscsi targets and mirroring them for "high
availability" with T5220s.
>>>   
>>>       
>
> RE> In general, such tasks would be better served by T5220 (or the new
T5440 :-)
> RE> and J4500s.  This would change the data paths from:
> RE>     client --<net>-- T5220 --<net>-- X4500
--<SATA>-- disks
> RE> to
> RE>     client --<net>-- T5440 --<SAS>-- disks
>
> RE> With the J4500 you get the same storage density as the X4500, but
> RE> with SAS access (some would call this direct access).  You will have
> RE> much better bandwidth and lower latency between the T5440 (server)
> RE> and disks while still having the ability to multi-head the disks. 
The
> RE> J4500 is a relatively new system, so this option may not have been
> RE> available at the time Archie was building his system.
>
> Has MPxIO for J4500 (SAS) been backported to S10 yet?
>   
It is not a J4500 feature, it will depend on the HBA and driver.  mpt(7d)
has it in Solaris 10 5/08 (update 5) and patches are available for update 4.
When in doubt, check the man page for your driver.
 -- richard

zfs discuss - Oct 2008 - ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...

[zfs-discuss] ZFS-over-iSCSI performance testing (with low random access results)...