thr3ads.net - Lustre discuss - [Lustre-discuss] DDN hints? [May 2007]

If this information is useful, please help other people find it:
Share via:

John R. Dunning

2007-May-18 05:57 UTC

[Lustre-discuss] DDN hints?

I''m getting my first exposure to a ddn storage array and it''s
enlightening :-}

Mostly it just does what I expected of it with little trouble, but I''m
having
hard time getting the read side working as fast as it ought to be able to go.
Does anybody have experience they''d like to share, tuning the
kernel/driver or
the array?

I''m using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver. 
Using the
anticipatory scheduler, and tweaking up the readahead size for the blockdev, I
can get around 300MB/s by using 4 threads on a port, or about 3/4 of the
expected max.  Writes max out easily.  The ddn''s stats say that the
large
majority of my reads are only 256K, even though the requests are larger than
that.

I tried incorporating the blkdev-max-io-size-selection and
increase-sglist-size patches from cfs, but that didn''t really help, my
reads
are still maxing out at 256K.

If anybody''s been through this kind of thing and has experiences,
rumors, or
war stories about what kinds of tuning in this area yield good results,
I''d
love to talk to you!

Daniel Leaberry

2007-May-18 08:06 UTC

head link

[Lustre-discuss] DDN hints?

John R. Dunning wrote:> I''m getting my first exposure to a ddn storage array and
it''s enlightening :-}
>
> Mostly it just does what I expected of it with little trouble, but
I''m having
> hard time getting the read side working as fast as it ought to be able to
go.
> Does anybody have experience they''d like to share, tuning the
kernel/driver or
> the array?
>
> I''m using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver.
Using the
> anticipatory scheduler, and tweaking up the readahead size for the
blockdev, I
> can get around 300MB/s by using 4 threads on a port, or about 3/4 of the
> expected max.  Writes max out easily.  The ddn''s stats say that
the large
> majority of my reads are only 256K, even though the requests are larger
than
> that.
>
> I tried incorporating the blkdev-max-io-size-selection and
> increase-sglist-size patches from cfs, but that didn''t really
help, my reads
> are still maxing out at 256K.
>
> If anybody''s been through this kind of thing and has experiences,
rumors, or
> war stories about what kinds of tuning in this area yield good results,
I''d
> love to talk to you!
>   I know this won''t help you but for posterities sake use IBGD 1.8.2 if 
you have an infiniband DDN array. I  tried for 4 days with OFED 1.1.1 to 
get decent io performance and never could. 30 minutes to install IBGD 
and I was pushing 700MB/sec through the port using dd.
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

Makia Minich

2007-May-18 08:13 UTC

head link

[Lustre-discuss] DDN hints?

On Friday 18 May 2007 10:06:35 am Daniel Leaberry wrote:> John R. Dunning wrote:
> > I''m getting my first exposure to a ddn storage array and
it''s
> > enlightening :-}
> >
> > Mostly it just does what I expected of it with little trouble, but
I''m
> > having hard time getting the read side working as fast as it ought to
be
> > able to go. Does anybody have experience they''d like to
share, tuning the
> > kernel/driver or the array?
> >
> > I''m using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07
driver.  Using
> > the anticipatory scheduler, and tweaking up the readahead size for the
> > blockdev, I can get around 300MB/s by using 4 threads on a port, or
about
> > 3/4 of the expected max.  Writes max out easily.  The ddn''s
stats say
> > that the large majority of my reads are only 256K, even though the
> > requests are larger than that.
> >
> > I tried incorporating the blkdev-max-io-size-selection and
> > increase-sglist-size patches from cfs, but that didn''t really
help, my
> > reads are still maxing out at 256K.
> >
> > If anybody''s been through this kind of thing and has
experiences, rumors,
> > or war stories about what kinds of tuning in this area yield good
> > results, I''d love to talk to you!
>
> I know this won''t help you but for posterities sake use IBGD 1.8.2
if
> you have an infiniband DDN array. I  tried for 4 days with OFED 1.1.1 to
> get decent io performance and never could. 30 minutes to install IBGD
> and I was pushing 700MB/sec through the port using dd.
Do you have actual numbers for your OFED test?  If so, please send a message 
to the OpenFabrics General mailing list (general@lists.openfabrics.org) 
letting them know of this performance degredation.  The more details we have 
of a slow-down in the SRP performance, the more chance we have of OFED 
finally fixing whatever the problem is (or at least getting Mellanox to pony 
up what is the difference between IBGD''s and OFED''s SRP client
code and
explain why they haven''t submitted changes).
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss@clusterfs.com
> > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
-- 
Makia Minich <minich@ornl.gov>
National Center for Computation Science
Oak Ridge National Laboratory
Phone: 865.574.7460
--*--
Imagine no possessions
I wonder if you can
- John Lennon

chas williams - CONTRACTOR

2007-May-18 08:38 UTC

head link

[Lustre-discuss] DDN hints?

In message <17997.38024.295869.482039@gs105.sicortex.com>,"John R.
Dunning" wri
tes:>I tried incorporating the blkdev-max-io-size-selection and
>increase-sglist-size patches from cfs, but that didn''t really help,
my reads
>are still maxing out at 256K.
the srp initator creates a virtual scsi device driver.  this virtual
device driver has a .max_sectors paramters associated with it.  you can
tune this with the max_sect= during login for the openfabrics stack.
no idea, how this is tuned on ibgold.  

take a look at
/sys/block/sd<whatever>/queue/{max_hw_sectors_kb,max_sectors_kb}

if you arent using direct i/o, use direct i/o.  you could just tune
the page size of the ddn to 256k.

Makia Minich

2007-May-18 09:15 UTC

head link

[Lustre-discuss] DDN hints?

How much luck did you have with this tuning and OFED''s SRP?  What
performance
are you seeing?  We had done quite a bit of testing playing with this option, 
but saw very little improvement in performance (if I remember correctly, the 
block sizes did increase, but performance was still down).

On Friday 18 May 2007 10:38:29 am chas williams - CONTRACTOR
wrote:> In message <17997.38024.295869.482039@gs105.sicortex.com>,"John
R. Dunning"
> wri
>
> tes:
> >I tried incorporating the blkdev-max-io-size-selection and
> >increase-sglist-size patches from cfs, but that didn''t really
help, my
> > reads are still maxing out at 256K.
>
> the srp initator creates a virtual scsi device driver.  this virtual
> device driver has a .max_sectors paramters associated with it.  you can
> tune this with the max_sect= during login for the openfabrics stack.
> no idea, how this is tuned on ibgold.
>
> take a look at
> /sys/block/sd<whatever>/queue/{max_hw_sectors_kb,max_sectors_kb}
>
> if you arent using direct i/o, use direct i/o.  you could just tune
> the page size of the ddn to 256k.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss@clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
-- 
Makia Minich <minich@ornl.gov>
National Center for Computation Science
Oak Ridge National Laboratory
Phone: 865.574.7460
--*--
Imagine no possessions
I wonder if you can
- John Lennon

Daniel Leaberry

2007-May-18 09:54 UTC

head link

[Lustre-discuss] DDN hints?

Makia Minich wrote:> How much luck did you have with this tuning and OFED''s SRP?  What
performance
> are you seeing?  We had done quite a bit of testing playing with this
option,
> but saw very little improvement in performance (if I remember correctly,
the
> block sizes did increase, but performance was still down).
>
>   That''s what I saw as well. I eventually got great performance writing 
with /dev/sg* devices by tuning srp_sg_tablesize (it defaults to 12 
which sent 48KB io''s to the array) the but I could never get /dev/sd* 
devices to perform and reading was always stuck at 128KB io''s no matter
what I passed into to srp_sg_tablesize.

 
> On Friday 18 May 2007 10:38:29 am chas williams - CONTRACTOR wrote:
>   
>> In message
<17997.38024.295869.482039@gs105.sicortex.com>,"John R. Dunning"
>> wri
>>
>> tes:
>>     
>>> I tried incorporating the blkdev-max-io-size-selection and
>>> increase-sglist-size patches from cfs, but that didn''t
really help, my
>>> reads are still maxing out at 256K.
>>>       
>> the srp initator creates a virtual scsi device driver.  this virtual
>> device driver has a .max_sectors paramters associated with it.  you can
>> tune this with the max_sect= during login for the openfabrics stack.
>> no idea, how this is tuned on ibgold.
>>
>> take a look at
>> /sys/block/sd<whatever>/queue/{max_hw_sectors_kb,max_sectors_kb}
>>
>> if you arent using direct i/o, use direct i/o.  you could just tune
>> the page size of the ddn to 256k.
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss@clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>>     
>
>

chas williams - CONTRACTOR

2007-May-18 09:57 UTC

head link

[Lustre-discuss] DDN hints?

well... i suspect tuning the i/o sizes to be larger didnt make a big
difference on reads.  it helps to get the write to atleast match the page
size on the ddn''s memory cache (2MB as i recall, but this can be tuned
to
a smaller value).  this will let most devices "write through" the
memory
cache directly to disk.  as you get farther and farther away from your
storage, you need to increase the message size to offset bandwidth*delay.

after a bit of fiddling, we managed to get:

        Using Minimum Record Size 1024 KB
        Auto Mode 2. This option is obsolete. Use -az -i0 -i1
        O_DIRECT feature enabled
        Command line used: /data1/iozone.ia64 -f testfile -y 1024k -A -I
        Output is in Kbytes/sec
        Time Resolution = 0.000001 seconds.
        Processor cache size set to 1024 Kbytes.
        Processor cache line size set to 32 bytes.
        File stride size set to 17 * record size.

              KB  reclen   write rewrite    read    reread
          524288    1024  270275  430822   354823   355126
          524288    2048  412733  545186   673329   679848
          524288    4096  533884  619260  1048551  1053483
          524288    8192  606596  665102  1201478  1192968
          524288   16384  662077  698136  1333838  1334341

this was a single host using 2 ddr adapters striped across 8 luns on
the ddn.  each lun on the ddn was across 14 tiers.  obviously a 512MB
test file fits inside the ddn cache.  the ddn should be able to go faster,
but my single host couldnt push harder.

In message <200705181114.50119.minich@ornl.gov>,Makia Minich
writes:>How much luck did you have with this tuning and OFED''s SRP?  What
performance
>are you seeing?  We had done quite a bit of testing playing with this
option,
>but saw very little improvement in performance (if I remember correctly, the
>block sizes did increase, but performance was still down).
>
>On Friday 18 May 2007 10:38:29 am chas williams - CONTRACTOR wrote:
>> In message
<17997.38024.295869.482039@gs105.sicortex.com>,"John R. Dunning"
>> wri
>>
>> tes:
>> >I tried incorporating the blkdev-max-io-size-selection and
>> >increase-sglist-size patches from cfs, but that didn''t
really help, my
>> > reads are still maxing out at 256K.
>>
>> the srp initator creates a virtual scsi device driver.  this virtual
>> device driver has a .max_sectors paramters associated with it.  you can
>> tune this with the max_sect= during login for the openfabrics stack.
>> no idea, how this is tuned on ibgold.
>>
>> take a look at
>> /sys/block/sd<whatever>/queue/{max_hw_sectors_kb,max_sectors_kb}
>>
>> if you arent using direct i/o, use direct i/o.  you could just tune
>> the page size of the ddn to 256k.
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss@clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
>-- 
>Makia Minich <minich@ornl.gov>
>National Center for Computation Science
>Oak Ridge National Laboratory
>Phone: 865.574.7460
>--*--
>Imagine no possessions
>I wonder if you can
>- John Lennon
>

Andreas Dilger

2007-May-18 12:43 UTC

head link

[Lustre-discuss] DDN hints?

On May 18, 2007  07:56 -0400, John R. Dunning wrote:> I''m using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07 driver.
Using the
> anticipatory scheduler, and tweaking up the readahead size for the
blockdev, I
For a DDN you should probably use noop or deadline scheduler.  Anticipatory
is really tuned for desktop workloads.
> can get around 300MB/s by using 4 threads on a port, or about 3/4 of the
> expected max.  Writes max out easily.  The ddn''s stats say that
the large
> majority of my reads are only 256K, even though the requests are larger
than
> that.
What tool are you using to measure performance?  I''d strongly suggest
using
the lustre-iokit, which has several components in order to test bare-disk,
local filesystem, network, and lustre-filesystem components independently.

Lustre can consistently generate 1MB IOs to the underlying filesystem because
it submits the IO in 1MB chunks, unlike the kernel''s read() and write()
calls
which submit IO in 4kB chunks and hope the elevator can merge them.

See also the DDN tuning section in the Lustre manual.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

John R. Dunning

2007-May-18 13:10 UTC

head link

[Lustre-discuss] DDN hints?

From: Andreas Dilger <adilger@clusterfs.com>
    Date: Fri, 18 May 2007 12:43:48 -0600

    On May 18, 2007  07:56 -0400, John R. Dunning wrote:
    > I''m using 2.6.15 kernel, and qlogic 2462 hbas with 8.01.07
driver.  Using the
    > anticipatory scheduler, and tweaking up the readahead size for the
blockdev, I

    For a DDN you should probably use noop or deadline scheduler.  Anticipatory
    is really tuned for desktop workloads.

Yes, others have said the same thing.  I''ve tried them both but so far
there''s
not much difference.  The evidence is that something in the block layer is
breaking up read requests, which seems to negate any effect I might be getting
from the iosched.

I found /proc/sys/vm/block_dump, added some extra instrumentation to it, and
turned it on.  On the write side, I''m seeing nice big requests (though
the
sizes are a bit all over the place) but on the read side it seems to be
willing to go up to 32 elements in the bio and never go any higher.  That
statement so far seems to be true regardless of what I use for readahead
values, what scheduler tuning params I give it, what kind of request size the
higher level thinks it''s issuing etc.  It''s behaving like
there''s something
which has an arbitrary limit on the size of a read request, but I
haven''t yet
figured out what that is.

    > can get around 300MB/s by using 4 threads on a port, or about 3/4 of
the
    > expected max.  Writes max out easily.  The ddn''s stats say
that the large
    > majority of my reads are only 256K, even though the requests are larger
than
    > that.

    What tool are you using to measure performance?  

Various.  Mostly iozone and timing dd and stuff like that.  I''m not
(yet)
running lustre against the ddn.

						     I''d strongly suggest using
    the lustre-iokit, which has several components in order to test bare-disk,
    local filesystem, network, and lustre-filesystem components independently.

Ok.  I tried an older version of it last year, and it didn''t seem to be
telling anything I hadn''t already found out by other means.  EEB
shipped me a
newer version, which I''ve unpacked, and am currently trying to figure
out how
to build.  It seems to be set up such that I have to autoconf it, but trying
to do that causes errors.  Hints? 

    Lustre can consistently generate 1MB IOs to the underlying filesystem
because
    it submits the IO in 1MB chunks, unlike the kernel''s read() and
write() calls
    which submit IO in 4kB chunks and hope the elevator can merge them.

    See also the DDN tuning section in the Lustre manual.

Ok, will do.  Thanks.

John R. Dunning

2007-May-18 13:18 UTC

head link

[Lustre-discuss] DDN hints?

From: "John R. Dunning" <jrd@sicortex.com>
    Date: Fri, 18 May 2007 15:10:45 -0400

	       It seems to be set up such that I have to autoconf it, but trying
    to do that causes errors.  Hints? 

Ok, never mind, I realized that this version doesn''t contain ior, so
it''s all
shell scripts, no building required.

Lustre discuss - May 2007 - DDN hints?

[Lustre-discuss] DDN hints?

[Lustre-discuss] DDN hints?

[Lustre-discuss] DDN hints?

[Lustre-discuss] DDN hints?

[Lustre-discuss] DDN hints?

[Lustre-discuss] DDN hints?

[Lustre-discuss] DDN hints?

[Lustre-discuss] DDN hints?

[Lustre-discuss] DDN hints?

[Lustre-discuss] DDN hints?