thr3ads.net - Lustre discuss - [Lustre-discuss] Using brw_stats to diagnose lustre performance [Jun 2010]

If this information is useful, please help other people find it:
Share via:

mark

2010-Jun-14 21:20 UTC

[Lustre-discuss] Using brw_stats to diagnose lustre performance

Hi Everyone,

I''m trying to diagnose some performance concerns we are having about
our
lustre deployment.  It seems to be a fairly multifaceted problem 
involving how ifort does buffered writes along with how we have lustre 
setup.

What I''ve identified so far is that our raid stripe size on the OSTs is
768KB (6 * 128KB chunks) and the partitions are not being mounted with 
-o strip.  We have 2 luns per controller and each virtual disk has 2 
partitions with the 2nd one being the lustre file system.  It is 
possible the partitions are not aligned.  Most of the client side 
settings are at default (ie 8 rpcs in flight, 32MB dirty cache per OST, 
etc).  The journals are on separate SSDs.  Our OSSes are probably 
oversubscribed.

What we''ve noticed is that with certain apps we get *really* bad 
performance to the OSTs.  As bad as 500-800KB/s to one OST.  The best 
performance I''ve seen to an OST is around 300MB/s, with 500MB/s being 
more or less the upper bound limited by IB.

Right now I''m trying to verify that fragmentation is happening like I 
would expect given the configuration mentioned above.  I just learned 
about brw_stats, so I tried examining it for one of our OSTs (It looks 
like lustre must have been restarted recently with so little data):

disk fragmented I/Os   ios   % cum % |  ios   % cum %
1:		         0   0   0   |  215   9   9
2:		         0   0   0   | 2004  89  98
3:		         0   0   0   |   22   0  99
4:		         0   0   0   |    2   0  99
5:		         0   0   0   |    5   0  99
6:		         0   0   0   |    2   0  99
7:		         1 100 100   |    1   0 100

disk I/O size          ios   % cum % |  ios   % cum %
4K:		         3  42  42   |   17   0   0
8K:		         0   0  42   |   17   0   0
16K:		         0   0  42   |   22   0   1
32K:		         0   0  42   |   73   1   2
64K:		         1  14  57   |  292   6   9
128K:		         0   0  57   |  385   8  18
256K:		         3  42 100   |   88   2  20
512K:		         0   0 100   | 1229  28  48
1M:		         0   0 100   | 2218  51 100

My questions are:

1) Does a disk framentation of "1" mean that those IO was fragmented
or
would that be "0"?

2) Does the disk I/O size mean what lustre actually wrote or what it 
wanted to write?  What does that number mean in the context of our 768KB 
stripe size since it lists so many I/Os at 1M?

Thanks,
Mark

Kevin Van Maren

2010-Jun-15 20:19 UTC

head link

[Lustre-discuss] Using brw_stats to diagnose lustre performance

Live is much easier with a 1MB (or 512KB) native raid stripe size.


It looks like most IOs are being broken into 2 pieces.  See 
https://bugzilla.lustre.org/show_bug.cgi?id=22850
for a few tweaks that would help get IOs > 512KB to disk.  See also Bug 9945

But you are also seeing IOs "combined" into pieces that are between 1 
and 2 raid stripes, so set
/sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler 
does not "help" too much.

There are mkfs options to tell ldiskfs your native raid stripe size.  
You probably also want to change
the client stripe size (lfs setstripe) to be an integral multiple of the 
raid size (ie, not the default 1MB).

Also note that those are power-of-2 buckets, so your 768KB chunks
aren''t
going to be listed as "768".

Kevin


mark wrote:> Hi Everyone,
>
> I''m trying to diagnose some performance concerns we are having
about our
> lustre deployment.  It seems to be a fairly multifaceted problem 
> involving how ifort does buffered writes along with how we have lustre 
> setup.
>
> What I''ve identified so far is that our raid stripe size on the
OSTs is
> 768KB (6 * 128KB chunks) and the partitions are not being mounted with 
> -o strip.  We have 2 luns per controller and each virtual disk has 2 
> partitions with the 2nd one being the lustre file system.  It is 
> possible the partitions are not aligned.  Most of the client side 
> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per OST, 
> etc).  The journals are on separate SSDs.  Our OSSes are probably 
> oversubscribed.
>
> What we''ve noticed is that with certain apps we get *really* bad 
> performance to the OSTs.  As bad as 500-800KB/s to one OST.  The best 
> performance I''ve seen to an OST is around 300MB/s, with 500MB/s
being
> more or less the upper bound limited by IB.
>
> Right now I''m trying to verify that fragmentation is happening
like I
> would expect given the configuration mentioned above.  I just learned 
> about brw_stats, so I tried examining it for one of our OSTs (It looks 
> like lustre must have been restarted recently with so little data):
>
> disk fragmented I/Os   ios   % cum % |  ios   % cum %
> 1:		         0   0   0   |  215   9   9
> 2:		         0   0   0   | 2004  89  98
> 3:		         0   0   0   |   22   0  99
> 4:		         0   0   0   |    2   0  99
> 5:		         0   0   0   |    5   0  99
> 6:		         0   0   0   |    2   0  99
> 7:		         1 100 100   |    1   0 100
>
> disk I/O size          ios   % cum % |  ios   % cum %
> 4K:		         3  42  42   |   17   0   0
> 8K:		         0   0  42   |   17   0   0
> 16K:		         0   0  42   |   22   0   1
> 32K:		         0   0  42   |   73   1   2
> 64K:		         1  14  57   |  292   6   9
> 128K:		         0   0  57   |  385   8  18
> 256K:		         3  42 100   |   88   2  20
> 512K:		         0   0 100   | 1229  28  48
> 1M:		         0   0 100   | 2218  51 100
>
> My questions are:
>
> 1) Does a disk framentation of "1" mean that those IO was
fragmented or
> would that be "0"?
>
> 2) Does the disk I/O size mean what lustre actually wrote or what it 
> wanted to write?  What does that number mean in the context of our 768KB 
> stripe size since it lists so many I/Os at 1M?
>
> Thanks,
> Mark
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Andreas Dilger

2010-Jun-15 21:08 UTC

head link

[Lustre-discuss] Using brw_stats to diagnose lustre performance

Also setting the max RPC size on the client to be 768kB would avoid  
the need for each RPC to generate 2 IO requests.

It is possible with newer tune2fs to set the RAID stripe size and the  
allocator (mballoc) will use that size. There is a bug open to  
transfer this "optimal " size to the client, but it hasn''t
gotten mug
attention since most sites are set up with 1MB stripe size.

Cheers, Andreas

On 2010-06-15, at 14:19, Kevin Van Maren <kevin.van.maren at oracle.com>  
wrote:
> Live is much easier with a 1MB (or 512KB) native raid stripe size.
>
>
> It looks like most IOs are being broken into 2 pieces.  See
> https://bugzilla.lustre.org/show_bug.cgi?id=22850
> for a few tweaks that would help get IOs > 512KB to disk.  See also  
> Bug 9945
>
> But you are also seeing IOs "combined" into pieces that are
between 1
> and 2 raid stripes, so set
> /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler
> does not "help" too much.
>
> There are mkfs options to tell ldiskfs your native raid stripe size.
> You probably also want to change
> the client stripe size (lfs setstripe) to be an integral multiple of  
> the
> raid size (ie, not the default 1MB).
>
> Also note that those are power-of-2 buckets, so your 768KB chunks  
> aren''t
> going to be listed as "768".
>
> Kevin
>
>
> mark wrote:
>> Hi Everyone,
>>
>> I''m trying to diagnose some performance concerns we are having
>> about our
>> lustre deployment.  It seems to be a fairly multifaceted problem
>> involving how ifort does buffered writes along with how we have  
>> lustre
>> setup.
>>
>> What I''ve identified so far is that our raid stripe size on
the
>> OSTs is
>> 768KB (6 * 128KB chunks) and the partitions are not being mounted  
>> with
>> -o strip.  We have 2 luns per controller and each virtual disk has 2
>> partitions with the 2nd one being the lustre file system.  It is
>> possible the partitions are not aligned.  Most of the client side
>> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per  
>> OST,
>> etc).  The journals are on separate SSDs.  Our OSSes are probably
>> oversubscribed.
>>
>> What we''ve noticed is that with certain apps we get *really*
bad
>> performance to the OSTs.  As bad as 500-800KB/s to one OST.  The best
>> performance I''ve seen to an OST is around 300MB/s, with
500MB/s being
>> more or less the upper bound limited by IB.
>>
>> Right now I''m trying to verify that fragmentation is happening
like I
>> would expect given the configuration mentioned above.  I just learned
>> about brw_stats, so I tried examining it for one of our OSTs (It  
>> looks
>> like lustre must have been restarted recently with so little data):
>>
>> disk fragmented I/Os   ios   % cum % |  ios   % cum %
>> 1:                 0   0   0   |  215   9   9
>> 2:                 0   0   0   | 2004  89  98
>> 3:                 0   0   0   |   22   0  99
>> 4:                 0   0   0   |    2   0  99
>> 5:                 0   0   0   |    5   0  99
>> 6:                 0   0   0   |    2   0  99
>> 7:                 1 100 100   |    1   0 100
>>
>> disk I/O size          ios   % cum % |  ios   % cum %
>> 4K:                 3  42  42   |   17   0   0
>> 8K:                 0   0  42   |   17   0   0
>> 16K:                 0   0  42   |   22   0   1
>> 32K:                 0   0  42   |   73   1   2
>> 64K:                 1  14  57   |  292   6   9
>> 128K:                 0   0  57   |  385   8  18
>> 256K:                 3  42 100   |   88   2  20
>> 512K:                 0   0 100   | 1229  28  48
>> 1M:                 0   0 100   | 2218  51 100
>>
>> My questions are:
>>
>> 1) Does a disk framentation of "1" mean that those IO was  
>> fragmented or
>> would that be "0"?
>>
>> 2) Does the disk I/O size mean what lustre actually wrote or what it
>> wanted to write?  What does that number mean in the context of our  
>> 768KB
>> stripe size since it lists so many I/Os at 1M?
>>
>> Thanks,
>> Mark
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Brian J. Murrell

2010-Jun-15 21:51 UTC

head link

[Lustre-discuss] Using brw_stats to diagnose lustre performance

On Tue, 2010-06-15 at 15:08 -0600, Andreas Dilger wrote:
> Also setting the max RPC size on the client to be 768kB would avoid  
> the need for each RPC to generate 2 IO requests.
I wondered about this too but recall I did an on-site at a client one
time where this was necessary.  They did a 7+1 RAID5 before we got
involved (or we''d have councilled them otherwise) and therefore did not
get 1MB raid stripes.  It seemed that setting the client max RPC didn''t
really have the effect that we thought it would/should.

That said, that was quite a while ago (1.6.x timeframe) and it could
very well have been a bug that has been since fixed.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100615/0543e93e/attachment-0001.bin

Mark Nelson

2010-Jun-15 21:55 UTC

head link

[Lustre-discuss] Using brw_stats to diagnose lustre performance

Hi Kevin and Andreas,

Thank you both for the excellent information!  At this point I doubt 
I''ll be able to configure the raid arrays for a 1MB stripe size (As
much
as I would like to).  How can I change the max RPC size to 768KB on the 
client?

So far on my list:

- work with tune2fs to set good stripe parameters for the FS.
- mount with -o strip=N (is this needed if tuen2fs sets defaults?)
- examine alignment of lustre partitions.
- set /sys/block/sd*/queue/max_sectors_kb to 768.
- set the client stripe size to 768kb.
- change the max RPC size on the clients to 768kb (not sure how yet).
- Upgrade to 1.8.4 to get benefits of patches mentioned in bug #22850.

I am also considering:
- Increasing RPCs in flight.
- Increasing dirty Cache size.
- Disabling lnet debugging.
- Changing OST service thread count.
- Checking out MDS configuration and raid.

Anything I''m missing?

Thanks,
Mark

Andreas Dilger wrote:> Also setting the max RPC size on the client to be 768kB would avoid the 
> need for each RPC to generate 2 IO requests.
> 
> It is possible with newer tune2fs to set the RAID stripe size and the 
> allocator (mballoc) will use that size. There is a bug open to transfer 
> this "optimal " size to the client, but it hasn''t gotten
mug attention
> since most sites are set up with 1MB stripe size.
> 
> Cheers, Andreas
> 
> On 2010-06-15, at 14:19, Kevin Van Maren <kevin.van.maren at
oracle.com>
> wrote:
> 
>> Live is much easier with a 1MB (or 512KB) native raid stripe size.
>>
>>
>> It looks like most IOs are being broken into 2 pieces.  See
>> https://bugzilla.lustre.org/show_bug.cgi?id=22850
>> for a few tweaks that would help get IOs > 512KB to disk.  See also 
>> Bug 9945
>>
>> But you are also seeing IOs "combined" into pieces that are
between 1
>> and 2 raid stripes, so set
>> /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler
>> does not "help" too much.
>>
>> There are mkfs options to tell ldiskfs your native raid stripe size.
>> You probably also want to change
>> the client stripe size (lfs setstripe) to be an integral multiple of
the
>> raid size (ie, not the default 1MB).
>>
>> Also note that those are power-of-2 buckets, so your 768KB chunks
aren''t
>> going to be listed as "768".
>>
>> Kevin
>>
>>
>> mark wrote:
>>> Hi Everyone,
>>>
>>> I''m trying to diagnose some performance concerns we are
having about our
>>> lustre deployment.  It seems to be a fairly multifaceted problem
>>> involving how ifort does buffered writes along with how we have
lustre
>>> setup.
>>>
>>> What I''ve identified so far is that our raid stripe size
on the OSTs is
>>> 768KB (6 * 128KB chunks) and the partitions are not being mounted
with
>>> -o strip.  We have 2 luns per controller and each virtual disk has
2
>>> partitions with the 2nd one being the lustre file system.  It is
>>> possible the partitions are not aligned.  Most of the client side
>>> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per
OST,
>>> etc).  The journals are on separate SSDs.  Our OSSes are probably
>>> oversubscribed.
>>>
>>> What we''ve noticed is that with certain apps we get
*really* bad
>>> performance to the OSTs.  As bad as 500-800KB/s to one OST.  The
best
>>> performance I''ve seen to an OST is around 300MB/s, with
500MB/s being
>>> more or less the upper bound limited by IB.
>>>
>>> Right now I''m trying to verify that fragmentation is
happening like I
>>> would expect given the configuration mentioned above.  I just
learned
>>> about brw_stats, so I tried examining it for one of our OSTs (It
looks
>>> like lustre must have been restarted recently with so little data):
>>>
>>> disk fragmented I/Os   ios   % cum % |  ios   % cum %
>>> 1:                 0   0   0   |  215   9   9
>>> 2:                 0   0   0   | 2004  89  98
>>> 3:                 0   0   0   |   22   0  99
>>> 4:                 0   0   0   |    2   0  99
>>> 5:                 0   0   0   |    5   0  99
>>> 6:                 0   0   0   |    2   0  99
>>> 7:                 1 100 100   |    1   0 100
>>>
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                 3  42  42   |   17   0   0
>>> 8K:                 0   0  42   |   17   0   0
>>> 16K:                 0   0  42   |   22   0   1
>>> 32K:                 0   0  42   |   73   1   2
>>> 64K:                 1  14  57   |  292   6   9
>>> 128K:                 0   0  57   |  385   8  18
>>> 256K:                 3  42 100   |   88   2  20
>>> 512K:                 0   0 100   | 1229  28  48
>>> 1M:                 0   0 100   | 2218  51 100
>>>
>>> My questions are:
>>>
>>> 1) Does a disk framentation of "1" mean that those IO was
fragmented or
>>> would that be "0"?
>>>
>>> 2) Does the disk I/O size mean what lustre actually wrote or what
it
>>> wanted to write?  What does that number mean in the context of our
768KB
>>> stripe size since it lists so many I/Os at 1M?
>>>
>>> Thanks,
>>> Mark
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss

-- 
Mark Nelson, Lead Software Developer
Minnesota Supercomputing Institute
Phone: (612)626-4479
Email: mark at msi.umn.edu

Kevin Van Maren

2010-Jun-15 21:58 UTC

head link

[Lustre-discuss] Using brw_stats to diagnose lustre performance

Mark Nelson wrote:> Hi Kevin and Andreas,
>
> Thank you both for the excellent information!  At this point I doubt 
> I''ll be able to configure the raid arrays for a 1MB stripe size
(As
> much as I would like to).  How can I change the max RPC size to 768KB 
> on the client?
Set max_pages_per_rpc (drop from 256 to 192), same way you would set 
max_rpcs_in_flight, with something like:
# lctl conf_param lustre.osc.max_pages_per_rpc=192

BTW, a blatant plug, but see:
http://www.oracle.com/us/support/systems/advanced-customer-services/readiness-service-lustre-ds-077261.pdf

> So far on my list:
>
> - work with tune2fs to set good stripe parameters for the FS.
> - mount with -o strip=N (is this needed if tuen2fs sets defaults?)
> - examine alignment of lustre partitions.
> - set /sys/block/sd*/queue/max_sectors_kb to 768.
> - set the client stripe size to 768kb.
> - change the max RPC size on the clients to 768kb (not sure how yet).
> - Upgrade to 1.8.4 to get benefits of patches mentioned in bug #22850.
>
> I am also considering:
> - Increasing RPCs in flight.
> - Increasing dirty Cache size.
> - Disabling lnet debugging.
> - Changing OST service thread count.
> - Checking out MDS configuration and raid.
>
> Anything I''m missing?
>
> Thanks,
> Mark
>
> Andreas Dilger wrote:
>> Also setting the max RPC size on the client to be 768kB would avoid 
>> the need for each RPC to generate 2 IO requests.
>>
>> It is possible with newer tune2fs to set the RAID stripe size and the 
>> allocator (mballoc) will use that size. There is a bug open to 
>> transfer this "optimal " size to the client, but it
hasn''t gotten mug
>> attention since most sites are set up with 1MB stripe size.
>>
>> Cheers, Andreas
>>
>> On 2010-06-15, at 14:19, Kevin Van Maren <kevin.van.maren at
oracle.com>
>> wrote:
>>
>>> Live is much easier with a 1MB (or 512KB) native raid stripe size.
>>>
>>>
>>> It looks like most IOs are being broken into 2 pieces.  See
>>> https://bugzilla.lustre.org/show_bug.cgi?id=22850
>>> for a few tweaks that would help get IOs > 512KB to disk.  See
also
>>> Bug 9945
>>>
>>> But you are also seeing IOs "combined" into pieces that
are between 1
>>> and 2 raid stripes, so set
>>> /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO
scheduler
>>> does not "help" too much.
>>>
>>> There are mkfs options to tell ldiskfs your native raid stripe
size.
>>> You probably also want to change
>>> the client stripe size (lfs setstripe) to be an integral multiple
of
>>> the
>>> raid size (ie, not the default 1MB).
>>>
>>> Also note that those are power-of-2 buckets, so your 768KB chunks 
>>> aren''t
>>> going to be listed as "768".
>>>
>>> Kevin
>>>
>>>
>>> mark wrote:
>>>> Hi Everyone,
>>>>
>>>> I''m trying to diagnose some performance concerns we
are having
>>>> about our
>>>> lustre deployment.  It seems to be a fairly multifaceted
problem
>>>> involving how ifort does buffered writes along with how we have
lustre
>>>> setup.
>>>>
>>>> What I''ve identified so far is that our raid stripe
size on the
>>>> OSTs is
>>>> 768KB (6 * 128KB chunks) and the partitions are not being
mounted with
>>>> -o strip.  We have 2 luns per controller and each virtual disk
has 2
>>>> partitions with the 2nd one being the lustre file system.  It
is
>>>> possible the partitions are not aligned.  Most of the client
side
>>>> settings are at default (ie 8 rpcs in flight, 32MB dirty cache
per
>>>> OST,
>>>> etc).  The journals are on separate SSDs.  Our OSSes are
probably
>>>> oversubscribed.
>>>>
>>>> What we''ve noticed is that with certain apps we get
*really* bad
>>>> performance to the OSTs.  As bad as 500-800KB/s to one OST. 
The best
>>>> performance I''ve seen to an OST is around 300MB/s,
with 500MB/s being
>>>> more or less the upper bound limited by IB.
>>>>
>>>> Right now I''m trying to verify that fragmentation is
happening like I
>>>> would expect given the configuration mentioned above.  I just
learned
>>>> about brw_stats, so I tried examining it for one of our OSTs
(It looks
>>>> like lustre must have been restarted recently with so little
data):
>>>>
>>>> disk fragmented I/Os   ios   % cum % |  ios   % cum %
>>>> 1:                 0   0   0   |  215   9   9
>>>> 2:                 0   0   0   | 2004  89  98
>>>> 3:                 0   0   0   |   22   0  99
>>>> 4:                 0   0   0   |    2   0  99
>>>> 5:                 0   0   0   |    5   0  99
>>>> 6:                 0   0   0   |    2   0  99
>>>> 7:                 1 100 100   |    1   0 100
>>>>
>>>> disk I/O size          ios   % cum % |  ios   % cum %
>>>> 4K:                 3  42  42   |   17   0   0
>>>> 8K:                 0   0  42   |   17   0   0
>>>> 16K:                 0   0  42   |   22   0   1
>>>> 32K:                 0   0  42   |   73   1   2
>>>> 64K:                 1  14  57   |  292   6   9
>>>> 128K:                 0   0  57   |  385   8  18
>>>> 256K:                 3  42 100   |   88   2  20
>>>> 512K:                 0   0 100   | 1229  28  48
>>>> 1M:                 0   0 100   | 2218  51 100
>>>>
>>>> My questions are:
>>>>
>>>> 1) Does a disk framentation of "1" mean that those IO
was
>>>> fragmented or
>>>> would that be "0"?
>>>>
>>>> 2) Does the disk I/O size mean what lustre actually wrote or
what it
>>>> wanted to write?  What does that number mean in the context of
our
>>>> 768KB
>>>> stripe size since it lists so many I/Os at 1M?
>>>>
>>>> Thanks,
>>>> Mark
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

Bernd Schubert

2010-Jun-16 21:40 UTC

head link

[Lustre-discuss] Using brw_stats to diagnose lustre performance

On Tuesday 15 June 2010, Kevin Van Maren wrote:> Live is much easier with a 1MB (or 512KB) native raid stripe size.
> 
> 
> It looks like most IOs are being broken into 2 pieces.  See
> https://bugzilla.lustre.org/show_bug.cgi?id=22850
> for a few tweaks that would help get IOs > 512KB to disk.  See also Bug
I played with a similar patch some time ago (blkdev defines), but
didn''t notice
any performance improvements on the 9900 DDN S2A. Before increasing those values
I got up to 7M IOs, after doubling MAX_HW_SEGMENTS and 
MAX_PHYS_SEGMENTS max IOs doubled to 14M. Unfortunately, also more IOs in
between
magic good IO sizes came up (magic good here: 1, 2, 3 ..., 14),  so e.g. lots of
1008 or 2032, etc. Example numbers from a production system:

Length        Port 1            Port 2            Port 3            Port 4
 Kbytes    Reads   Writes    Reads   Writes    Reads   Writes    Reads   Writes
>  960     1DCD     2EEB     1E44     3532     1431     1D7E     14FB    
2284 >  976     1ACD     34AC     1A0F     48EB     12E2     24AE     11E1    
257F
 >  992     1D46     3787     1CA7     51EB     144C     2E9B     1354    
3A62
 > 1008    100A5    11B5C    10391    13765     A9B8     FBED     9E9A    
D457
 > 1024   BFD41D  111F3C4   BFBE47  11A110D   8C316B   C95178   8E5A9F  
C83850
 > 1040      583      625      538      6C3      3F3      513      413     
337

...

 > 2032      551     1260      50D     136B      3E4     1218      3C8     
BA1
 > 2048    41B85    FDB21    3B8D1   101857    31088    B78E0    2C4A5   
92F48
 > 2064       FB       20      108       24       BE       19       C7      
10
 > 2080       E3       2F       E6       37       AA       44       C7      
1B

...

 > 7152       55      6C7       58      80C       60      70D       3F     
3B4
 > 7168     449F     E335     417C     E743     3332     AB34     3686    
A568
 > 7184       29        1       14        2       19        1       14       
0


I don''t think it matters for any storage system if max IO is 7M or 14M,
but those
sizes in between are rather annoying. And from output of brw_stats I *sometimes*
have no idea how that can happen. On that particular system I took the numbers
from, users mostly don''t do streaming writes, so it the reason is clear
there.


After tuning the FZJ (Kevin should know that system) 
system, the SLES11 kernel with chained scatter-gathering 
(so the blkdev patch is mostly not required anymore) can do IO sizes up to 12MB.
Unfortunately, also quite some 1008s again out of the blue without an obvious
reason
(during my streaming writes with obdecho).


Cheers,
Bernd


-- 
Bernd Schubert
DataDirect Networks

Lustre discuss - Jun 2010 - Using brw_stats to diagnose lustre performance

[Lustre-discuss] Using brw_stats to diagnose lustre performance

[Lustre-discuss] Using brw_stats to diagnose lustre performance

[Lustre-discuss] Using brw_stats to diagnose lustre performance

[Lustre-discuss] Using brw_stats to diagnose lustre performance

[Lustre-discuss] Using brw_stats to diagnose lustre performance

[Lustre-discuss] Using brw_stats to diagnose lustre performance

[Lustre-discuss] Using brw_stats to diagnose lustre performance