thr3ads.net - Lustre discuss - [Lustre-discuss] hacking max

If this information is useful, please help other people find it:
Share via:

Robin Humble

2009-Aug-26 04:46 UTC

[Lustre-discuss] hacking max_sectors

Hiya,

I''ve had another go at fixing the problem I was seeing a few months
ago:
  http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html
and which we are seeing again now as we are setting up a new machine
with 128k chunk software raid (md) RAID6 8+2 eg.
  Lustre: test-OST000d: underlying device md5 should be tuned for larger I/O
requests: max_sectors = 1024 could be up to max_hw_sectors=2560

I came up with the attached simple core kernel change which fixes the
problem, and seems stable enough under initial stress testing, but a
core scsi tweak seems a little drastic to me - is there a better way to
do it?

without this patch, and despite raising all disks to a ridiculously
huge max_sectors_kb, all Lustre 1M rpc''s are still fragmented into two
512k chunks before being sent to md :-/ likely md then aggregates them
again ''cos performance isn''t totaly dismal, which it would be
if it was
100% read-modify-writes for each stripe write.

with the patch, 1M i/o''s are being fed to md (according to brw_stats),
and performance is a little better for RAID6 8+2 with 128k chunks, and
a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed
half 512k and half 1M i/o''s by Lustre).

the one-liner is a core kernel change, so perhaps some Lustre/kernel
block device/md people can look at it and see if it''s acceptable for
inclusion in standard Lustre OSS kernels, or whether it breaks
assumptions in the core scsi layer somehow.

IMHO the best solution would be to apply the patch, and then have a
/sys/block/md*/queue/ for md devices so that max_sectors_kb and
max_hw_sectors_kb can be tuned without recompiling the kernel...
is that possible?

the patch is against 2.6.18-128.1.14.el5-lustre1.8.1

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility
-------------- next part --------------
--- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h	2009-08-18
17:40:51.000000000 +1000
+++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h	2009-08-21
13:47:55.000000000 +1000
@@ -778,7 +778,7 @@
 #define MAX_PHYS_SEGMENTS 128
 #define MAX_HW_SEGMENTS 128
 #define SAFE_MAX_SECTORS 255
-#define BLK_DEF_MAX_SECTORS 1024
+#define BLK_DEF_MAX_SECTORS 2048
 
 #define MAX_SEGMENT_SIZE	65536

Andreas Dilger

2009-Aug-26 10:11 UTC

head link

[Lustre-discuss] hacking max_sectors

On Aug 26, 2009  00:46 -0400, Robin Humble wrote:> I''ve had another go at fixing the problem I was seeing a few
months ago:
>   http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html
> and which we are seeing again now as we are setting up a new machine
> with 128k chunk software raid (md) RAID6 8+2 eg.
>   Lustre: test-OST000d: underlying device md5 should be tuned for larger
I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2560
> 
> without this patch, and despite raising all disks to a ridiculously
> huge max_sectors_kb, all Lustre 1M rpc''s are still fragmented into
two
> 512k chunks before being sent to md :-/ likely md then aggregates them
> again ''cos performance isn''t totaly dismal, which it
would be if it was
> 100% read-modify-writes for each stripe write.
Yes, we''ve seen this same issue, but haven''t been able to
tweak the
/sys tunables correctly to get MD RAID to agree.  I wonder if the
problem is that the /sys/block/*/queue/max_* tunables are being set
too late in the MD startup, and it has picked up the 1024 sectors
value too early, and never updates it afterward?
> with the patch, 1M i/o''s are being fed to md (according to
brw_stats),
> and performance is a little better for RAID6 8+2 with 128k chunks, and
> a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed
> half 512k and half 1M i/o''s by Lustre).
This was the other question I''d asked internally.  If the array is
formatted with 64kB chunks then 512k IOs shouldn''t cause any
read-modify-
write operations and (in theory) give the same performance as 1M IOs on
a 128kB chunksize array.  What is the relative performance of the
64kB and 128kB configurations?
> the one-liner is a core kernel change, so perhaps some Lustre/kernel
> block device/md people can look at it and see if it''s acceptable
for
> inclusion in standard Lustre OSS kernels, or whether it breaks
> assumptions in the core scsi layer somehow.
> 
> IMHO the best solution would be to apply the patch, and then have a
> /sys/block/md*/queue/ for md devices so that max_sectors_kb and
> max_hw_sectors_kb can be tuned without recompiling the kernel...
> is that possible?
> 
> the patch is against 2.6.18-128.1.14.el5-lustre1.8.1
> --- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h	2009-08-18
17:40:51.000000000 +1000
> +++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h	2009-08-21
13:47:55.000000000 +1000
> @@ -778,7 +778,7 @@
>  #define MAX_PHYS_SEGMENTS 128
>  #define MAX_HW_SEGMENTS 128
>  #define SAFE_MAX_SECTORS 255
> -#define BLK_DEF_MAX_SECTORS 1024
> +#define BLK_DEF_MAX_SECTORS 2048
>  
>  #define MAX_SEGMENT_SIZE	65536
This patch definitely looks reasonable, and since we already patch
the server kernel it doesn''t appear to be a huge problem to include
it.  Can you please create a bug and attach the patch there.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Kevin Van Maren

2009-Aug-26 10:27 UTC

head link

[Lustre-discuss] hacking max_sectors

Andreas Dilger wrote:> On Aug 26, 2009  00:46 -0400, Robin Humble wrote:
>   
>> I''ve had another go at fixing the problem I was seeing a few
months ago:
>>  
http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html
>> and which we are seeing again now as we are setting up a new machine
>> with 128k chunk software raid (md) RAID6 8+2 eg.
>>   Lustre: test-OST000d: underlying device md5 should be tuned for
larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2560
>>
>> without this patch, and despite raising all disks to a ridiculously
>> huge max_sectors_kb, all Lustre 1M rpc''s are still fragmented
into two
>> 512k chunks before being sent to md :-/ likely md then aggregates them
>> again ''cos performance isn''t totaly dismal, which it
would be if it was
>> 100% read-modify-writes for each stripe write.
>>     
They are "RCW"s rather than "RMW"s.
> Yes, we''ve seen this same issue, but haven''t been able to
tweak the
> /sys tunables correctly to get MD RAID to agree.  I wonder if the
> problem is that the /sys/block/*/queue/max_* tunables are being set
> too late in the MD startup, and it has picked up the 1024 sectors
> value too early, and never updates it afterward?
>   
There is no tunable for the "md" device (but there needs to be!), and
so
it appears that the (kernel) default value is used.  This is true even 
if the "sd" devices are set large before "md" is loaded.

>> with the patch, 1M i/o''s are being fed to md (according to
brw_stats),
>> and performance is a little better for RAID6 8+2 with 128k chunks, and
>> a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed
>> half 512k and half 1M i/o''s by Lustre).
>>     
>
> This was the other question I''d asked internally.  If the array is
> formatted with 64kB chunks then 512k IOs shouldn''t cause any
read-modify-
> write operations and (in theory) give the same performance as 1M IOs on
> a 128kB chunksize array.  What is the relative performance of the
> 64kB and 128kB configurations?
>
>   
>> the one-liner is a core kernel change, so perhaps some Lustre/kernel
>> block device/md people can look at it and see if it''s
acceptable for
>> inclusion in standard Lustre OSS kernels, or whether it breaks
>> assumptions in the core scsi layer somehow.
>>
>> IMHO the best solution would be to apply the patch, and then have a
>> /sys/block/md*/queue/ for md devices so that max_sectors_kb and
>> max_hw_sectors_kb can be tuned without recompiling the kernel...
>> is that possible?
>>
>> the patch is against 2.6.18-128.1.14.el5-lustre1.8.1
>>     
>
>   
>> --- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h	2009-08-18
17:40:51.000000000 +1000
>> +++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h
2009-08-21 13:47:55.000000000 +1000
>> @@ -778,7 +778,7 @@
>>  #define MAX_PHYS_SEGMENTS 128
>>  #define MAX_HW_SEGMENTS 128
>>  #define SAFE_MAX_SECTORS 255
>> -#define BLK_DEF_MAX_SECTORS 1024
>> +#define BLK_DEF_MAX_SECTORS 2048
>>  
>>  #define MAX_SEGMENT_SIZE	65536
>>     
>
> This patch definitely looks reasonable, and since we already patch
> the server kernel it doesn''t appear to be a huge problem to
include
> it.  Can you please create a bug and attach the patch there.
>
> Cheers, Andreas
Already done, Bug 20533.  Has a few other notes.

Kevin

Robin Humble

2009-Aug-28 05:28 UTC

head link

[Lustre-discuss] hacking max_sectors

On Wed, Aug 26, 2009 at 04:11:12AM -0600, Andreas Dilger
wrote:>On Aug 26, 2009  00:46 -0400, Robin Humble wrote:
>> with the patch, 1M i/o''s are being fed to md (according to
brw_stats),
>> and performance is a little better for RAID6 8+2 with 128k chunks, and
>> a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed
>> half 512k and half 1M i/o''s by Lustre).
>This was the other question I''d asked internally.  If the array is
>formatted with 64kB chunks then 512k IOs shouldn''t cause any
read-modify-
>write operations and (in theory) give the same performance as 1M IOs on
>a 128kB chunksize array.  What is the relative performance of the
>64kB and 128kB configurations?
on these 1TB SATA RAID6 8+2''s and external journals, with 1 client
writing to 1 OST, with 2.6.18-128.1.14.el5 + lustre1.8.1 + blkdev/md
patches from https://bugzilla.lustre.org/show_bug.cgi?id=20533 so that
128k chunk md gets 1M i/o''s and 64k chunk md gets 512k i/o''s
then ->

client max_rpcs_in_flight 8
 md chunk    write (MB/s)    read (MB/s)
     64k       185            345
    128k       235            390

so 128k chunks are 10-30% quicker than 64k in this particular setup on
big streaming i/o tests (1G of 1M lmdd''s).
having said that, 1.6.7.2 servers do better than 1.8.1 on some configs
(I haven''t had time to figure out why) but the trend of 128k chunks
being faster than 64k chunks remains. also if the i/o load was messier
and involved smaller i/o''s then 64k chunks might claw something back -
probably not enough though.

BTW, whilst we''re on the topic - what does this part of brw_stats
mean?
                             read      |     write
  disk fragmented I/Os   ios   % cum % |  ios   % cum %
  1:                    5742 100 100   | 103186 100 100

this is for the 128k chunk case, where the rest of brw_stats says I''m
seeing 1M rpc''s and 1M i/o''s, but I''m not sure what
''1'' disk fragmented
i/o''s means - should it be 0? or does ''1'' mean
unfragmented?

sorry for packing too many questions into one email, but these slowish
SATA disks seem to need a lots of rpc''s in flight for good performance.
32 max_dirty_mb (the default) and 32 max_rpcs_in_flight seems a good
magic combo. with that I get:

client max_rpcs_in_flight 32
 md chunk    write (MB/s)    read (MB/s)
     64k       275            450
    128k       395            480

which is a lot faster...
with a heavier load of 20 clients hammering 4 OSS''s each with 4 R6 8+2
OSTs I still see about a 10% advantage for clients with 32 rpcs.

is there a down side to running clients with max_rpcs_in_flight 32 ?
the initial production machine will be ~1500 clients and ~25 OSS''s.

cheers,
robin
--
Dr Robin Humble, HPC Systems Analyst, NCI National Facility

Andreas Dilger

2009-Aug-28 17:00 UTC

head link

[Lustre-discuss] hacking max_sectors

On Aug 28, 2009  01:28 -0400, Robin Humble wrote:> on these 1TB SATA RAID6 8+2''s and external journals, with 1 client
> writing to 1 OST, with 2.6.18-128.1.14.el5 + lustre1.8.1 + blkdev/md
> patches from https://bugzilla.lustre.org/show_bug.cgi?id=20533 so that
> 128k chunk md gets 1M i/o''s and 64k chunk md gets 512k
i/o''s then ->
> 
> client max_rpcs_in_flight 8
>  md chunk    write (MB/s)    read (MB/s)
>      64k       185            345
>     128k       235            390
> 
> so 128k chunks are 10-30% quicker than 64k in this particular setup on
> big streaming i/o tests (1G of 1M lmdd''s).
Hmm, that is too bad.  I would have hoped that there was minimal difference
between the smaller and the larger chunk size, given that they are still
doing 1MB writes to disk and the data + parity amount is the same.  It
would be interesting to see what the performance is if you change the RPC
size to simulate clients doing smaller IOs:

	lctl set_param osc.*.max_pages_per_rpc={128,64}

Depending on how well-behaved your applications are this could make
a noticable difference in "real world" application performance.  You
could also check the brw_stats "pages per bulk r/w" on an existing
filesystem that has been running for a while in order to see the
actual IO size.  Granted, without your patch it will be maxed out at
128 pages, but if there is a significant fraction of IO below that
you may still be better off with the smaller chunk size.
> BTW, whilst we''re on the topic - what does this part of brw_stats
> mean?
>                              read      |     write
>   disk fragmented I/Os   ios   % cum % |  ios   % cum %
>   1:                    5742 100 100   | 103186 100 100
> 
> this is for the 128k chunk case, where the rest of brw_stats says
I''m
> seeing 1M rpc''s and 1M i/o''s, but I''m not sure
what ''1'' disk fragmented
> i/o''s means - should it be 0? or does ''1'' mean
unfragmented?
That means the read/write request was submitted to disk in a single
fragment, which is ideal.  On my system there are a small number of
read requests that have "0" fragments.  These are for reads of a hole,
or at EOF that return no data at all.
> sorry for packing too many questions into one email, but these slowish
> SATA disks seem to need a lots of rpc''s in flight for good
performance.
> 32 max_dirty_mb (the default) and 32 max_rpcs_in_flight seems a good
> magic combo. with that I get:
> 
> client max_rpcs_in_flight 32
>  md chunk    write (MB/s)    read (MB/s)
>      64k       275            450
>     128k       395            480
> 
> which is a lot faster...
> with a heavier load of 20 clients hammering 4 OSS''s each with 4 R6
8+2
> OSTs I still see about a 10% advantage for clients with 32 rpcs.
Interesting.  We haven''t tuned this recently except in the case of WAN,
but I guess the bandwidth of disks and networks is increasing enough
that it just needs more 1MB RPCs to keep the pipe full.  We''ve also
tested 4MB RPCs (bug 16900 has a patch), but this gave us mixed
performance in our environment.  You could give this a try if you are
interested and report the results here.
> is there a down side to running clients with max_rpcs_in_flight 32 ?
> the initial production machine will be ~1500 clients and ~25
OSS''s.
For 1500 clients it shouldn''t be an issue, though it can make for
longer latency for some operations.  In the past we were also limited
by the number of request buffers on the server, but that is dynamic
these days and flow-controlled, and we''ve tested with up to 26000
clients on a single filesystem (192 OSSes).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Aug 2009 - hacking max_sectors

[Lustre-discuss] hacking max_sectors

[Lustre-discuss] hacking max_sectors

[Lustre-discuss] hacking max_sectors

[Lustre-discuss] hacking max_sectors

[Lustre-discuss] hacking max_sectors