Hiya, I''ve had another go at fixing the problem I was seeing a few months ago: http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html and which we are seeing again now as we are setting up a new machine with 128k chunk software raid (md) RAID6 8+2 eg. Lustre: test-OST000d: underlying device md5 should be tuned for larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2560 I came up with the attached simple core kernel change which fixes the problem, and seems stable enough under initial stress testing, but a core scsi tweak seems a little drastic to me - is there a better way to do it? without this patch, and despite raising all disks to a ridiculously huge max_sectors_kb, all Lustre 1M rpc''s are still fragmented into two 512k chunks before being sent to md :-/ likely md then aggregates them again ''cos performance isn''t totaly dismal, which it would be if it was 100% read-modify-writes for each stripe write. with the patch, 1M i/o''s are being fed to md (according to brw_stats), and performance is a little better for RAID6 8+2 with 128k chunks, and a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed half 512k and half 1M i/o''s by Lustre). the one-liner is a core kernel change, so perhaps some Lustre/kernel block device/md people can look at it and see if it''s acceptable for inclusion in standard Lustre OSS kernels, or whether it breaks assumptions in the core scsi layer somehow. IMHO the best solution would be to apply the patch, and then have a /sys/block/md*/queue/ for md devices so that max_sectors_kb and max_hw_sectors_kb can be tuned without recompiling the kernel... is that possible? the patch is against 2.6.18-128.1.14.el5-lustre1.8.1 cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility -------------- next part -------------- --- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h 2009-08-18 17:40:51.000000000 +1000 +++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h 2009-08-21 13:47:55.000000000 +1000 @@ -778,7 +778,7 @@ #define MAX_PHYS_SEGMENTS 128 #define MAX_HW_SEGMENTS 128 #define SAFE_MAX_SECTORS 255 -#define BLK_DEF_MAX_SECTORS 1024 +#define BLK_DEF_MAX_SECTORS 2048 #define MAX_SEGMENT_SIZE 65536
On Aug 26, 2009 00:46 -0400, Robin Humble wrote:> I''ve had another go at fixing the problem I was seeing a few months ago: > http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html > and which we are seeing again now as we are setting up a new machine > with 128k chunk software raid (md) RAID6 8+2 eg. > Lustre: test-OST000d: underlying device md5 should be tuned for larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2560 > > without this patch, and despite raising all disks to a ridiculously > huge max_sectors_kb, all Lustre 1M rpc''s are still fragmented into two > 512k chunks before being sent to md :-/ likely md then aggregates them > again ''cos performance isn''t totaly dismal, which it would be if it was > 100% read-modify-writes for each stripe write.Yes, we''ve seen this same issue, but haven''t been able to tweak the /sys tunables correctly to get MD RAID to agree. I wonder if the problem is that the /sys/block/*/queue/max_* tunables are being set too late in the MD startup, and it has picked up the 1024 sectors value too early, and never updates it afterward?> with the patch, 1M i/o''s are being fed to md (according to brw_stats), > and performance is a little better for RAID6 8+2 with 128k chunks, and > a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed > half 512k and half 1M i/o''s by Lustre).This was the other question I''d asked internally. If the array is formatted with 64kB chunks then 512k IOs shouldn''t cause any read-modify- write operations and (in theory) give the same performance as 1M IOs on a 128kB chunksize array. What is the relative performance of the 64kB and 128kB configurations?> the one-liner is a core kernel change, so perhaps some Lustre/kernel > block device/md people can look at it and see if it''s acceptable for > inclusion in standard Lustre OSS kernels, or whether it breaks > assumptions in the core scsi layer somehow. > > IMHO the best solution would be to apply the patch, and then have a > /sys/block/md*/queue/ for md devices so that max_sectors_kb and > max_hw_sectors_kb can be tuned without recompiling the kernel... > is that possible? > > the patch is against 2.6.18-128.1.14.el5-lustre1.8.1> --- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h 2009-08-18 17:40:51.000000000 +1000 > +++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h 2009-08-21 13:47:55.000000000 +1000 > @@ -778,7 +778,7 @@ > #define MAX_PHYS_SEGMENTS 128 > #define MAX_HW_SEGMENTS 128 > #define SAFE_MAX_SECTORS 255 > -#define BLK_DEF_MAX_SECTORS 1024 > +#define BLK_DEF_MAX_SECTORS 2048 > > #define MAX_SEGMENT_SIZE 65536This patch definitely looks reasonable, and since we already patch the server kernel it doesn''t appear to be a huge problem to include it. Can you please create a bug and attach the patch there. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> On Aug 26, 2009 00:46 -0400, Robin Humble wrote: > >> I''ve had another go at fixing the problem I was seeing a few months ago: >> http://lists.lustre.org/pipermail/lustre-discuss/2009-April/010315.html >> and which we are seeing again now as we are setting up a new machine >> with 128k chunk software raid (md) RAID6 8+2 eg. >> Lustre: test-OST000d: underlying device md5 should be tuned for larger I/O requests: max_sectors = 1024 could be up to max_hw_sectors=2560 >> >> without this patch, and despite raising all disks to a ridiculously >> huge max_sectors_kb, all Lustre 1M rpc''s are still fragmented into two >> 512k chunks before being sent to md :-/ likely md then aggregates them >> again ''cos performance isn''t totaly dismal, which it would be if it was >> 100% read-modify-writes for each stripe write. >>They are "RCW"s rather than "RMW"s.> Yes, we''ve seen this same issue, but haven''t been able to tweak the > /sys tunables correctly to get MD RAID to agree. I wonder if the > problem is that the /sys/block/*/queue/max_* tunables are being set > too late in the MD startup, and it has picked up the 1024 sectors > value too early, and never updates it afterward? >There is no tunable for the "md" device (but there needs to be!), and so it appears that the (kernel) default value is used. This is true even if the "sd" devices are set large before "md" is loaded.>> with the patch, 1M i/o''s are being fed to md (according to brw_stats), >> and performance is a little better for RAID6 8+2 with 128k chunks, and >> a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed >> half 512k and half 1M i/o''s by Lustre). >> > > This was the other question I''d asked internally. If the array is > formatted with 64kB chunks then 512k IOs shouldn''t cause any read-modify- > write operations and (in theory) give the same performance as 1M IOs on > a 128kB chunksize array. What is the relative performance of the > 64kB and 128kB configurations? > > >> the one-liner is a core kernel change, so perhaps some Lustre/kernel >> block device/md people can look at it and see if it''s acceptable for >> inclusion in standard Lustre OSS kernels, or whether it breaks >> assumptions in the core scsi layer somehow. >> >> IMHO the best solution would be to apply the patch, and then have a >> /sys/block/md*/queue/ for md devices so that max_sectors_kb and >> max_hw_sectors_kb can be tuned without recompiling the kernel... >> is that possible? >> >> the patch is against 2.6.18-128.1.14.el5-lustre1.8.1 >> > > >> --- linux-2.6.18.x86_64.lustre/include/linux/blkdev.h 2009-08-18 17:40:51.000000000 +1000 >> +++ linux-2.6.18.x86_64.lustre.hackBlock/include/linux/blkdev.h 2009-08-21 13:47:55.000000000 +1000 >> @@ -778,7 +778,7 @@ >> #define MAX_PHYS_SEGMENTS 128 >> #define MAX_HW_SEGMENTS 128 >> #define SAFE_MAX_SECTORS 255 >> -#define BLK_DEF_MAX_SECTORS 1024 >> +#define BLK_DEF_MAX_SECTORS 2048 >> >> #define MAX_SEGMENT_SIZE 65536 >> > > This patch definitely looks reasonable, and since we already patch > the server kernel it doesn''t appear to be a huge problem to include > it. Can you please create a bug and attach the patch there. > > Cheers, AndreasAlready done, Bug 20533. Has a few other notes. Kevin
On Wed, Aug 26, 2009 at 04:11:12AM -0600, Andreas Dilger wrote:>On Aug 26, 2009 00:46 -0400, Robin Humble wrote: >> with the patch, 1M i/o''s are being fed to md (according to brw_stats), >> and performance is a little better for RAID6 8+2 with 128k chunks, and >> a bit worse for RAID6 8+2 with 64k chunks (which are curiously now fed >> half 512k and half 1M i/o''s by Lustre). >This was the other question I''d asked internally. If the array is >formatted with 64kB chunks then 512k IOs shouldn''t cause any read-modify- >write operations and (in theory) give the same performance as 1M IOs on >a 128kB chunksize array. What is the relative performance of the >64kB and 128kB configurations?on these 1TB SATA RAID6 8+2''s and external journals, with 1 client writing to 1 OST, with 2.6.18-128.1.14.el5 + lustre1.8.1 + blkdev/md patches from https://bugzilla.lustre.org/show_bug.cgi?id=20533 so that 128k chunk md gets 1M i/o''s and 64k chunk md gets 512k i/o''s then -> client max_rpcs_in_flight 8 md chunk write (MB/s) read (MB/s) 64k 185 345 128k 235 390 so 128k chunks are 10-30% quicker than 64k in this particular setup on big streaming i/o tests (1G of 1M lmdd''s). having said that, 1.6.7.2 servers do better than 1.8.1 on some configs (I haven''t had time to figure out why) but the trend of 128k chunks being faster than 64k chunks remains. also if the i/o load was messier and involved smaller i/o''s then 64k chunks might claw something back - probably not enough though. BTW, whilst we''re on the topic - what does this part of brw_stats mean? read | write disk fragmented I/Os ios % cum % | ios % cum % 1: 5742 100 100 | 103186 100 100 this is for the 128k chunk case, where the rest of brw_stats says I''m seeing 1M rpc''s and 1M i/o''s, but I''m not sure what ''1'' disk fragmented i/o''s means - should it be 0? or does ''1'' mean unfragmented? sorry for packing too many questions into one email, but these slowish SATA disks seem to need a lots of rpc''s in flight for good performance. 32 max_dirty_mb (the default) and 32 max_rpcs_in_flight seems a good magic combo. with that I get: client max_rpcs_in_flight 32 md chunk write (MB/s) read (MB/s) 64k 275 450 128k 395 480 which is a lot faster... with a heavier load of 20 clients hammering 4 OSS''s each with 4 R6 8+2 OSTs I still see about a 10% advantage for clients with 32 rpcs. is there a down side to running clients with max_rpcs_in_flight 32 ? the initial production machine will be ~1500 clients and ~25 OSS''s. cheers, robin -- Dr Robin Humble, HPC Systems Analyst, NCI National Facility
On Aug 28, 2009 01:28 -0400, Robin Humble wrote:> on these 1TB SATA RAID6 8+2''s and external journals, with 1 client > writing to 1 OST, with 2.6.18-128.1.14.el5 + lustre1.8.1 + blkdev/md > patches from https://bugzilla.lustre.org/show_bug.cgi?id=20533 so that > 128k chunk md gets 1M i/o''s and 64k chunk md gets 512k i/o''s then -> > > client max_rpcs_in_flight 8 > md chunk write (MB/s) read (MB/s) > 64k 185 345 > 128k 235 390 > > so 128k chunks are 10-30% quicker than 64k in this particular setup on > big streaming i/o tests (1G of 1M lmdd''s).Hmm, that is too bad. I would have hoped that there was minimal difference between the smaller and the larger chunk size, given that they are still doing 1MB writes to disk and the data + parity amount is the same. It would be interesting to see what the performance is if you change the RPC size to simulate clients doing smaller IOs: lctl set_param osc.*.max_pages_per_rpc={128,64} Depending on how well-behaved your applications are this could make a noticable difference in "real world" application performance. You could also check the brw_stats "pages per bulk r/w" on an existing filesystem that has been running for a while in order to see the actual IO size. Granted, without your patch it will be maxed out at 128 pages, but if there is a significant fraction of IO below that you may still be better off with the smaller chunk size.> BTW, whilst we''re on the topic - what does this part of brw_stats > mean? > read | write > disk fragmented I/Os ios % cum % | ios % cum % > 1: 5742 100 100 | 103186 100 100 > > this is for the 128k chunk case, where the rest of brw_stats says I''m > seeing 1M rpc''s and 1M i/o''s, but I''m not sure what ''1'' disk fragmented > i/o''s means - should it be 0? or does ''1'' mean unfragmented?That means the read/write request was submitted to disk in a single fragment, which is ideal. On my system there are a small number of read requests that have "0" fragments. These are for reads of a hole, or at EOF that return no data at all.> sorry for packing too many questions into one email, but these slowish > SATA disks seem to need a lots of rpc''s in flight for good performance. > 32 max_dirty_mb (the default) and 32 max_rpcs_in_flight seems a good > magic combo. with that I get: > > client max_rpcs_in_flight 32 > md chunk write (MB/s) read (MB/s) > 64k 275 450 > 128k 395 480 > > which is a lot faster... > with a heavier load of 20 clients hammering 4 OSS''s each with 4 R6 8+2 > OSTs I still see about a 10% advantage for clients with 32 rpcs.Interesting. We haven''t tuned this recently except in the case of WAN, but I guess the bandwidth of disks and networks is increasing enough that it just needs more 1MB RPCs to keep the pipe full. We''ve also tested 4MB RPCs (bug 16900 has a patch), but this gave us mixed performance in our environment. You could give this a try if you are interested and report the results here.> is there a down side to running clients with max_rpcs_in_flight 32 ? > the initial production machine will be ~1500 clients and ~25 OSS''s.For 1500 clients it shouldn''t be an issue, though it can make for longer latency for some operations. In the past we were also limited by the number of request buffers on the server, but that is dynamic these days and flow-controlled, and we''ve tested with up to 26000 clients on a single filesystem (192 OSSes). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.