mark
2010-Jun-14 21:20 UTC
[Lustre-discuss] Using brw_stats to diagnose lustre performance
Hi Everyone, I''m trying to diagnose some performance concerns we are having about our lustre deployment. It seems to be a fairly multifaceted problem involving how ifort does buffered writes along with how we have lustre setup. What I''ve identified so far is that our raid stripe size on the OSTs is 768KB (6 * 128KB chunks) and the partitions are not being mounted with -o strip. We have 2 luns per controller and each virtual disk has 2 partitions with the 2nd one being the lustre file system. It is possible the partitions are not aligned. Most of the client side settings are at default (ie 8 rpcs in flight, 32MB dirty cache per OST, etc). The journals are on separate SSDs. Our OSSes are probably oversubscribed. What we''ve noticed is that with certain apps we get *really* bad performance to the OSTs. As bad as 500-800KB/s to one OST. The best performance I''ve seen to an OST is around 300MB/s, with 500MB/s being more or less the upper bound limited by IB. Right now I''m trying to verify that fragmentation is happening like I would expect given the configuration mentioned above. I just learned about brw_stats, so I tried examining it for one of our OSTs (It looks like lustre must have been restarted recently with so little data): disk fragmented I/Os ios % cum % | ios % cum % 1: 0 0 0 | 215 9 9 2: 0 0 0 | 2004 89 98 3: 0 0 0 | 22 0 99 4: 0 0 0 | 2 0 99 5: 0 0 0 | 5 0 99 6: 0 0 0 | 2 0 99 7: 1 100 100 | 1 0 100 disk I/O size ios % cum % | ios % cum % 4K: 3 42 42 | 17 0 0 8K: 0 0 42 | 17 0 0 16K: 0 0 42 | 22 0 1 32K: 0 0 42 | 73 1 2 64K: 1 14 57 | 292 6 9 128K: 0 0 57 | 385 8 18 256K: 3 42 100 | 88 2 20 512K: 0 0 100 | 1229 28 48 1M: 0 0 100 | 2218 51 100 My questions are: 1) Does a disk framentation of "1" mean that those IO was fragmented or would that be "0"? 2) Does the disk I/O size mean what lustre actually wrote or what it wanted to write? What does that number mean in the context of our 768KB stripe size since it lists so many I/Os at 1M? Thanks, Mark
Kevin Van Maren
2010-Jun-15 20:19 UTC
[Lustre-discuss] Using brw_stats to diagnose lustre performance
Live is much easier with a 1MB (or 512KB) native raid stripe size. It looks like most IOs are being broken into 2 pieces. See https://bugzilla.lustre.org/show_bug.cgi?id=22850 for a few tweaks that would help get IOs > 512KB to disk. See also Bug 9945 But you are also seeing IOs "combined" into pieces that are between 1 and 2 raid stripes, so set /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler does not "help" too much. There are mkfs options to tell ldiskfs your native raid stripe size. You probably also want to change the client stripe size (lfs setstripe) to be an integral multiple of the raid size (ie, not the default 1MB). Also note that those are power-of-2 buckets, so your 768KB chunks aren''t going to be listed as "768". Kevin mark wrote:> Hi Everyone, > > I''m trying to diagnose some performance concerns we are having about our > lustre deployment. It seems to be a fairly multifaceted problem > involving how ifort does buffered writes along with how we have lustre > setup. > > What I''ve identified so far is that our raid stripe size on the OSTs is > 768KB (6 * 128KB chunks) and the partitions are not being mounted with > -o strip. We have 2 luns per controller and each virtual disk has 2 > partitions with the 2nd one being the lustre file system. It is > possible the partitions are not aligned. Most of the client side > settings are at default (ie 8 rpcs in flight, 32MB dirty cache per OST, > etc). The journals are on separate SSDs. Our OSSes are probably > oversubscribed. > > What we''ve noticed is that with certain apps we get *really* bad > performance to the OSTs. As bad as 500-800KB/s to one OST. The best > performance I''ve seen to an OST is around 300MB/s, with 500MB/s being > more or less the upper bound limited by IB. > > Right now I''m trying to verify that fragmentation is happening like I > would expect given the configuration mentioned above. I just learned > about brw_stats, so I tried examining it for one of our OSTs (It looks > like lustre must have been restarted recently with so little data): > > disk fragmented I/Os ios % cum % | ios % cum % > 1: 0 0 0 | 215 9 9 > 2: 0 0 0 | 2004 89 98 > 3: 0 0 0 | 22 0 99 > 4: 0 0 0 | 2 0 99 > 5: 0 0 0 | 5 0 99 > 6: 0 0 0 | 2 0 99 > 7: 1 100 100 | 1 0 100 > > disk I/O size ios % cum % | ios % cum % > 4K: 3 42 42 | 17 0 0 > 8K: 0 0 42 | 17 0 0 > 16K: 0 0 42 | 22 0 1 > 32K: 0 0 42 | 73 1 2 > 64K: 1 14 57 | 292 6 9 > 128K: 0 0 57 | 385 8 18 > 256K: 3 42 100 | 88 2 20 > 512K: 0 0 100 | 1229 28 48 > 1M: 0 0 100 | 2218 51 100 > > My questions are: > > 1) Does a disk framentation of "1" mean that those IO was fragmented or > would that be "0"? > > 2) Does the disk I/O size mean what lustre actually wrote or what it > wanted to write? What does that number mean in the context of our 768KB > stripe size since it lists so many I/Os at 1M? > > Thanks, > Mark > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Andreas Dilger
2010-Jun-15 21:08 UTC
[Lustre-discuss] Using brw_stats to diagnose lustre performance
Also setting the max RPC size on the client to be 768kB would avoid the need for each RPC to generate 2 IO requests. It is possible with newer tune2fs to set the RAID stripe size and the allocator (mballoc) will use that size. There is a bug open to transfer this "optimal " size to the client, but it hasn''t gotten mug attention since most sites are set up with 1MB stripe size. Cheers, Andreas On 2010-06-15, at 14:19, Kevin Van Maren <kevin.van.maren at oracle.com> wrote:> Live is much easier with a 1MB (or 512KB) native raid stripe size. > > > It looks like most IOs are being broken into 2 pieces. See > https://bugzilla.lustre.org/show_bug.cgi?id=22850 > for a few tweaks that would help get IOs > 512KB to disk. See also > Bug 9945 > > But you are also seeing IOs "combined" into pieces that are between 1 > and 2 raid stripes, so set > /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler > does not "help" too much. > > There are mkfs options to tell ldiskfs your native raid stripe size. > You probably also want to change > the client stripe size (lfs setstripe) to be an integral multiple of > the > raid size (ie, not the default 1MB). > > Also note that those are power-of-2 buckets, so your 768KB chunks > aren''t > going to be listed as "768". > > Kevin > > > mark wrote: >> Hi Everyone, >> >> I''m trying to diagnose some performance concerns we are having >> about our >> lustre deployment. It seems to be a fairly multifaceted problem >> involving how ifort does buffered writes along with how we have >> lustre >> setup. >> >> What I''ve identified so far is that our raid stripe size on the >> OSTs is >> 768KB (6 * 128KB chunks) and the partitions are not being mounted >> with >> -o strip. We have 2 luns per controller and each virtual disk has 2 >> partitions with the 2nd one being the lustre file system. It is >> possible the partitions are not aligned. Most of the client side >> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per >> OST, >> etc). The journals are on separate SSDs. Our OSSes are probably >> oversubscribed. >> >> What we''ve noticed is that with certain apps we get *really* bad >> performance to the OSTs. As bad as 500-800KB/s to one OST. The best >> performance I''ve seen to an OST is around 300MB/s, with 500MB/s being >> more or less the upper bound limited by IB. >> >> Right now I''m trying to verify that fragmentation is happening like I >> would expect given the configuration mentioned above. I just learned >> about brw_stats, so I tried examining it for one of our OSTs (It >> looks >> like lustre must have been restarted recently with so little data): >> >> disk fragmented I/Os ios % cum % | ios % cum % >> 1: 0 0 0 | 215 9 9 >> 2: 0 0 0 | 2004 89 98 >> 3: 0 0 0 | 22 0 99 >> 4: 0 0 0 | 2 0 99 >> 5: 0 0 0 | 5 0 99 >> 6: 0 0 0 | 2 0 99 >> 7: 1 100 100 | 1 0 100 >> >> disk I/O size ios % cum % | ios % cum % >> 4K: 3 42 42 | 17 0 0 >> 8K: 0 0 42 | 17 0 0 >> 16K: 0 0 42 | 22 0 1 >> 32K: 0 0 42 | 73 1 2 >> 64K: 1 14 57 | 292 6 9 >> 128K: 0 0 57 | 385 8 18 >> 256K: 3 42 100 | 88 2 20 >> 512K: 0 0 100 | 1229 28 48 >> 1M: 0 0 100 | 2218 51 100 >> >> My questions are: >> >> 1) Does a disk framentation of "1" mean that those IO was >> fragmented or >> would that be "0"? >> >> 2) Does the disk I/O size mean what lustre actually wrote or what it >> wanted to write? What does that number mean in the context of our >> 768KB >> stripe size since it lists so many I/Os at 1M? >> >> Thanks, >> Mark >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Brian J. Murrell
2010-Jun-15 21:51 UTC
[Lustre-discuss] Using brw_stats to diagnose lustre performance
On Tue, 2010-06-15 at 15:08 -0600, Andreas Dilger wrote:> Also setting the max RPC size on the client to be 768kB would avoid > the need for each RPC to generate 2 IO requests.I wondered about this too but recall I did an on-site at a client one time where this was necessary. They did a 7+1 RAID5 before we got involved (or we''d have councilled them otherwise) and therefore did not get 1MB raid stripes. It seemed that setting the client max RPC didn''t really have the effect that we thought it would/should. That said, that was quite a while ago (1.6.x timeframe) and it could very well have been a bug that has been since fixed. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100615/0543e93e/attachment-0001.bin
Mark Nelson
2010-Jun-15 21:55 UTC
[Lustre-discuss] Using brw_stats to diagnose lustre performance
Hi Kevin and Andreas, Thank you both for the excellent information! At this point I doubt I''ll be able to configure the raid arrays for a 1MB stripe size (As much as I would like to). How can I change the max RPC size to 768KB on the client? So far on my list: - work with tune2fs to set good stripe parameters for the FS. - mount with -o strip=N (is this needed if tuen2fs sets defaults?) - examine alignment of lustre partitions. - set /sys/block/sd*/queue/max_sectors_kb to 768. - set the client stripe size to 768kb. - change the max RPC size on the clients to 768kb (not sure how yet). - Upgrade to 1.8.4 to get benefits of patches mentioned in bug #22850. I am also considering: - Increasing RPCs in flight. - Increasing dirty Cache size. - Disabling lnet debugging. - Changing OST service thread count. - Checking out MDS configuration and raid. Anything I''m missing? Thanks, Mark Andreas Dilger wrote:> Also setting the max RPC size on the client to be 768kB would avoid the > need for each RPC to generate 2 IO requests. > > It is possible with newer tune2fs to set the RAID stripe size and the > allocator (mballoc) will use that size. There is a bug open to transfer > this "optimal " size to the client, but it hasn''t gotten mug attention > since most sites are set up with 1MB stripe size. > > Cheers, Andreas > > On 2010-06-15, at 14:19, Kevin Van Maren <kevin.van.maren at oracle.com> > wrote: > >> Live is much easier with a 1MB (or 512KB) native raid stripe size. >> >> >> It looks like most IOs are being broken into 2 pieces. See >> https://bugzilla.lustre.org/show_bug.cgi?id=22850 >> for a few tweaks that would help get IOs > 512KB to disk. See also >> Bug 9945 >> >> But you are also seeing IOs "combined" into pieces that are between 1 >> and 2 raid stripes, so set >> /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler >> does not "help" too much. >> >> There are mkfs options to tell ldiskfs your native raid stripe size. >> You probably also want to change >> the client stripe size (lfs setstripe) to be an integral multiple of the >> raid size (ie, not the default 1MB). >> >> Also note that those are power-of-2 buckets, so your 768KB chunks aren''t >> going to be listed as "768". >> >> Kevin >> >> >> mark wrote: >>> Hi Everyone, >>> >>> I''m trying to diagnose some performance concerns we are having about our >>> lustre deployment. It seems to be a fairly multifaceted problem >>> involving how ifort does buffered writes along with how we have lustre >>> setup. >>> >>> What I''ve identified so far is that our raid stripe size on the OSTs is >>> 768KB (6 * 128KB chunks) and the partitions are not being mounted with >>> -o strip. We have 2 luns per controller and each virtual disk has 2 >>> partitions with the 2nd one being the lustre file system. It is >>> possible the partitions are not aligned. Most of the client side >>> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per OST, >>> etc). The journals are on separate SSDs. Our OSSes are probably >>> oversubscribed. >>> >>> What we''ve noticed is that with certain apps we get *really* bad >>> performance to the OSTs. As bad as 500-800KB/s to one OST. The best >>> performance I''ve seen to an OST is around 300MB/s, with 500MB/s being >>> more or less the upper bound limited by IB. >>> >>> Right now I''m trying to verify that fragmentation is happening like I >>> would expect given the configuration mentioned above. I just learned >>> about brw_stats, so I tried examining it for one of our OSTs (It looks >>> like lustre must have been restarted recently with so little data): >>> >>> disk fragmented I/Os ios % cum % | ios % cum % >>> 1: 0 0 0 | 215 9 9 >>> 2: 0 0 0 | 2004 89 98 >>> 3: 0 0 0 | 22 0 99 >>> 4: 0 0 0 | 2 0 99 >>> 5: 0 0 0 | 5 0 99 >>> 6: 0 0 0 | 2 0 99 >>> 7: 1 100 100 | 1 0 100 >>> >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 3 42 42 | 17 0 0 >>> 8K: 0 0 42 | 17 0 0 >>> 16K: 0 0 42 | 22 0 1 >>> 32K: 0 0 42 | 73 1 2 >>> 64K: 1 14 57 | 292 6 9 >>> 128K: 0 0 57 | 385 8 18 >>> 256K: 3 42 100 | 88 2 20 >>> 512K: 0 0 100 | 1229 28 48 >>> 1M: 0 0 100 | 2218 51 100 >>> >>> My questions are: >>> >>> 1) Does a disk framentation of "1" mean that those IO was fragmented or >>> would that be "0"? >>> >>> 2) Does the disk I/O size mean what lustre actually wrote or what it >>> wanted to write? What does that number mean in the context of our 768KB >>> stripe size since it lists so many I/Os at 1M? >>> >>> Thanks, >>> Mark >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Mark Nelson, Lead Software Developer Minnesota Supercomputing Institute Phone: (612)626-4479 Email: mark at msi.umn.edu
Kevin Van Maren
2010-Jun-15 21:58 UTC
[Lustre-discuss] Using brw_stats to diagnose lustre performance
Mark Nelson wrote:> Hi Kevin and Andreas, > > Thank you both for the excellent information! At this point I doubt > I''ll be able to configure the raid arrays for a 1MB stripe size (As > much as I would like to). How can I change the max RPC size to 768KB > on the client?Set max_pages_per_rpc (drop from 256 to 192), same way you would set max_rpcs_in_flight, with something like: # lctl conf_param lustre.osc.max_pages_per_rpc=192 BTW, a blatant plug, but see: http://www.oracle.com/us/support/systems/advanced-customer-services/readiness-service-lustre-ds-077261.pdf> So far on my list: > > - work with tune2fs to set good stripe parameters for the FS. > - mount with -o strip=N (is this needed if tuen2fs sets defaults?) > - examine alignment of lustre partitions. > - set /sys/block/sd*/queue/max_sectors_kb to 768. > - set the client stripe size to 768kb. > - change the max RPC size on the clients to 768kb (not sure how yet). > - Upgrade to 1.8.4 to get benefits of patches mentioned in bug #22850. > > I am also considering: > - Increasing RPCs in flight. > - Increasing dirty Cache size. > - Disabling lnet debugging. > - Changing OST service thread count. > - Checking out MDS configuration and raid. > > Anything I''m missing? > > Thanks, > Mark > > Andreas Dilger wrote: >> Also setting the max RPC size on the client to be 768kB would avoid >> the need for each RPC to generate 2 IO requests. >> >> It is possible with newer tune2fs to set the RAID stripe size and the >> allocator (mballoc) will use that size. There is a bug open to >> transfer this "optimal " size to the client, but it hasn''t gotten mug >> attention since most sites are set up with 1MB stripe size. >> >> Cheers, Andreas >> >> On 2010-06-15, at 14:19, Kevin Van Maren <kevin.van.maren at oracle.com> >> wrote: >> >>> Live is much easier with a 1MB (or 512KB) native raid stripe size. >>> >>> >>> It looks like most IOs are being broken into 2 pieces. See >>> https://bugzilla.lustre.org/show_bug.cgi?id=22850 >>> for a few tweaks that would help get IOs > 512KB to disk. See also >>> Bug 9945 >>> >>> But you are also seeing IOs "combined" into pieces that are between 1 >>> and 2 raid stripes, so set >>> /sys/block/sd*/queue/max_sectors_kb to 768, so that the IO scheduler >>> does not "help" too much. >>> >>> There are mkfs options to tell ldiskfs your native raid stripe size. >>> You probably also want to change >>> the client stripe size (lfs setstripe) to be an integral multiple of >>> the >>> raid size (ie, not the default 1MB). >>> >>> Also note that those are power-of-2 buckets, so your 768KB chunks >>> aren''t >>> going to be listed as "768". >>> >>> Kevin >>> >>> >>> mark wrote: >>>> Hi Everyone, >>>> >>>> I''m trying to diagnose some performance concerns we are having >>>> about our >>>> lustre deployment. It seems to be a fairly multifaceted problem >>>> involving how ifort does buffered writes along with how we have lustre >>>> setup. >>>> >>>> What I''ve identified so far is that our raid stripe size on the >>>> OSTs is >>>> 768KB (6 * 128KB chunks) and the partitions are not being mounted with >>>> -o strip. We have 2 luns per controller and each virtual disk has 2 >>>> partitions with the 2nd one being the lustre file system. It is >>>> possible the partitions are not aligned. Most of the client side >>>> settings are at default (ie 8 rpcs in flight, 32MB dirty cache per >>>> OST, >>>> etc). The journals are on separate SSDs. Our OSSes are probably >>>> oversubscribed. >>>> >>>> What we''ve noticed is that with certain apps we get *really* bad >>>> performance to the OSTs. As bad as 500-800KB/s to one OST. The best >>>> performance I''ve seen to an OST is around 300MB/s, with 500MB/s being >>>> more or less the upper bound limited by IB. >>>> >>>> Right now I''m trying to verify that fragmentation is happening like I >>>> would expect given the configuration mentioned above. I just learned >>>> about brw_stats, so I tried examining it for one of our OSTs (It looks >>>> like lustre must have been restarted recently with so little data): >>>> >>>> disk fragmented I/Os ios % cum % | ios % cum % >>>> 1: 0 0 0 | 215 9 9 >>>> 2: 0 0 0 | 2004 89 98 >>>> 3: 0 0 0 | 22 0 99 >>>> 4: 0 0 0 | 2 0 99 >>>> 5: 0 0 0 | 5 0 99 >>>> 6: 0 0 0 | 2 0 99 >>>> 7: 1 100 100 | 1 0 100 >>>> >>>> disk I/O size ios % cum % | ios % cum % >>>> 4K: 3 42 42 | 17 0 0 >>>> 8K: 0 0 42 | 17 0 0 >>>> 16K: 0 0 42 | 22 0 1 >>>> 32K: 0 0 42 | 73 1 2 >>>> 64K: 1 14 57 | 292 6 9 >>>> 128K: 0 0 57 | 385 8 18 >>>> 256K: 3 42 100 | 88 2 20 >>>> 512K: 0 0 100 | 1229 28 48 >>>> 1M: 0 0 100 | 2218 51 100 >>>> >>>> My questions are: >>>> >>>> 1) Does a disk framentation of "1" mean that those IO was >>>> fragmented or >>>> would that be "0"? >>>> >>>> 2) Does the disk I/O size mean what lustre actually wrote or what it >>>> wanted to write? What does that number mean in the context of our >>>> 768KB >>>> stripe size since it lists so many I/Os at 1M? >>>> >>>> Thanks, >>>> Mark >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss > >
Bernd Schubert
2010-Jun-16 21:40 UTC
[Lustre-discuss] Using brw_stats to diagnose lustre performance
On Tuesday 15 June 2010, Kevin Van Maren wrote:> Live is much easier with a 1MB (or 512KB) native raid stripe size. > > > It looks like most IOs are being broken into 2 pieces. See > https://bugzilla.lustre.org/show_bug.cgi?id=22850 > for a few tweaks that would help get IOs > 512KB to disk. See also BugI played with a similar patch some time ago (blkdev defines), but didn''t notice any performance improvements on the 9900 DDN S2A. Before increasing those values I got up to 7M IOs, after doubling MAX_HW_SEGMENTS and MAX_PHYS_SEGMENTS max IOs doubled to 14M. Unfortunately, also more IOs in between magic good IO sizes came up (magic good here: 1, 2, 3 ..., 14), so e.g. lots of 1008 or 2032, etc. Example numbers from a production system: Length Port 1 Port 2 Port 3 Port 4 Kbytes Reads Writes Reads Writes Reads Writes Reads Writes> 960 1DCD 2EEB 1E44 3532 1431 1D7E 14FB 2284> 976 1ACD 34AC 1A0F 48EB 12E2 24AE 11E1 257F > 992 1D46 3787 1CA7 51EB 144C 2E9B 1354 3A62 > 1008 100A5 11B5C 10391 13765 A9B8 FBED 9E9A D457 > 1024 BFD41D 111F3C4 BFBE47 11A110D 8C316B C95178 8E5A9F C83850 > 1040 583 625 538 6C3 3F3 513 413 337 ... > 2032 551 1260 50D 136B 3E4 1218 3C8 BA1 > 2048 41B85 FDB21 3B8D1 101857 31088 B78E0 2C4A5 92F48 > 2064 FB 20 108 24 BE 19 C7 10 > 2080 E3 2F E6 37 AA 44 C7 1B ... > 7152 55 6C7 58 80C 60 70D 3F 3B4 > 7168 449F E335 417C E743 3332 AB34 3686 A568 > 7184 29 1 14 2 19 1 14 0 I don''t think it matters for any storage system if max IO is 7M or 14M, but those sizes in between are rather annoying. And from output of brw_stats I *sometimes* have no idea how that can happen. On that particular system I took the numbers from, users mostly don''t do streaming writes, so it the reason is clear there. After tuning the FZJ (Kevin should know that system) system, the SLES11 kernel with chained scatter-gathering (so the blkdev patch is mostly not required anymore) can do IO sizes up to 12MB. Unfortunately, also quite some 1008s again out of the blue without an obvious reason (during my streaming writes with obdecho). Cheers, Bernd -- Bernd Schubert DataDirect Networks