Erich Focht
2007-Sep-26 15:50 UTC
[Lustre-discuss] Fwd: max_sectors_kb change doesn''t help
Ooops, this was actually meant for lustre-discuss, not for lustre-devel. ---------- Forwarded Message ---------- Subject: [Lustre-devel] max_sectors_kb change doesn''t help Date: Wednesday 26 September 2007 17:41 From: Erich Focht <efocht at hpce.nec.com> To: lustre-devel at clusterfs.com Hi, in /proc/fs/lustre/obdfilter/*/brw_stats I found that the disk I/O is done in 512K pieces. Following the Lustre manual I changed /sys/block/DEVICE/queue/max_sectors_kb to 1024 but the I/O sizes in Lustre didn''t change. Is there any place in Lustre where I can enforce 1MB I/Os? What else can I do? My RAID is connected through an mpt fusion driver and I cannot find any place where the I/O size is limited to 512KB. In fact, without Lustre I am able to write with larger I/Os, so I suspect it''s not the driver''s fault... Regards, Erich
Andreas Dilger
2007-Sep-26 23:20 UTC
[Lustre-discuss] Fwd: max_sectors_kb change doesn''t help
On Sep 26, 2007 17:50 +0200, Erich Focht wrote:> ---------- Forwarded Message ---------- > > Subject: [Lustre-devel] max_sectors_kb change doesn''t help > Date: Wednesday 26 September 2007 17:41 > From: Erich Focht <efocht at hpce.nec.com> > To: lustre-devel at clusterfs.com > > Hi, > > in /proc/fs/lustre/obdfilter/*/brw_stats I found that the disk I/O is done > in 512K pieces. Following the Lustre manual I changed > /sys/block/DEVICE/queue/max_sectors_kb to 1024 but the I/O sizes in Lustre > didn''t change. > > Is there any place in Lustre where I can enforce 1MB I/Os? What else can I do? > My RAID is connected through an mpt fusion driver and I cannot find any place > where the I/O size is limited to 512KB. In fact, without Lustre I am able to > write with larger I/Os, so I suspect it''s not the driver''s fault...Did you try remounting the OSTs after changing the setting? Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Erich Focht
2007-Sep-27 08:40 UTC
[Lustre-discuss] Fwd: max_sectors_kb change doesn''t help
Hi Andreas, On Thursday 27 September 2007 01:20, Andreas Dilger wrote:> On Sep 26, 2007 17:50 +0200, Erich Focht wrote: > > in 512K pieces. Following the Lustre manual I changed > > /sys/block/DEVICE/queue/max_sectors_kb to 1024 but the I/O sizes in Lustre > > didn''t change. > > Did you try remounting the OSTs after changing the setting?Yes, actually when I changed the settings the lustre modules were not loaded, yet. Where is the detection of the maximally allowed I/O size done in Lustre? Thanks, Erich
Andreas Dilger
2007-Sep-27 08:52 UTC
[Lustre-discuss] Fwd: max_sectors_kb change doesn''t help
On Sep 27, 2007 10:40 +0200, Erich Focht wrote:> On Thursday 27 September 2007 01:20, Andreas Dilger wrote: > > On Sep 26, 2007 17:50 +0200, Erich Focht wrote: > > > in 512K pieces. Following the Lustre manual I changed > > > /sys/block/DEVICE/queue/max_sectors_kb to 1024 but the I/O sizes in Lustre > > > didn''t change. > > > > Did you try remounting the OSTs after changing the setting? > > Yes, actually when I changed the settings the lustre modules were not loaded, > yet. Where is the detection of the maximally allowed I/O size done in Lustre?In fact there isn''t any such detection in Lustre - it will push pages into an IO until the block layer tells it to stop. Please check /proc/fs/lustre/obdfilter/*/brw_stats to see if the IO requests coming from the client are 1MB in size (256 pages), and if yes then the issue would likely be in the block layer. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Erich Focht
2007-Sep-27 09:28 UTC
[Lustre-discuss] Fwd: max_sectors_kb change doesn''t help
On Thursday 27 September 2007 10:52, Andreas Dilger wrote:> In fact there isn''t any such detection in Lustre - it will push pages into > an IO until the block layer tells it to stop. > > Please check /proc/fs/lustre/obdfilter/*/brw_stats to see if the IO requests > coming from the client are 1MB in size (256 pages), and if yes then the issue > would likely be in the block layer.The output is below. I see 256 pages per transfer. But I also see "disk fragmented I/Os". Sounds somehow related, but can I influence the fragmentation? BTW: I''m running on a RHEL5 system, with noop I/O scheduler. The disks are now connected through Emulex FC controllers, but I see the same behavior with SAS storage attached through LSI Logic HCAs. read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 256: 0 0 0 | 955 100 100 read | write discontiguous pages rpcs % cum % | rpcs % cum % 0: 0 0 0 | 955 100 100 read | write discontiguous blocks rpcs % cum % | rpcs % cum % 0: 0 0 0 | 955 100 100 read | write disk fragmented I/Os ios % cum % | ios % cum % 2: 0 0 0 | 955 100 100 read | write disk I/Os in flight ios % cum % | ios % cum % 1: 0 0 0 | 216 11 11 2: 0 0 0 | 220 11 22 3: 0 0 0 | 194 10 32 4: 0 0 0 | 198 10 43 5: 0 0 0 | 166 8 52 6: 0 0 0 | 165 8 60 7: 0 0 0 | 122 6 67 8: 0 0 0 | 121 6 73 9: 0 0 0 | 116 6 79 10: 0 0 0 | 115 6 85 11: 0 0 0 | 95 4 90 12: 0 0 0 | 94 4 95 13: 0 0 0 | 35 1 97 14: 0 0 0 | 32 1 98 15: 0 0 0 | 9 0 99 16: 0 0 0 | 9 0 99 17: 0 0 0 | 2 0 99 18: 0 0 0 | 1 0 100 read | write I/O time (1/1000s) ios % cum % | ios % cum % 4: 0 0 0 | 3 0 0 8: 0 0 0 | 17 1 2 16: 0 0 0 | 98 10 12 32: 0 0 0 | 326 34 46 64: 0 0 0 | 370 38 85 128: 0 0 0 | 129 13 98 256: 0 0 0 | 10 1 99 512: 0 0 0 | 2 0 100 read | write disk I/O size ios % cum % | ios % cum % 512K: 0 0 0 | 1910 100 100 Thanks, best regards, Erich -- Dr. Erich Focht Solution Architecture Group, Linux R&D NEC High Performance Computing Europe
Andreas Dilger
2007-Sep-27 10:34 UTC
[Lustre-discuss] Fwd: max_sectors_kb change doesn''t help
On Sep 27, 2007 11:28 +0200, Erich Focht wrote:> On Thursday 27 September 2007 10:52, Andreas Dilger wrote: > > In fact there isn''t any such detection in Lustre - it will push pages into > > an IO until the block layer tells it to stop. > > > > Please check /proc/fs/lustre/obdfilter/*/brw_stats to see if the IO requests > > coming from the client are 1MB in size (256 pages), and if yes then the issue > > would likely be in the block layer. > > The output is below. I see 256 pages per transfer. But I also see "disk > fragmented I/Os". Sounds somehow related, but can I influence the > fragmentation? > > pages per bulk r/w rpcs % cum % | rpcs % cum % > 256: 0 0 0 | 955 100 100 > > read | write > disk fragmented I/Os ios % cum % | ios % cum % > 2: 0 0 0 | 955 100 100 > > read | write > disk I/O size ios % cum % | ios % cum % > 512K: 0 0 0 | 1910 100 100This generally points to the underlying layer fragmenting the IO, since the "disk fragmented I/O" counter is only when we can''t add a page to the exising bio (see "frags" in lustre/obdfilter/filter_io_26/filter_do_bio()). The culprit is in "can_be_merged()" or "bio_add_page()". Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Erich Focht
2007-Oct-01 12:02 UTC
[Lustre-discuss] Fwd: max_sectors_kb change doesn''t help
Hi Andreas, On Thursday 27 September 2007 12:34, Andreas Dilger wrote:> > disk I/O size ios % cum % | ios % cum % > > 512K: 0 0 0 | 1910 100 100 > > This generally points to the underlying layer fragmenting the IO, since the > "disk fragmented I/O" counter is only when we can''t add a page to the exising > bio (see "frags" in lustre/obdfilter/filter_io_26/filter_do_bio()). The > culprit is in "can_be_merged()" or "bio_add_page()".the Lustre debugging messages look like this: 00002000:00000002:3:1191233501.646369:0:15619:0:(filter_io_26.c:339:filter_do_bio()) bio++ sz 524288 vcnt 128(256) sectors 1024(1024) psg 18(128) hsg 18(64) and are printed by the code: /* Dang! I have to fragment this I/O */ CDEBUG(D_INODE, "bio++ sz %d vcnt %d(%d) " "sectors %d(%d) psg %d(%d) hsg %d(%d)\n", bio->bi_size, bio->bi_vcnt, bio->bi_max_vecs, bio->bi_size >> 9, q->max_sectors, bio_phys_segments(q, bio), q->max_phys_segments, bio_hw_segments(q, bio), q->max_hw_segments); This actually suggests that q->max_sectors is 1024, although /sys/block/sd*/queue/max_sectors_kb is set to 2048 (i.e. the value should be 4096 sectors). Could this problem come from multipath? It is "assembling" the dm-* devices out of the SCSI devices, presents the SCSI devices as "slaves", but has no own settings for the queue parameters in /sys/block/dm-*. I tried increasing the SCSI member devices'' queue max_sectors_kb before starting the multipathd, but it didn''t help. Uhmmm, yes, I am using multipath... forgot to mention earlier. Best regards, Erich
Andreas Dilger
2007-Oct-02 05:10 UTC
[Lustre-discuss] Fwd: max_sectors_kb change doesn''t help
On Oct 01, 2007 14:02 +0200, Erich Focht wrote:> On Thursday 27 September 2007 12:34, Andreas Dilger wrote: > > > disk I/O size ios % cum % | ios % cum % > > > 512K: 0 0 0 | 1910 100 100 > > > > This generally points to the underlying layer fragmenting the IO, since the > > "disk fragmented I/O" counter is only when we can''t add a page to the exising > > bio (see "frags" in lustre/obdfilter/filter_io_26/filter_do_bio()). The > > culprit is in "can_be_merged()" or "bio_add_page()". > > the Lustre debugging messages look like this: > 00002000:00000002:3:1191233501.646369:0:15619:0:(filter_io_26.c:339:filter_do_bio()) bio++ sz 524288 vcnt 128(256) sectors 1024(1024) psg 18(128) hsg 18(64) > > and are printed by the code: > /* Dang! I have to fragment this I/O */ > CDEBUG(D_INODE, "bio++ sz %d vcnt %d(%d) " > "sectors %d(%d) psg %d(%d) hsg %d(%d)\n", > bio->bi_size, > bio->bi_vcnt, bio->bi_max_vecs, > bio->bi_size >> 9, q->max_sectors, > bio_phys_segments(q, bio), > q->max_phys_segments, > bio_hw_segments(q, bio), > q->max_hw_segments); > > This actually suggests that q->max_sectors is 1024, although > /sys/block/sd*/queue/max_sectors_kb is set to 2048 (i.e. the value should be > 4096 sectors). > > Could this problem come from multipath? It is "assembling" the dm-* devices > out of the SCSI devices, presents the SCSI devices as "slaves", but has no > own settings for the queue parameters in /sys/block/dm-*. I tried increasing > the SCSI member devices'' queue max_sectors_kb before starting the multipathd, > but it didn''t help. Uhmmm, yes, I am using multipath... forgot to mention > earlier.It seems entirely possible, I haven''t looked at the multipath code myself. It should really build the q->max_* values as the minimum of all of the underlying devices instead of using a default value. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.
Erich Focht
2007-Oct-06 11:23 UTC
[Lustre-discuss] Fwd: max_sectors_kb change doesn''t help
On Tuesday 02 October 2007 07:10, Andreas Dilger wrote:> > Could this problem come from multipath? It is "assembling" the dm-* devices > > out of the SCSI devices, presents the SCSI devices as "slaves", but has no > > own settings for the queue parameters in /sys/block/dm-*. I tried increasing > > the SCSI member devices'' queue max_sectors_kb before starting the multipathd, > > but it didn''t help. Uhmmm, yes, I am using multipath... forgot to mention > > earlier. > > It seems entirely possible, I haven''t looked at the multipath code myself. > It should really build the q->max_* values as the minimum of all of the > underlying devices instead of using a default value.Without multipath (mounting the scsi devices directly) the change of max_sectors_kb has the expected effect: Lustre stops to fragment and I/Os are done in 1MB chunks. I recon I was unable to find the place where max_sectors for the multipath queues is set, will continue to look into this. Regards, Erich