I am setting up a new lustre filesystem using LSI engenio based disk enclosures with integrated dual RAID controllers. I configured disks into 8+2 RAID6 groups using 128kb segment size (chunk size). This hardware uses mpt2sas kernel module on the Linux host side. I use the whole block device for an OST (to avoid any alignment issues). When running sgpdd-survey I can see high through numbers (~3GB/s write, 5GB/s read), Also controllers stats show that number of IOPS = number of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows slower results (~2GB/s write , ~2GB/s read ) and controller stats show more then double IOPS than MB/s. Looking at output from iostat -m -x 1 and brw_stats I can see that a large number of I/O operations are smaller than 1MB, mostly 512kb. I know that there was some work done on optimising the kernel block device layer to process 1MB I/O requests and that those changes were committed to Lustre 1.8.5. Thus I guess this I/O chopping happens below the Lustre stack, maybe in the mpt2sas driver? I am hoping that someone in Lustre community can shed some light on to my problem. In my setup I use: Lustre 1.8.5 CentOS-5.5 Some parameters I tuned from defaults in CentOS: deadline I/O scheduler max_hw_sectors_kb=4096 max_sectors_kb=1024 brw_stats output -- find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost; do cat $ost/brw_stats ; done | grep "disk I/O size" -A9 disk I/O size ios % cum % | ios % cum % 4K: 206 0 0 | 521 0 0 8K: 224 0 0 | 595 0 1 16K: 105 0 1 | 479 0 1 32K: 140 0 1 | 1108 1 3 64K: 231 0 1 | 1470 1 4 128K: 536 1 2 | 2259 2 7 256K: 1762 3 6 | 5644 6 14 512K: 31574 64 71 | 30431 35 50 1M: 14200 28 100 | 42143 49 100 -- disk I/O size ios % cum % | ios % cum % 4K: 187 0 0 | 457 0 0 8K: 244 0 0 | 598 0 1 16K: 109 0 1 | 481 0 1 32K: 129 0 1 | 1100 1 3 64K: 222 0 1 | 1408 1 4 128K: 514 1 2 | 2291 2 7 256K: 1718 3 6 | 5652 6 14 512K: 32222 65 72 | 29810 35 49 1M: 13654 27 100 | 42202 50 100 -- disk I/O size ios % cum % | ios % cum % 4K: 196 0 0 | 551 0 0 8K: 206 0 0 | 551 0 1 16K: 79 0 0 | 513 0 1 32K: 136 0 1 | 1048 1 3 64K: 232 0 1 | 1278 1 4 128K: 540 1 2 | 2172 2 7 256K: 1681 3 6 | 5679 6 13 512K: 31842 64 71 | 31705 37 51 1M: 14077 28 100 | 41789 48 100 -- disk I/O size ios % cum % | ios % cum % 4K: 190 0 0 | 486 0 0 8K: 200 0 0 | 547 0 1 16K: 93 0 0 | 448 0 1 32K: 141 0 1 | 1029 1 3 64K: 240 0 1 | 1283 1 4 128K: 558 1 2 | 2125 2 7 256K: 1716 3 6 | 5400 6 13 512K: 31476 64 70 | 29029 35 48 1M: 14366 29 100 | 42454 51 100 -- disk I/O size ios % cum % | ios % cum % 4K: 209 0 0 | 511 0 0 8K: 195 0 0 | 621 0 1 16K: 79 0 0 | 558 0 1 32K: 134 0 1 | 1135 1 3 64K: 245 0 1 | 1390 1 4 128K: 509 1 2 | 2219 2 7 256K: 1715 3 6 | 5687 6 14 512K: 31784 64 71 | 31172 36 50 1M: 14112 28 100 | 41719 49 100 -- disk I/O size ios % cum % | ios % cum % 4K: 201 0 0 | 500 0 0 8K: 241 0 0 | 604 0 1 16K: 82 0 1 | 584 0 1 32K: 130 0 1 | 1092 1 3 64K: 230 0 1 | 1331 1 4 128K: 547 1 2 | 2253 2 7 256K: 1695 3 6 | 5634 6 14 512K: 31501 64 70 | 31836 37 51 1M: 14343 29 100 | 41517 48 100 -- Wojciech Turek
Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not in the rest of the kernel. Driver limits are not normally noticed by (non-Lustre) people, because the default kernel limits IO to 512KB. May want to see Bug 22850 for the changes required eg, for the Emulex/lpfc driver. Glancing at the stock RHEL5 kernel, it looks like the issue is MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set to match the default kernel limit, but it is possible there is also a driver/HW limit. You should be able to increase that to 256 and see if it works... Also note that the size buckets are power-of-2, so a "1MB" entry is any IO > 512KB and <= 1MB. If you can''t get the driver to reliably do full 1MB IOs, change to a 64KB chunk and set max_sectors_kb to 512. This will help ensure you get aligned, full-stripe writes. Kevin Wojciech Turek wrote:> I am setting up a new lustre filesystem using LSI engenio based disk > enclosures with integrated dual RAID controllers. I configured disks > into 8+2 RAID6 groups using 128kb segment size (chunk size). This > hardware uses mpt2sas kernel module on the Linux host side. I use the > whole block device for an OST (to avoid any alignment issues). When > running sgpdd-survey I can see high through numbers (~3GB/s write, > 5GB/s read), Also controllers stats show that number of IOPS = number > of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows > slower results (~2GB/s write , ~2GB/s read ) and controller stats show > more then double IOPS than MB/s. Looking at output from iostat -m -x 1 > and brw_stats I can see that a large number of I/O operations are > smaller than 1MB, mostly 512kb. I know that there was some work done > on optimising the kernel block device layer to process 1MB I/O > requests and that those changes were committed to Lustre 1.8.5. Thus I > guess this I/O chopping happens below the Lustre stack, maybe in the > mpt2sas driver? > > I am hoping that someone in Lustre community can shed some light on to > my problem. > > In my setup I use: > Lustre 1.8.5 > CentOS-5.5 > > Some parameters I tuned from defaults in CentOS: > deadline I/O scheduler > > max_hw_sectors_kb=4096 > max_sectors_kb=1024 > > > brw_stats output > -- > > find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost; > do cat $ost/brw_stats ; done | grep "disk I/O size" -A9 > > disk I/O size ios % cum % | ios % cum % > 4K: 206 0 0 | 521 0 0 > 8K: 224 0 0 | 595 0 1 > 16K: 105 0 1 | 479 0 1 > 32K: 140 0 1 | 1108 1 3 > 64K: 231 0 1 | 1470 1 4 > 128K: 536 1 2 | 2259 2 7 > 256K: 1762 3 6 | 5644 6 14 > 512K: 31574 64 71 | 30431 35 50 > 1M: 14200 28 100 | 42143 49 100 > -- > disk I/O size ios % cum % | ios % cum % > 4K: 187 0 0 | 457 0 0 > 8K: 244 0 0 | 598 0 1 > 16K: 109 0 1 | 481 0 1 > 32K: 129 0 1 | 1100 1 3 > 64K: 222 0 1 | 1408 1 4 > 128K: 514 1 2 | 2291 2 7 > 256K: 1718 3 6 | 5652 6 14 > 512K: 32222 65 72 | 29810 35 49 > 1M: 13654 27 100 | 42202 50 100 > -- > disk I/O size ios % cum % | ios % cum % > 4K: 196 0 0 | 551 0 0 > 8K: 206 0 0 | 551 0 1 > 16K: 79 0 0 | 513 0 1 > 32K: 136 0 1 | 1048 1 3 > 64K: 232 0 1 | 1278 1 4 > 128K: 540 1 2 | 2172 2 7 > 256K: 1681 3 6 | 5679 6 13 > 512K: 31842 64 71 | 31705 37 51 > 1M: 14077 28 100 | 41789 48 100 > -- > disk I/O size ios % cum % | ios % cum % > 4K: 190 0 0 | 486 0 0 > 8K: 200 0 0 | 547 0 1 > 16K: 93 0 0 | 448 0 1 > 32K: 141 0 1 | 1029 1 3 > 64K: 240 0 1 | 1283 1 4 > 128K: 558 1 2 | 2125 2 7 > 256K: 1716 3 6 | 5400 6 13 > 512K: 31476 64 70 | 29029 35 48 > 1M: 14366 29 100 | 42454 51 100 > -- > disk I/O size ios % cum % | ios % cum % > 4K: 209 0 0 | 511 0 0 > 8K: 195 0 0 | 621 0 1 > 16K: 79 0 0 | 558 0 1 > 32K: 134 0 1 | 1135 1 3 > 64K: 245 0 1 | 1390 1 4 > 128K: 509 1 2 | 2219 2 7 > 256K: 1715 3 6 | 5687 6 14 > 512K: 31784 64 71 | 31172 36 50 > 1M: 14112 28 100 | 41719 49 100 > -- > disk I/O size ios % cum % | ios % cum % > 4K: 201 0 0 | 500 0 0 > 8K: 241 0 0 | 604 0 1 > 16K: 82 0 1 | 584 0 1 > 32K: 130 0 1 | 1092 1 3 > 64K: 230 0 1 | 1331 1 4 > 128K: 547 1 2 | 2253 2 7 > 256K: 1695 3 6 | 5634 6 14 > 512K: 31501 64 70 | 31836 37 51 > 1M: 14343 29 100 | 41517 48 100 > >
Hi Kevin, Thanks for very helpful answer. I tried your suggestion and recompiled the mpt2sas driver with the following changes: --- mpt2sas_base.h 2010-01-16 20:57:30.000000000 +0000 +++ new_mpt2sas_base.h 2011-06-10 12:53:35.000000000 +0100 @@ -83,13 +83,13 @@ #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE #if CONFIG_SCSI_MPT2SAS_MAX_SGE < 16 #define MPT2SAS_SG_DEPTH 16 -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 128 -#define MPT2SAS_SG_DEPTH 128 +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 256 +#define MPT2SAS_SG_DEPTH 256 #else #define MPT2SAS_SG_DEPTH CONFIG_SCSI_MPT2SAS_MAX_SGE #endif #else -#define MPT2SAS_SG_DEPTH 128 /* MAX_HW_SEGMENTS */ +#define MPT2SAS_SG_DEPTH 256 /* MAX_HW_SEGMENTS */ #endif #if defined(TARGET_MODE) However I can still that almost 50% of writes and slightly over 50% of reads falls under 512K I/Os I am using device-mapper-multipath to manage active/passive paths do you think that could have something to do with the I/O fragmentation? Best regards, Wojciech On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com> wrote:> Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not in > the rest of the kernel. Driver limits are not normally noticed by > (non-Lustre) people, because the default kernel limits IO to 512KB. > > May want to see Bug 22850 for the changes required eg, for the Emulex/lpfc > driver. > > Glancing at the stock RHEL5 kernel, it looks like the issue is > MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set to match > the default kernel limit, but it is possible there is also a driver/HW > limit. You should be able to increase that to 256 and see if it works... > > > Also note that the size buckets are power-of-2, so a "1MB" entry is any IO > > 512KB and <= 1MB. > > If you can''t get the driver to reliably do full 1MB IOs, change to a 64KB > chunk and set max_sectors_kb to 512. This will help ensure you get aligned, > full-stripe writes. > > Kevin > > > > Wojciech Turek wrote: > >> I am setting up a new lustre filesystem using LSI engenio based disk >> enclosures with integrated dual RAID controllers. I configured disks >> into 8+2 RAID6 groups using 128kb segment size (chunk size). This >> hardware uses mpt2sas kernel module on the Linux host side. I use the >> whole block device for an OST (to avoid any alignment issues). When >> running sgpdd-survey I can see high through numbers (~3GB/s write, >> 5GB/s read), Also controllers stats show that number of IOPS = number >> of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows >> slower results (~2GB/s write , ~2GB/s read ) and controller stats show >> more then double IOPS than MB/s. Looking at output from iostat -m -x 1 >> and brw_stats I can see that a large number of I/O operations are >> smaller than 1MB, mostly 512kb. I know that there was some work done >> on optimising the kernel block device layer to process 1MB I/O >> requests and that those changes were committed to Lustre 1.8.5. Thus I >> guess this I/O chopping happens below the Lustre stack, maybe in the >> mpt2sas driver? >> >> I am hoping that someone in Lustre community can shed some light on to >> my problem. >> >> In my setup I use: >> Lustre 1.8.5 >> CentOS-5.5 >> >> Some parameters I tuned from defaults in CentOS: >> deadline I/O scheduler >> >> max_hw_sectors_kb=4096 >> max_sectors_kb=1024 >> >> >> brw_stats output >> -- >> >> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost; >> do cat $ost/brw_stats ; done | grep "disk I/O size" -A9 >> >> disk I/O size ios % cum % | ios % cum % >> 4K: 206 0 0 | 521 0 0 >> 8K: 224 0 0 | 595 0 1 >> 16K: 105 0 1 | 479 0 1 >> 32K: 140 0 1 | 1108 1 3 >> 64K: 231 0 1 | 1470 1 4 >> 128K: 536 1 2 | 2259 2 7 >> 256K: 1762 3 6 | 5644 6 14 >> 512K: 31574 64 71 | 30431 35 50 >> 1M: 14200 28 100 | 42143 49 100 >> -- >> disk I/O size ios % cum % | ios % cum % >> 4K: 187 0 0 | 457 0 0 >> 8K: 244 0 0 | 598 0 1 >> 16K: 109 0 1 | 481 0 1 >> 32K: 129 0 1 | 1100 1 3 >> 64K: 222 0 1 | 1408 1 4 >> 128K: 514 1 2 | 2291 2 7 >> 256K: 1718 3 6 | 5652 6 14 >> 512K: 32222 65 72 | 29810 35 49 >> 1M: 13654 27 100 | 42202 50 100 >> -- >> disk I/O size ios % cum % | ios % cum % >> 4K: 196 0 0 | 551 0 0 >> 8K: 206 0 0 | 551 0 1 >> 16K: 79 0 0 | 513 0 1 >> 32K: 136 0 1 | 1048 1 3 >> 64K: 232 0 1 | 1278 1 4 >> 128K: 540 1 2 | 2172 2 7 >> 256K: 1681 3 6 | 5679 6 13 >> 512K: 31842 64 71 | 31705 37 51 >> 1M: 14077 28 100 | 41789 48 100 >> -- >> disk I/O size ios % cum % | ios % cum % >> 4K: 190 0 0 | 486 0 0 >> 8K: 200 0 0 | 547 0 1 >> 16K: 93 0 0 | 448 0 1 >> 32K: 141 0 1 | 1029 1 3 >> 64K: 240 0 1 | 1283 1 4 >> 128K: 558 1 2 | 2125 2 7 >> 256K: 1716 3 6 | 5400 6 13 >> 512K: 31476 64 70 | 29029 35 48 >> 1M: 14366 29 100 | 42454 51 100 >> -- >> disk I/O size ios % cum % | ios % cum % >> 4K: 209 0 0 | 511 0 0 >> 8K: 195 0 0 | 621 0 1 >> 16K: 79 0 0 | 558 0 1 >> 32K: 134 0 1 | 1135 1 3 >> 64K: 245 0 1 | 1390 1 4 >> 128K: 509 1 2 | 2219 2 7 >> 256K: 1715 3 6 | 5687 6 14 >> 512K: 31784 64 71 | 31172 36 50 >> 1M: 14112 28 100 | 41719 49 100 >> -- >> disk I/O size ios % cum % | ios % cum % >> 4K: 201 0 0 | 500 0 0 >> 8K: 241 0 0 | 604 0 1 >> 16K: 82 0 1 | 584 0 1 >> 32K: 130 0 1 | 1092 1 3 >> 64K: 230 0 1 | 1331 1 4 >> 128K: 547 1 2 | 2253 2 7 >> 256K: 1695 3 6 | 5634 6 14 >> 512K: 31501 64 70 | 31836 37 51 >> 1M: 14343 29 100 | 41517 48 100 >> >> >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110610/c55bec0d/attachment.html
Hi Kevin, In my kernel .config I find following lines CONFIG_SCSI_MPT2SAS=m CONFIG_SCSI_MPT2SAS_MAX_SGE=128 CONFIG_SCSI_MPT2SAS_LOGGING=y I changed SGE value to 256 Do I need to recompile the Kernel before building new module based on that .config? On 10 June 2011 13:00, Wojciech Turek <wjt27 at cam.ac.uk> wrote:> Hi Kevin, > > Thanks for very helpful answer. I tried your suggestion and recompiled the > mpt2sas driver with the following changes: > > --- mpt2sas_base.h 2010-01-16 20:57:30.000000000 +0000 > +++ new_mpt2sas_base.h 2011-06-10 12:53:35.000000000 +0100 > @@ -83,13 +83,13 @@ > #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE > #if CONFIG_SCSI_MPT2SAS_MAX_SGE < 16 > #define MPT2SAS_SG_DEPTH 16 > -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 128 > -#define MPT2SAS_SG_DEPTH 128 > +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 256 > +#define MPT2SAS_SG_DEPTH 256 > #else > #define MPT2SAS_SG_DEPTH CONFIG_SCSI_MPT2SAS_MAX_SGE > #endif > #else > -#define MPT2SAS_SG_DEPTH 128 /* MAX_HW_SEGMENTS */ > +#define MPT2SAS_SG_DEPTH 256 /* MAX_HW_SEGMENTS */ > #endif > > #if defined(TARGET_MODE) > > However I can still that almost 50% of writes and slightly over 50% of > reads falls under 512K I/Os > I am using device-mapper-multipath to manage active/passive paths do you > think that could have something to do with the I/O fragmentation? > > Best regards, > > Wojciech > > On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com> wrote: > >> Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not in >> the rest of the kernel. Driver limits are not normally noticed by >> (non-Lustre) people, because the default kernel limits IO to 512KB. >> >> May want to see Bug 22850 for the changes required eg, for the Emulex/lpfc >> driver. >> >> Glancing at the stock RHEL5 kernel, it looks like the issue is >> MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set to match >> the default kernel limit, but it is possible there is also a driver/HW >> limit. You should be able to increase that to 256 and see if it works... >> >> >> Also note that the size buckets are power-of-2, so a "1MB" entry is any IO >> > 512KB and <= 1MB. >> >> If you can''t get the driver to reliably do full 1MB IOs, change to a 64KB >> chunk and set max_sectors_kb to 512. This will help ensure you get aligned, >> full-stripe writes. >> >> Kevin >> >> >> >> Wojciech Turek wrote: >> >>> I am setting up a new lustre filesystem using LSI engenio based disk >>> enclosures with integrated dual RAID controllers. I configured disks >>> into 8+2 RAID6 groups using 128kb segment size (chunk size). This >>> hardware uses mpt2sas kernel module on the Linux host side. I use the >>> whole block device for an OST (to avoid any alignment issues). When >>> running sgpdd-survey I can see high through numbers (~3GB/s write, >>> 5GB/s read), Also controllers stats show that number of IOPS = number >>> of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows >>> slower results (~2GB/s write , ~2GB/s read ) and controller stats show >>> more then double IOPS than MB/s. Looking at output from iostat -m -x 1 >>> and brw_stats I can see that a large number of I/O operations are >>> smaller than 1MB, mostly 512kb. I know that there was some work done >>> on optimising the kernel block device layer to process 1MB I/O >>> requests and that those changes were committed to Lustre 1.8.5. Thus I >>> guess this I/O chopping happens below the Lustre stack, maybe in the >>> mpt2sas driver? >>> >>> I am hoping that someone in Lustre community can shed some light on to >>> my problem. >>> >>> In my setup I use: >>> Lustre 1.8.5 >>> CentOS-5.5 >>> >>> Some parameters I tuned from defaults in CentOS: >>> deadline I/O scheduler >>> >>> max_hw_sectors_kb=4096 >>> max_sectors_kb=1024 >>> >>> >>> brw_stats output >>> -- >>> >>> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost; >>> do cat $ost/brw_stats ; done | grep "disk I/O size" -A9 >>> >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 206 0 0 | 521 0 0 >>> 8K: 224 0 0 | 595 0 1 >>> 16K: 105 0 1 | 479 0 1 >>> 32K: 140 0 1 | 1108 1 3 >>> 64K: 231 0 1 | 1470 1 4 >>> 128K: 536 1 2 | 2259 2 7 >>> 256K: 1762 3 6 | 5644 6 14 >>> 512K: 31574 64 71 | 30431 35 50 >>> 1M: 14200 28 100 | 42143 49 100 >>> -- >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 187 0 0 | 457 0 0 >>> 8K: 244 0 0 | 598 0 1 >>> 16K: 109 0 1 | 481 0 1 >>> 32K: 129 0 1 | 1100 1 3 >>> 64K: 222 0 1 | 1408 1 4 >>> 128K: 514 1 2 | 2291 2 7 >>> 256K: 1718 3 6 | 5652 6 14 >>> 512K: 32222 65 72 | 29810 35 49 >>> 1M: 13654 27 100 | 42202 50 100 >>> -- >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 196 0 0 | 551 0 0 >>> 8K: 206 0 0 | 551 0 1 >>> 16K: 79 0 0 | 513 0 1 >>> 32K: 136 0 1 | 1048 1 3 >>> 64K: 232 0 1 | 1278 1 4 >>> 128K: 540 1 2 | 2172 2 7 >>> 256K: 1681 3 6 | 5679 6 13 >>> 512K: 31842 64 71 | 31705 37 51 >>> 1M: 14077 28 100 | 41789 48 100 >>> -- >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 190 0 0 | 486 0 0 >>> 8K: 200 0 0 | 547 0 1 >>> 16K: 93 0 0 | 448 0 1 >>> 32K: 141 0 1 | 1029 1 3 >>> 64K: 240 0 1 | 1283 1 4 >>> 128K: 558 1 2 | 2125 2 7 >>> 256K: 1716 3 6 | 5400 6 13 >>> 512K: 31476 64 70 | 29029 35 48 >>> 1M: 14366 29 100 | 42454 51 100 >>> -- >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 209 0 0 | 511 0 0 >>> 8K: 195 0 0 | 621 0 1 >>> 16K: 79 0 0 | 558 0 1 >>> 32K: 134 0 1 | 1135 1 3 >>> 64K: 245 0 1 | 1390 1 4 >>> 128K: 509 1 2 | 2219 2 7 >>> 256K: 1715 3 6 | 5687 6 14 >>> 512K: 31784 64 71 | 31172 36 50 >>> 1M: 14112 28 100 | 41719 49 100 >>> -- >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 201 0 0 | 500 0 0 >>> 8K: 241 0 0 | 604 0 1 >>> 16K: 82 0 1 | 584 0 1 >>> 32K: 130 0 1 | 1092 1 3 >>> 64K: 230 0 1 | 1331 1 4 >>> 128K: 547 1 2 | 2253 2 7 >>> 256K: 1695 3 6 | 5634 6 14 >>> 512K: 31501 64 70 | 31836 37 51 >>> 1M: 14343 29 100 | 41517 48 100 >>> >>> >>> >> >> > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110610/6014ed31/attachment-0001.html
It''s possible there is another issue, but are you sure you (or RedHat) are not setting CONFIG_SCSI_MPT2SAS_MAX_SGE in your .config, which is preventing it from being set to 256? I don''t have a machine using this driver. You could put #warning in the code to see if you hit the non-256 code path when building, or printk the max_sgl_entries in _base_allocate_memory_pools. Kevin Wojciech Turek wrote:> Hi Kevin, > > Thanks for very helpful answer. I tried your suggestion and recompiled > the mpt2sas driver with the following changes: > > --- mpt2sas_base.h 2010-01-16 20:57:30.000000000 +0000 > +++ new_mpt2sas_base.h 2011-06-10 12:53:35.000000000 +0100 > @@ -83,13 +83,13 @@ > #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE > #if CONFIG_SCSI_MPT2SAS_MAX_SGE < 16 > #define MPT2SAS_SG_DEPTH 16 > -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 128 > -#define MPT2SAS_SG_DEPTH 128 > +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 256 > +#define MPT2SAS_SG_DEPTH 256 > #else > #define MPT2SAS_SG_DEPTH CONFIG_SCSI_MPT2SAS_MAX_SGE > #endif > #else > -#define MPT2SAS_SG_DEPTH 128 /* MAX_HW_SEGMENTS */ > +#define MPT2SAS_SG_DEPTH 256 /* MAX_HW_SEGMENTS */ > #endif > > #if defined(TARGET_MODE) > > However I can still that almost 50% of writes and slightly over 50% of > reads falls under 512K I/Os > I am using device-mapper-multipath to manage active/passive paths do > you think that could have something to do with the I/O fragmentation? > > Best regards, > > Wojciech > > On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com > <mailto:kevin.van.maren at oracle.com>> wrote: > > Yep, with 1.8.5 the problem is most likely in the (mpt2sas) > driver, not in the rest of the kernel. Driver limits are not > normally noticed by (non-Lustre) people, because the default > kernel limits IO to 512KB. > > May want to see Bug 22850 for the changes required eg, for the > Emulex/lpfc driver. > > Glancing at the stock RHEL5 kernel, it looks like the issue is > MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set > to match the default kernel limit, but it is possible there is > also a driver/HW limit. You should be able to increase that to > 256 and see if it works... > > > Also note that the size buckets are power-of-2, so a "1MB" entry > is any IO > 512KB and <= 1MB. > > If you can''t get the driver to reliably do full 1MB IOs, change to > a 64KB chunk and set max_sectors_kb to 512. This will help ensure > you get aligned, full-stripe writes. > > Kevin > > > > Wojciech Turek wrote: > > I am setting up a new lustre filesystem using LSI engenio > based disk > enclosures with integrated dual RAID controllers. I configured > disks > into 8+2 RAID6 groups using 128kb segment size (chunk size). This > hardware uses mpt2sas kernel module on the Linux host side. I > use the > whole block device for an OST (to avoid any alignment issues). > When > running sgpdd-survey I can see high through numbers (~3GB/s write, > 5GB/s read), Also controllers stats show that number of IOPS > number > of MB/s. However as soon as I put ldiskfs on the OSTs, > obdfilter shows > slower results (~2GB/s write , ~2GB/s read ) and controller > stats show > more then double IOPS than MB/s. Looking at output from iostat > -m -x 1 > and brw_stats I can see that a large number of I/O operations are > smaller than 1MB, mostly 512kb. I know that there was some > work done > on optimising the kernel block device layer to process 1MB I/O > requests and that those changes were committed to Lustre > 1.8.5. Thus I > guess this I/O chopping happens below the Lustre stack, maybe > in the > mpt2sas driver? > > I am hoping that someone in Lustre community can shed some > light on to > my problem. > > In my setup I use: > Lustre 1.8.5 > CentOS-5.5 > > Some parameters I tuned from defaults in CentOS: > deadline I/O scheduler > > max_hw_sectors_kb=4096 > max_sectors_kb=1024 >
Wojciech Turek wrote:> Hi Kevin, > > In my kernel .config I find following lines > > CONFIG_SCSI_MPT2SAS=m > CONFIG_SCSI_MPT2SAS_MAX_SGE=128 > CONFIG_SCSI_MPT2SAS_LOGGING=y > > I changed SGE value to 256 > > Do I need to recompile the Kernel before building new module based on > that .config?No, but you do need to do something like "make oldconfig" to propagate the change in .config to the header files, and then rebuild the driver. Kevin
Wojciech, We have seen similar issues with DM-Multipath. Can you experiment with going straight to the block device without DM-Multipath? Thanks, Galen On Jun 10, 2011, at 8:00 AM, Wojciech Turek wrote:> Hi Kevin, > > Thanks for very helpful answer. I tried your suggestion and recompiled the > mpt2sas driver with the following changes: > > --- mpt2sas_base.h 2010-01-16 20:57:30.000000000 +0000 > +++ new_mpt2sas_base.h 2011-06-10 12:53:35.000000000 +0100 > @@ -83,13 +83,13 @@ > #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE > #if CONFIG_SCSI_MPT2SAS_MAX_SGE < 16 > #define MPT2SAS_SG_DEPTH 16 > -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 128 > -#define MPT2SAS_SG_DEPTH 128 > +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 256 > +#define MPT2SAS_SG_DEPTH 256 > #else > #define MPT2SAS_SG_DEPTH CONFIG_SCSI_MPT2SAS_MAX_SGE > #endif > #else > -#define MPT2SAS_SG_DEPTH 128 /* MAX_HW_SEGMENTS */ > +#define MPT2SAS_SG_DEPTH 256 /* MAX_HW_SEGMENTS */ > #endif > > #if defined(TARGET_MODE) > > However I can still that almost 50% of writes and slightly over 50% of reads > falls under 512K I/Os > I am using device-mapper-multipath to manage active/passive paths do you > think that could have something to do with the I/O fragmentation? > > Best regards, > > Wojciech > > On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com> wrote: > >> Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not in >> the rest of the kernel. Driver limits are not normally noticed by >> (non-Lustre) people, because the default kernel limits IO to 512KB. >> >> May want to see Bug 22850 for the changes required eg, for the Emulex/lpfc >> driver. >> >> Glancing at the stock RHEL5 kernel, it looks like the issue is >> MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set to match >> the default kernel limit, but it is possible there is also a driver/HW >> limit. You should be able to increase that to 256 and see if it works... >> >> >> Also note that the size buckets are power-of-2, so a "1MB" entry is any IO >>> 512KB and <= 1MB. >> >> If you can''t get the driver to reliably do full 1MB IOs, change to a 64KB >> chunk and set max_sectors_kb to 512. This will help ensure you get aligned, >> full-stripe writes. >> >> Kevin >> >> >> >> Wojciech Turek wrote: >> >>> I am setting up a new lustre filesystem using LSI engenio based disk >>> enclosures with integrated dual RAID controllers. I configured disks >>> into 8+2 RAID6 groups using 128kb segment size (chunk size). This >>> hardware uses mpt2sas kernel module on the Linux host side. I use the >>> whole block device for an OST (to avoid any alignment issues). When >>> running sgpdd-survey I can see high through numbers (~3GB/s write, >>> 5GB/s read), Also controllers stats show that number of IOPS = number >>> of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows >>> slower results (~2GB/s write , ~2GB/s read ) and controller stats show >>> more then double IOPS than MB/s. Looking at output from iostat -m -x 1 >>> and brw_stats I can see that a large number of I/O operations are >>> smaller than 1MB, mostly 512kb. I know that there was some work done >>> on optimising the kernel block device layer to process 1MB I/O >>> requests and that those changes were committed to Lustre 1.8.5. Thus I >>> guess this I/O chopping happens below the Lustre stack, maybe in the >>> mpt2sas driver? >>> >>> I am hoping that someone in Lustre community can shed some light on to >>> my problem. >>> >>> In my setup I use: >>> Lustre 1.8.5 >>> CentOS-5.5 >>> >>> Some parameters I tuned from defaults in CentOS: >>> deadline I/O scheduler >>> >>> max_hw_sectors_kb=4096 >>> max_sectors_kb=1024 >>> >>> >>> brw_stats output >>> -- >>> >>> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost; >>> do cat $ost/brw_stats ; done | grep "disk I/O size" -A9 >>> >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 206 0 0 | 521 0 0 >>> 8K: 224 0 0 | 595 0 1 >>> 16K: 105 0 1 | 479 0 1 >>> 32K: 140 0 1 | 1108 1 3 >>> 64K: 231 0 1 | 1470 1 4 >>> 128K: 536 1 2 | 2259 2 7 >>> 256K: 1762 3 6 | 5644 6 14 >>> 512K: 31574 64 71 | 30431 35 50 >>> 1M: 14200 28 100 | 42143 49 100 >>> -- >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 187 0 0 | 457 0 0 >>> 8K: 244 0 0 | 598 0 1 >>> 16K: 109 0 1 | 481 0 1 >>> 32K: 129 0 1 | 1100 1 3 >>> 64K: 222 0 1 | 1408 1 4 >>> 128K: 514 1 2 | 2291 2 7 >>> 256K: 1718 3 6 | 5652 6 14 >>> 512K: 32222 65 72 | 29810 35 49 >>> 1M: 13654 27 100 | 42202 50 100 >>> -- >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 196 0 0 | 551 0 0 >>> 8K: 206 0 0 | 551 0 1 >>> 16K: 79 0 0 | 513 0 1 >>> 32K: 136 0 1 | 1048 1 3 >>> 64K: 232 0 1 | 1278 1 4 >>> 128K: 540 1 2 | 2172 2 7 >>> 256K: 1681 3 6 | 5679 6 13 >>> 512K: 31842 64 71 | 31705 37 51 >>> 1M: 14077 28 100 | 41789 48 100 >>> -- >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 190 0 0 | 486 0 0 >>> 8K: 200 0 0 | 547 0 1 >>> 16K: 93 0 0 | 448 0 1 >>> 32K: 141 0 1 | 1029 1 3 >>> 64K: 240 0 1 | 1283 1 4 >>> 128K: 558 1 2 | 2125 2 7 >>> 256K: 1716 3 6 | 5400 6 13 >>> 512K: 31476 64 70 | 29029 35 48 >>> 1M: 14366 29 100 | 42454 51 100 >>> -- >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 209 0 0 | 511 0 0 >>> 8K: 195 0 0 | 621 0 1 >>> 16K: 79 0 0 | 558 0 1 >>> 32K: 134 0 1 | 1135 1 3 >>> 64K: 245 0 1 | 1390 1 4 >>> 128K: 509 1 2 | 2219 2 7 >>> 256K: 1715 3 6 | 5687 6 14 >>> 512K: 31784 64 71 | 31172 36 50 >>> 1M: 14112 28 100 | 41719 49 100 >>> -- >>> disk I/O size ios % cum % | ios % cum % >>> 4K: 201 0 0 | 500 0 0 >>> 8K: 241 0 0 | 604 0 1 >>> 16K: 82 0 1 | 584 0 1 >>> 32K: 130 0 1 | 1092 1 3 >>> 64K: 230 0 1 | 1331 1 4 >>> 128K: 547 1 2 | 2253 2 7 >>> 256K: 1695 3 6 | 5634 6 14 >>> 512K: 31501 64 70 | 31836 37 51 >>> 1M: 14343 29 100 | 41517 48 100 >>> >>> >>> >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > blockedhttp://lists.lustre.org/mailman/listinfo/lustre-discuss
Hi Galen, I have tried your suggestion and mounted OST directly on /dev/sd<x> devices but that didn''t help and I/O is still being fragmented. Best regards, Wojciech On 10 June 2011 14:25, Shipman, Galen M. <gshipman at ornl.gov> wrote:> Wojciech, > > We have seen similar issues with DM-Multipath. Can you experiment with > going straight to the block device without DM-Multipath? > > Thanks, > > Galen > > On Jun 10, 2011, at 8:00 AM, Wojciech Turek wrote: > > > Hi Kevin, > > > > Thanks for very helpful answer. I tried your suggestion and recompiled > the > > mpt2sas driver with the following changes: > > > > --- mpt2sas_base.h 2010-01-16 20:57:30.000000000 +0000 > > +++ new_mpt2sas_base.h 2011-06-10 12:53:35.000000000 +0100 > > @@ -83,13 +83,13 @@ > > #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE > > #if CONFIG_SCSI_MPT2SAS_MAX_SGE < 16 > > #define MPT2SAS_SG_DEPTH 16 > > -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 128 > > -#define MPT2SAS_SG_DEPTH 128 > > +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE > 256 > > +#define MPT2SAS_SG_DEPTH 256 > > #else > > #define MPT2SAS_SG_DEPTH CONFIG_SCSI_MPT2SAS_MAX_SGE > > #endif > > #else > > -#define MPT2SAS_SG_DEPTH 128 /* MAX_HW_SEGMENTS */ > > +#define MPT2SAS_SG_DEPTH 256 /* MAX_HW_SEGMENTS */ > > #endif > > > > #if defined(TARGET_MODE) > > > > However I can still that almost 50% of writes and slightly over 50% of > reads > > falls under 512K I/Os > > I am using device-mapper-multipath to manage active/passive paths do you > > think that could have something to do with the I/O fragmentation? > > > > Best regards, > > > > Wojciech > > > > On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com> > wrote: > > > >> Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not > in > >> the rest of the kernel. Driver limits are not normally noticed by > >> (non-Lustre) people, because the default kernel limits IO to 512KB. > >> > >> May want to see Bug 22850 for the changes required eg, for the > Emulex/lpfc > >> driver. > >> > >> Glancing at the stock RHEL5 kernel, it looks like the issue is > >> MPT2SAS_SG_DEPTH, which is limited to 128. This appears to be set to > match > >> the default kernel limit, but it is possible there is also a driver/HW > >> limit. You should be able to increase that to 256 and see if it > works... > >> > >> > >> Also note that the size buckets are power-of-2, so a "1MB" entry is any > IO > >>> 512KB and <= 1MB. > >> > >> If you can''t get the driver to reliably do full 1MB IOs, change to a > 64KB > >> chunk and set max_sectors_kb to 512. This will help ensure you get > aligned, > >> full-stripe writes. > >> > >> Kevin > >> > >> > >> > >> Wojciech Turek wrote: > >> > >>> I am setting up a new lustre filesystem using LSI engenio based disk > >>> enclosures with integrated dual RAID controllers. I configured disks > >>> into 8+2 RAID6 groups using 128kb segment size (chunk size). This > >>> hardware uses mpt2sas kernel module on the Linux host side. I use the > >>> whole block device for an OST (to avoid any alignment issues). When > >>> running sgpdd-survey I can see high through numbers (~3GB/s write, > >>> 5GB/s read), Also controllers stats show that number of IOPS = number > >>> of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows > >>> slower results (~2GB/s write , ~2GB/s read ) and controller stats show > >>> more then double IOPS than MB/s. Looking at output from iostat -m -x 1 > >>> and brw_stats I can see that a large number of I/O operations are > >>> smaller than 1MB, mostly 512kb. I know that there was some work done > >>> on optimising the kernel block device layer to process 1MB I/O > >>> requests and that those changes were committed to Lustre 1.8.5. Thus I > >>> guess this I/O chopping happens below the Lustre stack, maybe in the > >>> mpt2sas driver? > >>> > >>> I am hoping that someone in Lustre community can shed some light on to > >>> my problem. > >>> > >>> In my setup I use: > >>> Lustre 1.8.5 > >>> CentOS-5.5 > >>> > >>> Some parameters I tuned from defaults in CentOS: > >>> deadline I/O scheduler > >>> > >>> max_hw_sectors_kb=4096 > >>> max_sectors_kb=1024 > >>> > >>> > >>> brw_stats output > >>> -- > >>> > >>> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost; > >>> do cat $ost/brw_stats ; done | grep "disk I/O size" -A9 > >>> > >>> disk I/O size ios % cum % | ios % cum % > >>> 4K: 206 0 0 | 521 0 0 > >>> 8K: 224 0 0 | 595 0 1 > >>> 16K: 105 0 1 | 479 0 1 > >>> 32K: 140 0 1 | 1108 1 3 > >>> 64K: 231 0 1 | 1470 1 4 > >>> 128K: 536 1 2 | 2259 2 7 > >>> 256K: 1762 3 6 | 5644 6 14 > >>> 512K: 31574 64 71 | 30431 35 50 > >>> 1M: 14200 28 100 | 42143 49 100 > >>> -- > >>> disk I/O size ios % cum % | ios % cum % > >>> 4K: 187 0 0 | 457 0 0 > >>> 8K: 244 0 0 | 598 0 1 > >>> 16K: 109 0 1 | 481 0 1 > >>> 32K: 129 0 1 | 1100 1 3 > >>> 64K: 222 0 1 | 1408 1 4 > >>> 128K: 514 1 2 | 2291 2 7 > >>> 256K: 1718 3 6 | 5652 6 14 > >>> 512K: 32222 65 72 | 29810 35 49 > >>> 1M: 13654 27 100 | 42202 50 100 > >>> -- > >>> disk I/O size ios % cum % | ios % cum % > >>> 4K: 196 0 0 | 551 0 0 > >>> 8K: 206 0 0 | 551 0 1 > >>> 16K: 79 0 0 | 513 0 1 > >>> 32K: 136 0 1 | 1048 1 3 > >>> 64K: 232 0 1 | 1278 1 4 > >>> 128K: 540 1 2 | 2172 2 7 > >>> 256K: 1681 3 6 | 5679 6 13 > >>> 512K: 31842 64 71 | 31705 37 51 > >>> 1M: 14077 28 100 | 41789 48 100 > >>> -- > >>> disk I/O size ios % cum % | ios % cum % > >>> 4K: 190 0 0 | 486 0 0 > >>> 8K: 200 0 0 | 547 0 1 > >>> 16K: 93 0 0 | 448 0 1 > >>> 32K: 141 0 1 | 1029 1 3 > >>> 64K: 240 0 1 | 1283 1 4 > >>> 128K: 558 1 2 | 2125 2 7 > >>> 256K: 1716 3 6 | 5400 6 13 > >>> 512K: 31476 64 70 | 29029 35 48 > >>> 1M: 14366 29 100 | 42454 51 100 > >>> -- > >>> disk I/O size ios % cum % | ios % cum % > >>> 4K: 209 0 0 | 511 0 0 > >>> 8K: 195 0 0 | 621 0 1 > >>> 16K: 79 0 0 | 558 0 1 > >>> 32K: 134 0 1 | 1135 1 3 > >>> 64K: 245 0 1 | 1390 1 4 > >>> 128K: 509 1 2 | 2219 2 7 > >>> 256K: 1715 3 6 | 5687 6 14 > >>> 512K: 31784 64 71 | 31172 36 50 > >>> 1M: 14112 28 100 | 41719 49 100 > >>> -- > >>> disk I/O size ios % cum % | ios % cum % > >>> 4K: 201 0 0 | 500 0 0 > >>> 8K: 241 0 0 | 604 0 1 > >>> 16K: 82 0 1 | 584 0 1 > >>> 32K: 130 0 1 | 1092 1 3 > >>> 64K: 230 0 1 | 1331 1 4 > >>> 128K: 547 1 2 | 2253 2 7 > >>> 256K: 1695 3 6 | 5634 6 14 > >>> 512K: 31501 64 70 | 31836 37 51 > >>> 1M: 14343 29 100 | 41517 48 100 > >>> > >>> > >>> > >> > >> > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > blockedhttp://lists.lustre.org/mailman/listinfo/lustre-discuss > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110613/3b5655a1/attachment-0001.html
Did you printk the SGE in the driver, to make sure it was being set properly? sg_tablesize may be being limited elsewhere, although the kernel patches in v1.8.5 should prevent that. do this: # cat /sys/class/scsi_host/host*/sg_tablesize This should be 256. If not, then this is still the issue. # cat /sys/block/sd*/queue/max_hw_sectors_kb This should be >= 1024 # cat /sys/block/sd*/queue/max_sectors_kb This should be 1024 (Lustre mount sets it to max_hw_sectors_kb) _base_allocate_memory_pools prints a bunch of helpful info using MPT2SAS_INFO_FMT -- goes to KERN_INFO, and dinitprintk (MPT_DEBUG_INIT flag). Turn up kernel verbosity and set the module parameter logging_level=0x20 If you still don''t have an answer, then look at these values in drivers/scsi/scsi_lib.c: blk_queue_max_hw_segments(q, shost->sg_tablesize); blk_queue_max_phys_segments(q, SCSI_MAX_PHYS_SEGMENTS); blk_queue_max_sectors(q, shost->max_sectors); Kevin Wojciech Turek wrote:> Hi Kevin, > > Unfortunately still no luck with 1MB I/O. I have forced my OSS to do > 512KB I/O following your suggestion and setting 512 max_sectors_kb. I > also recreated my HW RAID with 64KB chunks to align it for 512KB > chunks. I can see from the brw_stats and controller statistics that > it does indeed twice as many IOPS as compared to throughput MB/s but > perfoamnce isn''t any better as before. > From the sgpdd-survey I know that this controller can do around 3GB/s > write and 4GB/s read. Also when running sgpdd-survey controller stats > show that I/O is not fragmented (nr of IOPS = Throughput in MB/s). I > also try to bypass multipath layer by mounting the sd devices directly > but that did not make any difference. > > If you have any more suggestions I will be happy to try them out. > > Best regards, > > Wojciech > > > On 13 June 2011 15:13, Kevin Van Maren <kevin.van.maren at oracle.com > <mailto:kevin.van.maren at oracle.com>> wrote: > > Did you get it doing 1MB IOs? > > Kevin > > > > Kevin Van Maren wrote: > > Wojciech Turek wrote: > > Hi Kevin, > > In my kernel .config I find following lines > > CONFIG_SCSI_MPT2SAS=m > CONFIG_SCSI_MPT2SAS_MAX_SGE=128 > CONFIG_SCSI_MPT2SAS_LOGGING=y > > I changed SGE value to 256 > > Do I need to recompile the Kernel before building new > module based on that .config? > > > No, but you do need to do something like "make oldconfig" to > propagate the change in .config to the header files, and then > rebuild the driver. > > Kevin > > > >