thr3ads.net - Lustre discuss - [Lustre-discuss] HW RAID

If this information is useful, please help other people find it:
Share via:

Wojciech Turek

2011-Jun-08 15:53 UTC

[Lustre-discuss] HW RAID - fragmented I/O

I am setting up a new lustre filesystem using LSI engenio based disk
enclosures with integrated dual RAID controllers. I configured disks
into 8+2 RAID6 groups using 128kb segment size (chunk size). This
hardware uses mpt2sas kernel module on the Linux host side. I use the
whole block device for an OST (to avoid any alignment issues). When
running sgpdd-survey I can see high through numbers (~3GB/s write,
5GB/s read), Also controllers stats show that number of IOPS = number
of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows
slower results (~2GB/s write , ~2GB/s read ) and controller stats show
more then double IOPS than MB/s. Looking at output from iostat -m -x 1
and brw_stats I can see that a large number of I/O operations are
smaller than 1MB, mostly 512kb.  I know that there was some work done
on optimising the kernel block device layer to process 1MB I/O
requests and that those changes were committed to Lustre 1.8.5. Thus I
guess this I/O chopping happens below the Lustre stack, maybe in the
mpt2sas driver?

I am hoping that someone in Lustre community can shed some light on to
my problem.

In my setup I  use:
Lustre 1.8.5
CentOS-5.5

Some parameters I tuned from defaults in CentOS:
deadline I/O scheduler

max_hw_sectors_kb=4096
max_sectors_kb=1024


brw_stats output
--

find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read ost;
do cat $ost/brw_stats ; done | grep "disk I/O size" -A9

disk I/O size          ios   % cum % |  ios   % cum %
4K:                    206   0   0   |  521   0   0
8K:                    224   0   0   |  595   0   1
16K:                   105   0   1   |  479   0   1
32K:                   140   0   1   | 1108   1   3
64K:                   231   0   1   | 1470   1   4
128K:                  536   1   2   | 2259   2   7
256K:                 1762   3   6   | 5644   6  14
512K:                31574  64  71   | 30431  35  50
1M:                  14200  28 100   | 42143  49 100
--
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    187   0   0   |  457   0   0
8K:                    244   0   0   |  598   0   1
16K:                   109   0   1   |  481   0   1
32K:                   129   0   1   | 1100   1   3
64K:                   222   0   1   | 1408   1   4
128K:                  514   1   2   | 2291   2   7
256K:                 1718   3   6   | 5652   6  14
512K:                32222  65  72   | 29810  35  49
1M:                  13654  27 100   | 42202  50 100
--
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    196   0   0   |  551   0   0
8K:                    206   0   0   |  551   0   1
16K:                    79   0   0   |  513   0   1
32K:                   136   0   1   | 1048   1   3
64K:                   232   0   1   | 1278   1   4
128K:                  540   1   2   | 2172   2   7
256K:                 1681   3   6   | 5679   6  13
512K:                31842  64  71   | 31705  37  51
1M:                  14077  28 100   | 41789  48 100
--
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    190   0   0   |  486   0   0
8K:                    200   0   0   |  547   0   1
16K:                    93   0   0   |  448   0   1
32K:                   141   0   1   | 1029   1   3
64K:                   240   0   1   | 1283   1   4
128K:                  558   1   2   | 2125   2   7
256K:                 1716   3   6   | 5400   6  13
512K:                31476  64  70   | 29029  35  48
1M:                  14366  29 100   | 42454  51 100
--
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    209   0   0   |  511   0   0
8K:                    195   0   0   |  621   0   1
16K:                    79   0   0   |  558   0   1
32K:                   134   0   1   | 1135   1   3
64K:                   245   0   1   | 1390   1   4
128K:                  509   1   2   | 2219   2   7
256K:                 1715   3   6   | 5687   6  14
512K:                31784  64  71   | 31172  36  50
1M:                  14112  28 100   | 41719  49 100
--
disk I/O size          ios   % cum % |  ios   % cum %
4K:                    201   0   0   |  500   0   0
8K:                    241   0   0   |  604   0   1
16K:                    82   0   1   |  584   0   1
32K:                   130   0   1   | 1092   1   3
64K:                   230   0   1   | 1331   1   4
128K:                  547   1   2   | 2253   2   7
256K:                 1695   3   6   | 5634   6  14
512K:                31501  64  70   | 31836  37  51
1M:                  14343  29 100   | 41517  48 100

-- 
Wojciech Turek

Kevin Van Maren

2011-Jun-08 16:30 UTC

head link

[Lustre-discuss] HW RAID - fragmented I/O

Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not 
in the rest of the kernel.  Driver limits are not normally noticed by 
(non-Lustre) people, because the default kernel limits IO to 512KB.

May want to see Bug 22850 for the changes required eg, for the 
Emulex/lpfc driver.

Glancing at the stock RHEL5 kernel, it looks like the issue is 
MPT2SAS_SG_DEPTH, which is limited to 128.  This appears to be set to 
match the default kernel limit, but it is possible there is also a 
driver/HW limit.  You should be able to increase that to 256 and see if 
it works...


Also note that the size buckets are power-of-2, so a "1MB" entry is
any
IO > 512KB and <= 1MB.

If you can''t get the driver to reliably do full 1MB IOs, change to a 
64KB chunk and set max_sectors_kb to 512.  This will help ensure you get 
aligned, full-stripe writes.

Kevin


Wojciech Turek wrote:> I am setting up a new lustre filesystem using LSI engenio based disk
> enclosures with integrated dual RAID controllers. I configured disks
> into 8+2 RAID6 groups using 128kb segment size (chunk size). This
> hardware uses mpt2sas kernel module on the Linux host side. I use the
> whole block device for an OST (to avoid any alignment issues). When
> running sgpdd-survey I can see high through numbers (~3GB/s write,
> 5GB/s read), Also controllers stats show that number of IOPS = number
> of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows
> slower results (~2GB/s write , ~2GB/s read ) and controller stats show
> more then double IOPS than MB/s. Looking at output from iostat -m -x 1
> and brw_stats I can see that a large number of I/O operations are
> smaller than 1MB, mostly 512kb.  I know that there was some work done
> on optimising the kernel block device layer to process 1MB I/O
> requests and that those changes were committed to Lustre 1.8.5. Thus I
> guess this I/O chopping happens below the Lustre stack, maybe in the
> mpt2sas driver?
>
> I am hoping that someone in Lustre community can shed some light on to
> my problem.
>
> In my setup I  use:
> Lustre 1.8.5
> CentOS-5.5
>
> Some parameters I tuned from defaults in CentOS:
> deadline I/O scheduler
>
> max_hw_sectors_kb=4096
> max_sectors_kb=1024
>
>
> brw_stats output
> --
>
> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while read
ost;
> do cat $ost/brw_stats ; done | grep "disk I/O size" -A9
>
> disk I/O size          ios   % cum % |  ios   % cum %
> 4K:                    206   0   0   |  521   0   0
> 8K:                    224   0   0   |  595   0   1
> 16K:                   105   0   1   |  479   0   1
> 32K:                   140   0   1   | 1108   1   3
> 64K:                   231   0   1   | 1470   1   4
> 128K:                  536   1   2   | 2259   2   7
> 256K:                 1762   3   6   | 5644   6  14
> 512K:                31574  64  71   | 30431  35  50
> 1M:                  14200  28 100   | 42143  49 100
> --
> disk I/O size          ios   % cum % |  ios   % cum %
> 4K:                    187   0   0   |  457   0   0
> 8K:                    244   0   0   |  598   0   1
> 16K:                   109   0   1   |  481   0   1
> 32K:                   129   0   1   | 1100   1   3
> 64K:                   222   0   1   | 1408   1   4
> 128K:                  514   1   2   | 2291   2   7
> 256K:                 1718   3   6   | 5652   6  14
> 512K:                32222  65  72   | 29810  35  49
> 1M:                  13654  27 100   | 42202  50 100
> --
> disk I/O size          ios   % cum % |  ios   % cum %
> 4K:                    196   0   0   |  551   0   0
> 8K:                    206   0   0   |  551   0   1
> 16K:                    79   0   0   |  513   0   1
> 32K:                   136   0   1   | 1048   1   3
> 64K:                   232   0   1   | 1278   1   4
> 128K:                  540   1   2   | 2172   2   7
> 256K:                 1681   3   6   | 5679   6  13
> 512K:                31842  64  71   | 31705  37  51
> 1M:                  14077  28 100   | 41789  48 100
> --
> disk I/O size          ios   % cum % |  ios   % cum %
> 4K:                    190   0   0   |  486   0   0
> 8K:                    200   0   0   |  547   0   1
> 16K:                    93   0   0   |  448   0   1
> 32K:                   141   0   1   | 1029   1   3
> 64K:                   240   0   1   | 1283   1   4
> 128K:                  558   1   2   | 2125   2   7
> 256K:                 1716   3   6   | 5400   6  13
> 512K:                31476  64  70   | 29029  35  48
> 1M:                  14366  29 100   | 42454  51 100
> --
> disk I/O size          ios   % cum % |  ios   % cum %
> 4K:                    209   0   0   |  511   0   0
> 8K:                    195   0   0   |  621   0   1
> 16K:                    79   0   0   |  558   0   1
> 32K:                   134   0   1   | 1135   1   3
> 64K:                   245   0   1   | 1390   1   4
> 128K:                  509   1   2   | 2219   2   7
> 256K:                 1715   3   6   | 5687   6  14
> 512K:                31784  64  71   | 31172  36  50
> 1M:                  14112  28 100   | 41719  49 100
> --
> disk I/O size          ios   % cum % |  ios   % cum %
> 4K:                    201   0   0   |  500   0   0
> 8K:                    241   0   0   |  604   0   1
> 16K:                    82   0   1   |  584   0   1
> 32K:                   130   0   1   | 1092   1   3
> 64K:                   230   0   1   | 1331   1   4
> 128K:                  547   1   2   | 2253   2   7
> 256K:                 1695   3   6   | 5634   6  14
> 512K:                31501  64  70   | 31836  37  51
> 1M:                  14343  29 100   | 41517  48 100
>
>

Wojciech Turek

2011-Jun-10 12:00 UTC

head link

[Lustre-discuss] HW RAID - fragmented I/O

Hi Kevin,

Thanks for very helpful answer. I tried your suggestion and recompiled the
mpt2sas driver with the following changes:

--- mpt2sas_base.h      2010-01-16 20:57:30.000000000 +0000
+++ new_mpt2sas_base.h  2011-06-10 12:53:35.000000000 +0100
@@ -83,13 +83,13 @@
#ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE
#if     CONFIG_SCSI_MPT2SAS_MAX_SGE  < 16
#define MPT2SAS_SG_DEPTH       16
-#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 128
-#define MPT2SAS_SG_DEPTH       128
+#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 256
+#define MPT2SAS_SG_DEPTH       256
#else
#define MPT2SAS_SG_DEPTH       CONFIG_SCSI_MPT2SAS_MAX_SGE
#endif
#else
-#define MPT2SAS_SG_DEPTH       128 /* MAX_HW_SEGMENTS */
+#define MPT2SAS_SG_DEPTH       256 /* MAX_HW_SEGMENTS */
#endif

#if defined(TARGET_MODE)

However I can still that almost 50% of writes and slightly over 50% of reads
falls under 512K I/Os
I am using device-mapper-multipath to manage active/passive paths do you
think that could have something to do with the I/O fragmentation?

Best regards,

Wojciech

On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com>
wrote:
> Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not in
> the rest of the kernel.  Driver limits are not normally noticed by
> (non-Lustre) people, because the default kernel limits IO to 512KB.
>
> May want to see Bug 22850 for the changes required eg, for the Emulex/lpfc
> driver.
>
> Glancing at the stock RHEL5 kernel, it looks like the issue is
> MPT2SAS_SG_DEPTH, which is limited to 128.  This appears to be set to match
> the default kernel limit, but it is possible there is also a driver/HW
> limit.  You should be able to increase that to 256 and see if it works...
>
>
> Also note that the size buckets are power-of-2, so a "1MB" entry
is any IO
> > 512KB and <= 1MB.
>
> If you can''t get the driver to reliably do full 1MB IOs, change to
a 64KB
> chunk and set max_sectors_kb to 512.  This will help ensure you get
aligned,
> full-stripe writes.
>
> Kevin
>
>
>
> Wojciech Turek wrote:
>
>> I am setting up a new lustre filesystem using LSI engenio based disk
>> enclosures with integrated dual RAID controllers. I configured disks
>> into 8+2 RAID6 groups using 128kb segment size (chunk size). This
>> hardware uses mpt2sas kernel module on the Linux host side. I use the
>> whole block device for an OST (to avoid any alignment issues). When
>> running sgpdd-survey I can see high through numbers (~3GB/s write,
>> 5GB/s read), Also controllers stats show that number of IOPS = number
>> of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter shows
>> slower results (~2GB/s write , ~2GB/s read ) and controller stats show
>> more then double IOPS than MB/s. Looking at output from iostat -m -x 1
>> and brw_stats I can see that a large number of I/O operations are
>> smaller than 1MB, mostly 512kb.  I know that there was some work done
>> on optimising the kernel block device layer to process 1MB I/O
>> requests and that those changes were committed to Lustre 1.8.5. Thus I
>> guess this I/O chopping happens below the Lustre stack, maybe in the
>> mpt2sas driver?
>>
>> I am hoping that someone in Lustre community can shed some light on to
>> my problem.
>>
>> In my setup I  use:
>> Lustre 1.8.5
>> CentOS-5.5
>>
>> Some parameters I tuned from defaults in CentOS:
>> deadline I/O scheduler
>>
>> max_hw_sectors_kb=4096
>> max_sectors_kb=1024
>>
>>
>> brw_stats output
>> --
>>
>> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" | while
read ost;
>> do cat $ost/brw_stats ; done | grep "disk I/O size" -A9
>>
>> disk I/O size          ios   % cum % |  ios   % cum %
>> 4K:                    206   0   0   |  521   0   0
>> 8K:                    224   0   0   |  595   0   1
>> 16K:                   105   0   1   |  479   0   1
>> 32K:                   140   0   1   | 1108   1   3
>> 64K:                   231   0   1   | 1470   1   4
>> 128K:                  536   1   2   | 2259   2   7
>> 256K:                 1762   3   6   | 5644   6  14
>> 512K:                31574  64  71   | 30431  35  50
>> 1M:                  14200  28 100   | 42143  49 100
>> --
>> disk I/O size          ios   % cum % |  ios   % cum %
>> 4K:                    187   0   0   |  457   0   0
>> 8K:                    244   0   0   |  598   0   1
>> 16K:                   109   0   1   |  481   0   1
>> 32K:                   129   0   1   | 1100   1   3
>> 64K:                   222   0   1   | 1408   1   4
>> 128K:                  514   1   2   | 2291   2   7
>> 256K:                 1718   3   6   | 5652   6  14
>> 512K:                32222  65  72   | 29810  35  49
>> 1M:                  13654  27 100   | 42202  50 100
>> --
>> disk I/O size          ios   % cum % |  ios   % cum %
>> 4K:                    196   0   0   |  551   0   0
>> 8K:                    206   0   0   |  551   0   1
>> 16K:                    79   0   0   |  513   0   1
>> 32K:                   136   0   1   | 1048   1   3
>> 64K:                   232   0   1   | 1278   1   4
>> 128K:                  540   1   2   | 2172   2   7
>> 256K:                 1681   3   6   | 5679   6  13
>> 512K:                31842  64  71   | 31705  37  51
>> 1M:                  14077  28 100   | 41789  48 100
>> --
>> disk I/O size          ios   % cum % |  ios   % cum %
>> 4K:                    190   0   0   |  486   0   0
>> 8K:                    200   0   0   |  547   0   1
>> 16K:                    93   0   0   |  448   0   1
>> 32K:                   141   0   1   | 1029   1   3
>> 64K:                   240   0   1   | 1283   1   4
>> 128K:                  558   1   2   | 2125   2   7
>> 256K:                 1716   3   6   | 5400   6  13
>> 512K:                31476  64  70   | 29029  35  48
>> 1M:                  14366  29 100   | 42454  51 100
>> --
>> disk I/O size          ios   % cum % |  ios   % cum %
>> 4K:                    209   0   0   |  511   0   0
>> 8K:                    195   0   0   |  621   0   1
>> 16K:                    79   0   0   |  558   0   1
>> 32K:                   134   0   1   | 1135   1   3
>> 64K:                   245   0   1   | 1390   1   4
>> 128K:                  509   1   2   | 2219   2   7
>> 256K:                 1715   3   6   | 5687   6  14
>> 512K:                31784  64  71   | 31172  36  50
>> 1M:                  14112  28 100   | 41719  49 100
>> --
>> disk I/O size          ios   % cum % |  ios   % cum %
>> 4K:                    201   0   0   |  500   0   0
>> 8K:                    241   0   0   |  604   0   1
>> 16K:                    82   0   1   |  584   0   1
>> 32K:                   130   0   1   | 1092   1   3
>> 64K:                   230   0   1   | 1331   1   4
>> 128K:                  547   1   2   | 2253   2   7
>> 256K:                 1695   3   6   | 5634   6  14
>> 512K:                31501  64  70   | 31836  37  51
>> 1M:                  14343  29 100   | 41517  48 100
>>
>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110610/c55bec0d/attachment.html

Wojciech Turek

2011-Jun-10 12:29 UTC

head link

[Lustre-discuss] HW RAID - fragmented I/O

Hi Kevin,

In my kernel .config I find following lines

CONFIG_SCSI_MPT2SAS=m
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
CONFIG_SCSI_MPT2SAS_LOGGING=y

I changed SGE value to 256

Do I need to recompile the Kernel before building new module based on that
.config?



On 10 June 2011 13:00, Wojciech Turek <wjt27 at cam.ac.uk> wrote:
> Hi Kevin,
>
> Thanks for very helpful answer. I tried your suggestion and recompiled the
> mpt2sas driver with the following changes:
>
> --- mpt2sas_base.h      2010-01-16 20:57:30.000000000 +0000
> +++ new_mpt2sas_base.h  2011-06-10 12:53:35.000000000 +0100
> @@ -83,13 +83,13 @@
>  #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE
> #if     CONFIG_SCSI_MPT2SAS_MAX_SGE  < 16
>  #define MPT2SAS_SG_DEPTH       16
> -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 128
> -#define MPT2SAS_SG_DEPTH       128
> +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 256
> +#define MPT2SAS_SG_DEPTH       256
> #else
>  #define MPT2SAS_SG_DEPTH       CONFIG_SCSI_MPT2SAS_MAX_SGE
> #endif
>  #else
> -#define MPT2SAS_SG_DEPTH       128 /* MAX_HW_SEGMENTS */
> +#define MPT2SAS_SG_DEPTH       256 /* MAX_HW_SEGMENTS */
> #endif
>
> #if defined(TARGET_MODE)
>
> However I can still that almost 50% of writes and slightly over 50% of
> reads falls under 512K I/Os
> I am using device-mapper-multipath to manage active/passive paths do you
> think that could have something to do with the I/O fragmentation?
>
> Best regards,
>
> Wojciech
>
> On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com>
wrote:
>
>> Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not
in
>> the rest of the kernel.  Driver limits are not normally noticed by
>> (non-Lustre) people, because the default kernel limits IO to 512KB.
>>
>> May want to see Bug 22850 for the changes required eg, for the
Emulex/lpfc
>> driver.
>>
>> Glancing at the stock RHEL5 kernel, it looks like the issue is
>> MPT2SAS_SG_DEPTH, which is limited to 128.  This appears to be set to
match
>> the default kernel limit, but it is possible there is also a driver/HW
>> limit.  You should be able to increase that to 256 and see if it
works...
>>
>>
>> Also note that the size buckets are power-of-2, so a "1MB"
entry is any IO
>> > 512KB and <= 1MB.
>>
>> If you can''t get the driver to reliably do full 1MB IOs,
change to a 64KB
>> chunk and set max_sectors_kb to 512.  This will help ensure you get
aligned,
>> full-stripe writes.
>>
>> Kevin
>>
>>
>>
>> Wojciech Turek wrote:
>>
>>> I am setting up a new lustre filesystem using LSI engenio based
disk
>>> enclosures with integrated dual RAID controllers. I configured
disks
>>> into 8+2 RAID6 groups using 128kb segment size (chunk size). This
>>> hardware uses mpt2sas kernel module on the Linux host side. I use
the
>>> whole block device for an OST (to avoid any alignment issues). When
>>> running sgpdd-survey I can see high through numbers (~3GB/s write,
>>> 5GB/s read), Also controllers stats show that number of IOPS =
number
>>> of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter
shows
>>> slower results (~2GB/s write , ~2GB/s read ) and controller stats
show
>>> more then double IOPS than MB/s. Looking at output from iostat -m
-x 1
>>> and brw_stats I can see that a large number of I/O operations are
>>> smaller than 1MB, mostly 512kb.  I know that there was some work
done
>>> on optimising the kernel block device layer to process 1MB I/O
>>> requests and that those changes were committed to Lustre 1.8.5.
Thus I
>>> guess this I/O chopping happens below the Lustre stack, maybe in
the
>>> mpt2sas driver?
>>>
>>> I am hoping that someone in Lustre community can shed some light on
to
>>> my problem.
>>>
>>> In my setup I  use:
>>> Lustre 1.8.5
>>> CentOS-5.5
>>>
>>> Some parameters I tuned from defaults in CentOS:
>>> deadline I/O scheduler
>>>
>>> max_hw_sectors_kb=4096
>>> max_sectors_kb=1024
>>>
>>>
>>> brw_stats output
>>> --
>>>
>>> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" |
while read ost;
>>> do cat $ost/brw_stats ; done | grep "disk I/O size" -A9
>>>
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    206   0   0   |  521   0   0
>>> 8K:                    224   0   0   |  595   0   1
>>> 16K:                   105   0   1   |  479   0   1
>>> 32K:                   140   0   1   | 1108   1   3
>>> 64K:                   231   0   1   | 1470   1   4
>>> 128K:                  536   1   2   | 2259   2   7
>>> 256K:                 1762   3   6   | 5644   6  14
>>> 512K:                31574  64  71   | 30431  35  50
>>> 1M:                  14200  28 100   | 42143  49 100
>>> --
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    187   0   0   |  457   0   0
>>> 8K:                    244   0   0   |  598   0   1
>>> 16K:                   109   0   1   |  481   0   1
>>> 32K:                   129   0   1   | 1100   1   3
>>> 64K:                   222   0   1   | 1408   1   4
>>> 128K:                  514   1   2   | 2291   2   7
>>> 256K:                 1718   3   6   | 5652   6  14
>>> 512K:                32222  65  72   | 29810  35  49
>>> 1M:                  13654  27 100   | 42202  50 100
>>> --
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    196   0   0   |  551   0   0
>>> 8K:                    206   0   0   |  551   0   1
>>> 16K:                    79   0   0   |  513   0   1
>>> 32K:                   136   0   1   | 1048   1   3
>>> 64K:                   232   0   1   | 1278   1   4
>>> 128K:                  540   1   2   | 2172   2   7
>>> 256K:                 1681   3   6   | 5679   6  13
>>> 512K:                31842  64  71   | 31705  37  51
>>> 1M:                  14077  28 100   | 41789  48 100
>>> --
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    190   0   0   |  486   0   0
>>> 8K:                    200   0   0   |  547   0   1
>>> 16K:                    93   0   0   |  448   0   1
>>> 32K:                   141   0   1   | 1029   1   3
>>> 64K:                   240   0   1   | 1283   1   4
>>> 128K:                  558   1   2   | 2125   2   7
>>> 256K:                 1716   3   6   | 5400   6  13
>>> 512K:                31476  64  70   | 29029  35  48
>>> 1M:                  14366  29 100   | 42454  51 100
>>> --
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    209   0   0   |  511   0   0
>>> 8K:                    195   0   0   |  621   0   1
>>> 16K:                    79   0   0   |  558   0   1
>>> 32K:                   134   0   1   | 1135   1   3
>>> 64K:                   245   0   1   | 1390   1   4
>>> 128K:                  509   1   2   | 2219   2   7
>>> 256K:                 1715   3   6   | 5687   6  14
>>> 512K:                31784  64  71   | 31172  36  50
>>> 1M:                  14112  28 100   | 41719  49 100
>>> --
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    201   0   0   |  500   0   0
>>> 8K:                    241   0   0   |  604   0   1
>>> 16K:                    82   0   1   |  584   0   1
>>> 32K:                   130   0   1   | 1092   1   3
>>> 64K:                   230   0   1   | 1331   1   4
>>> 128K:                  547   1   2   | 2253   2   7
>>> 256K:                 1695   3   6   | 5634   6  14
>>> 512K:                31501  64  70   | 31836  37  51
>>> 1M:                  14343  29 100   | 41517  48 100
>>>
>>>
>>>
>>
>>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110610/6014ed31/attachment-0001.html

Kevin Van Maren

2011-Jun-10 12:38 UTC

head link

[Lustre-discuss] HW RAID - fragmented I/O

It''s possible there is another issue, but are you sure you (or RedHat) 
are not setting CONFIG_SCSI_MPT2SAS_MAX_SGE in your .config, which is 
preventing it from being set to 256?  I don''t have a machine using this
driver.

You could put #warning in the code to see if you hit the non-256 code 
path when building, or printk the max_sgl_entries in 
_base_allocate_memory_pools.

Kevin



Wojciech Turek wrote:> Hi Kevin,
>
> Thanks for very helpful answer. I tried your suggestion and recompiled 
> the mpt2sas driver with the following changes:
>
> --- mpt2sas_base.h      2010-01-16 20:57:30.000000000 +0000
> +++ new_mpt2sas_base.h  2011-06-10 12:53:35.000000000 +0100
> @@ -83,13 +83,13 @@
> #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE
> #if     CONFIG_SCSI_MPT2SAS_MAX_SGE  < 16
> #define MPT2SAS_SG_DEPTH       16
> -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 128
> -#define MPT2SAS_SG_DEPTH       128
> +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 256
> +#define MPT2SAS_SG_DEPTH       256
> #else
> #define MPT2SAS_SG_DEPTH       CONFIG_SCSI_MPT2SAS_MAX_SGE
> #endif
> #else
> -#define MPT2SAS_SG_DEPTH       128 /* MAX_HW_SEGMENTS */
> +#define MPT2SAS_SG_DEPTH       256 /* MAX_HW_SEGMENTS */
> #endif
>
> #if defined(TARGET_MODE)
>
> However I can still that almost 50% of writes and slightly over 50% of 
> reads falls under 512K I/Os
> I am using device-mapper-multipath to manage active/passive paths do 
> you think that could have something to do with the I/O fragmentation?
>
> Best regards,
>
> Wojciech
>
> On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com 
> <mailto:kevin.van.maren at oracle.com>> wrote:
>
>     Yep, with 1.8.5 the problem is most likely in the (mpt2sas)
>     driver, not in the rest of the kernel.  Driver limits are not
>     normally noticed by (non-Lustre) people, because the default
>     kernel limits IO to 512KB.
>
>     May want to see Bug 22850 for the changes required eg, for the
>     Emulex/lpfc driver.
>
>     Glancing at the stock RHEL5 kernel, it looks like the issue is
>     MPT2SAS_SG_DEPTH, which is limited to 128.  This appears to be set
>     to match the default kernel limit, but it is possible there is
>     also a driver/HW limit.  You should be able to increase that to
>     256 and see if it works...
>
>
>     Also note that the size buckets are power-of-2, so a "1MB"
entry
>     is any IO > 512KB and <= 1MB.
>
>     If you can''t get the driver to reliably do full 1MB IOs,
change to
>     a 64KB chunk and set max_sectors_kb to 512.  This will help ensure
>     you get aligned, full-stripe writes.
>
>     Kevin
>
>
>
>     Wojciech Turek wrote:
>
>         I am setting up a new lustre filesystem using LSI engenio
>         based disk
>         enclosures with integrated dual RAID controllers. I configured
>         disks
>         into 8+2 RAID6 groups using 128kb segment size (chunk size). This
>         hardware uses mpt2sas kernel module on the Linux host side. I
>         use the
>         whole block device for an OST (to avoid any alignment issues).
>         When
>         running sgpdd-survey I can see high through numbers (~3GB/s write,
>         5GB/s read), Also controllers stats show that number of IOPS >  
number
>         of MB/s. However as soon as I put ldiskfs on the OSTs,
>         obdfilter shows
>         slower results (~2GB/s write , ~2GB/s read ) and controller
>         stats show
>         more then double IOPS than MB/s. Looking at output from iostat
>         -m -x 1
>         and brw_stats I can see that a large number of I/O operations are
>         smaller than 1MB, mostly 512kb.  I know that there was some
>         work done
>         on optimising the kernel block device layer to process 1MB I/O
>         requests and that those changes were committed to Lustre
>         1.8.5. Thus I
>         guess this I/O chopping happens below the Lustre stack, maybe
>         in the
>         mpt2sas driver?
>
>         I am hoping that someone in Lustre community can shed some
>         light on to
>         my problem.
>
>         In my setup I  use:
>         Lustre 1.8.5
>         CentOS-5.5
>
>         Some parameters I tuned from defaults in CentOS:
>         deadline I/O scheduler
>
>         max_hw_sectors_kb=4096
>         max_sectors_kb=1024
>

Kevin Van Maren

2011-Jun-10 12:42 UTC

head link

[Lustre-discuss] HW RAID - fragmented I/O

Wojciech Turek wrote:> Hi Kevin,
>
> In my kernel .config I find following lines
>
> CONFIG_SCSI_MPT2SAS=m
> CONFIG_SCSI_MPT2SAS_MAX_SGE=128
> CONFIG_SCSI_MPT2SAS_LOGGING=y
>
> I changed SGE value to 256
>
> Do I need to recompile the Kernel before building new module based on 
> that .config?
No, but you do need to do something like "make oldconfig" to propagate
the change in .config to the header files, and then rebuild the driver.

Kevin

Shipman, Galen M.

2011-Jun-10 13:25 UTC

head link

[Lustre-discuss] HW RAID - fragmented I/O

Wojciech, 

We have seen similar issues with DM-Multipath. Can you experiment with going
straight to the block device without DM-Multipath?

Thanks,

Galen 

On Jun 10, 2011, at 8:00 AM, Wojciech Turek wrote:
> Hi Kevin,
> 
> Thanks for very helpful answer. I tried your suggestion and recompiled the
> mpt2sas driver with the following changes:
> 
> --- mpt2sas_base.h      2010-01-16 20:57:30.000000000 +0000
> +++ new_mpt2sas_base.h  2011-06-10 12:53:35.000000000 +0100
> @@ -83,13 +83,13 @@
> #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE
> #if     CONFIG_SCSI_MPT2SAS_MAX_SGE  < 16
> #define MPT2SAS_SG_DEPTH       16
> -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 128
> -#define MPT2SAS_SG_DEPTH       128
> +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 256
> +#define MPT2SAS_SG_DEPTH       256
> #else
> #define MPT2SAS_SG_DEPTH       CONFIG_SCSI_MPT2SAS_MAX_SGE
> #endif
> #else
> -#define MPT2SAS_SG_DEPTH       128 /* MAX_HW_SEGMENTS */
> +#define MPT2SAS_SG_DEPTH       256 /* MAX_HW_SEGMENTS */
> #endif
> 
> #if defined(TARGET_MODE)
> 
> However I can still that almost 50% of writes and slightly over 50% of
reads
> falls under 512K I/Os
> I am using device-mapper-multipath to manage active/passive paths do you
> think that could have something to do with the I/O fragmentation?
> 
> Best regards,
> 
> Wojciech
> 
> On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at oracle.com>
wrote:
> 
>> Yep, with 1.8.5 the problem is most likely in the (mpt2sas) driver, not
in
>> the rest of the kernel.  Driver limits are not normally noticed by
>> (non-Lustre) people, because the default kernel limits IO to 512KB.
>> 
>> May want to see Bug 22850 for the changes required eg, for the
Emulex/lpfc
>> driver.
>> 
>> Glancing at the stock RHEL5 kernel, it looks like the issue is
>> MPT2SAS_SG_DEPTH, which is limited to 128.  This appears to be set to
match
>> the default kernel limit, but it is possible there is also a driver/HW
>> limit.  You should be able to increase that to 256 and see if it
works...
>> 
>> 
>> Also note that the size buckets are power-of-2, so a "1MB"
entry is any IO
>>> 512KB and <= 1MB.
>> 
>> If you can''t get the driver to reliably do full 1MB IOs,
change to a 64KB
>> chunk and set max_sectors_kb to 512.  This will help ensure you get
aligned,
>> full-stripe writes.
>> 
>> Kevin
>> 
>> 
>> 
>> Wojciech Turek wrote:
>> 
>>> I am setting up a new lustre filesystem using LSI engenio based
disk
>>> enclosures with integrated dual RAID controllers. I configured
disks
>>> into 8+2 RAID6 groups using 128kb segment size (chunk size). This
>>> hardware uses mpt2sas kernel module on the Linux host side. I use
the
>>> whole block device for an OST (to avoid any alignment issues). When
>>> running sgpdd-survey I can see high through numbers (~3GB/s write,
>>> 5GB/s read), Also controllers stats show that number of IOPS =
number
>>> of MB/s. However as soon as I put ldiskfs on the OSTs, obdfilter
shows
>>> slower results (~2GB/s write , ~2GB/s read ) and controller stats
show
>>> more then double IOPS than MB/s. Looking at output from iostat -m
-x 1
>>> and brw_stats I can see that a large number of I/O operations are
>>> smaller than 1MB, mostly 512kb.  I know that there was some work
done
>>> on optimising the kernel block device layer to process 1MB I/O
>>> requests and that those changes were committed to Lustre 1.8.5.
Thus I
>>> guess this I/O chopping happens below the Lustre stack, maybe in
the
>>> mpt2sas driver?
>>> 
>>> I am hoping that someone in Lustre community can shed some light on
to
>>> my problem.
>>> 
>>> In my setup I  use:
>>> Lustre 1.8.5
>>> CentOS-5.5
>>> 
>>> Some parameters I tuned from defaults in CentOS:
>>> deadline I/O scheduler
>>> 
>>> max_hw_sectors_kb=4096
>>> max_sectors_kb=1024
>>> 
>>> 
>>> brw_stats output
>>> --
>>> 
>>> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*" |
while read ost;
>>> do cat $ost/brw_stats ; done | grep "disk I/O size" -A9
>>> 
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    206   0   0   |  521   0   0
>>> 8K:                    224   0   0   |  595   0   1
>>> 16K:                   105   0   1   |  479   0   1
>>> 32K:                   140   0   1   | 1108   1   3
>>> 64K:                   231   0   1   | 1470   1   4
>>> 128K:                  536   1   2   | 2259   2   7
>>> 256K:                 1762   3   6   | 5644   6  14
>>> 512K:                31574  64  71   | 30431  35  50
>>> 1M:                  14200  28 100   | 42143  49 100
>>> --
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    187   0   0   |  457   0   0
>>> 8K:                    244   0   0   |  598   0   1
>>> 16K:                   109   0   1   |  481   0   1
>>> 32K:                   129   0   1   | 1100   1   3
>>> 64K:                   222   0   1   | 1408   1   4
>>> 128K:                  514   1   2   | 2291   2   7
>>> 256K:                 1718   3   6   | 5652   6  14
>>> 512K:                32222  65  72   | 29810  35  49
>>> 1M:                  13654  27 100   | 42202  50 100
>>> --
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    196   0   0   |  551   0   0
>>> 8K:                    206   0   0   |  551   0   1
>>> 16K:                    79   0   0   |  513   0   1
>>> 32K:                   136   0   1   | 1048   1   3
>>> 64K:                   232   0   1   | 1278   1   4
>>> 128K:                  540   1   2   | 2172   2   7
>>> 256K:                 1681   3   6   | 5679   6  13
>>> 512K:                31842  64  71   | 31705  37  51
>>> 1M:                  14077  28 100   | 41789  48 100
>>> --
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    190   0   0   |  486   0   0
>>> 8K:                    200   0   0   |  547   0   1
>>> 16K:                    93   0   0   |  448   0   1
>>> 32K:                   141   0   1   | 1029   1   3
>>> 64K:                   240   0   1   | 1283   1   4
>>> 128K:                  558   1   2   | 2125   2   7
>>> 256K:                 1716   3   6   | 5400   6  13
>>> 512K:                31476  64  70   | 29029  35  48
>>> 1M:                  14366  29 100   | 42454  51 100
>>> --
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    209   0   0   |  511   0   0
>>> 8K:                    195   0   0   |  621   0   1
>>> 16K:                    79   0   0   |  558   0   1
>>> 32K:                   134   0   1   | 1135   1   3
>>> 64K:                   245   0   1   | 1390   1   4
>>> 128K:                  509   1   2   | 2219   2   7
>>> 256K:                 1715   3   6   | 5687   6  14
>>> 512K:                31784  64  71   | 31172  36  50
>>> 1M:                  14112  28 100   | 41719  49 100
>>> --
>>> disk I/O size          ios   % cum % |  ios   % cum %
>>> 4K:                    201   0   0   |  500   0   0
>>> 8K:                    241   0   0   |  604   0   1
>>> 16K:                    82   0   1   |  584   0   1
>>> 32K:                   130   0   1   | 1092   1   3
>>> 64K:                   230   0   1   | 1331   1   4
>>> 128K:                  547   1   2   | 2253   2   7
>>> 256K:                 1695   3   6   | 5634   6  14
>>> 512K:                31501  64  70   | 31836  37  51
>>> 1M:                  14343  29 100   | 41517  48 100
>>> 
>>> 
>>> 
>> 
>> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> blockedhttp://lists.lustre.org/mailman/listinfo/lustre-discuss

Wojciech Turek

2011-Jun-13 15:41 UTC

head link

[Lustre-discuss] HW RAID - fragmented I/O

Hi Galen,

I have tried your suggestion and mounted OST directly on /dev/sd<x>
devices
but that didn''t help and I/O is still being fragmented.

Best regards,

Wojciech

On 10 June 2011 14:25, Shipman, Galen M. <gshipman at ornl.gov> wrote:
> Wojciech,
>
> We have seen similar issues with DM-Multipath. Can you experiment with
> going straight to the block device without DM-Multipath?
>
> Thanks,
>
> Galen
>
> On Jun 10, 2011, at 8:00 AM, Wojciech Turek wrote:
>
> > Hi Kevin,
> >
> > Thanks for very helpful answer. I tried your suggestion and recompiled
> the
> > mpt2sas driver with the following changes:
> >
> > --- mpt2sas_base.h      2010-01-16 20:57:30.000000000 +0000
> > +++ new_mpt2sas_base.h  2011-06-10 12:53:35.000000000 +0100
> > @@ -83,13 +83,13 @@
> > #ifdef CONFIG_SCSI_MPT2SAS_MAX_SGE
> > #if     CONFIG_SCSI_MPT2SAS_MAX_SGE  < 16
> > #define MPT2SAS_SG_DEPTH       16
> > -#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 128
> > -#define MPT2SAS_SG_DEPTH       128
> > +#elif CONFIG_SCSI_MPT2SAS_MAX_SGE  > 256
> > +#define MPT2SAS_SG_DEPTH       256
> > #else
> > #define MPT2SAS_SG_DEPTH       CONFIG_SCSI_MPT2SAS_MAX_SGE
> > #endif
> > #else
> > -#define MPT2SAS_SG_DEPTH       128 /* MAX_HW_SEGMENTS */
> > +#define MPT2SAS_SG_DEPTH       256 /* MAX_HW_SEGMENTS */
> > #endif
> >
> > #if defined(TARGET_MODE)
> >
> > However I can still that almost 50% of writes and slightly over 50% of
> reads
> > falls under 512K I/Os
> > I am using device-mapper-multipath to manage active/passive paths do
you
> > think that could have something to do with the I/O fragmentation?
> >
> > Best regards,
> >
> > Wojciech
> >
> > On 8 June 2011 17:30, Kevin Van Maren <kevin.van.maren at
oracle.com>
> wrote:
> >
> >> Yep, with 1.8.5 the problem is most likely in the (mpt2sas)
driver, not
> in
> >> the rest of the kernel.  Driver limits are not normally noticed by
> >> (non-Lustre) people, because the default kernel limits IO to
512KB.
> >>
> >> May want to see Bug 22850 for the changes required eg, for the
> Emulex/lpfc
> >> driver.
> >>
> >> Glancing at the stock RHEL5 kernel, it looks like the issue is
> >> MPT2SAS_SG_DEPTH, which is limited to 128.  This appears to be set
to
> match
> >> the default kernel limit, but it is possible there is also a
driver/HW
> >> limit.  You should be able to increase that to 256 and see if it
> works...
> >>
> >>
> >> Also note that the size buckets are power-of-2, so a
"1MB" entry is any
> IO
> >>> 512KB and <= 1MB.
> >>
> >> If you can''t get the driver to reliably do full 1MB IOs,
change to a
> 64KB
> >> chunk and set max_sectors_kb to 512.  This will help ensure you
get
> aligned,
> >> full-stripe writes.
> >>
> >> Kevin
> >>
> >>
> >>
> >> Wojciech Turek wrote:
> >>
> >>> I am setting up a new lustre filesystem using LSI engenio
based disk
> >>> enclosures with integrated dual RAID controllers. I configured
disks
> >>> into 8+2 RAID6 groups using 128kb segment size (chunk size).
This
> >>> hardware uses mpt2sas kernel module on the Linux host side. I
use the
> >>> whole block device for an OST (to avoid any alignment issues).
When
> >>> running sgpdd-survey I can see high through numbers (~3GB/s
write,
> >>> 5GB/s read), Also controllers stats show that number of IOPS =
number
> >>> of MB/s. However as soon as I put ldiskfs on the OSTs,
obdfilter shows
> >>> slower results (~2GB/s write , ~2GB/s read ) and controller
stats show
> >>> more then double IOPS than MB/s. Looking at output from iostat
-m -x 1
> >>> and brw_stats I can see that a large number of I/O operations
are
> >>> smaller than 1MB, mostly 512kb.  I know that there was some
work done
> >>> on optimising the kernel block device layer to process 1MB I/O
> >>> requests and that those changes were committed to Lustre
1.8.5. Thus I
> >>> guess this I/O chopping happens below the Lustre stack, maybe
in the
> >>> mpt2sas driver?
> >>>
> >>> I am hoping that someone in Lustre community can shed some
light on to
> >>> my problem.
> >>>
> >>> In my setup I  use:
> >>> Lustre 1.8.5
> >>> CentOS-5.5
> >>>
> >>> Some parameters I tuned from defaults in CentOS:
> >>> deadline I/O scheduler
> >>>
> >>> max_hw_sectors_kb=4096
> >>> max_sectors_kb=1024
> >>>
> >>>
> >>> brw_stats output
> >>> --
> >>>
> >>> find /proc/fs/lustre/obdfilter/ -name "testfs-OST*"
| while read ost;
> >>> do cat $ost/brw_stats ; done | grep "disk I/O size"
-A9
> >>>
> >>> disk I/O size          ios   % cum % |  ios   % cum %
> >>> 4K:                    206   0   0   |  521   0   0
> >>> 8K:                    224   0   0   |  595   0   1
> >>> 16K:                   105   0   1   |  479   0   1
> >>> 32K:                   140   0   1   | 1108   1   3
> >>> 64K:                   231   0   1   | 1470   1   4
> >>> 128K:                  536   1   2   | 2259   2   7
> >>> 256K:                 1762   3   6   | 5644   6  14
> >>> 512K:                31574  64  71   | 30431  35  50
> >>> 1M:                  14200  28 100   | 42143  49 100
> >>> --
> >>> disk I/O size          ios   % cum % |  ios   % cum %
> >>> 4K:                    187   0   0   |  457   0   0
> >>> 8K:                    244   0   0   |  598   0   1
> >>> 16K:                   109   0   1   |  481   0   1
> >>> 32K:                   129   0   1   | 1100   1   3
> >>> 64K:                   222   0   1   | 1408   1   4
> >>> 128K:                  514   1   2   | 2291   2   7
> >>> 256K:                 1718   3   6   | 5652   6  14
> >>> 512K:                32222  65  72   | 29810  35  49
> >>> 1M:                  13654  27 100   | 42202  50 100
> >>> --
> >>> disk I/O size          ios   % cum % |  ios   % cum %
> >>> 4K:                    196   0   0   |  551   0   0
> >>> 8K:                    206   0   0   |  551   0   1
> >>> 16K:                    79   0   0   |  513   0   1
> >>> 32K:                   136   0   1   | 1048   1   3
> >>> 64K:                   232   0   1   | 1278   1   4
> >>> 128K:                  540   1   2   | 2172   2   7
> >>> 256K:                 1681   3   6   | 5679   6  13
> >>> 512K:                31842  64  71   | 31705  37  51
> >>> 1M:                  14077  28 100   | 41789  48 100
> >>> --
> >>> disk I/O size          ios   % cum % |  ios   % cum %
> >>> 4K:                    190   0   0   |  486   0   0
> >>> 8K:                    200   0   0   |  547   0   1
> >>> 16K:                    93   0   0   |  448   0   1
> >>> 32K:                   141   0   1   | 1029   1   3
> >>> 64K:                   240   0   1   | 1283   1   4
> >>> 128K:                  558   1   2   | 2125   2   7
> >>> 256K:                 1716   3   6   | 5400   6  13
> >>> 512K:                31476  64  70   | 29029  35  48
> >>> 1M:                  14366  29 100   | 42454  51 100
> >>> --
> >>> disk I/O size          ios   % cum % |  ios   % cum %
> >>> 4K:                    209   0   0   |  511   0   0
> >>> 8K:                    195   0   0   |  621   0   1
> >>> 16K:                    79   0   0   |  558   0   1
> >>> 32K:                   134   0   1   | 1135   1   3
> >>> 64K:                   245   0   1   | 1390   1   4
> >>> 128K:                  509   1   2   | 2219   2   7
> >>> 256K:                 1715   3   6   | 5687   6  14
> >>> 512K:                31784  64  71   | 31172  36  50
> >>> 1M:                  14112  28 100   | 41719  49 100
> >>> --
> >>> disk I/O size          ios   % cum % |  ios   % cum %
> >>> 4K:                    201   0   0   |  500   0   0
> >>> 8K:                    241   0   0   |  604   0   1
> >>> 16K:                    82   0   1   |  584   0   1
> >>> 32K:                   130   0   1   | 1092   1   3
> >>> 64K:                   230   0   1   | 1331   1   4
> >>> 128K:                  547   1   2   | 2253   2   7
> >>> 256K:                 1695   3   6   | 5634   6  14
> >>> 512K:                31501  64  70   | 31836  37  51
> >>> 1M:                  14343  29 100   | 41517  48 100
> >>>
> >>>
> >>>
> >>
> >>
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > blockedhttp://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110613/3b5655a1/attachment-0001.html

Kevin Van Maren

2011-Jun-13 19:38 UTC

head link

[Lustre-discuss] HW RAID - fragmented I/O

Did you printk the SGE in the driver, to make sure it was being set 
properly?

sg_tablesize may be being limited elsewhere, although the kernel patches 
in v1.8.5 should prevent that.


do this:

# cat /sys/class/scsi_host/host*/sg_tablesize
This should be 256.  If not, then this is still the issue.


# cat /sys/block/sd*/queue/max_hw_sectors_kb
This should be >= 1024

# cat /sys/block/sd*/queue/max_sectors_kb
This should be 1024 (Lustre mount sets it to max_hw_sectors_kb)


_base_allocate_memory_pools prints a bunch of helpful info using 
MPT2SAS_INFO_FMT -- goes to KERN_INFO,
and dinitprintk (MPT_DEBUG_INIT flag).  Turn up kernel verbosity and set 
the module parameter logging_level=0x20



If you still don''t have an answer, then look at these values in 
drivers/scsi/scsi_lib.c:

        blk_queue_max_hw_segments(q, shost->sg_tablesize);
        blk_queue_max_phys_segments(q, SCSI_MAX_PHYS_SEGMENTS);
        blk_queue_max_sectors(q, shost->max_sectors);


Kevin


Wojciech Turek wrote:> Hi Kevin,
>
> Unfortunately still no luck with 1MB I/O. I have forced my OSS to do 
> 512KB I/O following your suggestion and setting 512 max_sectors_kb. I 
> also recreated my HW RAID with 64KB chunks to align it for 512KB 
> chunks. I can see from the brw_stats and  controller statistics that 
> it does indeed twice as many IOPS as compared to throughput MB/s but 
> perfoamnce isn''t any better as before.
> From the sgpdd-survey I know that this controller can do around 3GB/s 
> write and 4GB/s read. Also when running sgpdd-survey controller stats 
> show that I/O is not fragmented (nr of IOPS = Throughput in MB/s). I 
> also try to bypass multipath layer by mounting the sd devices directly 
> but that did not make any difference.
>
> If you have any more suggestions I will be happy to try them out.
>
> Best regards,
>
> Wojciech
>
>
> On 13 June 2011 15:13, Kevin Van Maren <kevin.van.maren at oracle.com 
> <mailto:kevin.van.maren at oracle.com>> wrote:
>
>     Did you get it doing 1MB IOs?
>
>     Kevin
>
>
>
>     Kevin Van Maren wrote:
>
>         Wojciech Turek wrote:
>
>             Hi Kevin,
>
>             In my kernel .config I find following lines
>
>             CONFIG_SCSI_MPT2SAS=m
>             CONFIG_SCSI_MPT2SAS_MAX_SGE=128
>             CONFIG_SCSI_MPT2SAS_LOGGING=y
>
>             I changed SGE value to 256
>
>             Do I need to recompile the Kernel before building new
>             module based on that .config?
>
>
>         No, but you do need to do something like "make oldconfig"
to
>         propagate the change in .config to the header files, and then
>         rebuild the driver.
>
>         Kevin
>
>
>
>

Lustre discuss - Jun 2011 - HW RAID - fragmented I/O

[Lustre-discuss] HW RAID - fragmented I/O

[Lustre-discuss] HW RAID - fragmented I/O

[Lustre-discuss] HW RAID - fragmented I/O

[Lustre-discuss] HW RAID - fragmented I/O

[Lustre-discuss] HW RAID - fragmented I/O

[Lustre-discuss] HW RAID - fragmented I/O

[Lustre-discuss] HW RAID - fragmented I/O

[Lustre-discuss] HW RAID - fragmented I/O

[Lustre-discuss] HW RAID - fragmented I/O