Dear list, We have got a brw_stat result for one OST since it was up. According to this statistic, 50% percent disk I/Os are fragmented. I find a earlier discussion in this list referred to this qestion: http://lists.lustre.org/pipermail/lustre-discuss/2009-August/011433.html It seems it is ideal to have 100% disk I/Os with fragment "1" or "0". I don''t know why the I/Os are fragmented, since I found the max_sectors_kb is big enough(16MB?)for biggest disk I/O size( according to brw_stat, it is 1MB) # cat /sys/block/sda/queue/max_sectors_kb 32767 # cat /sys/block/sda/queue/max_hw_sectors_kb 32767 read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 1: 1624391 6 6 | 133752 3 3 2: 87238 0 6 | 13637 0 3 4: 175534 0 7 | 9266 0 3 8: 357057 1 9 | 15245 0 4 16: 638252 2 11 | 26643 0 4 32: 1249181 5 16 | 49749 1 6 64: 2266806 9 25 | 98010 2 8 128: 4976083 20 45 | 198864 4 13 256: 13457144 54 100 | 3522333 86 100 read | write discontiguous pages rpcs % cum % | rpcs % cum % 0: 24628983 99 99 | 3921862 96 96 1: 197041 0 99 | 143302 3 99 2: 4679 0 99 | 1592 0 99 3: 692 0 99 | 523 0 99 4: 133 0 99 | 100 0 99 5: 38 0 99 | 120 0 100 6: 26 0 99 | 0 0 100 7: 17 0 99 | 0 0 100 8: 14 0 99 | 0 0 100 9: 5 0 99 | 0 0 100 10: 7 0 99 | 0 0 100 11: 8 0 99 | 0 0 100 12: 2 0 99 | 0 0 100 13: 6 0 99 | 0 0 100 14: 6 0 99 | 0 0 100 15: 4 0 99 | 0 0 100 16: 4 0 99 | 0 0 100 17: 4 0 99 | 0 0 100 18: 2 0 99 | 0 0 100 19: 2 0 99 | 0 0 100 20: 3 0 99 | 0 0 100 21: 1 0 99 | 0 0 100 22: 1 0 99 | 0 0 100 23: 3 0 99 | 0 0 100 24: 1 0 99 | 0 0 100 25: 0 0 99 | 0 0 100 26: 0 0 99 | 0 0 100 27: 1 0 99 | 0 0 100 28: 0 0 99 | 0 0 100 29: 0 0 99 | 0 0 100 30: 0 0 99 | 0 0 100 31: 3 0 100 | 0 0 100 read | write discontiguous blocks rpcs % cum % | rpcs % cum % 0: 24616522 99 99 | 3908288 96 96 1: 205785 0 99 | 156444 3 99 2: 5200 0 99 | 1805 0 99 3: 797 0 99 | 733 0 99 4: 233 0 99 | 109 0 99 5: 331 0 99 | 120 0 100 6: 381 0 99 | 0 0 100 7: 346 0 99 | 0 0 100 8: 253 0 99 | 0 0 100 9: 369 0 99 | 0 0 100 10: 124 0 99 | 0 0 100 11: 172 0 99 | 0 0 100 12: 53 0 99 | 0 0 100 13: 113 0 99 | 0 0 100 14: 226 0 99 | 0 0 100 15: 69 0 99 | 0 0 100 16: 63 0 99 | 0 0 100 17: 195 0 99 | 0 0 100 18: 29 0 99 | 0 0 100 19: 152 0 99 | 0 0 100 20: 148 0 99 | 0 0 100 21: 7 0 99 | 0 0 100 22: 7 0 99 | 0 0 100 23: 3 0 99 | 0 0 100 24: 10 0 99 | 0 0 100 25: 2 0 99 | 0 0 100 26: 8 0 99 | 0 0 100 27: 11 0 99 | 0 0 100 28: 4 0 99 | 0 0 100 29: 2 0 99 | 0 0 100 30: 2 0 99 | 0 0 100 31: 69 0 100 | 0 0 100 read | write disk fragmented I/Os ios % cum % | ios % cum % 0: 9821 0 0 | 0 0 0 1: 11933478 48 48 | 630964 15 15 2: 12726392 51 99 | 3350479 82 97 3: 155476 0 99 | 84465 2 99 4: 2962 0 99 | 1103 0 99 5: 393 0 99 | 364 0 99 6: 341 0 99 | 123 0 99 7: 384 0 99 | 1 0 100 8: 345 0 99 | 0 0 100 9: 254 0 99 | 0 0 100 10: 369 0 99 | 0 0 100 11: 124 0 99 | 0 0 100 12: 172 0 99 | 0 0 100 13: 53 0 99 | 0 0 100 14: 113 0 99 | 0 0 100 15: 226 0 99 | 0 0 100 16: 69 0 99 | 0 0 100 17: 63 0 99 | 0 0 100 18: 195 0 99 | 0 0 100 19: 29 0 99 | 0 0 100 20: 152 0 99 | 0 0 100 21: 148 0 99 | 0 0 100 22: 7 0 99 | 0 0 100 23: 7 0 99 | 0 0 100 24: 3 0 99 | 0 0 100 25: 10 0 99 | 0 0 100 26: 2 0 99 | 0 0 100 27: 8 0 99 | 0 0 100 28: 11 0 99 | 0 0 100 29: 4 0 99 | 0 0 100 30: 2 0 99 | 0 0 100 31: 71 0 100 | 0 0 100 read | write disk I/Os in flight ios % cum % | ios % cum % 1: 10954265 28 28 | 3781021 49 49 2: 9217023 24 53 | 3329128 43 93 3: 6063548 15 69 | 272981 3 97 4: 4147809 10 80 | 121808 1 98 5: 2974531 7 87 | 30924 0 99 6: 1985323 5 93 | 19276 0 99 7: 866099 2 95 | 10509 0 99 8: 568797 1 97 | 7813 0 99 9: 373967 0 98 | 5110 0 99 10: 225063 0 98 | 3844 0 99 11: 158800 0 99 | 2594 0 99 12: 112305 0 99 | 1997 0 99 13: 71542 0 99 | 1351 0 99 14: 55104 0 99 | 1055 0 99 15: 44688 0 99 | 721 0 99 16: 29819 0 99 | 527 0 99 17: 14854 0 99 | 369 0 99 18: 10595 0 99 | 286 0 99 19: 7301 0 99 | 203 0 99 20: 5284 0 99 | 154 0 99 21: 3859 0 99 | 113 0 99 22: 2766 0 99 | 76 0 99 23: 1887 0 99 | 56 0 99 24: 1386 0 99 | 44 0 99 25: 1086 0 99 | 39 0 99 26: 877 0 99 | 24 0 99 27: 708 0 99 | 18 0 99 28: 559 0 99 | 17 0 99 29: 455 0 99 | 15 0 99 30: 390 0 99 | 13 0 99 31: 7047 0 100 | 208 0 100 read | write I/O time (1/1000s) ios % cum % | ios % cum % 1: 139469 0 0 | 279599 6 6 2: 241078 0 1 | 280009 6 13 4: 2422008 9 11 | 2710995 66 80 8: 4039296 16 27 | 601942 14 95 16: 4139106 16 44 | 146443 3 98 32: 11264485 45 89 | 39174 0 99 64: 2161652 8 98 | 7816 0 99 128: 369479 1 99 | 1294 0 99 256: 47904 0 99 | 187 0 99 512: 6368 0 99 | 29 0 99 1K: 408 0 99 | 4 0 99 2K: 316 0 99 | 6 0 99 4K: 71 0 99 | 1 0 100 8K: 32 0 99 | 0 0 100 16K: 12 0 100 | 0 0 100 read | write disk I/O size ios % cum % | ios % cum % 4K: 1737019 4 4 | 180540 2 2 8K: 192978 0 5 | 34454 0 2 16K: 339763 0 5 | 37064 0 3 32K: 674382 1 7 | 47011 0 3 64K: 1252812 3 11 | 90929 1 5 128K: 2449225 6 17 | 168972 2 7 256K: 4464264 11 29 | 288737 3 11 512K: 24846133 65 94 | 5997373 78 90 1M: 1951161 5 100 | 747214 9 100 I have 2 questions: 1. Could any one explain what dose these parameters exactly mean? /sys/block/sda/queue/max_sectors_kb /sys/block/sda/queue/max_hw_sectors_kb ,disk fragmented I/Os, disk I/O size of brw_stats 2. In which case, the disk I/O will be fragemented? Thanks a lot in advance! Best Regards Lu Wang
Lu Wang wrote:> Dear list, > We have got a brw_stat result for one OST since it was up. According to this statistic, 50% percent disk I/Os are fragmented. I find a earlier discussion in this list referred to this qestion: > http://lists.lustre.org/pipermail/lustre-discuss/2009-August/011433.html > It seems it is ideal to have 100% disk I/Os with fragment "1" or "0". I don''t know why the I/Os are fragmented, since I found the max_sectors_kb is big enough(16MB?)for biggest disk I/O size( according to brw_stat, it is 1MB) > > # cat /sys/block/sda/queue/max_sectors_kb > 32767 > # cat /sys/block/sda/queue/max_hw_sectors_kb > 32767 >So the drive is limited to 32MB IOs. Below it is clear that you are seeing fragmentation, so the question becomes why are the IOs being broken up? It is very unlikely Lustre is breaking up the IO willingly, so most likely something in the IO stack is restricting the IO sizes. What are you using for an OST, and what controller/driver/driver version? What version of Lustre, and what version of Linux are you using on the OSS node?> read | write > pages per bulk r/w rpcs % cum % | rpcs % cum % > 128: 4976083 20 45 | 198864 4 13 > 256: 13457144 54 100 | 3522333 86 100 >So the clients are doing 1MB RPCs to the server (which is good).> read | write > disk fragmented I/Os ios % cum % | ios % cum % > 0: 9821 0 0 | 0 0 0 > 1: 11933478 48 48 | 630964 15 15 > 2: 12726392 51 99 | 3350479 82 97 > 3: 155476 0 99 | 84465 2 99 >But all your IOs are being broken in half.> read | write > disk I/Os in flight ios % cum % | ios % cum % > 1: 10954265 28 28 | 3781021 49 49 > 2: 9217023 24 53 | 3329128 43 93 > 3: 6063548 15 69 | 272981 3 97 >This is really bad -- it seems that it only ever issues one write at a time to the disks. Lustre would normally issue up to 31, so there may be something about your disk or driver preventing multiple outstanding IOs.> read | write > disk I/O size ios % cum % | ios % cum % > 256K: 4464264 11 29 | 288737 3 11 > 512K: 24846133 65 94 | 5997373 78 90 > 1M: 1951161 5 100 | 747214 9 100 >Basically this is saying that nearly all 1MB IOs are being broken into 512KB pieces.> I have 2 questions: > 1. Could any one explain what dose these parameters exactly mean? > /sys/block/sda/queue/max_sectors_kb /sys/block/sda/queue/max_hw_sectors_kbHow large an IO size can be sent (allowed) to the disk, and how large of an IO the disk drive supports.> ,disk fragmented I/Os, disk I/O size of brw_stats >How many pieces each ldiskfs write are broken into, and the size of the pieces.> 2. In which case, the disk I/O will be fragemented? > > > Thanks a lot in advance! > > Best Regards > Lu Wang >
We are using lustre 1.8.1.1 on 2.6.18-128.7.1.el5. The disk controller is NetStor_iSUM510, driver is qla2xxx (8.02.00.06.05.03-k). We made 2 partition for each disk volume: /dev/sda1 4325574520 3911425648 194422312 96% /lustre/ost1 /dev/sda2 4324980788 3898888124 206396204 95% /lustre/ost2 /dev/sdb1 4325574520 3909042320 196805640 96% /lustre/ost3 /dev/sdb2 4324980788 3920306524 184977804 96% /lustre/ost4 /dev/sdc1 4325574520 3868328108 237519852 95% /lustre/ost5 /dev/sdc2 4324980788 3921774384 183509944 96% /lustre/ost6 /dev/sdd1 4325574520 3911662272 194185688 96% /lustre/ost7 /dev/sdd2 4324980788 3884415428 220868900 95% /lustre/ost8 Because Lustre can not support OST larger than 8TB. ------------------ Lu Wang 2010-04-01 ------------------------------------------------------------- ????Kevin Van Maren ?????2010-03-31 19:56:59 ????Lu Wang ???lustre-discuss ???Re: [Lustre-discuss] disk fragmented I/Os Lu Wang wrote:> Dear list, > We have got a brw_stat result for one OST since it was up. According to this statistic, 50% percent disk I/Os are fragmented. I find a earlier discussion in this list referred to this qestion: > http://lists.lustre.org/pipermail/lustre-discuss/2009-August/011433.html > It seems it is ideal to have 100% disk I/Os with fragment "1" or "0". I don''t know why the I/Os are fragmented, since I found the max_sectors_kb is big enough(16MB?)for biggest disk I/O size( according to brw_stat, it is 1MB) > > # cat /sys/block/sda/queue/max_sectors_kb > 32767 > # cat /sys/block/sda/queue/max_hw_sectors_kb > 32767 >So the drive is limited to 32MB IOs. Below it is clear that you are seeing fragmentation, so the question becomes why are the IOs being broken up? It is very unlikely Lustre is breaking up the IO willingly, so most likely something in the IO stack is restricting the IO sizes. What are you using for an OST, and what controller/driver/driver version? What version of Lustre, and what version of Linux are you using on the OSS node?> read | write > pages per bulk r/w rpcs % cum % | rpcs % cum % > 128: 4976083 20 45 | 198864 4 13 > 256: 13457144 54 100 | 3522333 86 100 >So the clients are doing 1MB RPCs to the server (which is good).> read | write > disk fragmented I/Os ios % cum % | ios % cum % > 0: 9821 0 0 | 0 0 0 > 1: 11933478 48 48 | 630964 15 15 > 2: 12726392 51 99 | 3350479 82 97 > 3: 155476 0 99 | 84465 2 99 >But all your IOs are being broken in half.> read | write > disk I/Os in flight ios % cum % | ios % cum % > 1: 10954265 28 28 | 3781021 49 49 > 2: 9217023 24 53 | 3329128 43 93 > 3: 6063548 15 69 | 272981 3 97 >This is really bad -- it seems that it only ever issues one write at a time to the disks. Lustre would normally issue up to 31, so there may be something about your disk or driver preventing multiple outstanding IOs.> read | write > disk I/O size ios % cum % | ios % cum % > 256K: 4464264 11 29 | 288737 3 11 > 512K: 24846133 65 94 | 5997373 78 90 > 1M: 1951161 5 100 | 747214 9 100 >Basically this is saying that nearly all 1MB IOs are being broken into 512KB pieces.> I have 2 questions: > 1. Could any one explain what dose these parameters exactly mean? > /sys/block/sda/queue/max_sectors_kb /sys/block/sda/queue/max_hw_sectors_kbHow large an IO size can be sent (allowed) to the disk, and how large of an IO the disk drive supports.> ,disk fragmented I/Os, disk I/O size of brw_stats >How many pieces each ldiskfs write are broken into, and the size of the pieces.> 2. In which case, the disk I/O will be fragemented? > > > Thanks a lot in advance! > > Best Regards > Lu Wang >
Thanks for the "df" output -- at 96% full the problem is likely Lustre fragmenting the IO because it cannot allocate contiguous space on the OST. If possible, free up a bunch of space on the OSTs (ie, delete old large files) and see if it improves. Still not clear to me why you don''t have more outstanding IOs to the disk. 8 TiB devices would only have been 1% less capacity than you have with two partitions. Possibly related, did you ensure partitions were aligned on the RAID boundary of the underlying device? RAID alignment is the main reason it is recommended to not use drive partitions with Lustre. "sfdisk -uS -l /dev/sda" will show the actual start. Kevin Lu Wang wrote:> We are using lustre 1.8.1.1 on 2.6.18-128.7.1.el5. The disk controller is NetStor_iSUM510, driver is qla2xxx (8.02.00.06.05.03-k). > > We made 2 partition for each disk volume: > /dev/sda1 4325574520 3911425648 194422312 96% /lustre/ost1 > /dev/sda2 4324980788 3898888124 206396204 95% /lustre/ost2 > /dev/sdb1 4325574520 3909042320 196805640 96% /lustre/ost3 > /dev/sdb2 4324980788 3920306524 184977804 96% /lustre/ost4 > /dev/sdc1 4325574520 3868328108 237519852 95% /lustre/ost5 > /dev/sdc2 4324980788 3921774384 183509944 96% /lustre/ost6 > /dev/sdd1 4325574520 3911662272 194185688 96% /lustre/ost7 > /dev/sdd2 4324980788 3884415428 220868900 95% /lustre/ost8 > > Because Lustre can not support OST larger than 8TB. > > > > > ------------------ > Lu Wang > 2010-04-01 > > ------------------------------------------------------------- > ????Kevin Van Maren > ?????2010-03-31 19:56:59 > ????Lu Wang > ???lustre-discuss > ???Re: [Lustre-discuss] disk fragmented I/Os > > Lu Wang wrote: > >> Dear list, >> We have got a brw_stat result for one OST since it was up. According to this statistic, 50% percent disk I/Os are fragmented. I find a earlier discussion in this list referred to this qestion: >> http://lists.lustre.org/pipermail/lustre-discuss/2009-August/011433.html >> It seems it is ideal to have 100% disk I/Os with fragment "1" or "0". I don''t know why the I/Os are fragmented, since I found the max_sectors_kb is big enough(16MB?)for biggest disk I/O size( according to brw_stat, it is 1MB) >> >> # cat /sys/block/sda/queue/max_sectors_kb >> 32767 >> # cat /sys/block/sda/queue/max_hw_sectors_kb >> 32767 >> >> > > So the drive is limited to 32MB IOs. Below it is clear that you are > seeing fragmentation, so the question becomes why are the IOs being > broken up? It is very unlikely Lustre is breaking up the IO willingly, > so most likely something in the IO stack is restricting the IO sizes. > > What are you using for an OST, and what controller/driver/driver version? > > What version of Lustre, and what version of Linux are you using on the > OSS node? > > >> read | write >> pages per bulk r/w rpcs % cum % | rpcs % cum % >> 128: 4976083 20 45 | 198864 4 13 >> 256: 13457144 54 100 | 3522333 86 100 >> >> > So the clients are doing 1MB RPCs to the server (which is good). > > >> read | write >> disk fragmented I/Os ios % cum % | ios % cum % >> 0: 9821 0 0 | 0 0 0 >> 1: 11933478 48 48 | 630964 15 15 >> 2: 12726392 51 99 | 3350479 82 97 >> 3: 155476 0 99 | 84465 2 99 >> >> > But all your IOs are being broken in half. > > >> read | write >> disk I/Os in flight ios % cum % | ios % cum % >> 1: 10954265 28 28 | 3781021 49 49 >> 2: 9217023 24 53 | 3329128 43 93 >> 3: 6063548 15 69 | 272981 3 97 >> >> > > This is really bad -- it seems that it only ever issues one write at a > time to the disks. > Lustre would normally issue up to 31, so there may be something about > your disk or driver > preventing multiple outstanding IOs. > > >> read | write >> disk I/O size ios % cum % | ios % cum % >> 256K: 4464264 11 29 | 288737 3 11 >> 512K: 24846133 65 94 | 5997373 78 90 >> 1M: 1951161 5 100 | 747214 9 100 >> >> > > Basically this is saying that nearly all 1MB IOs are being broken into > 512KB pieces. > > >> I have 2 questions: >> 1. Could any one explain what dose these parameters exactly mean? >> /sys/block/sda/queue/max_sectors_kb /sys/block/sda/queue/max_hw_sectors_kb >> > How large an IO size can be sent (allowed) to the disk, and how large of > an IO the disk drive supports. > > >> ,disk fragmented I/Os, disk I/O size of brw_stats >> >> > How many pieces each ldiskfs write are broken into, and the size of the > pieces. > > >> 2. In which case, the disk I/O will be fragemented? >> >> >> Thanks a lot in advance! >> >> Best Regards >> Lu Wang >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Hi, We parted the device /dev/sda with "parted" command Using GUID Partition Table. Therefore the fsdisk result is not correct: [root at boss32 ~]# sfdisk -uS -l /dev/sda WARNING: GPT (GUID Partition Table) detected on ''/dev/sda''! The util sfdisk doesn''t support GPT. Use GNU Parted. Disk /dev/sda: 1094112 cylinders, 255 heads, 63 sectors/track Units = sectors of 512 bytes, counting from 0 Device Boot Start End #sectors Id System /dev/sda1 1 4294967295 4294967295 ee EFI GPT start: (c,h,s) expected (0,0,2) found (0,0,1) /dev/sda2 0 - 0 0 Empty /dev/sda3 0 - 0 0 Empty /dev/sda4 0 - 0 0 Empty With "parted", you will see the result. [root at boss32 ~]# parted /dev/sda GNU Parted 1.8.1 Using /dev/sda Welcome to GNU Parted! Type ''help'' to view a list of commands. (parted) p Model: TOYOU NetStor_iSUM510 (scsi) Disk /dev/sda: 8999GB Sector size (logical/physical): 512B/512B Partition Table: gpt Number Start End Size File system Name Flags 1 17.4kB 4500GB 4500GB ext3 2 4500GB 8999GB 4499GB ext3 (parted) ------------------ Lu Wang 2010-04-01 ------------------------------------------------------------- ????Kevin Van Maren ?????2010-04-01 11:09:25 ????Lu Wang ???lustre-discuss ???Re: [Lustre-discuss] Re: Re: disk fragmented I/Os Thanks for the "df" output -- at 96% full the problem is likely Lustre fragmenting the IO because it cannot allocate contiguous space on the OST. If possible, free up a bunch of space on the OSTs (ie, delete old large files) and see if it improves. Still not clear to me why you don''t have more outstanding IOs to the disk. 8 TiB devices would only have been 1% less capacity than you have with two partitions. Possibly related, did you ensure partitions were aligned on the RAID boundary of the underlying device? RAID alignment is the main reason it is recommended to not use drive partitions with Lustre. "sfdisk -uS -l /dev/sda" will show the actual start. Kevin Lu Wang wrote:> We are using lustre 1.8.1.1 on 2.6.18-128.7.1.el5. The disk controller is NetStor_iSUM510, driver is qla2xxx (8.02.00.06.05.03-k). > > We made 2 partition for each disk volume: > /dev/sda1 4325574520 3911425648 194422312 96% /lustre/ost1 > /dev/sda2 4324980788 3898888124 206396204 95% /lustre/ost2 > /dev/sdb1 4325574520 3909042320 196805640 96% /lustre/ost3 > /dev/sdb2 4324980788 3920306524 184977804 96% /lustre/ost4 > /dev/sdc1 4325574520 3868328108 237519852 95% /lustre/ost5 > /dev/sdc2 4324980788 3921774384 183509944 96% /lustre/ost6 > /dev/sdd1 4325574520 3911662272 194185688 96% /lustre/ost7 > /dev/sdd2 4324980788 3884415428 220868900 95% /lustre/ost8 > > Because Lustre can not support OST larger than 8TB. > > > > > ------------------ > Lu Wang > 2010-04-01 > > ------------------------------------------------------------- > ????Kevin Van Maren > ?????2010-03-31 19:56:59 > ????Lu Wang > ???lustre-discuss > ???Re: [Lustre-discuss] disk fragmented I/Os > > Lu Wang wrote: > >> Dear list, >> We have got a brw_stat result for one OST since it was up. According to this statistic, 50% percent disk I/Os are fragmented. I find a earlier discussion in this list referred to this qestion: >> http://lists.lustre.org/pipermail/lustre-discuss/2009-August/011433.html >> It seems it is ideal to have 100% disk I/Os with fragment "1" or "0". I don''t know why the I/Os are fragmented, since I found the max_sectors_kb is big enough(16MB?)for biggest disk I/O size( according to brw_stat, it is 1MB) >> >> # cat /sys/block/sda/queue/max_sectors_kb >> 32767 >> # cat /sys/block/sda/queue/max_hw_sectors_kb >> 32767 >> >> > > So the drive is limited to 32MB IOs. Below it is clear that you are > seeing fragmentation, so the question becomes why are the IOs being > broken up? It is very unlikely Lustre is breaking up the IO willingly, > so most likely something in the IO stack is restricting the IO sizes. > > What are you using for an OST, and what controller/driver/driver version? > > What version of Lustre, and what version of Linux are you using on the > OSS node? > > >> read | write >> pages per bulk r/w rpcs % cum % | rpcs % cum % >> 128: 4976083 20 45 | 198864 4 13 >> 256: 13457144 54 100 | 3522333 86 100 >> >> > So the clients are doing 1MB RPCs to the server (which is good). > > >> read | write >> disk fragmented I/Os ios % cum % | ios % cum % >> 0: 9821 0 0 | 0 0 0 >> 1: 11933478 48 48 | 630964 15 15 >> 2: 12726392 51 99 | 3350479 82 97 >> 3: 155476 0 99 | 84465 2 99 >> >> > But all your IOs are being broken in half. > > >> read | write >> disk I/Os in flight ios % cum % | ios % cum % >> 1: 10954265 28 28 | 3781021 49 49 >> 2: 9217023 24 53 | 3329128 43 93 >> 3: 6063548 15 69 | 272981 3 97 >> >> > > This is really bad -- it seems that it only ever issues one write at a > time to the disks. > Lustre would normally issue up to 31, so there may be something about > your disk or driver > preventing multiple outstanding IOs. > > >> read | write >> disk I/O size ios % cum % | ios % cum % >> 256K: 4464264 11 29 | 288737 3 11 >> 512K: 24846133 65 94 | 5997373 78 90 >> 1M: 1951161 5 100 | 747214 9 100 >> >> > > Basically this is saying that nearly all 1MB IOs are being broken into > 512KB pieces. > > >> I have 2 questions: >> 1. Could any one explain what dose these parameters exactly mean? >> /sys/block/sda/queue/max_sectors_kb /sys/block/sda/queue/max_hw_sectors_kb >> > How large an IO size can be sent (allowed) to the disk, and how large of > an IO the disk drive supports. > > >> ,disk fragmented I/Os, disk I/O size of brw_stats >> >> > How many pieces each ldiskfs write are broken into, and the size of the > pieces. > > >> 2. In which case, the disk I/O will be fragemented? >> >> >> Thanks a lot in advance! >> >> Best Regards >> Lu Wang >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Hi, We set up a test file system with same patrition and same hardware. When the system is empty, the disk I/O is less fragemented. However, the disk I/O in flight is still low (mostly at "1"). Is there any way to increase this value through configuration? # cat /proc/fs/lustre/obdfilter/testfs-OST0000/brw_stats snapshot_time: 1270229935.155857 (secs.usecs) read | write pages per bulk r/w rpcs % cum % | rpcs % cum % 256: 0 0 0 | 2358 100 100 read | write discontiguous pages rpcs % cum % | rpcs % cum % 0: 0 0 0 | 2358 100 100 read | write discontiguous blocks rpcs % cum % | rpcs % cum % 0: 0 0 0 | 2358 100 100 read | write disk fragmented I/Os ios % cum % | ios % cum % 1: 0 0 0 | 2289 97 97 2: 0 0 0 | 69 2 100 read | write disk I/Os in flight ios % cum % | ios % cum % 1: 0 0 0 | 2222 91 91 2: 0 0 0 | 180 7 98 3: 0 0 0 | 14 0 99 4: 0 0 0 | 4 0 99 5: 0 0 0 | 3 0 99 6: 0 0 0 | 2 0 99 7: 0 0 0 | 1 0 99 8: 0 0 0 | 1 0 100 read | write I/O time (1/1000s) ios % cum % | ios % cum % 4: 0 0 0 | 2195 93 93 8: 0 0 0 | 150 6 99 16: 0 0 0 | 9 0 99 32: 0 0 0 | 4 0 100 read | write disk I/O size ios % cum % | ios % cum % 8K: 0 0 0 | 1 0 0 16K: 0 0 0 | 1 0 0 32K: 0 0 0 | 0 0 0 64K: 0 0 0 | 1 0 0 128K: 0 0 0 | 4 0 0 256K: 0 0 0 | 12 0 0 512K: 0 0 0 | 58 2 3 1M: 0 0 0 | 2350 96 100 ------------------ Lu Wang 2010-04-02 ------------------------------------------------------------- ????Kevin Van Maren ?????2010-04-01 11:09:25 ????Lu Wang ???lustre-discuss ???Re: [Lustre-discuss] Re: Re: disk fragmented I/Os Thanks for the "df" output -- at 96% full the problem is likely Lustre fragmenting the IO because it cannot allocate contiguous space on the OST. If possible, free up a bunch of space on the OSTs (ie, delete old large files) and see if it improves. Still not clear to me why you don''t have more outstanding IOs to the disk. 8 TiB devices would only have been 1% less capacity than you have with two partitions. Possibly related, did you ensure partitions were aligned on the RAID boundary of the underlying device? RAID alignment is the main reason it is recommended to not use drive partitions with Lustre. "sfdisk -uS -l /dev/sda" will show the actual start. Kevin Lu Wang wrote:> We are using lustre 1.8.1.1 on 2.6.18-128.7.1.el5. The disk controller is NetStor_iSUM510, driver is qla2xxx (8.02.00.06.05.03-k). > > We made 2 partition for each disk volume: > /dev/sda1 4325574520 3911425648 194422312 96% /lustre/ost1 > /dev/sda2 4324980788 3898888124 206396204 95% /lustre/ost2 > /dev/sdb1 4325574520 3909042320 196805640 96% /lustre/ost3 > /dev/sdb2 4324980788 3920306524 184977804 96% /lustre/ost4 > /dev/sdc1 4325574520 3868328108 237519852 95% /lustre/ost5 > /dev/sdc2 4324980788 3921774384 183509944 96% /lustre/ost6 > /dev/sdd1 4325574520 3911662272 194185688 96% /lustre/ost7 > /dev/sdd2 4324980788 3884415428 220868900 95% /lustre/ost8 > > Because Lustre can not support OST larger than 8TB. > > > > > ------------------ > Lu Wang > 2010-04-01 > > ------------------------------------------------------------- > ????Kevin Van Maren > ?????2010-03-31 19:56:59 > ????Lu Wang > ???lustre-discuss > ???Re: [Lustre-discuss] disk fragmented I/Os > > Lu Wang wrote: > >> Dear list, >> We have got a brw_stat result for one OST since it was up. According to this statistic, 50% percent disk I/Os are fragmented. I find a earlier discussion in this list referred to this qestion: >> http://lists.lustre.org/pipermail/lustre-discuss/2009-August/011433.html >> It seems it is ideal to have 100% disk I/Os with fragment "1" or "0". I don''t know why the I/Os are fragmented, since I found the max_sectors_kb is big enough(16MB?)for biggest disk I/O size( according to brw_stat, it is 1MB) >> >> # cat /sys/block/sda/queue/max_sectors_kb >> 32767 >> # cat /sys/block/sda/queue/max_hw_sectors_kb >> 32767 >> >> > > So the drive is limited to 32MB IOs. Below it is clear that you are > seeing fragmentation, so the question becomes why are the IOs being > broken up? It is very unlikely Lustre is breaking up the IO willingly, > so most likely something in the IO stack is restricting the IO sizes. > > What are you using for an OST, and what controller/driver/driver version? > > What version of Lustre, and what version of Linux are you using on the > OSS node? > > >> read | write >> pages per bulk r/w rpcs % cum % | rpcs % cum % >> 128: 4976083 20 45 | 198864 4 13 >> 256: 13457144 54 100 | 3522333 86 100 >> >> > So the clients are doing 1MB RPCs to the server (which is good). > > >> read | write >> disk fragmented I/Os ios % cum % | ios % cum % >> 0: 9821 0 0 | 0 0 0 >> 1: 11933478 48 48 | 630964 15 15 >> 2: 12726392 51 99 | 3350479 82 97 >> 3: 155476 0 99 | 84465 2 99 >> >> > But all your IOs are being broken in half. > > >> read | write >> disk I/Os in flight ios % cum % | ios % cum % >> 1: 10954265 28 28 | 3781021 49 49 >> 2: 9217023 24 53 | 3329128 43 93 >> 3: 6063548 15 69 | 272981 3 97 >> >> > > This is really bad -- it seems that it only ever issues one write at a > time to the disks. > Lustre would normally issue up to 31, so there may be something about > your disk or driver > preventing multiple outstanding IOs. > > >> read | write >> disk I/O size ios % cum % | ios % cum % >> 256K: 4464264 11 29 | 288737 3 11 >> 512K: 24846133 65 94 | 5997373 78 90 >> 1M: 1951161 5 100 | 747214 9 100 >> >> > > Basically this is saying that nearly all 1MB IOs are being broken into > 512KB pieces. > > >> I have 2 questions: >> 1. Could any one explain what dose these parameters exactly mean? >> /sys/block/sda/queue/max_sectors_kb /sys/block/sda/queue/max_hw_sectors_kb >> > How large an IO size can be sent (allowed) to the disk, and how large of > an IO the disk drive supports. > > >> ,disk fragmented I/Os, disk I/O size of brw_stats >> >> > How many pieces each ldiskfs write are broken into, and the size of the > pieces. > > >> 2. In which case, the disk I/O will be fragemented? >> >> >> Thanks a lot in advance! >> >> Best Regards >> Lu Wang >> >> > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
On Fri, 2010-04-02 at 17:45 +0800, Lu Wang wrote:> Hi, > We set up a test file system with same patrition and same hardware. When the system is empty, the disk I/O is less fragemented.So, I think you now have confirmation as to what''s causing your disk I/O fragmentation problem on your production system, yes?> However, the disk I/O in flight is still low (mostly at "1"). Is there any way to increase this value through configuration?I think you are chasing a red herring. The number of disk I/Os in flight is only an indicator as to what could be wrong when other things are not working correctly. But as you can see from your brw_stats, the only items that are not absolutely *perfect* are that 2% of your disk I/Os were fragmented:> read | write > disk fragmented I/Os ios % cum % | ios % cum % > 1: 0 0 0 | 2289 97 97 > 2: 0 0 0 | 69 2 100and 4% of your disk I/Os were not a full 1M.> read | write > disk I/O size ios % cum % | ios % cum % > 8K: 0 0 0 | 1 0 0 > 16K: 0 0 0 | 1 0 0 > 32K: 0 0 0 | 0 0 0 > 64K: 0 0 0 | 1 0 0 > 128K: 0 0 0 | 4 0 0 > 256K: 0 0 0 | 12 0 0 > 512K: 0 0 0 | 58 2 3 > 1M: 0 0 0 | 2350 96 100I would think those two small deviations from perfect would be within the realm of acceptable, yes? If you agree then you need to stop chasing the disk I/Os in flight. One likely explanation is simply that the disk(s) is(are) able to drain the I/Os in flight as fast as the OST(s) is(are) able to push them -- which is good! Cheers, b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100405/4d276fd8/attachment.bin
Thanks for your explanation. Now I am much more clear about what this parameters'' meaning. I am monitoring the /proc/ I/O tunables these days. On the Lustre clients, these is a /proc/options called "offset_stats". According to my observation, this Stat refreshes itself every few seconds---Because nomatter how long you observe this "offset_stats", the result is always of same size( several hundred lines). An Example output of our applicaiton is as below: cat /proc/fs/lustre/llite/bes3fs-f7f34600/offset_stats |sort R 18245 1235222528 1236271104 1048576 1048576 -2097152 R 18245 1235222528 1237319680 2097152 2097152 -3145728 R 18245 1235222528 1238368256 3145728 3145728 -4194304 R 18245 1235222528 1239416832 4194304 4194304 -5242880 R 18245 1235222528 1240465408 5242880 5242880 -6291456 R 18245 1241513984 1243611136 2097152 2097152 5242880 R 18245 4207371795 4207935488 563693 563693 0 R 31286 2344747242 2345359684 612442 612442 0 R 31286 731906048 732954624 1048576 1048576 -2097152 R 31286 731906048 734003200 2097152 2097152 -3145728 R 31286 731906048 735051776 3145728 3145728 -4194304 R 31286 731906048 736100352 4194304 4194304 -5242880 R 31286 731906048 737148928 5242880 5242880 -6291456 R 31286 738197504 740294656 2097152 2097152 5242880 R/W PID RANGE START RANGE END SMALLEST EXTENT LARGEST EXTENT OFFSET snapshot_time: 1270536172.926198 (secs.usecs) W 17041 428613209 428867584 254375 254375 0 W 17041 4544959606 4545039667 80061 80061 -617354 W 17041 4546233710 4546421474 187764 187764 -391826 W 17041 4546233710 4546625536 391826 391826 0 W 17041 4546813300 4547362320 549020 549020 -860812 W 17041 4546813300 4547674112 860812 860812 391826 W 17041 4548223132 4548428237 205105 205105 -499556 W 17041 4548223132 4548722688 499556 499556 860812 W 17041 4548927793 4549517473 589680 589680 499556 W 17621 4713119986 4713589595 469609 469609 -229134 W 17621 4713818729 4713840455 21726 21726 -578967 W 17621 4713818729 4714397696 578967 578967 229134 W 17621 4714419422 4715446272 92387 934463 578967 W 17621 4715353885 4715850698 496813 496813 -92387 W 17621 4715943085 4716093291 150206 150206 -551763 W 17621 4715943085 4716494848 551763 551763 92387 W 18245 1235222528 1241513984 6291456 6291456 0 W 18245 4202565256 4203013298 448042 448042 -127352 W 18245 4203140650 4203741184 10424 590110 127352 W 18245 4203730760 4204662379 931619 931619 -10424 W 18245 4204672803 4204789760 116957 116957 10424 W 18245 4204672803 4205255964 583161 583161 -116957 W 18245 4205372921 4205838336 465415 465415 116957 W 18245 4205372921 4206327006 954085 954085 -465415 W 18245 4206792421 4206886912 94491 94491 465415 W 18245 4207371795 4207548337 176542 176542 -1048576 W 18245 4207371795 4208420371 1048576 1048576 -563693 W 18245 4209160606 4209746813 586207 586207 1612269 W 30800 2361474386 2361934283 459897 459897 -967342 W 30800 2362901625 2363490304 588679 588679 967342 W 30800 2362901625 2363743699 842074 842074 -588679 W 30800 2364332378 2364538880 206502 206502 588679 W 30800 2364332378 2364835485 503107 503107 -206502 W 30800 2365041987 2365104344 62357 62357 -545469 W 30800 2365041987 2365587456 545469 545469 206502 W 30800 2365649813 2366636032 359449 626770 545469 W 30800 2366276583 2366628031 351448 351448 -359449 W 30854 1589868539 1589981263 112724 112724 -821253 W 30854 1590802516 1591738368 337997 597855 821253 W 30854 1591400371 1591451153 50782 50782 -1048576 W 30854 1591400371 1592448947 1048576 1048576 -337997 W 30854 1592837726 1593835520 415850 581944 1386573 W 30854 1593419670 1593714126 294456 294456 -415850 W 30854 1594129976 1594204276 74300 74300 -1048576 W 30854 1594129976 1594884096 754120 754120 415850 W 30854 1594129976 1595178552 1048576 1048576 -754120 W 30854 1596006972 1596058929 51957 51957 -1048576 W 30854 1596006972 1596981248 974276 974276 1802696 W 30854 1596006972 1597055548 1048576 1048576 -974276 W 30854 1598081781 1598542226 460445 460445 -996619 W 30854 1598081781 1599078400 996619 996619 2022852 W 30916 1838846205 1839110629 264424 264424 -356099 W 30916 1839466728 1839610532 143804 143804 -784152 W 30916 1839466728 1840250880 784152 784152 356099 W 30916 1840394684 1840925257 530573 530573 -904772 W 30916 1840394684 1841299456 904772 904772 784152 W 30916 1841830029 1841946671 116642 116642 -518003 W 30916 1841830029 1842348032 518003 518003 904772 W 30916 1842464674 1843396608 213116 718818 518003 W 30916 1843183492 1843432408 248916 248916 -213116 W 30916 1843645524 1844279233 633709 633709 213116 W 31286 2241375450 2241836184 460734 460734 -136642272 W 31286 2345359680 2345649724 290044 290044 -304832 W 31286 2345359680 2345664512 304832 304832 -4 W 31286 2345954552 2346666941 712389 712389 304828 W 31286 2346666937 2346713088 46151 46151 -4 W 31286 2346666937 2347551880 884943 884943 -46151 W 31286 2347598027 2347761664 163637 163637 -31816608 W 31286 2347598027 2348055783 457756 457756 -163637 W 31286 2377380988 2378017722 636734 636734 -789380 W 31286 2377380988 2378170368 789380 789380 29829108 W 31286 2378807102 2379218944 411842 411842 136970918 W 31286 2378807102 2379414635 607533 607533 -411842 W 31286 2379826477 2379975747 149270 149270 -441043 W 31286 2379826477 2380267520 441043 441043 31770694 W 31286 2380416790 2381128603 711813 711813 441043 W 31286 731906048 738197504 6291456 6291456 0 W 32016 183598180 184549376 951196 951196 0 W 32016 2168397607 2168937339 539732 539732 -57561 W 32016 2168994896 2169503744 508848 508848 57557 W 32016 2168994896 2169906334 911438 911438 -508848 W 32016 2201732201 2202009600 277399 277399 31825867 W 32016 2201732201 2202165235 433034 433034 -277399 W 32016 2203037998 2203058176 20178 20178 0 W 32016 2203037998 2203802011 764013 764013 -20178 W 32016 2203822189 2204106752 284563 284563 20178 W 32016 2203822189 2204136741 314552 314552 -284563 W 32016 2204421304 2205135835 714531 714531 -734024 W 32016 2204421304 2205155328 734024 734024 284563 W 32016 2205869859 2206203904 334045 334045 734024 W 32016 2205869859 2206468833 598974 598974 -334045 W 32016 2206802878 2206963345 160467 160467 -449602 W 32016 2206802878 2207252480 449602 449602 334045 W 32016 2207412947 2208126753 713806 713806 449602 W 5313 0 63 63 63 -13977 W 761 0 63 63 63 -13976 Dose that mean our applications seek a lot when they are writing? I think the Range means the position in a certain file. and Extend means every sequential I/O size. (SMALLEST EXTENT)=(LARGEST EXTENDT)=(RANGE END-RANGE START), does that mean the application seeks every time after it writes something to a file ? ---------------- Lu Wang 2010-04-06 ------------------------------------------------------------- ????Brian J. Murrell ?????2010-04-05 21:01:33 ????lustre-discuss ??? ???Re: [Lustre-discuss] disk fragmented I/Os On Fri, 2010-04-02 at 17:45 +0800, Lu Wang wrote:> Hi, > We set up a test file system with same patrition and same hardware. When the system is empty, the disk I/O is less fragemented.So, I think you now have confirmation as to what''s causing your disk I/O fragmentation problem on your production system, yes?> However, the disk I/O in flight is still low (mostly at "1"). Is there any way to increase this value through configuration?I think you are chasing a red herring. The number of disk I/Os in flight is only an indicator as to what could be wrong when other things are not working correctly. But as you can see from your brw_stats, the only items that are not absolutely *perfect* are that 2% of your disk I/Os were fragmented:> read | write > disk fragmented I/Os ios % cum % | ios % cum % > 1: 0 0 0 | 2289 97 97 > 2: 0 0 0 | 69 2 100and 4% of your disk I/Os were not a full 1M.> read | write > disk I/O size ios % cum % | ios % cum % > 8K: 0 0 0 | 1 0 0 > 16K: 0 0 0 | 1 0 0 > 32K: 0 0 0 | 0 0 0 > 64K: 0 0 0 | 1 0 0 > 128K: 0 0 0 | 4 0 0 > 256K: 0 0 0 | 12 0 0 > 512K: 0 0 0 | 58 2 3 > 1M: 0 0 0 | 2350 96 100I would think those two small deviations from perfect would be within the realm of acceptable, yes? If you agree then you need to stop chasing the disk I/Os in flight. One likely explanation is simply that the disk(s) is(are) able to drain the I/Os in flight as fast as the OST(s) is(are) able to push them -- which is good! Cheers, b. _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss