Greetings all, I''m looking for some advice on improving disk performance and understanding what Lustre is doing with it. Right now I have a ~28 TB OSS with 4 OSTs on it. There are 4 clients using Lustre native - no NFS. If I write to the lustre volume from the clients I get odd behavior. Typically the writes have a long pause before any data starts hitting the disks. Then 2 or 3 of the clients will write happily but one or two will not. Eventually Lustre will pump out a number of I/O related errors such as "slow i_mutex 165 seconds, slow direct_io 32 seconds" and so on. Next the clients that couldn''t write will catch up and pass the clients that could write. At some point (5 minutes or so) the jobs start failing without any errors. New jobs can be started after these fail and the pattern repeats. Write speeds are low, around 22 MB/sec per client, the disks shouldn''t have any problem handling 4 writes at this speed!! This did work using NFS. When these disks were formated with XFS I/O was fast. No problems at all writing 475 MB/sec sustained per RAID controller (locally, not via NFS). No delays. After configuring for Lustre the peak sustained write (locally) is 230 MB/sec. It will write for about 2 minutes before logging about slow I/O. This is without any clients connected. So far I''ve done the following: 1. Recompiled SCSI driver for RAID controller to use 1 MB blocks (from 256k). 2. Adjusted MDS, OST threads 3. Tried all I/O schedulers 4. Tried all possible settings on RAID controllers for Caching and read-ahead. 5. Some minor stuff I forgot about! Nothing makes a difference - same results under each configuration except for schedulers. When running the deadline scheduler the writes fail faster and have delays around 30 seconds. With all others the delays range from 100 to 500 seconds. The system has 4 cores and 4 GB of memory with 4 7 TB OSTs. The disks are in RAID 6 split between two controllers with 2 GB cache each. One controller has the MGS/MDT on it. When running top it indicates 2/3 to 3/4 of memory utilized and 25% CPU utilization normally. Suggestions? Thank you, Dan
Hello Dan, On Saturday 19 January 2008 01:45:13 Dan wrote:> Greetings all, > > I''m looking for some advice on improving disk performance and > understanding what Lustre is doing with it. Right now I have a ~28 TB > OSS with 4 OSTs on it. There are 4 clients using Lustre native - no > NFS. If I write to the lustre volume from the clients I get odd > behavior. Typically the writes have a long pause before any data > starts hitting the disks. Then 2 or 3 of the clients will write > happily but one or two will not. Eventually Lustre will pump out a > number of I/O related errors such as "slow i_mutex 165 seconds, slow > direct_io 32 seconds" and so on. Next the clients that couldn''t write > will catch up and pass the clients that could write. At some point (5 > minutes or so) the jobs start failing without any errors. New jobs > can be started after these fail and the pattern repeats. Write speeds > are low, around 22 MB/sec per client, the disks shouldn''t have any > problem handling 4 writes at this speed!! This did work using NFS. > > When these disks were formated with XFS I/O was fast. No problems at > all writing 475 MB/sec sustained per RAID controller (locally, not via > NFS). No delays. After configuring for Lustre the peak sustained > write (locally) is 230 MB/sec. It will write for about 2 minutes > before logging about slow I/O. This is without any clients connected. > > So far I''ve done the following: > > 1. Recompiled SCSI driver for RAID controller to use 1 MB blocks (from > 256k). > 2. Adjusted MDS, OST threads > 3. Tried all I/O schedulers > 4. Tried all possible settings on RAID controllers for Caching and > read-ahead. > 5. Some minor stuff I forgot about! > > Nothing makes a difference - same results under each configuration except > for schedulers. When running the deadline scheduler the writes fail > faster and have delays around 30 seconds. With all others the delays > range from 100 to 500 seconds. > > The system has 4 cores and 4 GB of memory with 4 7 TB OSTs. The disks are > in RAID 6 split between two controllers with 2 GB cache each. One > controller has the MGS/MDT on it. When running top it indicates 2/3 to > 3/4 of memory utilized and 25% CPU utilization normally. > > Suggestions? >we are usually benchmarking with ldiskfs first, so to figure out what we should get, we use ldiskfs in comparison to xfs. mount -t ldiskfs -omballoc,extents /dev/{device name} /{favorite mount} Now benchmark it and compare it to xfs. You may also want to play with additional options as "data=writeback". It also would be helpful if we would know which lustre version you are using. E.g. in lustre-1.4 mballoc and extents are not enabled by default, so its almost pure ext3, which is terribly slow compared to xfs. Cheers, Bernd -- Bernd Schubert Q-Leap Networks GmbH
On Jan 18, 2008 16:45 -0800, Dan wrote:> I''m looking for some advice on improving disk performance and > understanding what Lustre is doing with it. Right now I have a ~28 TB > OSS with 4 OSTs on it. There are 4 clients using Lustre native - no > NFS. If I write to the lustre volume from the clients I get odd > behavior. Typically the writes have a long pause before any data > starts hitting the disks. Then 2 or 3 of the clients will write > happily but one or two will not. Eventually Lustre will pump out a > number of I/O related errors such as "slow i_mutex 165 seconds, slow > direct_io 32 seconds" and so on. Next the clients that couldn''t write > will catch up and pass the clients that could write. At some point (5 > minutes or so) the jobs start failing without any errors. New jobs > can be started after these fail and the pattern repeats. Write speeds > are low, around 22 MB/sec per client, the disks shouldn''t have any > problem handling 4 writes at this speed!! This did work using NFS. > > When these disks were formated with XFS I/O was fast. No problems at > all writing 475 MB/sec sustained per RAID controller (locally, not via > NFS). No delays. After configuring for Lustre the peak sustained > write (locally) is 230 MB/sec. It will write for about 2 minutes > before logging about slow I/O. This is without any clients connected. > > So far I''ve done the following: > > 1. Recompiled SCSI driver for RAID controller to use 1 MB blocks (from > 256k). > 2. Adjusted MDS, OST threads > 3. Tried all I/O schedulers > 4. Tried all possible settings on RAID controllers for Caching and > read-ahead. > 5. Some minor stuff I forgot about! > > Nothing makes a difference - same results under each configuration except > for schedulers. When running the deadline scheduler the writes fail > faster and have delays around 30 seconds. With all others the delays > range from 100 to 500 seconds. > > The system has 4 cores and 4 GB of memory with 4 7 TB OSTs. The disks are > in RAID 6 split between two controllers with 2 GB cache each. One > controller has the MGS/MDT on it. When running top it indicates 2/3 to > 3/4 of memory utilized and 25% CPU utilization normally.Are you using Lustre 1.4 or 1.6? Are you mounting your OSTs with "-o extents,mballoc"? We''ve had Lustre OSSs nodes running in excess of 2GB/s with h/w RAID controllers. Are you using partitions on your RAID device? You shouldn''t - that causes unaligned IO to the device and needless read-modify-write for each RAID stripe. Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)? If not, then you should consider mounting your OSTs with "-o stripe={raid_stripe}", where raid_stripe=N*raid_chunksize, N is the number of data disks for RAID 5 N+1 or RAID 6 N+2. You should download the lustre-iokit and use sgpdd-survey, obdfilter-survey, and PIOS to determine what is causing the performance bottleneck. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Sorry for the long delay! I''m running Lustre 1.6.4.2. I''m mounting with default options. When I used -o extents,mballoc it mounts and the volume hangs. I tried to check it out with ldiskfs but no luck. I had to reboot the machine (hard boot at that) to get the devices back. It appears in the logs to mount with mballoc by default. I''m not using partitions on the RAID devices. I have two RAID controllers in the system. All disks on each are grouped into a single RAID 6. The first controller has three volumes one for the MGS/MDT and two OSTs. The other only has two OSTs. I attempted using the -o stripe=<raid_stripe=N*raid_chunksize> but no luck. When mounting the OSTs with the stripe option they hang and never mount. I''ve tried a couple of stipe sizes. I was a little uncertain of the stripe size calculation so here we go... My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare leave 23). That means 21 data disks? Judging by your formula I take 23 * 128k whis is 2944. Is this even close to what you intended? This stripe size hangs at mount... I''ve tried to test with the lustre-io kit but the tests (writes) fail on most OSTs. That is the problem I''m having after all... frustrating. Would it make sense to reconfigure the RAID controllers to have separate groups of disks in RAID 6? For performance is there a recommended max size or number of disks for each OST? Lastly, is it worth while to consider putting the ext3 journal on another device exported from the RAID controller? Thank you!! Dan> On Jan 18, 2008 16:45 -0800, Dan wrote: >> I''m looking for some advice on improving disk performance and >> understanding what Lustre is doing with it. Right now I have a ~28 TB >> OSS with 4 OSTs on it. There are 4 clients using Lustre native - no >> NFS. If I write to the lustre volume from the clients I get odd >> behavior. Typically the writes have a long pause before any data >> starts hitting the disks. Then 2 or 3 of the clients will write >> happily but one or two will not. Eventually Lustre will pump out a >> number of I/O related errors such as "slow i_mutex 165 seconds, slow >> direct_io 32 seconds" and so on. Next the clients that couldn''t write >> will catch up and pass the clients that could write. At some point (5 >> minutes or so) the jobs start failing without any errors. New jobs >> can be started after these fail and the pattern repeats. Write speeds >> are low, around 22 MB/sec per client, the disks shouldn''t have any >> problem handling 4 writes at this speed!! This did work using NFS. >> >> When these disks were formated with XFS I/O was fast. No problems >> at >> all writing 475 MB/sec sustained per RAID controller (locally, not via >> NFS). No delays. After configuring for Lustre the peak sustained >> write (locally) is 230 MB/sec. It will write for about 2 minutes >> before logging about slow I/O. This is without any clients connected. >> >> So far I''ve done the following: >> >> 1. Recompiled SCSI driver for RAID controller to use 1 MB blocks (from >> 256k). >> 2. Adjusted MDS, OST threads >> 3. Tried all I/O schedulers >> 4. Tried all possible settings on RAID controllers for Caching and >> read-ahead. >> 5. Some minor stuff I forgot about! >> >> Nothing makes a difference - same results under each configuration >> except >> for schedulers. When running the deadline scheduler the writes fail >> faster and have delays around 30 seconds. With all others the delays >> range from 100 to 500 seconds. >> >> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs. The disks >> are >> in RAID 6 split between two controllers with 2 GB cache each. One >> controller has the MGS/MDT on it. When running top it indicates 2/3 to >> 3/4 of memory utilized and 25% CPU utilization normally. > > Are you using Lustre 1.4 or 1.6? Are you mounting your OSTs with > "-o extents,mballoc"? We''ve had Lustre OSSs nodes running in excess > of 2GB/s with h/w RAID controllers. > > Are you using partitions on your RAID device? You shouldn''t - that causes > unaligned IO to the device and needless read-modify-write for each RAID > stripe. > > Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)? If not, > then you should consider mounting your OSTs with "-o > stripe={raid_stripe}", > where raid_stripe=N*raid_chunksize, N is the number of data disks for > RAID 5 N+1 or RAID 6 N+2. > > You should download the lustre-iokit and use sgpdd-survey, > obdfilter-survey, > and PIOS to determine what is causing the performance bottleneck. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. >
On Jan 30, 2008 18:32 -0800, Dan wrote:> I was a little uncertain of the stripe size calculation so here we go... > My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare > leave 23). That means 21 data disks? Judging by your formula I take 23 * > 128k whis is 2944. Is this even close to what you intended? This stripe > size hangs at mount...Hmm, I don''t think the mballoc code can efficiently deal with a stripe size larger than the RPC size (which is 1MB) because this will always result in a read-modify-write of the RAID stripe as not enough data can be collected to fill a stripe.> I''ve tried to test with the lustre-io kit but the tests (writes) fail on > most OSTs. That is the problem I''m having after all... frustrating. > > Would it make sense to reconfigure the RAID controllers to have separate > groups of disks in RAID 6? For performance is there a recommended max > size or number of disks for each OST? Lastly, is it worth while to > consider putting the ext3 journal on another device exported from the RAID > controller?Having 21 disks in the RAID set is probably too large to be practical because of the high overhead of doing IO of such a large size. Good configurations for such a system might be 2x 8+2 + spare = 21 disks with 128kB chunk size, or 16+2 + spare = 19 disks with 64kB chunk size. Both result in 1MB full stripe size, which is what mballoc and Lustre are optimized to by default.> > On Jan 18, 2008 16:45 -0800, Dan wrote: > >> I''m looking for some advice on improving disk performance and > >> understanding what Lustre is doing with it. Right now I have a ~28 TB > >> OSS with 4 OSTs on it. There are 4 clients using Lustre native - no > >> NFS. If I write to the lustre volume from the clients I get odd > >> behavior. Typically the writes have a long pause before any data > >> starts hitting the disks. Then 2 or 3 of the clients will write > >> happily but one or two will not. Eventually Lustre will pump out a > >> number of I/O related errors such as "slow i_mutex 165 seconds, slow > >> direct_io 32 seconds" and so on. Next the clients that couldn''t write > >> will catch up and pass the clients that could write. At some point (5 > >> minutes or so) the jobs start failing without any errors. New jobs > >> can be started after these fail and the pattern repeats. Write speeds > >> are low, around 22 MB/sec per client, the disks shouldn''t have any > >> problem handling 4 writes at this speed!! This did work using NFS. > >> > >> When these disks were formated with XFS I/O was fast. No problems > >> at > >> all writing 475 MB/sec sustained per RAID controller (locally, not via > >> NFS). No delays. After configuring for Lustre the peak sustained > >> write (locally) is 230 MB/sec. It will write for about 2 minutes > >> before logging about slow I/O. This is without any clients connected. > >> > >> So far I''ve done the following: > >> > >> 1. Recompiled SCSI driver for RAID controller to use 1 MB blocks (from > >> 256k). > >> 2. Adjusted MDS, OST threads > >> 3. Tried all I/O schedulers > >> 4. Tried all possible settings on RAID controllers for Caching and > >> read-ahead. > >> 5. Some minor stuff I forgot about! > >> > >> Nothing makes a difference - same results under each configuration > >> except > >> for schedulers. When running the deadline scheduler the writes fail > >> faster and have delays around 30 seconds. With all others the delays > >> range from 100 to 500 seconds. > >> > >> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs. The disks > >> are > >> in RAID 6 split between two controllers with 2 GB cache each. One > >> controller has the MGS/MDT on it. When running top it indicates 2/3 to > >> 3/4 of memory utilized and 25% CPU utilization normally. > > > > Are you using Lustre 1.4 or 1.6? Are you mounting your OSTs with > > "-o extents,mballoc"? We''ve had Lustre OSSs nodes running in excess > > of 2GB/s with h/w RAID controllers. > > > > Are you using partitions on your RAID device? You shouldn''t - that causes > > unaligned IO to the device and needless read-modify-write for each RAID > > stripe. > > > > Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)? If not, > > then you should consider mounting your OSTs with "-o > > stripe={raid_stripe}", > > where raid_stripe=N*raid_chunksize, N is the number of data disks for > > RAID 5 N+1 or RAID 6 N+2. > > > > You should download the lustre-iokit and use sgpdd-survey, > > obdfilter-survey, > > and PIOS to determine what is causing the performance bottleneck. > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussCheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Thanks Andreas. I''ll reconfigure the RAID and give it another shot today. Would it be reasonable to credit the stalled writes with this I/O mismatch I have? Dan On Thu, 2008-01-31 at 01:40 -0700, Andreas Dilger wrote:> On Jan 30, 2008 18:32 -0800, Dan wrote: > > I was a little uncertain of the stripe size calculation so here we go... > > My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare > > leave 23). That means 21 data disks? Judging by your formula I take 23 * > > 128k whis is 2944. Is this even close to what you intended? This stripe > > size hangs at mount... > > Hmm, I don''t think the mballoc code can efficiently deal with a stripe size > larger than the RPC size (which is 1MB) because this will always result in > a read-modify-write of the RAID stripe as not enough data can be collected > to fill a stripe. > > > I''ve tried to test with the lustre-io kit but the tests (writes) fail on > > most OSTs. That is the problem I''m having after all... frustrating. > > > > Would it make sense to reconfigure the RAID controllers to have separate > > groups of disks in RAID 6? For performance is there a recommended max > > size or number of disks for each OST? Lastly, is it worth while to > > consider putting the ext3 journal on another device exported from the RAID > > controller? > > Having 21 disks in the RAID set is probably too large to be practical > because of the high overhead of doing IO of such a large size. > Good configurations for such a system might be 2x 8+2 + spare = 21 disks > with 128kB chunk size, or 16+2 + spare = 19 disks with 64kB chunk size. > Both result in 1MB full stripe size, which is what mballoc and Lustre > are optimized to by default. > > > > On Jan 18, 2008 16:45 -0800, Dan wrote: > > >> I''m looking for some advice on improving disk performance and > > >> understanding what Lustre is doing with it. Right now I have a ~28 TB > > >> OSS with 4 OSTs on it. There are 4 clients using Lustre native - no > > >> NFS. If I write to the lustre volume from the clients I get odd > > >> behavior. Typically the writes have a long pause before any data > > >> starts hitting the disks. Then 2 or 3 of the clients will write > > >> happily but one or two will not. Eventually Lustre will pump out a > > >> number of I/O related errors such as "slow i_mutex 165 seconds, slow > > >> direct_io 32 seconds" and so on. Next the clients that couldn''t write > > >> will catch up and pass the clients that could write. At some point (5 > > >> minutes or so) the jobs start failing without any errors. New jobs > > >> can be started after these fail and the pattern repeats. Write speeds > > >> are low, around 22 MB/sec per client, the disks shouldn''t have any > > >> problem handling 4 writes at this speed!! This did work using NFS. > > >> > > >> When these disks were formated with XFS I/O was fast. No problems > > >> at > > >> all writing 475 MB/sec sustained per RAID controller (locally, not via > > >> NFS). No delays. After configuring for Lustre the peak sustained > > >> write (locally) is 230 MB/sec. It will write for about 2 minutes > > >> before logging about slow I/O. This is without any clients connected. > > >> > > >> So far I''ve done the following: > > >> > > >> 1. Recompiled SCSI driver for RAID controller to use 1 MB blocks (from > > >> 256k). > > >> 2. Adjusted MDS, OST threads > > >> 3. Tried all I/O schedulers > > >> 4. Tried all possible settings on RAID controllers for Caching and > > >> read-ahead. > > >> 5. Some minor stuff I forgot about! > > >> > > >> Nothing makes a difference - same results under each configuration > > >> except > > >> for schedulers. When running the deadline scheduler the writes fail > > >> faster and have delays around 30 seconds. With all others the delays > > >> range from 100 to 500 seconds. > > >> > > >> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs. The disks > > >> are > > >> in RAID 6 split between two controllers with 2 GB cache each. One > > >> controller has the MGS/MDT on it. When running top it indicates 2/3 to > > >> 3/4 of memory utilized and 25% CPU utilization normally. > > > > > > Are you using Lustre 1.4 or 1.6? Are you mounting your OSTs with > > > "-o extents,mballoc"? We''ve had Lustre OSSs nodes running in excess > > > of 2GB/s with h/w RAID controllers. > > > > > > Are you using partitions on your RAID device? You shouldn''t - that causes > > > unaligned IO to the device and needless read-modify-write for each RAID > > > stripe. > > > > > > Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)? If not, > > > then you should consider mounting your OSTs with "-o > > > stripe={raid_stripe}", > > > where raid_stripe=N*raid_chunksize, N is the number of data disks for > > > RAID 5 N+1 or RAID 6 N+2. > > > > > > You should download the lustre-iokit and use sgpdd-survey, > > > obdfilter-survey, > > > and PIOS to determine what is causing the performance bottleneck. > > > > > > Cheers, Andreas > > > -- > > > Andreas Dilger > > > Sr. Staff Engineer, Lustre Group > > > Sun Microsystems of Canada, Inc. > > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc.-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080131/226f9d66/attachment-0002.html
On Jan 31, 2008 08:25 -0800, Dan wrote:> Thanks Andreas. I''ll reconfigure the RAID and give it another shot > today. Would it be reasonable to credit the stalled writes with this > I/O mismatch I have?It would definitely hurt performance... Also, placing the MDT on the same RAID6 is not very desirable... Given that you now have a few spare disks on the system, I''d also recommend a separate RAID 0+1 for the MDT device.> On Thu, 2008-01-31 at 01:40 -0700, Andreas Dilger wrote: > > On Jan 30, 2008 18:32 -0800, Dan wrote: > > > I was a little uncertain of the stripe size calculation so here we go... > > > My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare > > > leave 23). That means 21 data disks? Judging by your formula I take 23 * > > > 128k whis is 2944. Is this even close to what you intended? This stripe > > > size hangs at mount... > > > > Hmm, I don''t think the mballoc code can efficiently deal with a stripe size > > larger than the RPC size (which is 1MB) because this will always result in > > a read-modify-write of the RAID stripe as not enough data can be collected > > to fill a stripe. > > > > > I''ve tried to test with the lustre-io kit but the tests (writes) fail on > > > most OSTs. That is the problem I''m having after all... frustrating. > > > > > > Would it make sense to reconfigure the RAID controllers to have separate > > > groups of disks in RAID 6? For performance is there a recommended max > > > size or number of disks for each OST? Lastly, is it worth while to > > > consider putting the ext3 journal on another device exported from the RAID > > > controller? > > > > Having 21 disks in the RAID set is probably too large to be practical > > because of the high overhead of doing IO of such a large size. > > Good configurations for such a system might be 2x 8+2 + spare = 21 disks > > with 128kB chunk size, or 16+2 + spare = 19 disks with 64kB chunk size. > > Both result in 1MB full stripe size, which is what mballoc and Lustre > > are optimized to by default. > > > > > > On Jan 18, 2008 16:45 -0800, Dan wrote: > > > >> I''m looking for some advice on improving disk performance and > > > >> understanding what Lustre is doing with it. Right now I have a ~28 TB > > > >> OSS with 4 OSTs on it. There are 4 clients using Lustre native - no > > > >> NFS. If I write to the lustre volume from the clients I get odd > > > >> behavior. Typically the writes have a long pause before any data > > > >> starts hitting the disks. Then 2 or 3 of the clients will write > > > >> happily but one or two will not. Eventually Lustre will pump out a > > > >> number of I/O related errors such as "slow i_mutex 165 seconds, slow > > > >> direct_io 32 seconds" and so on. Next the clients that couldn''t write > > > >> will catch up and pass the clients that could write. At some point (5 > > > >> minutes or so) the jobs start failing without any errors. New jobs > > > >> can be started after these fail and the pattern repeats. Write speeds > > > >> are low, around 22 MB/sec per client, the disks shouldn''t have any > > > >> problem handling 4 writes at this speed!! This did work using NFS. > > > >> > > > >> When these disks were formated with XFS I/O was fast. No problems > > > >> at > > > >> all writing 475 MB/sec sustained per RAID controller (locally, not via > > > >> NFS). No delays. After configuring for Lustre the peak sustained > > > >> write (locally) is 230 MB/sec. It will write for about 2 minutes > > > >> before logging about slow I/O. This is without any clients connected. > > > >> > > > >> So far I''ve done the following: > > > >> > > > >> 1. Recompiled SCSI driver for RAID controller to use 1 MB blocks (from > > > >> 256k). > > > >> 2. Adjusted MDS, OST threads > > > >> 3. Tried all I/O schedulers > > > >> 4. Tried all possible settings on RAID controllers for Caching and > > > >> read-ahead. > > > >> 5. Some minor stuff I forgot about! > > > >> > > > >> Nothing makes a difference - same results under each configuration > > > >> except > > > >> for schedulers. When running the deadline scheduler the writes fail > > > >> faster and have delays around 30 seconds. With all others the delays > > > >> range from 100 to 500 seconds. > > > >> > > > >> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs. The disks > > > >> are > > > >> in RAID 6 split between two controllers with 2 GB cache each. One > > > >> controller has the MGS/MDT on it. When running top it indicates 2/3 to > > > >> 3/4 of memory utilized and 25% CPU utilization normally. > > > > > > > > Are you using Lustre 1.4 or 1.6? Are you mounting your OSTs with > > > > "-o extents,mballoc"? We''ve had Lustre OSSs nodes running in excess > > > > of 2GB/s with h/w RAID controllers. > > > > > > > > Are you using partitions on your RAID device? You shouldn''t - that causes > > > > unaligned IO to the device and needless read-modify-write for each RAID > > > > stripe. > > > > > > > > Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)? If not, > > > > then you should consider mounting your OSTs with "-o > > > > stripe={raid_stripe}", > > > > where raid_stripe=N*raid_chunksize, N is the number of data disks for > > > > RAID 5 N+1 or RAID 6 N+2. > > > > > > > > You should download the lustre-iokit and use sgpdd-survey, > > > > obdfilter-survey, > > > > and PIOS to determine what is causing the performance bottleneck. > > > > > > > > Cheers, Andreas > > > > -- > > > > Andreas Dilger > > > > Sr. Staff Engineer, Lustre Group > > > > Sun Microsystems of Canada, Inc. > > > > > > > > > > > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc.Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.