Hi We have been experiencing degraded performance for a few days on a fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms). The initial bulk load of the data will be fine but once in use for a while writes become very slow to individual ost. This will block io for a few minutes and then carry on as normal. The slow writes will then move to another ost. This can be seen in iostat and many slow IO messages will be seen in the logs (example included) The osts are between 87 90 % full. Not ideal but has not caused any issues running 1.6.7.2 on the same hardware. The osts are RAID6 on external raid chassis (Infortrend). Each ost is 5.4T (small). The server is Dual AMD (4 cores). 16G Ram. Qlogic FC HBA. I mounted the osts as ldiskfs and tried a few write tests. These also show the same behaviour. While the write operation is blocked there will be hundreds of read tps and a very small kb/s read from the raid but now writes. As soon as this completes writes will go through at a more expected speed. Any idea what is going on? Many thanks James. Example error messages: Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/lustre-log.1303271584.3935 Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota init 286s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 39s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow brw_start 39s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 133s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow brw_start 133s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 236s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex 40s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex 277s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow direct_io 286s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous similar messages Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal start 285s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous similar message Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow commitrw commit 285s due to heavy IO load Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous similar message Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent lock 236s due to heavy IO load
First guess is the increased memory pressure caused by the Lustre 1.8 read cache. Many times "slow" messages are caused by memory allocatons taking a long time. You could try disabling the read cache and see if that clears up the slow messages. Kevin On Apr 20, 2011, at 4:29 AM, James Rose <James.Rose at framestore.com> wrote:> Hi > > We have been experiencing degraded performance for a few days on a > fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms). The > initial bulk load of the data will be fine but once in use for a > while writes become very slow to individual ost. This will block io > for a few minutes and then carry on as normal. The slow writes will > then move to another ost. This can be seen in iostat and many slow > IO messages will be seen in the logs (example included) > > The osts are between 87 90 % full. Not ideal but has not caused any > issues running 1.6.7.2 on the same hardware. > > The osts are RAID6 on external raid chassis (Infortrend). Each ost > is 5.4T (small). The server is Dual AMD (4 cores). 16G Ram. Qlogic > FC HBA. > > I mounted the osts as ldiskfs and tried a few write tests. These > also show the same behaviour. > > While the write operation is blocked there will be hundreds of read > tps and a very small kb/s read from the raid but now writes. As > soon as this completes writes will go through at a more expected > speed. > > Any idea what is going on? > > Many thanks > > James. > > Example error messages: > > Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ > lustre-log.1303271584.3935 > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota > init 286s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > start 39s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous > similar messages > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > brw_start 39s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous > similar messages > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > start 133s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous > similar messages > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > brw_start 133s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous > similar messages > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > start 236s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex > 40s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous > similar messages > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous > similar messages > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex > 277s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > direct_io 286s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous > similar messages > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > start 285s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous > similar message > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > commitrw commit 285s due to heavy IO load > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous > similar message > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent > lock 236s due to heavy IO load > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Hi Kevin, Thanks for the suggestion. I will try this out. For the moment it seems that it may be disk space related. I have removed some data from the file system. Performance returned to where I would expect it to be as space freed up (currently at 83% full). Since free space I have seen two messages on an OSS where the number of threads is tuned to the amount of RAM in the host and six on an OSS that has the number of threads set higher than it should. This is a much better situation than the steady stream I was experiencing last night. Maybe disabling the read cache will remove the last few. I am still very curious what the rapid small reads seen when writing are as this showed up while mounted ldiskfs so not doing regular lustre operations at all. Thanks again for your help, James. On Wed, 2011-04-20 at 08:48 -0300, Kevin Van Maren wrote:> First guess is the increased memory pressure caused by the Lustre 1.8 > read cache. Many times "slow" messages are caused by memory > allocatons taking a long time. > > You could try disabling the read cache and see if that clears up the > slow messages. > > Kevin > > > On Apr 20, 2011, at 4:29 AM, James Rose <James.Rose at framestore.com> > wrote: > > > Hi > > > > We have been experiencing degraded performance for a few days on a > > fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms). The > > initial bulk load of the data will be fine but once in use for a > > while writes become very slow to individual ost. This will block io > > for a few minutes and then carry on as normal. The slow writes will > > then move to another ost. This can be seen in iostat and many slow > > IO messages will be seen in the logs (example included) > > > > The osts are between 87 90 % full. Not ideal but has not caused any > > issues running 1.6.7.2 on the same hardware. > > > > The osts are RAID6 on external raid chassis (Infortrend). Each ost > > is 5.4T (small). The server is Dual AMD (4 cores). 16G Ram. Qlogic > > FC HBA. > > > > I mounted the osts as ldiskfs and tried a few write tests. These > > also show the same behaviour. > > > > While the write operation is blocked there will be hundreds of read > > tps and a very small kb/s read from the raid but now writes. As > > soon as this completes writes will go through at a more expected > > speed. > > > > Any idea what is going on? > > > > Many thanks > > > > James. > > > > Example error messages: > > > > Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ > > lustre-log.1303271584.3935 > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota > > init 286s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > > start 39s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous > > similar messages > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > > brw_start 39s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous > > similar messages > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > > start 133s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous > > similar messages > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > > brw_start 133s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous > > similar messages > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > > start 236s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex > > 40s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous > > similar messages > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous > > similar messages > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex > > 277s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > > direct_io 286s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous > > similar messages > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > > start 285s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous > > similar message > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > > commitrw commit 285s due to heavy IO load > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous > > similar message > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent > > lock 236s due to heavy IO load > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Yes, difficulty finding free disk space can also be a problem, but I could not recall big changes in how that worked since 1.6, other than memory pressure from the read cache pushing out the bitmaps. See http://jira.whamcloud.com/browse/LU-15 Kevin James Rose wrote:> Hi Kevin, > > Thanks for the suggestion. I will try this out. > > For the moment it seems that it may be disk space related. I have > removed some data from the file system. Performance returned to where I > would expect it to be as space freed up (currently at 83% full). Since > free space I have seen two messages on an OSS where the number of > threads is tuned to the amount of RAM in the host and six on an OSS that > has the number of threads set higher than it should. This is a much > better situation than the steady stream I was experiencing last night. > Maybe disabling the read cache will remove the last few. > > I am still very curious what the rapid small reads seen when writing are > as this showed up while mounted ldiskfs so not doing regular lustre > operations at all. > > Thanks again for your help, > > James. > > > On Wed, 2011-04-20 at 08:48 -0300, Kevin Van Maren wrote: > >> First guess is the increased memory pressure caused by the Lustre 1.8 >> read cache. Many times "slow" messages are caused by memory >> allocatons taking a long time. >> >> You could try disabling the read cache and see if that clears up the >> slow messages. >> >> Kevin >> >> >> On Apr 20, 2011, at 4:29 AM, James Rose <James.Rose at framestore.com> >> wrote: >> >> >>> Hi >>> >>> We have been experiencing degraded performance for a few days on a >>> fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms). The >>> initial bulk load of the data will be fine but once in use for a >>> while writes become very slow to individual ost. This will block io >>> for a few minutes and then carry on as normal. The slow writes will >>> then move to another ost. This can be seen in iostat and many slow >>> IO messages will be seen in the logs (example included) >>> >>> The osts are between 87 90 % full. Not ideal but has not caused any >>> issues running 1.6.7.2 on the same hardware. >>> >>> The osts are RAID6 on external raid chassis (Infortrend). Each ost >>> is 5.4T (small). The server is Dual AMD (4 cores). 16G Ram. Qlogic >>> FC HBA. >>> >>> I mounted the osts as ldiskfs and tried a few write tests. These >>> also show the same behaviour. >>> >>> While the write operation is blocked there will be hundreds of read >>> tps and a very small kb/s read from the raid but now writes. As >>> soon as this completes writes will go through at a more expected >>> speed. >>> >>> Any idea what is going on? >>> >>> Many thanks >>> >>> James. >>> >>> Example error messages: >>> >>> Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ >>> lustre-log.1303271584.3935 >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota >>> init 286s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal >>> start 39s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow >>> brw_start 39s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal >>> start 133s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow >>> brw_start 133s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal >>> start 236s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex >>> 40s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex >>> 277s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow >>> direct_io 286s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal >>> start 285s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous >>> similar message >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow >>> commitrw commit 285s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous >>> similar message >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent >>> lock 236s due to heavy IO load >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > >
Its an issue with ext[34]. Take a look at http://jira.whamcloud.com/browse/LU-15 for more details. Jeremy On Wed, Apr 20, 2011 at 8:28 AM, James Rose <James.Rose at framestore.com>wrote:> Hi Kevin, > > Thanks for the suggestion. I will try this out. > > For the moment it seems that it may be disk space related. I have > removed some data from the file system. Performance returned to where I > would expect it to be as space freed up (currently at 83% full). Since > free space I have seen two messages on an OSS where the number of > threads is tuned to the amount of RAM in the host and six on an OSS that > has the number of threads set higher than it should. This is a much > better situation than the steady stream I was experiencing last night. > Maybe disabling the read cache will remove the last few. > > I am still very curious what the rapid small reads seen when writing are > as this showed up while mounted ldiskfs so not doing regular lustre > operations at all. > > Thanks again for your help, > > James. > > > On Wed, 2011-04-20 at 08:48 -0300, Kevin Van Maren wrote: > > First guess is the increased memory pressure caused by the Lustre 1.8 > > read cache. Many times "slow" messages are caused by memory > > allocatons taking a long time. > > > > You could try disabling the read cache and see if that clears up the > > slow messages. > > > > Kevin > > > > > > On Apr 20, 2011, at 4:29 AM, James Rose <James.Rose at framestore.com> > > wrote: > > > > > Hi > > > > > > We have been experiencing degraded performance for a few days on a > > > fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms). The > > > initial bulk load of the data will be fine but once in use for a > > > while writes become very slow to individual ost. This will block io > > > for a few minutes and then carry on as normal. The slow writes will > > > then move to another ost. This can be seen in iostat and many slow > > > IO messages will be seen in the logs (example included) > > > > > > The osts are between 87 90 % full. Not ideal but has not caused any > > > issues running 1.6.7.2 on the same hardware. > > > > > > The osts are RAID6 on external raid chassis (Infortrend). Each ost > > > is 5.4T (small). The server is Dual AMD (4 cores). 16G Ram. Qlogic > > > FC HBA. > > > > > > I mounted the osts as ldiskfs and tried a few write tests. These > > > also show the same behaviour. > > > > > > While the write operation is blocked there will be hundreds of read > > > tps and a very small kb/s read from the raid but now writes. As > > > soon as this completes writes will go through at a more expected > > > speed. > > > > > > Any idea what is going on? > > > > > > Many thanks > > > > > > James. > > > > > > Example error messages: > > > > > > Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ > > > lustre-log.1303271584.3935 > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota > > > init 286s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > > > start 39s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous > > > similar messages > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > > > brw_start 39s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous > > > similar messages > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > > > start 133s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous > > > similar messages > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > > > brw_start 133s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous > > > similar messages > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > > > start 236s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex > > > 40s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous > > > similar messages > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous > > > similar messages > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex > > > 277s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > > > direct_io 286s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous > > > similar messages > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > > > start 285s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous > > > similar message > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > > > commitrw commit 285s due to heavy IO load > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous > > > similar message > > > Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent > > > lock 236s due to heavy IO load > > > > > > > > > _______________________________________________ > > > Lustre-discuss mailing list > > > Lustre-discuss at lists.lustre.org > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss at lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110420/8cb1242b/attachment.html
Yes, difficulty finding free disk space can also be a problem, but I could not recall big changes in how that worked since 1.6, other than memory pressure from the read cache pushing out the bitmaps. See http://jira.whamcloud.com/browse/LU-15 Kevin James Rose wrote:> Hi Kevin, > > Thanks for the suggestion. I will try this out. > > For the moment it seems that it may be disk space related. I have > removed some data from the file system. Performance returned to where I > would expect it to be as space freed up (currently at 83% full). Since > free space I have seen two messages on an OSS where the number of > threads is tuned to the amount of RAM in the host and six on an OSS that > has the number of threads set higher than it should. This is a much > better situation than the steady stream I was experiencing last night. > Maybe disabling the read cache will remove the last few. > > I am still very curious what the rapid small reads seen when writing are > as this showed up while mounted ldiskfs so not doing regular lustre > operations at all. > > Thanks again for your help, > > James. > > > On Wed, 2011-04-20 at 08:48 -0300, Kevin Van Maren wrote: > >> First guess is the increased memory pressure caused by the Lustre 1.8 >> read cache. Many times "slow" messages are caused by memory >> allocatons taking a long time. >> >> You could try disabling the read cache and see if that clears up the >> slow messages. >> >> Kevin >> >> >> On Apr 20, 2011, at 4:29 AM, James Rose <James.Rose at framestore.com> >> wrote: >> >> >>> Hi >>> >>> We have been experiencing degraded performance for a few days on a >>> fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms). The >>> initial bulk load of the data will be fine but once in use for a >>> while writes become very slow to individual ost. This will block io >>> for a few minutes and then carry on as normal. The slow writes will >>> then move to another ost. This can be seen in iostat and many slow >>> IO messages will be seen in the logs (example included) >>> >>> The osts are between 87 90 % full. Not ideal but has not caused any >>> issues running 1.6.7.2 on the same hardware. >>> >>> The osts are RAID6 on external raid chassis (Infortrend). Each ost >>> is 5.4T (small). The server is Dual AMD (4 cores). 16G Ram. Qlogic >>> FC HBA. >>> >>> I mounted the osts as ldiskfs and tried a few write tests. These >>> also show the same behaviour. >>> >>> While the write operation is blocked there will be hundreds of read >>> tps and a very small kb/s read from the raid but now writes. As >>> soon as this completes writes will go through at a more expected >>> speed. >>> >>> Any idea what is going on? >>> >>> Many thanks >>> >>> James. >>> >>> Example error messages: >>> >>> Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ >>> lustre-log.1303271584.3935 >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota >>> init 286s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal >>> start 39s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow >>> brw_start 39s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal >>> start 133s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow >>> brw_start 133s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal >>> start 236s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex >>> 40s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex >>> 277s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow >>> direct_io 286s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous >>> similar messages >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal >>> start 285s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous >>> similar message >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow >>> commitrw commit 285s due to heavy IO load >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous >>> similar message >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent >>> lock 236s due to heavy IO load >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > >
I think we are in agreement here, but my point is adding a read cache to Lustre (which is feature I rely heavily on) isn''t the fundmental problem, it just exposed it. The fundamental problem is with ext[34] performance on larger disks. Jeremy On Wed, Apr 20, 2011 at 8:57 AM, Kevin Van Maren <kevin.van.maren at oracle.com> wrote:> Yes, difficulty finding free disk space can also be a problem, but I > could not recall big changes in > how that worked since 1.6, other than memory pressure from the read > cache pushing out the bitmaps. > See http://jira.whamcloud.com/browse/LU-15 > > Kevin > > > James Rose wrote: > > Hi Kevin, > > > > Thanks for the suggestion. I will try this out. > > > > For the moment it seems that it may be disk space related. I have > > removed some data from the file system. Performance returned to where I > > would expect it to be as space freed up (currently at 83% full). Since > > free space I have seen two messages on an OSS where the number of > > threads is tuned to the amount of RAM in the host and six on an OSS that > > has the number of threads set higher than it should. This is a much > > better situation than the steady stream I was experiencing last night. > > Maybe disabling the read cache will remove the last few. > > > > I am still very curious what the rapid small reads seen when writing are > > as this showed up while mounted ldiskfs so not doing regular lustre > > operations at all. > > > > Thanks again for your help, > > > > James. > > > > > > On Wed, 2011-04-20 at 08:48 -0300, Kevin Van Maren wrote: > > > >> First guess is the increased memory pressure caused by the Lustre 1.8 > >> read cache. Many times "slow" messages are caused by memory > >> allocatons taking a long time. > >> > >> You could try disabling the read cache and see if that clears up the > >> slow messages. > >> > >> Kevin > >> > >> > >> On Apr 20, 2011, at 4:29 AM, James Rose <James.Rose at framestore.com> > >> wrote: > >> > >> > >>> Hi > >>> > >>> We have been experiencing degraded performance for a few days on a > >>> fresh install of lustre 1.8.5 (on RHEL5 using sun ext4 rpms). The > >>> initial bulk load of the data will be fine but once in use for a > >>> while writes become very slow to individual ost. This will block io > >>> for a few minutes and then carry on as normal. The slow writes will > >>> then move to another ost. This can be seen in iostat and many slow > >>> IO messages will be seen in the logs (example included) > >>> > >>> The osts are between 87 90 % full. Not ideal but has not caused any > >>> issues running 1.6.7.2 on the same hardware. > >>> > >>> The osts are RAID6 on external raid chassis (Infortrend). Each ost > >>> is 5.4T (small). The server is Dual AMD (4 cores). 16G Ram. Qlogic > >>> FC HBA. > >>> > >>> I mounted the osts as ldiskfs and tried a few write tests. These > >>> also show the same behaviour. > >>> > >>> While the write operation is blocked there will be hundreds of read > >>> tps and a very small kb/s read from the raid but now writes. As > >>> soon as this completes writes will go through at a more expected > >>> speed. > >>> > >>> Any idea what is going on? > >>> > >>> Many thanks > >>> > >>> James. > >>> > >>> Example error messages: > >>> > >>> Apr 20 04:53:04 oss5r-mgmt kernel: LustreError: dumping log to /tmp/ > >>> lustre-log.1303271584.3935 > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow quota > >>> init 286s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > >>> start 39s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 39 previous > >>> similar messages > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > >>> brw_start 39s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 38 previous > >>> similar messages > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > >>> start 133s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous > >>> similar messages > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > >>> brw_start 133s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 44 previous > >>> similar messages > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > >>> start 236s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex > >>> 40s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 2 previous > >>> similar messages > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 6 previous > >>> similar messages > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow i_mutex > >>> 277s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > >>> direct_io 286s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 3 previous > >>> similar messages > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow journal > >>> start 285s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous > >>> similar message > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow > >>> commitrw commit 285s due to heavy IO load > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: Skipped 1 previous > >>> similar message > >>> Apr 20 04:53:40 oss5r-mgmt kernel: Lustre: rho-OST0012: slow parent > >>> lock 236s due to heavy IO load > >>> > >>> > >>> _______________________________________________ > >>> Lustre-discuss mailing list > >>> Lustre-discuss at lists.lustre.org > >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss > >>> > >> _______________________________________________ > >> Lustre-discuss mailing list > >> Lustre-discuss at lists.lustre.org > >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > >> > > > > > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110420/6975f282/attachment.html