Scott Barber
2010-Mar-26 01:19 UTC
[Lustre-discuss] filter_grant_incoming()) LBUG in 1.8.1.1
Background: MDS and OSTs are all running CentOS 5.4 / x86_64 / 2.6.18-128.7.1.el5_lustre.1.8.1.1 2 types of clients - CentOS 5.4 / x86_64 / 2.6.18-128.7.1.el5_lustre.1.8.1.1 - Ubuntu 8.04.1 / i686 / 2.6.22.19 patchless A few days ago one of the OSSs hit an LBUG. The syslog looked like: http://pastie.org/887643 I brought it back up by unmounting the OSTs, restarting the machine and remounting the OSTs. The OST was just fine after that, but this seemed to start a chain-reaction with other OSSs. I''d run into the same LBUG and same call trace in the syslog on other OSSs. I kept bringing them back up again and an hour later it would happen again - interestingly never on the same OSS twice. It finally stopped when I unmounted the MDS/MGS, rebooted the MDS server and them remounted it again. We had no issues after that.... until this afternoon :( In researching the issue it looks as though it is bug #19338 which in turn is a duplicate of #20278. It looks as though that bug isn''t slated for 1.8 at all. Am I reading that right? There''s been no testing that I could tell of the patch on 1.8.x so I''m leery of trying to patch my servers. Is there something else that I can do? Any more info you need? Thanks for your help, Scott Barber Senior Systems Admin iMemories.com
Cliff White
2010-Mar-26 05:59 UTC
[Lustre-discuss] filter_grant_incoming()) LBUG in 1.8.1.1
Scott Barber wrote:> Background: > MDS and OSTs are all running CentOS 5.4 / x86_64 / > 2.6.18-128.7.1.el5_lustre.1.8.1.1 > 2 types of clients > - CentOS 5.4 / x86_64 / 2.6.18-128.7.1.el5_lustre.1.8.1.1 > - Ubuntu 8.04.1 / i686 / 2.6.22.19 patchless > > A few days ago one of the OSSs hit an LBUG. The syslog looked like: > http://pastie.org/887643 > > I brought it back up by unmounting the OSTs, restarting the machine > and remounting the OSTs. The OST was just fine after that, but this > seemed to start a chain-reaction with other OSSs. I''d run into the > same LBUG and same call trace in the syslog on other OSSs. I kept > bringing them back up again and an hour later it would happen again - > interestingly never on the same OSS twice. It finally stopped when I > unmounted the MDS/MGS, rebooted the MDS server and them remounted it > again. We had no issues after that.... until this afternoon :( > > In researching the issue it looks as though it is bug #19338 which in > turn is a duplicate of #20278. It looks as though that bug isn''t > slated for 1.8 at all. Am I reading that right? There''s been no > testing that I could tell of the patch on 1.8.x so I''m leery of trying > to patch my servers. Is there something else that I can do? Any more > info you need? >Hmm. Not sure why that fix was not landed for 1.8. Looks like we may have just missed it. :( The correct fix is in 20278. bugzilla.lustre.org/attachment.cgi?id=25139 We''ll see about getting it tested/landed. It applies mostly okay to b1_8, further news when available. cliffw> > Thanks for your help, > Scott Barber > Senior Systems Admin > iMemories.com > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Cliff White
2010-Mar-26 08:29 UTC
[Lustre-discuss] filter_grant_incoming()) LBUG in 1.8.1.1
Scott Barber wrote:> Background: > MDS and OSTs are all running CentOS 5.4 / x86_64 / > 2.6.18-128.7.1.el5_lustre.1.8.1.1 > 2 types of clients > - CentOS 5.4 / x86_64 / 2.6.18-128.7.1.el5_lustre.1.8.1.1 > - Ubuntu 8.04.1 / i686 / 2.6.22.19 patchless > > A few days ago one of the OSSs hit an LBUG. The syslog looked like: > http://pastie.org/887643 > > I brought it back up by unmounting the OSTs, restarting the machine > and remounting the OSTs. The OST was just fine after that, but this > seemed to start a chain-reaction with other OSSs. I''d run into the > same LBUG and same call trace in the syslog on other OSSs. I kept > bringing them back up again and an hour later it would happen again - > interestingly never on the same OSS twice. It finally stopped when I > unmounted the MDS/MGS, rebooted the MDS server and them remounted it > again. We had no issues after that.... until this afternoon :( > > In researching the issue it looks as though it is bug #19338 which in > turn is a duplicate of #20278. It looks as though that bug isn''t > slated for 1.8 at all. Am I reading that right? There''s been no > testing that I could tell of the patch on 1.8.x so I''m leery of trying > to patch my servers. Is there something else that I can do? Any more > info you need?I''ve attached a 1.8.x version of the patch to 20278. Builds fine on rhel5. Further tests are in the queue, but likely to be awhile running. I''ve also asked for landings/further inspection, you can follow progress in the bug. cliffw> > > Thanks for your help, > Scott Barber > Senior Systems Admin > iMemories.com > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Scott Barber
2010-Mar-26 14:43 UTC
[Lustre-discuss] filter_grant_incoming()) LBUG in 1.8.1.1
Thanks for your help. I''ll watch the bug. On Fri, Mar 26, 2010 at 1:29 AM, Cliff White <Cliff.White at sun.com> wrote:> Scott Barber wrote: >> >> Background: >> MDS and OSTs are all running CentOS 5.4 / x86_64 / >> 2.6.18-128.7.1.el5_lustre.1.8.1.1 >> 2 types of clients >> ?- CentOS 5.4 / x86_64 / 2.6.18-128.7.1.el5_lustre.1.8.1.1 >> ?- Ubuntu 8.04.1 / i686 / 2.6.22.19 patchless >> >> A few days ago one of the OSSs hit an LBUG. The syslog looked like: >> http://pastie.org/887643 >> >> I brought it back up by unmounting the OSTs, restarting the machine >> and remounting the OSTs. The OST was just fine after that, but this >> seemed to start a chain-reaction with other OSSs. I''d run into the >> same LBUG and same call trace in the syslog on other OSSs. I kept >> bringing them back up again and an hour later it would happen again - >> interestingly never on the same OSS twice. It finally stopped when I >> unmounted the MDS/MGS, rebooted the MDS server and them remounted it >> again. We had no issues after that.... until this afternoon :( >> >> In researching the issue it looks as though it is bug #19338 which in >> turn is a duplicate of #20278. It looks as though that bug isn''t >> slated for 1.8 at all. Am I reading that right? There''s been no >> testing that I could tell of the patch on 1.8.x so I''m leery of trying >> to patch my servers. Is there something else that I can do? Any more >> info you need? > > I''ve attached a 1.8.x version of the patch to 20278. Builds fine on rhel5. > ?Further tests are in the queue, but likely to be awhile running. > I''ve also asked for landings/further inspection, you can follow progress in > the bug. > cliffw > >> >> >> Thanks for your help, >> Scott Barber >> Senior Systems Admin >> iMemories.com >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > >