Sorry, I had meant to cc this to the list. Herbert -------------- next part -------------- An embedded message was scrubbed... From: Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk> Subject: Re: [Lustre-discuss] Broken client Date: Thu, 18 Nov 2010 17:56:53 +0000 Size: 3099 Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101118/be8c4626/attachment.eml
Hello! So are there any other compplaints on the OSS node when you mount that OST? Did you try to run e2fsck on the ost disk itself (while unmounted)? I assume one of the possible problems is just on0disk fs corruptions (and it might show unhealthy due to that right after mount too). Bye, Oleg On Nov 18, 2010, at 1:47 PM, Herbert Fruchtl wrote:> Sorry, I had meant to cc this to the list. > > Herbert > > From: Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk> > Date: November 18, 2010 12:56:53 PM EST > To: Kevin Van Maren <Kevin.Van.Maren at oracle.com> > Subject: Re: [Lustre-discuss] Broken client > > > Hi Kevin, > > That didn''t change anything. Umounting the of the OSTs hung (yes, with an LBUG), and I did a hard reboot. It came up again, and the status is as before: on the MDT server, I can see all files (well, I assume it''s all); on the client in question some files appear broken. The OST is still "not healthy". I am running another lfsck, without much hope. Here''s the LBUG: > > Nov 18 17:05:16 oss1-fs kernel: LustreError: 8125:0:(lprocfs_status.c:865:lprocfs_free_client_stats()) LBU > > Herbert > > Kevin Van Maren wrote: >> Reboot the server with the unhealthy OST. >> If you look at the logs, there is likely an LBUG that is causing the problems. >> Kevin >> On Nov 18, 2010, at 9:51 AM, Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk> wrote: >>>> >>>> It looks like you may have corruption on the mdt or an ost, where the >>>> objects on an OST can''t be found for the directory entry. Have you >>>> had a crash recently or run Lustre fsck? You might need to do fsck and >>>> delete (unlink) the "broken" files. >>>> >>> The files do exist (I can see them on the mdt server) and I don''t want to delete >>> them. There was a crash lately, and I have run an lfsck afterwards (repeatedly, >>> actually. >>> >>>> I suppose it''s also possible you''re seeing fallout from an earlier LBUG or >>>> something. Try ''cat /proc/fs/lustre/health_check'' on all the servers. >>>> >>> There seems to be a problem: >>> [root at master ~]# cat /proc/fs/lustre/health_check >>> healthy >>> [root at master ~]# ssh oss1 ''cat /proc/fs/lustre/health_check'' >>> device home-OST0005 reported unhealthy >>> NOT HEALTHY >>> [root at master ~]# ssh oss2 ''cat /proc/fs/lustre/health_check'' >>> healthy >>> [root at master ~]# ssh oss3 ''cat /proc/fs/lustre/health_check'' >>> healthy >>> >>> What do I do about the unhealthy OST? >>> >>> Herbert >>> -- >>> Herbert Fruchtl >>> Senior Scientific Computing Officer >>> School of Chemistry, School of Mathematics and Statistics >>> University of St Andrews >>> -- >>> The University of St Andrews is a charity registered in Scotland: >>> No SC013532 > > -- > Herbert Fruchtl > Senior Scientific Computing Officer > School of Chemistry, School of Mathematics and Statistics > University of St Andrews > -- > The University of St Andrews is a charity registered in Scotland: > No SC013532 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Thanks guys, Looks like unmounting the "unhealthy" OST filesystem and running an fsck on it (which found several errors) solved the problem! I still don''t understand why it looked different from different clients... Cheers, Herbert Oleg Drokin wrote:> Hello! > > So are there any other compplaints on the OSS node when you mount that OST? > Did you try to run e2fsck on the ost disk itself (while unmounted)? I assume one of the possible problems is just on0disk fs corruptions > (and it might show unhealthy due to that right after mount too). > > Bye, > Oleg > On Nov 18, 2010, at 1:47 PM, Herbert Fruchtl wrote: > >> Sorry, I had meant to cc this to the list. >> >> Herbert >> >> From: Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk> >> Date: November 18, 2010 12:56:53 PM EST >> To: Kevin Van Maren <Kevin.Van.Maren at oracle.com> >> Subject: Re: [Lustre-discuss] Broken client >> >> >> Hi Kevin, >> >> That didn''t change anything. Umounting the of the OSTs hung (yes, with an LBUG), and I did a hard reboot. It came up again, and the status is as before: on the MDT server, I can see all files (well, I assume it''s all); on the client in question some files appear broken. The OST is still "not healthy". I am running another lfsck, without much hope. Here''s the LBUG: >> >> Nov 18 17:05:16 oss1-fs kernel: LustreError: 8125:0:(lprocfs_status.c:865:lprocfs_free_client_stats()) LBU >> >> Herbert >> >> Kevin Van Maren wrote: >>> Reboot the server with the unhealthy OST. >>> If you look at the logs, there is likely an LBUG that is causing the problems. >>> Kevin >>> On Nov 18, 2010, at 9:51 AM, Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk> wrote: >>>>> It looks like you may have corruption on the mdt or an ost, where the >>>>> objects on an OST can''t be found for the directory entry. Have you >>>>> had a crash recently or run Lustre fsck? You might need to do fsck and >>>>> delete (unlink) the "broken" files. >>>>> >>>> The files do exist (I can see them on the mdt server) and I don''t want to delete >>>> them. There was a crash lately, and I have run an lfsck afterwards (repeatedly, >>>> actually. >>>> >>>>> I suppose it''s also possible you''re seeing fallout from an earlier LBUG or >>>>> something. Try ''cat /proc/fs/lustre/health_check'' on all the servers. >>>>> >>>> There seems to be a problem: >>>> [root at master ~]# cat /proc/fs/lustre/health_check >>>> healthy >>>> [root at master ~]# ssh oss1 ''cat /proc/fs/lustre/health_check'' >>>> device home-OST0005 reported unhealthy >>>> NOT HEALTHY >>>> [root at master ~]# ssh oss2 ''cat /proc/fs/lustre/health_check'' >>>> healthy >>>> [root at master ~]# ssh oss3 ''cat /proc/fs/lustre/health_check'' >>>> healthy >>>> >>>> What do I do about the unhealthy OST? >>>> >>>> Herbert >>>> -- >>>> Herbert Fruchtl >>>> Senior Scientific Computing Officer >>>> School of Chemistry, School of Mathematics and Statistics >>>> University of St Andrews >>>> -- >>>> The University of St Andrews is a charity registered in Scotland: >>>> No SC013532 >> -- >> Herbert Fruchtl >> Senior Scientific Computing Officer >> School of Chemistry, School of Mathematics and Statistics >> University of St Andrews >> -- >> The University of St Andrews is a charity registered in Scotland: >> No SC013532 >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- Herbert Fruchtl Senior Scientific Computing Officer School of Chemistry, School of Mathematics and Statistics University of St Andrews -- The University of St Andrews is a charity registered in Scotland: No SC013532
Not sure. Could be some clients had data in their cache, and others hit the error when they tried to get it from the OST. Sorry I misunderstood you -- I thought you had already run fsck on the OSTs. Kevin On Nov 19, 2010, at 9:41 AM, Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk > wrote:> Thanks guys, Looks like unmounting the "unhealthy" OST filesystem > and running an > fsck on it (which found several errors) solved the problem! I still > don''t > understand why it looked different from different clients... > > Cheers, > > Herbert > > Oleg Drokin wrote: >> Hello! >> >> So are there any other compplaints on the OSS node when you mount >> that OST? >> Did you try to run e2fsck on the ost disk itself (while >> unmounted)? I assume one of the possible problems is just on0disk >> fs corruptions >> (and it might show unhealthy due to that right after mount too). >> >> Bye, >> Oleg >> On Nov 18, 2010, at 1:47 PM, Herbert Fruchtl wrote: >> >>> Sorry, I had meant to cc this to the list. >>> >>> Herbert >>> >>> From: Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk> >>> Date: November 18, 2010 12:56:53 PM EST >>> To: Kevin Van Maren <Kevin.Van.Maren at oracle.com> >>> Subject: Re: [Lustre-discuss] Broken client >>> >>> >>> Hi Kevin, >>> >>> That didn''t change anything. Umounting the of the OSTs hung (yes, >>> with an LBUG), and I did a hard reboot. It came up again, and the >>> status is as before: on the MDT server, I can see all files (well, >>> I assume it''s all); on the client in question some files appear >>> broken. The OST is still "not healthy". I am running another >>> lfsck, without much hope. Here''s the LBUG: >>> >>> Nov 18 17:05:16 oss1-fs kernel: LustreError: 8125:0: >>> (lprocfs_status.c:865:lprocfs_free_client_stats()) LBU >>> >>> Herbert >>> >>> Kevin Van Maren wrote: >>>> Reboot the server with the unhealthy OST. >>>> If you look at the logs, there is likely an LBUG that is causing >>>> the problems. >>>> Kevin >>>> On Nov 18, 2010, at 9:51 AM, Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk >>>> > wrote: >>>>>> It looks like you may have corruption on the mdt or an ost, >>>>>> where the >>>>>> objects on an OST can''t be found for the directory entry. Have >>>>>> you >>>>>> had a crash recently or run Lustre fsck? You might need to do >>>>>> fsck and >>>>>> delete (unlink) the "broken" files. >>>>>> >>>>> The files do exist (I can see them on the mdt server) and I >>>>> don''t want to delete >>>>> them. There was a crash lately, and I have run an lfsck >>>>> afterwards (repeatedly, >>>>> actually. >>>>> >>>>>> I suppose it''s also possible you''re seeing fallout from an >>>>>> earlier LBUG or >>>>>> something. Try ''cat /proc/fs/lustre/health_check'' on all the >>>>>> servers. >>>>>> >>>>> There seems to be a problem: >>>>> [root at master ~]# cat /proc/fs/lustre/health_check >>>>> healthy >>>>> [root at master ~]# ssh oss1 ''cat /proc/fs/lustre/health_check'' >>>>> device home-OST0005 reported unhealthy >>>>> NOT HEALTHY >>>>> [root at master ~]# ssh oss2 ''cat /proc/fs/lustre/health_check'' >>>>> healthy >>>>> [root at master ~]# ssh oss3 ''cat /proc/fs/lustre/health_check'' >>>>> healthy >>>>> >>>>> What do I do about the unhealthy OST? >>>>> >>>>> Herbert >>>>> -- >>>>> Herbert Fruchtl >>>>> Senior Scientific Computing Officer >>>>> School of Chemistry, School of Mathematics and Statistics >>>>> University of St Andrews >>>>> -- >>>>> The University of St Andrews is a charity registered in Scotland: >>>>> No SC013532 >>> -- >>> Herbert Fruchtl >>> Senior Scientific Computing Officer >>> School of Chemistry, School of Mathematics and Statistics >>> University of St Andrews >>> -- >>> The University of St Andrews is a charity registered in Scotland: >>> No SC013532 >>> >>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > -- > Herbert Fruchtl > Senior Scientific Computing Officer > School of Chemistry, School of Mathematics and Statistics > University of St Andrews > -- > The University of St Andrews is a charity registered in Scotland: > No SC013532 > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss