thr3ads.net - Lustre discuss - [Lustre-discuss] [Fwd: Re: Broken client] [Nov 2010]

If this information is useful, please help other people find it:
Share via:

Herbert Fruchtl

2010-Nov-18 18:47 UTC

[Lustre-discuss] [Fwd: Re: Broken client]

Sorry, I had meant to cc this to the list.

   Herbert
-------------- next part --------------
An embedded message was scrubbed...
From: Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk>
Subject: Re: [Lustre-discuss] Broken client
Date: Thu, 18 Nov 2010 17:56:53 +0000
Size: 3099
Url:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101118/be8c4626/attachment.eml

Oleg Drokin

2010-Nov-18 21:16 UTC

head link

[Lustre-discuss] [Fwd: Re: Broken client]

Hello!

  So are there any other compplaints on the OSS node when you mount that OST?
  Did you try to run e2fsck on the ost disk itself (while unmounted)? I assume
one of the possible problems is just on0disk fs corruptions
  (and it might show unhealthy due to that right after mount too).

Bye, 
    Oleg
On Nov 18, 2010, at 1:47 PM, Herbert Fruchtl wrote:
> Sorry, I had meant to cc this to the list.
> 
>  Herbert
> 
> From: Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk>
> Date: November 18, 2010 12:56:53 PM EST
> To: Kevin Van Maren <Kevin.Van.Maren at oracle.com>
> Subject: Re: [Lustre-discuss] Broken client
> 
> 
> Hi Kevin,
> 
> That didn''t change anything. Umounting the of the OSTs hung (yes,
with an LBUG), and I did a hard reboot. It came up again, and the status is as
before: on the MDT server, I can see all files (well, I assume it''s
all); on the client in question some files appear broken. The OST is still
"not healthy". I am running another lfsck, without much hope.
Here''s the LBUG:
> 
> Nov 18 17:05:16 oss1-fs kernel: LustreError:
8125:0:(lprocfs_status.c:865:lprocfs_free_client_stats()) LBU
> 
>  Herbert
> 
> Kevin Van Maren wrote:
>> Reboot the server with the unhealthy OST.
>> If you look at the logs, there is likely an LBUG that is causing the
problems.
>> Kevin
>> On Nov 18, 2010, at 9:51 AM, Herbert Fruchtl <herbert.fruchtl at
st-andrews.ac.uk> wrote:
>>>> 
>>>> It looks like you may have corruption on the mdt or an ost,
where the
>>>> objects on an OST can''t be found for the directory
entry. Have you
>>>> had a crash recently or run Lustre fsck? You might need to do
fsck and
>>>> delete (unlink) the "broken" files.
>>>> 
>>> The files do exist (I can see them on the mdt server) and I
don''t want to delete
>>> them. There was a crash lately, and I have run an lfsck afterwards
(repeatedly,
>>> actually.
>>> 
>>>> I suppose it''s also possible you''re seeing
fallout from an earlier LBUG or
>>>> something. Try ''cat
/proc/fs/lustre/health_check'' on all the servers.
>>>> 
>>> There seems to be a problem:
>>> [root at master ~]# cat /proc/fs/lustre/health_check
>>> healthy
>>> [root at master ~]# ssh oss1 ''cat
/proc/fs/lustre/health_check''
>>> device home-OST0005 reported unhealthy
>>> NOT HEALTHY
>>> [root at master ~]# ssh oss2 ''cat
/proc/fs/lustre/health_check''
>>> healthy
>>> [root at master ~]# ssh oss3 ''cat
/proc/fs/lustre/health_check''
>>> healthy
>>> 
>>> What do I do about the unhealthy OST?
>>> 
>>> Herbert
>>> -- 
>>> Herbert Fruchtl
>>> Senior Scientific Computing Officer
>>> School of Chemistry, School of Mathematics and Statistics
>>> University of St Andrews
>>> -- 
>>> The University of St Andrews is a charity registered in Scotland:
>>> No SC013532
> 
> -- 
> Herbert Fruchtl
> Senior Scientific Computing Officer
> School of Chemistry, School of Mathematics and Statistics
> University of St Andrews
> --
> The University of St Andrews is a charity registered in Scotland:
> No SC013532
> 
> 
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Herbert Fruchtl

2010-Nov-19 15:41 UTC

head link

[Lustre-discuss] [Fwd: Re: Broken client]

Thanks guys, Looks like unmounting the "unhealthy" OST filesystem and
running an
fsck on it (which found several errors) solved the problem! I still
don''t
understand why it looked different from different clients...

Cheers,

   Herbert

Oleg Drokin wrote:> Hello!
> 
>   So are there any other compplaints on the OSS node when you mount that
OST?
>   Did you try to run e2fsck on the ost disk itself (while unmounted)? I
assume one of the possible problems is just on0disk fs corruptions
>   (and it might show unhealthy due to that right after mount too).
> 
> Bye, 
>     Oleg
> On Nov 18, 2010, at 1:47 PM, Herbert Fruchtl wrote:
> 
>> Sorry, I had meant to cc this to the list.
>>
>>  Herbert
>>
>> From: Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk>
>> Date: November 18, 2010 12:56:53 PM EST
>> To: Kevin Van Maren <Kevin.Van.Maren at oracle.com>
>> Subject: Re: [Lustre-discuss] Broken client
>>
>>
>> Hi Kevin,
>>
>> That didn''t change anything. Umounting the of the OSTs hung
(yes, with an LBUG), and I did a hard reboot. It came up again, and the status
is as before: on the MDT server, I can see all files (well, I assume
it''s all); on the client in question some files appear broken. The OST
is still "not healthy". I am running another lfsck, without much hope.
Here''s the LBUG:
>>
>> Nov 18 17:05:16 oss1-fs kernel: LustreError:
8125:0:(lprocfs_status.c:865:lprocfs_free_client_stats()) LBU
>>
>>  Herbert
>>
>> Kevin Van Maren wrote:
>>> Reboot the server with the unhealthy OST.
>>> If you look at the logs, there is likely an LBUG that is causing
the problems.
>>> Kevin
>>> On Nov 18, 2010, at 9:51 AM, Herbert Fruchtl <herbert.fruchtl at
st-andrews.ac.uk> wrote:
>>>>> It looks like you may have corruption on the mdt or an ost,
where the
>>>>> objects on an OST can''t be found for the directory
entry. Have you
>>>>> had a crash recently or run Lustre fsck? You might need to
do fsck and
>>>>> delete (unlink) the "broken" files.
>>>>>
>>>> The files do exist (I can see them on the mdt server) and I
don''t want to delete
>>>> them. There was a crash lately, and I have run an lfsck
afterwards (repeatedly,
>>>> actually.
>>>>
>>>>> I suppose it''s also possible you''re
seeing fallout from an earlier LBUG or
>>>>> something. Try ''cat
/proc/fs/lustre/health_check'' on all the servers.
>>>>>
>>>> There seems to be a problem:
>>>> [root at master ~]# cat /proc/fs/lustre/health_check
>>>> healthy
>>>> [root at master ~]# ssh oss1 ''cat
/proc/fs/lustre/health_check''
>>>> device home-OST0005 reported unhealthy
>>>> NOT HEALTHY
>>>> [root at master ~]# ssh oss2 ''cat
/proc/fs/lustre/health_check''
>>>> healthy
>>>> [root at master ~]# ssh oss3 ''cat
/proc/fs/lustre/health_check''
>>>> healthy
>>>>
>>>> What do I do about the unhealthy OST?
>>>>
>>>> Herbert
>>>> -- 
>>>> Herbert Fruchtl
>>>> Senior Scientific Computing Officer
>>>> School of Chemistry, School of Mathematics and Statistics
>>>> University of St Andrews
>>>> -- 
>>>> The University of St Andrews is a charity registered in
Scotland:
>>>> No SC013532
>> -- 
>> Herbert Fruchtl
>> Senior Scientific Computing Officer
>> School of Chemistry, School of Mathematics and Statistics
>> University of St Andrews
>> --
>> The University of St Andrews is a charity registered in Scotland:
>> No SC013532
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 
-- 
Herbert Fruchtl
Senior Scientific Computing Officer
School of Chemistry, School of Mathematics and Statistics
University of St Andrews
--
The University of St Andrews is a charity registered in Scotland:
No SC013532

Kevin Van Maren

2010-Nov-19 16:30 UTC

head link

[Lustre-discuss] [Fwd: Re: Broken client]

Not sure.  Could be some clients had data in their cache, and others  
hit the error when they tried to get it from the OST.

Sorry I misunderstood you -- I thought you had already run fsck on the  
OSTs.

Kevin


On Nov 19, 2010, at 9:41 AM, Herbert Fruchtl <herbert.fruchtl at
st-andrews.ac.uk
 > wrote:
> Thanks guys, Looks like unmounting the "unhealthy" OST filesystem
> and running an
> fsck on it (which found several errors) solved the problem! I still  
> don''t
> understand why it looked different from different clients...
>
> Cheers,
>
>   Herbert
>
> Oleg Drokin wrote:
>> Hello!
>>
>>  So are there any other compplaints on the OSS node when you mount  
>> that OST?
>>  Did you try to run e2fsck on the ost disk itself (while  
>> unmounted)? I assume one of the possible problems is just on0disk  
>> fs corruptions
>>  (and it might show unhealthy due to that right after mount too).
>>
>> Bye,
>>    Oleg
>> On Nov 18, 2010, at 1:47 PM, Herbert Fruchtl wrote:
>>
>>> Sorry, I had meant to cc this to the list.
>>>
>>> Herbert
>>>
>>> From: Herbert Fruchtl <herbert.fruchtl at st-andrews.ac.uk>
>>> Date: November 18, 2010 12:56:53 PM EST
>>> To: Kevin Van Maren <Kevin.Van.Maren at oracle.com>
>>> Subject: Re: [Lustre-discuss] Broken client
>>>
>>>
>>> Hi Kevin,
>>>
>>> That didn''t change anything. Umounting the of the OSTs
hung (yes,
>>> with an LBUG), and I did a hard reboot. It came up again, and the  
>>> status is as before: on the MDT server, I can see all files (well,
>>> I assume it''s all); on the client in question some files
appear
>>> broken. The OST is still "not healthy". I am running
another
>>> lfsck, without much hope. Here''s the LBUG:
>>>
>>> Nov 18 17:05:16 oss1-fs kernel: LustreError: 8125:0: 
>>> (lprocfs_status.c:865:lprocfs_free_client_stats()) LBU
>>>
>>> Herbert
>>>
>>> Kevin Van Maren wrote:
>>>> Reboot the server with the unhealthy OST.
>>>> If you look at the logs, there is likely an LBUG that is
causing
>>>> the problems.
>>>> Kevin
>>>> On Nov 18, 2010, at 9:51 AM, Herbert Fruchtl
<herbert.fruchtl at st-andrews.ac.uk
>>>> > wrote:
>>>>>> It looks like you may have corruption on the mdt or an
ost,
>>>>>> where the
>>>>>> objects on an OST can''t be found for the
directory entry. Have
>>>>>> you
>>>>>> had a crash recently or run Lustre fsck? You might need
to do
>>>>>> fsck and
>>>>>> delete (unlink) the "broken" files.
>>>>>>
>>>>> The files do exist (I can see them on the mdt server) and I
>>>>> don''t want to delete
>>>>> them. There was a crash lately, and I have run an lfsck  
>>>>> afterwards (repeatedly,
>>>>> actually.
>>>>>
>>>>>> I suppose it''s also possible you''re
seeing fallout from an
>>>>>> earlier LBUG or
>>>>>> something. Try ''cat
/proc/fs/lustre/health_check'' on all the
>>>>>> servers.
>>>>>>
>>>>> There seems to be a problem:
>>>>> [root at master ~]# cat /proc/fs/lustre/health_check
>>>>> healthy
>>>>> [root at master ~]# ssh oss1 ''cat
/proc/fs/lustre/health_check''
>>>>> device home-OST0005 reported unhealthy
>>>>> NOT HEALTHY
>>>>> [root at master ~]# ssh oss2 ''cat
/proc/fs/lustre/health_check''
>>>>> healthy
>>>>> [root at master ~]# ssh oss3 ''cat
/proc/fs/lustre/health_check''
>>>>> healthy
>>>>>
>>>>> What do I do about the unhealthy OST?
>>>>>
>>>>> Herbert
>>>>> -- 
>>>>> Herbert Fruchtl
>>>>> Senior Scientific Computing Officer
>>>>> School of Chemistry, School of Mathematics and Statistics
>>>>> University of St Andrews
>>>>> -- 
>>>>> The University of St Andrews is a charity registered in
Scotland:
>>>>> No SC013532
>>> -- 
>>> Herbert Fruchtl
>>> Senior Scientific Computing Officer
>>> School of Chemistry, School of Mathematics and Statistics
>>> University of St Andrews
>>> --
>>> The University of St Andrews is a charity registered in Scotland:
>>> No SC013532
>>>
>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
> -- 
> Herbert Fruchtl
> Senior Scientific Computing Officer
> School of Chemistry, School of Mathematics and Statistics
> University of St Andrews
> --
> The University of St Andrews is a charity registered in Scotland:
> No SC013532
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Nov 2010 - [Fwd: Re: Broken client]

[Lustre-discuss] [Fwd: Re: Broken client]

[Lustre-discuss] [Fwd: Re: Broken client]

[Lustre-discuss] [Fwd: Re: Broken client]

[Lustre-discuss] [Fwd: Re: Broken client]