thr3ads.net - Lustre discuss - [Lustre-discuss] inode weirdness [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Stu Midgley

2009-Sep-04 09:35 UTC

[Lustre-discuss] inode weirdness

I am having jobs on a cluster client crash.  The job creates a small
text file (using cp) and then immediately tries to use it with another
application.  The application fails saying the file doesn''t exist.

In the client /var/log/messages, I''m seeing

Sep  4 15:58:17 clus039 kernel: LustreError:
15249:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode
75792903

which, I''m led to believe is never meant to occur :)

Any ideas?

# uname -a
Linux clus039 2.6.18-92.1.26.el5_lustre.1.6.7.2smp #1 SMP Wed May 27
19:06:26 MDT 2009 x86_64 x86_64 x86_64 GNU/Linux


Sep  4 15:54:06 clus039 kernel: LustreError:
14982:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode
75793756
Sep  4 15:58:17 clus039 kernel: LustreError:
15249:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode
75792903

nothing on the mds or oss.


-- 
Dr Stuart Midgley
sdm900 at gmail.com

Stuart Midgley

2009-Sep-04 15:31 UTC

head link

[Lustre-discuss] inode weirdness

Evening

The file was created on the same node it was access from.

The error isn''t permanent.  When the job crashed, I went and started  
investigating and the file was fine.

No, the file is never unlinked.

How do I go about getting a lustre log?


-- 
Dr Stuart Midgley
sdm900 at gmail.com



On 04/09/2009, at 11:28 PM, Oleg Drokin wrote:
> Hello!
>
> On Sep 4, 2009, at 5:35 AM, Stu Midgley wrote:
>
>> I am having jobs on a cluster client crash.  The job creates a small
>> text file (using cp) and then immediately tries to use it with  
>> another
>> application.  The application fails saying the file doesn''t
exist.
>
> That''s quite strange for such a sequence of actions.
> Is the file created on one node and accessed on another?
> How permanent is the error ? (i.e. does it still happen when you  
> later access the file again?)
> Is the file unlinked at any time, could there be a race with unlink  
> by any chance?
>
>> In the client /var/log/messages, I''m seeing
>> Sep  4 15:58:17 clus039 kernel: LustreError:
>> 15249:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode
>> 75792903
>
> There is bug 16377 about this same message, though it is not clear  
> what happened there.
> Perhaps you can gather -1 lustre logs from mds and a client that  
> creates
> and client that accesses this file and gets an error and attach  
> those to the bug 16377?
>
> Bye,
>    Oleg

Stuart Midgley

2009-Sep-04 15:52 UTC

head link

[Lustre-discuss] inode weirdness

I''m sorry Oleg, but I suspect I will never be able to run this test.

* I don''t have a reproducer.  At the time I had this problem, I  
started about 200 jobs simultaneously and about 50 failed with this  
problem.  I reran those jobs and they worked just fine.
* I will never get a chance to make the FS quiet.  We have way to much  
production work on.

If I do get time to fiddle about and reproduce this problem I''ll  
create a bug.

-- 
Dr Stuart Midgley
sdm900 at gmail.com



On 04/09/2009, at 11:46 PM, Oleg Drokin wrote:
> Hello!
>
> On Sep 4, 2009, at 11:31 AM, Stuart Midgley wrote:
>
>> The file was created on the same node it was access from.
>
> Hm, interesting.
>
>> The error isn''t permanent.  When the job crashed, I went and  
>> started investigating and the file was fine.
>
> I think I remember a bug like this that shadow(@sun.com) worked on.
> Turned out it is bug 17545 which has somewhat different symptoms,  
> though.
>
>> No, the file is never unlinked.
>> How do I go about getting a lustre log?
>
> Make the system (mds-wise) as idle as possible (ideally only this  
> node with problems should do anything
> on lustre).
> on mds and a client do a cat /proc/sys/lnet/debug and remember the  
> value
> echo -1 >/proc/sys/lnet/debug on both mds and the client.
> lctl dk >/dev/null
> run your reproducer and immediatelly after error happens do
> lctl dk >/tmp/lustre.log on both mds and client nodes.
> then restore /proc/sys/lnet/debug values on the nodes back
> to what they were.
>
> Thanks.
>
> Bye,
>    Oleg

Stu Midgley

2009-Sep-07 05:31 UTC

head link

[Lustre-discuss] inode weirdness

extra to my previous information (a colleague prompted me to add
this).  This file was being created in a new directory.  The parent of
this directory would have had a few hundred directories created in it
"simultaneously"...

That is, the first thing my job does on startup is create a temporary
working directory and create this temporary working file within that
directory.


On Fri, Sep 4, 2009 at 11:31 PM, Stuart Midgley<sdm900 at gmail.com>
wrote:> Evening
>
> The file was created on the same node it was access from.
>
> The error isn''t permanent. ?When the job crashed, I went and
started
> investigating and the file was fine.
>
> No, the file is never unlinked.
>
> How do I go about getting a lustre log?
>
>
> --
> Dr Stuart Midgley
> sdm900 at gmail.com
>
>
>
> On 04/09/2009, at 11:28 PM, Oleg Drokin wrote:
>
>> Hello!
>>
>> On Sep 4, 2009, at 5:35 AM, Stu Midgley wrote:
>>
>>> I am having jobs on a cluster client crash. ?The job creates a
small
>>> text file (using cp) and then immediately tries to use it with
another
>>> application. ?The application fails saying the file
doesn''t exist.
>>
>> That''s quite strange for such a sequence of actions.
>> Is the file created on one node and accessed on another?
>> How permanent is the error ? (i.e. does it still happen when you later
>> access the file again?)
>> Is the file unlinked at any time, could there be a race with unlink by
any
>> chance?
>>
>>> In the client /var/log/messages, I''m seeing
>>> Sep ?4 15:58:17 clus039 kernel: LustreError:
>>> 15249:0:(file.c:2930:ll_inode_revalidate_fini()) failure -2 inode
>>> 75792903
>>
>> There is bug 16377 about this same message, though it is not clear what
>> happened there.
>> Perhaps you can gather -1 lustre logs from mds and a client that
creates
>> and client that accesses this file and gets an error and attach those
to
>> the bug 16377?
>>
>> Bye,
>> ? Oleg
>
>


-- 
Dr Stuart Midgley
sdm900 at gmail.com

Lustre discuss - Sep 2009 - inode weirdness

[Lustre-discuss] inode weirdness

[Lustre-discuss] inode weirdness

[Lustre-discuss] inode weirdness

[Lustre-discuss] inode weirdness