thr3ads.net - Lustre discuss - [Lustre-discuss] Client complaining about duplicate inode entry after luster recovery [Aug 2009]

If this information is useful, please help other people find it:
Share via:

rishi pathak

2009-Aug-24 08:54 UTC

[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

Hi,

Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover. Client as
well as servers run lustre-1.6 and kernel 2.6.9-18.

       Doing a ls -ltr for a directory in lustre fs throws following errors
(as got from lustre logs) on client

00000008:00020000:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set_data_with_check())
### inconsistent l_ast_data found ns: scratch-OST0005-osc-ffff81201e8dd800
lock: ffff811f9af04
000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc: 2 type: EXT
[0->18446744073709551615] (req 0->18446744073709551615) flags: 100000
remote: 0xb79b445e381bc9e6 expref: -99 p
id: 22878
00000008:00040000:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set_data_with_check())
ASSERTION(old_inode->i_state & I_FREEING) failed:Found existing inode
ffff811f2cf693b8/1972725
44/1895600178 state 0 in lock: setting data to
ffff8118ef8ed5f8/207519777/1771835328
00000000:00040000:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set_data_with_check())
LBUG


On scratch-OST0005 OST it shows

Aug 24 10:22:53 yn266 kernel: LustreError:
3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
ce 569204: rc -2
Aug 24 10:22:53 yn266 kernel: LustreError:
3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19 previous similar
 messages
Aug 24 12:40:43 yn266 kernel: LustreError:
2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
ce 569195: rc -2
Aug 24 12:44:59 yn266 kernel: LustreError:
2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for resour
ce 569198: rc -2

These kind of errors we are getting for many clients.

##History ##
Prior to thsese occurences, our MDS showed signs of failure in way that cpu
load was shooting above 100 (on a quad core quad socket system) and users
were complaining about slow storage performance. We took it offline and did
fsck on unmounted MDS and OSTs. fsck on OSTs went fine but it showed some
errors which were fixed. For data integrity check, mdsdb and ostdb were
built and lfsck was run on a client(client was mounted with abort_recov).

lfsck was run in following order:
lfsck with no fix - reported dangling inodes and orphaned objects
lfsck with -l (backup orphaned objects)
lfsck with -d and -c (delete orphaned objects and create missing OST objects
referenced by MDS)

After above operations, on clients we were seeing file in red and blinking.
Doing a stat came out with an error stating ''no such file or
directory''.

My question is whether the order in which lfsck was run (should lfsck be run
multiple times) and  the errors we are getting are related or not.




-- 
Regards--
Rishi Pathak
National PARAM Supercomputing Facility
Center for Development of Advanced Computing(C-DAC)
Pune University Campus,Ganesh Khind Road
Pune-Maharastra
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090824/a07e0db6/attachment.html

Wojciech Turek

2009-Oct-10 00:40 UTC

head link

[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

Hi,

Did you get to the bottom of this?

We are having exactly the same problem with our lustre-1.6.6 (rhel4)  file
systems. Recently it got worst and MDS crashes quite frequently, when we run
e2fsck there are errors that are being fixed. However after some time we
still are seeing  the same errors in the logs about missing objects and
files get corrupted (?-----------) Also clients LBUGs quite frequently with
this message (osc_request.c:2904:osc_set_data_with_check()) LBUG
This looks like serious lustre problem but so far I didn''t find any
clues on
that even after long search through lustre bugzilla.

Our MDSs and OSSs are UPSed, RAID is behaving OK, we don''t see any
errors in
the syslog.

I will be grateful for some hints on this one

Wojciech

2009/8/24 rishi pathak <mailmaverick666 at gmail.com>
> Hi,
>
> Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover. Client as
> well as servers run lustre-1.6 and kernel 2.6.9-18.
>
>        Doing a ls -ltr for a directory in lustre fs throws following errors
> (as got from lustre logs) on client
>
>
00000008:00020000:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set_data_with_check())
> ### inconsistent l_ast_data found ns: scratch-OST0005-osc-ffff81201e8dd800
> lock: ffff811f9af04
> 000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc: 2 type:
> EXT [0->18446744073709551615] (req 0->18446744073709551615) flags:
100000
> remote: 0xb79b445e381bc9e6 expref: -99 p
> id: 22878
>
00000008:00040000:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set_data_with_check())
> ASSERTION(old_inode->i_state & I_FREEING) failed:Found existing
inode
> ffff811f2cf693b8/1972725
> 44/1895600178 state 0 in lock: setting data to
> ffff8118ef8ed5f8/207519777/1771835328
>
00000000:00040000:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set_data_with_check())
> LBUG
>
>
> On scratch-OST0005 OST it shows
>
> Aug 24 10:22:53 yn266 kernel: LustreError:
> 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
resour
> ce 569204: rc -2
> Aug 24 10:22:53 yn266 kernel: LustreError:
> 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19 previous
similar
>  messages
> Aug 24 12:40:43 yn266 kernel: LustreError:
> 2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
resour
> ce 569195: rc -2
> Aug 24 12:44:59 yn266 kernel: LustreError:
> 2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
resour
> ce 569198: rc -2
>
> These kind of errors we are getting for many clients.
>
> ##History ##
> Prior to thsese occurences, our MDS showed signs of failure in way that cpu
> load was shooting above 100 (on a quad core quad socket system) and users
> were complaining about slow storage performance. We took it offline and did
> fsck on unmounted MDS and OSTs. fsck on OSTs went fine but it showed some
> errors which were fixed. For data integrity check, mdsdb and ostdb were
> built and lfsck was run on a client(client was mounted with abort_recov).
>
> lfsck was run in following order:
> lfsck with no fix - reported dangling inodes and orphaned objects
> lfsck with -l (backup orphaned objects)
> lfsck with -d and -c (delete orphaned objects and create missing OST
> objects referenced by MDS)
>
> After above operations, on clients we were seeing file in red and blinking.
> Doing a stat came out with an error stating ''no such file or
directory''.
>
> My question is whether the order in which lfsck was run (should lfsck be
> run multiple times) and  the errors we are getting are related or not.
>
>
>
>
> --
> Regards--
> Rishi Pathak
> National PARAM Supercomputing Facility
> Center for Development of Advanced Computing(C-DAC)
> Pune University Campus,Ganesh Khind Road
> Pune-Maharastra
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>
>

-- 
--
Wojciech Turek

Assistant System Manager

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091010/b9988c35/attachment.html

Bernd Schubert

2009-Oct-10 14:18 UTC

head link

[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

"ASSERTION(old_inode->i_state & I_FREEING)" is the infamous
bug17485. You will
need to run lfsck to fix it.


On Saturday 10 October 2009, Wojciech Turek wrote:> Hi,
> 
> Did you get to the bottom of this?
> 
> We are having exactly the same problem with our lustre-1.6.6 (rhel4)  file
> systems. Recently it got worst and MDS crashes quite frequently, when we
>  run e2fsck there are errors that are being fixed. However after some time
>  we still are seeing  the same errors in the logs about missing objects and
>  files get corrupted (?-----------) Also clients LBUGs quite frequently
>  with this message (osc_request.c:2904:osc_set_data_with_check()) LBUG
> This looks like serious lustre problem but so far I didn''t find
any clues
>  on that even after long search through lustre bugzilla.
> 
> Our MDSs and OSSs are UPSed, RAID is behaving OK, we don''t see any
errors
>  in the syslog.
> 
> I will be grateful for some hints on this one
> 
> Wojciech
> 
> 2009/8/24 rishi pathak <mailmaverick666 at gmail.com>
> 
> > Hi,
> >
> > Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover.
Client
> > as well as servers run lustre-1.6 and kernel 2.6.9-18.
> >
> >        Doing a ls -ltr for a directory in lustre fs throws following
> > errors (as got from lustre logs) on client
> >
> >
00000008:00020000:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set
> >_data_with_check()) ### inconsistent l_ast_data found ns:
> > scratch-OST0005-osc-ffff81201e8dd800 lock: ffff811f9af04
> > 000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc: 2
type:
> > EXT [0->18446744073709551615] (req 0->18446744073709551615)
flags: 100000
> > remote: 0xb79b445e381bc9e6 expref: -99 p
> > id: 22878
> >
00000008:00040000:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set
> >_data_with_check()) ASSERTION(old_inode->i_state & I_FREEING)
failed:Found
> > existing inode ffff811f2cf693b8/1972725
> > 44/1895600178 state 0 in lock: setting data to
> > ffff8118ef8ed5f8/207519777/1771835328
> >
00000000:00040000:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set
> >_data_with_check()) LBUG
> >
> >
> > On scratch-OST0005 OST it shows
> >
> > Aug 24 10:22:53 yn266 kernel: LustreError:
> > 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
> > resour ce 569204: rc -2
> > Aug 24 10:22:53 yn266 kernel: LustreError:
> > 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19 previous
> > similar messages
> > Aug 24 12:40:43 yn266 kernel: LustreError:
> > 2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
> > resour ce 569195: rc -2
> > Aug 24 12:44:59 yn266 kernel: LustreError:
> > 2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed for
> > resour ce 569198: rc -2
> >
> > These kind of errors we are getting for many clients.
> >
> > ##History ##
> > Prior to thsese occurences, our MDS showed signs of failure in way
that
> > cpu load was shooting above 100 (on a quad core quad socket system)
and
> > users were complaining about slow storage performance. We took it
offline
> > and did fsck on unmounted MDS and OSTs. fsck on OSTs went fine but it
> > showed some errors which were fixed. For data integrity check, mdsdb
and
> > ostdb were built and lfsck was run on a client(client was mounted with
> > abort_recov).
> >
> > lfsck was run in following order:
> > lfsck with no fix - reported dangling inodes and orphaned objects
> > lfsck with -l (backup orphaned objects)
> > lfsck with -d and -c (delete orphaned objects and create missing OST
> > objects referenced by MDS)
> >
> > After above operations, on clients we were seeing file in red and
> > blinking. Doing a stat came out with an error stating ''no
such file or
> > directory''.
> >
> > My question is whether the order in which lfsck was run (should lfsck
be
> > run multiple times) and  the errors we are getting are related or not.
> >
> >
> >
> >
> > --
> > Regards--
> > Rishi Pathak
> > National PARAM Supercomputing Facility
> > Center for Development of Advanced Computing(C-DAC)
> > Pune University Campus,Ganesh Khind Road
> > Pune-Maharastra
> >
> > _______________________________________________
> > Lustre-discuss mailing list
> > Lustre-discuss at lists.lustre.org
> > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> 

-- 
Bernd Schubert
DataDirect Networks

Wojciech Turek

2009-Oct-10 23:27 UTC

head link

[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

Hi Bernd,

Many thanks for your reply. I have found this bug last night and as far as I
can see there is no fix for it yet? I am preparing dbs to run lfsck on
affected file systems. I also found bug 18748 and I must say we have exactly
the same problems. It just looks like we run into that problem few months
after CIEMAT did. As far as I know if we can see this message it means that
there are files with missing objects. The worst is that we don''t know
when
and why files looses they objects. It just happens spontaneously and there
isn''t any lustre messages that could give us a clue. Users run jobs and
some
time after their files were written some of these files get corrupted/looses
objects (?-----) trying to access this files for the first time triggers
''lvbo'' message.
We have third lustre file system which runs on different hardware but the
same lustre version and RHEL version as the affected ones. I can not see any
problems on the third file system.

Wojciech

2009/10/10 Bernd Schubert <bs_lists at aakef.fastmail.fm>
> "ASSERTION(old_inode->i_state & I_FREEING)" is the
infamous bug17485. You
> will
> need to run lfsck to fix it.
>
>
> On Saturday 10 October 2009, Wojciech Turek wrote:
> > Hi,
> >
> > Did you get to the bottom of this?
> >
> > We are having exactly the same problem with our lustre-1.6.6 (rhel4)
>  file
> > systems. Recently it got worst and MDS crashes quite frequently, when
we
> >  run e2fsck there are errors that are being fixed. However after some
> time
> >  we still are seeing  the same errors in the logs about missing
objects
> and
> >  files get corrupted (?-----------) Also clients LBUGs quite
frequently
> >  with this message (osc_request.c:2904:osc_set_data_with_check()) LBUG
> > This looks like serious lustre problem but so far I didn''t
find any clues
> >  on that even after long search through lustre bugzilla.
> >
> > Our MDSs and OSSs are UPSed, RAID is behaving OK, we don''t
see any errors
> >  in the syslog.
> >
> > I will be grateful for some hints on this one
> >
> > Wojciech
> >
> > 2009/8/24 rishi pathak <mailmaverick666 at gmail.com>
> >
> > > Hi,
> > >
> > > Our lustre fs comprises of 15 OST/OSS and 1 MDS with no failover.
> Client
> > > as well as servers run lustre-1.6 and kernel 2.6.9-18.
> > >
> > >        Doing a ls -ltr for a directory in lustre fs throws
following
> > > errors (as got from lustre logs) on client
> > >
> > >
> 00000008:00020000:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set
> > >_data_with_check()) ### inconsistent l_ast_data found ns:
> > > scratch-OST0005-osc-ffff81201e8dd800 lock: ffff811f9af04
> > > 000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0 rrc:
2
> type:
> > > EXT [0->18446744073709551615] (req 0->18446744073709551615)
flags:
> 100000
> > > remote: 0xb79b445e381bc9e6 expref: -99 p
> > > id: 22878
> > >
> 00000008:00040000:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set
> > >_data_with_check()) ASSERTION(old_inode->i_state &
I_FREEING)
> failed:Found
> > > existing inode ffff811f2cf693b8/1972725
> > > 44/1895600178 state 0 in lock: setting data to
> > > ffff8118ef8ed5f8/207519777/1771835328
> > >
> 00000000:00040000:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set
> > >_data_with_check()) LBUG
> > >
> > >
> > > On scratch-OST0005 OST it shows
> > >
> > > Aug 24 10:22:53 yn266 kernel: LustreError:
> > > 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed
for
> > > resour ce 569204: rc -2
> > > Aug 24 10:22:53 yn266 kernel: LustreError:
> > > 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19
previous
> > > similar messages
> > > Aug 24 12:40:43 yn266 kernel: LustreError:
> > > 2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed
for
> > > resour ce 569195: rc -2
> > > Aug 24 12:44:59 yn266 kernel: LustreError:
> > > 2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init failed
for
> > > resour ce 569198: rc -2
> > >
> > > These kind of errors we are getting for many clients.
> > >
> > > ##History ##
> > > Prior to thsese occurences, our MDS showed signs of failure in
way that
> > > cpu load was shooting above 100 (on a quad core quad socket
system) and
> > > users were complaining about slow storage performance. We took it
> offline
> > > and did fsck on unmounted MDS and OSTs. fsck on OSTs went fine
but it
> > > showed some errors which were fixed. For data integrity check,
mdsdb
> and
> > > ostdb were built and lfsck was run on a client(client was mounted
with
> > > abort_recov).
> > >
> > > lfsck was run in following order:
> > > lfsck with no fix - reported dangling inodes and orphaned objects
> > > lfsck with -l (backup orphaned objects)
> > > lfsck with -d and -c (delete orphaned objects and create missing
OST
> > > objects referenced by MDS)
> > >
> > > After above operations, on clients we were seeing file in red and
> > > blinking. Doing a stat came out with an error stating
''no such file or
> > > directory''.
> > >
> > > My question is whether the order in which lfsck was run (should
lfsck
> be
> > > run multiple times) and  the errors we are getting are related or
not.
> > >
> > >
> > >
> > >
> > > --
> > > Regards--
> > > Rishi Pathak
> > > National PARAM Supercomputing Facility
> > > Center for Development of Advanced Computing(C-DAC)
> > > Pune University Campus,Ganesh Khind Road
> > > Pune-Maharastra
> > >
> > > _______________________________________________
> > > Lustre-discuss mailing list
> > > Lustre-discuss at lists.lustre.org
> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
>
>
> --
> Bernd Schubert
> DataDirect Networks
>


-- 
--
Wojciech Turek

Assistant System Manager

High Performance Computing Service
University of Cambridge
Email: wjt27 at cam.ac.uk
Tel: (+)44 1223 763517
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091011/019000e7/attachment.html

Bernd Schubert

2009-Oct-11 17:34 UTC

head link

[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

Hello Wojciech,

bug 17485 has patch, that has landed in 1.8 to prevent that duplicate 
references to OST objects come up after MDS fail over. But if you create 
duplicate entries yourself, it won''t help of course. Bug 20412 has such
a
valid use case for duplicate MDT files and also lots of patches for lfsck, 
since the default way to fix such issues wasn''t suitable for us.

Hmm, but 18748 came up, when I tested at CIEMAT lustre-1.6.7 + patch from bug 
17485 and somehow filesystem corruption came up. I''m still not sure
what was
the main culprit for the corruption - either the initial patch of bug 17485 or 
the MDS issue with 1.6.7. Unfortunately I still didn''t get a test
system with
at least 100 clients to reproduce the test.
So in principle you shouldn''t run into it, at least not with corrupted 
objects. I guess it will be fixed, once you fix the filesystem with e2fsck and 
lfsck. I''m only surprised that vanilla 1.6.6 works for you, it has so
many
bugs...



Cheers,
Bernd



On Sunday 11 October 2009, Wojciech Turek wrote:> Hi Bernd,
> 
> Many thanks for your reply. I have found this bug last night and as far as
>  I can see there is no fix for it yet? I am preparing dbs to run lfsck on
>  affected file systems. I also found bug 18748 and I must say we have
>  exactly the same problems. It just looks like we run into that problem few
>  months after CIEMAT did. As far as I know if we can see this message it
>  means that there are files with missing objects. The worst is that we
>  don''t know when and why files looses they objects. It just
happens
>  spontaneously and there isn''t any lustre messages that could give
us a
>  clue. Users run jobs and some time after their files were written some of
>  these files get corrupted/looses objects (?-----) trying to access this
>  files for the first time triggers ''lvbo'' message.
> We have third lustre file system which runs on different hardware but the
> same lustre version and RHEL version as the affected ones. I can not see
>  any problems on the third file system.
> 
> Wojciech
> 
> 2009/10/10 Bernd Schubert <bs_lists at aakef.fastmail.fm>
> 
> > "ASSERTION(old_inode->i_state & I_FREEING)" is the
infamous bug17485. You
> > will
> > need to run lfsck to fix it.
> >
> > On Saturday 10 October 2009, Wojciech Turek wrote:
> > > Hi,
> > >
> > > Did you get to the bottom of this?
> > >
> > > We are having exactly the same problem with our lustre-1.6.6
(rhel4)
> >
> >  file
> >
> > > systems. Recently it got worst and MDS crashes quite frequently,
when
> > > we run e2fsck there are errors that are being fixed. However
after some
> >
> > time
> >
> > >  we still are seeing  the same errors in the logs about missing
objects
> >
> > and
> >
> > >  files get corrupted (?-----------) Also clients LBUGs quite
frequently
> > >  with this message (osc_request.c:2904:osc_set_data_with_check())
LBUG
> > > This looks like serious lustre problem but so far I
didn''t find any
> > > clues on that even after long search through lustre bugzilla.
> > >
> > > Our MDSs and OSSs are UPSed, RAID is behaving OK, we
don''t see any
> > > errors in the syslog.
> > >
> > > I will be grateful for some hints on this one
> > >
> > > Wojciech
> > >
> > > 2009/8/24 rishi pathak <mailmaverick666 at gmail.com>
> > >
> > > > Hi,
> > > >
> > > > Our lustre fs comprises of 15 OST/OSS and 1 MDS with no
failover.
> >
> > Client
> >
> > > > as well as servers run lustre-1.6 and kernel 2.6.9-18.
> > > >
> > > >        Doing a ls -ltr for a directory in lustre fs throws
following
> > > > errors (as got from lustre logs) on client
> >
> >
00000008:00020000:0:1251099455.304622:0:724:0:(osc_request.c:2898:osc_set
> >
> > > >_data_with_check()) ### inconsistent l_ast_data found ns:
> > > > scratch-OST0005-osc-ffff81201e8dd800 lock: ffff811f9af04
> > > > 000/0xec0d1c36da6992fd lrc: 3/1,0 mode: PR/PR res: 570622/0
rrc: 2
> >
> > type:
> > > > EXT [0->18446744073709551615] (req
0->18446744073709551615) flags:
> >
> > 100000
> >
> > > > remote: 0xb79b445e381bc9e6 expref: -99 p
> > > > id: 22878
> >
> >
00000008:00040000:0:1251099455.337868:0:724:0:(osc_request.c:2904:osc_set
> >
> > > >_data_with_check()) ASSERTION(old_inode->i_state &
I_FREEING)
> >
> > failed:Found
> >
> > > > existing inode ffff811f2cf693b8/1972725
> > > > 44/1895600178 state 0 in lock: setting data to
> > > > ffff8118ef8ed5f8/207519777/1771835328
> >
> >
00000000:00040000:0:1251099455.360090:0:724:0:(osc_request.c:2904:osc_set
> >
> > > >_data_with_check()) LBUG
> > > >
> > > >
> > > > On scratch-OST0005 OST it shows
> > > >
> > > > Aug 24 10:22:53 yn266 kernel: LustreError:
> > > > 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init
failed for
> > > > resour ce 569204: rc -2
> > > > Aug 24 10:22:53 yn266 kernel: LustreError:
> > > > 3023:0:(ldlm_resource.c:851:ldlm_resource_add()) Skipped 19
previous
> > > > similar messages
> > > > Aug 24 12:40:43 yn266 kernel: LustreError:
> > > > 2737:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init
failed for
> > > > resour ce 569195: rc -2
> > > > Aug 24 12:44:59 yn266 kernel: LustreError:
> > > > 2835:0:(ldlm_resource.c:851:ldlm_resource_add()) lvbo_init
failed for
> > > > resour ce 569198: rc -2
> > > >
> > > > These kind of errors we are getting for many clients.
> > > >
> > > > ##History ##
> > > > Prior to thsese occurences, our MDS showed signs of failure
in way
> > > > that cpu load was shooting above 100 (on a quad core quad
socket
> > > > system) and users were complaining about slow storage
performance. We
> > > > took it
> >
> > offline
> >
> > > > and did fsck on unmounted MDS and OSTs. fsck on OSTs went
fine but it
> > > > showed some errors which were fixed. For data integrity
check, mdsdb
> >
> > and
> >
> > > > ostdb were built and lfsck was run on a client(client was
mounted
> > > > with abort_recov).
> > > >
> > > > lfsck was run in following order:
> > > > lfsck with no fix - reported dangling inodes and orphaned
objects
> > > > lfsck with -l (backup orphaned objects)
> > > > lfsck with -d and -c (delete orphaned objects and create
missing OST
> > > > objects referenced by MDS)
> > > >
> > > > After above operations, on clients we were seeing file in
red and
> > > > blinking. Doing a stat came out with an error stating
''no such file
> > > > or directory''.
> > > >
> > > > My question is whether the order in which lfsck was run
(should lfsck
> >
> > be
> >
> > > > run multiple times) and  the errors we are getting are
related or
> > > > not.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Regards--
> > > > Rishi Pathak
> > > > National PARAM Supercomputing Facility
> > > > Center for Development of Advanced Computing(C-DAC)
> > > > Pune University Campus,Ganesh Khind Road
> > > > Pune-Maharastra
> > > >
> > > > _______________________________________________
> > > > Lustre-discuss mailing list
> > > > Lustre-discuss at lists.lustre.org
> > > > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> >
> > --
> > Bernd Schubert
> > DataDirect Networks
> 

-- 
Bernd Schubert
DataDirect Networks

Lustre discuss - Aug 2009 - Client complaining about duplicate inode entry after luster recovery

[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery

[Lustre-discuss] Client complaining about duplicate inode entry after luster recovery