thr3ads.net - Lustre discuss - [Lustre-discuss] OSS crash after LDISKFS-fs error [Nov 2007]

If this information is useful, please help other people find it:
Share via:

Wojciech Turek

2007-Nov-09 20:33 UTC

[Lustre-discuss] OSS crash after LDISKFS-fs error

Hi,

My lustre environment is: 2.6.9-55.0.9.EL_lustre.1.6.3smp

One of my OSS''s crashed today. Below you can see messages sent by it  
(storage09) to the syslog (first three lines). Then it died (my guess  
is with kernel panic) and heartbeat software STONITH that OSS''s.

Nov  9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error  
(device dm-5): mb_free_blocks: double-free of inode 38887437''s block  
155560192(bit 10496 in group 4747)
Nov  9 19:08:44 storage09.beowulf.cluster kernel:  Nov  9 19:08:44  
storage09.beowulf.cluster kernel: Remounting filesystem read-only
Nov  9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error  
(device dm-5): mb_free_blocks: double-free of inode 38887437''s block  
155560193(bit 10497 in group 4747)
Nov  9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: WARN:  
node storage09: is dead Nov  9 19:09:13 storage10.beowulf.cluster  
heartbeat: [21231]: info: Link storage09:eth0 dead.
Nov  9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: info:  
Link storage09:eth2 dead. Nov  9 19:09:13 storage10.beowulf.cluster  
heartbeat: [32414]: info: Resetting node storage09 with [external/ipmi ]

Do you know how serious are LDISKFS-fs errors? Is that indicates data  
corruption on the certain block device? Device dm-5 is a DDN LUN.   
DDN controller S2A9500 says that everything is Healthy there.

Cheers

Wojciech Turek


Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27 at cam.ac.uk
tel. +441223763517



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071109/4496c987/attachment-0002.html

Alex Tomas

2007-Nov-11 06:14 UTC

head link

[Lustre-discuss] OSS crash after LDISKFS-fs error

I think this is bz13620. rhel4 kernel has a bug when two instances
of same inode can co-exist in the cache. you can find the fix in
https://bugzilla.lustre.org/show_bug.cgi?id=13620

thanks, Alex

Wojciech Turek wrote:> Hi,
> 
> My lustre environment is: 2.6.9-55.0.9.EL_lustre.1.6.3smp
> 
> One of my OSS''s crashed today. Below you can see messages sent by
it
> (storage09) to the syslog (first three lines). Then it died (my guess is 
> with kernel panic) and heartbeat software STONITH that OSS''s.
> 
> Nov  9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error 
> (device dm-5): mb_free_blocks: double-free of inode 38887437''s
block
> 155560192(bit 10496 in group 4747)
> Nov  9 19:08:44 storage09.beowulf.cluster kernel:  Nov  9 19:08:44 
> storage09.beowulf.cluster kernel: Remounting filesystem read-only
> Nov  9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error 
> (device dm-5): mb_free_blocks: double-free of inode 38887437''s
block
> 155560193(bit 10497 in group 4747)
> Nov  9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: WARN: node 
> storage09: is dead Nov  9 19:09:13 storage10.beowulf.cluster heartbeat: 
> [21231]: info: Link storage09:eth0 dead.
> Nov  9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: info: Link 
> storage09:eth2 dead. Nov  9 19:09:13 storage10.beowulf.cluster 
> heartbeat: [32414]: info: Resetting node storage09 with [external/ipmi ]
> 
> Do you know how serious are LDISKFS-fs errors? Is that indicates data 
> corruption on the certain block device? Device dm-5 is a DDN LUN.  DDN 
> controller S2A9500 says that everything is Healthy there.
> 
> Cheers
> 
> Wojciech Turek
> 
> 
> Mr Wojciech Turek
> Assistant System Manager
> University of Cambridge
> High Performance Computing service 
> email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk>
> tel. +441223763517
> 
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Jim Garlick

2007-Nov-11 19:19 UTC

head link

[Lustre-discuss] OSS crash after LDISKFS-fs error

I thought I''d check if we had that fix but got ''You are not
authorized
to access bug #13620''.  Any chance of having that fixed?
Jim

On Sun, 11 Nov 2007, Alex Tomas wrote:
> I think this is bz13620. rhel4 kernel has a bug when two instances
> of same inode can co-exist in the cache. you can find the fix in
> https://bugzilla.lustre.org/show_bug.cgi?id=13620
>
> thanks, Alex
>
> Wojciech Turek wrote:
>> Hi,
>>
>> My lustre environment is: 2.6.9-55.0.9.EL_lustre.1.6.3smp
>>
>> One of my OSS''s crashed today. Below you can see messages sent
by it
>> (storage09) to the syslog (first three lines). Then it died (my guess
is
>> with kernel panic) and heartbeat software STONITH that OSS''s.
>>
>> Nov  9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error
>> (device dm-5): mb_free_blocks: double-free of inode 38887437''s
block
>> 155560192(bit 10496 in group 4747)
>> Nov  9 19:08:44 storage09.beowulf.cluster kernel:  Nov  9 19:08:44
>> storage09.beowulf.cluster kernel: Remounting filesystem read-only
>> Nov  9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error
>> (device dm-5): mb_free_blocks: double-free of inode 38887437''s
block
>> 155560193(bit 10497 in group 4747)
>> Nov  9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: WARN:
node
>> storage09: is dead Nov  9 19:09:13 storage10.beowulf.cluster heartbeat:
>> [21231]: info: Link storage09:eth0 dead.
>> Nov  9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: info:
Link
>> storage09:eth2 dead. Nov  9 19:09:13 storage10.beowulf.cluster
>> heartbeat: [32414]: info: Resetting node storage09 with [external/ipmi
]
>>
>> Do you know how serious are LDISKFS-fs errors? Is that indicates data
>> corruption on the certain block device? Device dm-5 is a DDN LUN.  DDN
>> controller S2A9500 says that everything is Healthy there.
>>
>> Cheers
>>
>> Wojciech Turek
>>
>>
>> Mr Wojciech Turek
>> Assistant System Manager
>> University of Cambridge
>> High Performance Computing service
>> email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk>
>> tel. +441223763517
>>
>>
>>
>>
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>

James Braid

2007-Nov-12 21:14 UTC

head link

[Lustre-discuss] OSS crash after LDISKFS-fs error

On Sun, 11 Nov 2007 11:19:56 -0800 (PST), Jim Garlick <garlick at
llnl.gov>
wrote:> I thought I''d check if we had that fix but got ''You are
not authorized
> to access bug #13620''.  Any chance of having that fixed?
> Jim
It''s our support case/bug, at the moment we''d prefer not make
it public due
to various internal reasons. Hopefully this will change in the future.

I''ve attached the patch we applied against RHEL4 2.6.9-55.0.2, which
fixed
the double-free problems for us.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 13620-rhel4.5502.patch
Type: text/x-patch
Size: 843 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071112/7fb45e21/attachment-0002.bin

Kilian CAVALOTTI

2007-Nov-13 21:04 UTC

head link

[Lustre-discuss] OSS crash after LDISKFS-fs error

Hi all,

ay 12 November 2007 13:14:09 James Braid wrote:> I''ve attached the patch we applied against RHEL4 2.6.9-55.0.2,
which fixed
> the double-free problems for us.
Is it planned to be included in a future release?

Thanks,
--
Kilian

Alex Tomas

2007-Nov-15 05:46 UTC

head link

[Lustre-discuss] OSS crash after LDISKFS-fs error

Kilian CAVALOTTI wrote:> Hi all,
> 
> ay 12 November 2007 13:14:09 James Braid wrote:
>> I''ve attached the patch we applied against RHEL4 2.6.9-55.0.2,
which fixed
>> the double-free problems for us.
> 
> Is it planned to be included in a future release?
It will be in 1.6.4.

thanks, Alex

Lustre discuss - Nov 2007 - OSS crash after LDISKFS-fs error

[Lustre-discuss] OSS crash after LDISKFS-fs error

[Lustre-discuss] OSS crash after LDISKFS-fs error

[Lustre-discuss] OSS crash after LDISKFS-fs error

[Lustre-discuss] OSS crash after LDISKFS-fs error

[Lustre-discuss] OSS crash after LDISKFS-fs error

[Lustre-discuss] OSS crash after LDISKFS-fs error