Hi, My lustre environment is: 2.6.9-55.0.9.EL_lustre.1.6.3smp One of my OSS''s crashed today. Below you can see messages sent by it (storage09) to the syslog (first three lines). Then it died (my guess is with kernel panic) and heartbeat software STONITH that OSS''s. Nov 9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error (device dm-5): mb_free_blocks: double-free of inode 38887437''s block 155560192(bit 10496 in group 4747) Nov 9 19:08:44 storage09.beowulf.cluster kernel: Nov 9 19:08:44 storage09.beowulf.cluster kernel: Remounting filesystem read-only Nov 9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error (device dm-5): mb_free_blocks: double-free of inode 38887437''s block 155560193(bit 10497 in group 4747) Nov 9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: WARN: node storage09: is dead Nov 9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: info: Link storage09:eth0 dead. Nov 9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: info: Link storage09:eth2 dead. Nov 9 19:09:13 storage10.beowulf.cluster heartbeat: [32414]: info: Resetting node storage09 with [external/ipmi ] Do you know how serious are LDISKFS-fs errors? Is that indicates data corruption on the certain block device? Device dm-5 is a DDN LUN. DDN controller S2A9500 says that everything is Healthy there. Cheers Wojciech Turek Mr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27 at cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071109/4496c987/attachment-0002.html
I think this is bz13620. rhel4 kernel has a bug when two instances of same inode can co-exist in the cache. you can find the fix in https://bugzilla.lustre.org/show_bug.cgi?id=13620 thanks, Alex Wojciech Turek wrote:> Hi, > > My lustre environment is: 2.6.9-55.0.9.EL_lustre.1.6.3smp > > One of my OSS''s crashed today. Below you can see messages sent by it > (storage09) to the syslog (first three lines). Then it died (my guess is > with kernel panic) and heartbeat software STONITH that OSS''s. > > Nov 9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error > (device dm-5): mb_free_blocks: double-free of inode 38887437''s block > 155560192(bit 10496 in group 4747) > Nov 9 19:08:44 storage09.beowulf.cluster kernel: Nov 9 19:08:44 > storage09.beowulf.cluster kernel: Remounting filesystem read-only > Nov 9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error > (device dm-5): mb_free_blocks: double-free of inode 38887437''s block > 155560193(bit 10497 in group 4747) > Nov 9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: WARN: node > storage09: is dead Nov 9 19:09:13 storage10.beowulf.cluster heartbeat: > [21231]: info: Link storage09:eth0 dead. > Nov 9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: info: Link > storage09:eth2 dead. Nov 9 19:09:13 storage10.beowulf.cluster > heartbeat: [32414]: info: Resetting node storage09 with [external/ipmi ] > > Do you know how serious are LDISKFS-fs errors? Is that indicates data > corruption on the certain block device? Device dm-5 is a DDN LUN. DDN > controller S2A9500 says that everything is Healthy there. > > Cheers > > Wojciech Turek > > > Mr Wojciech Turek > Assistant System Manager > University of Cambridge > High Performance Computing service > email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk> > tel. +441223763517 > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
I thought I''d check if we had that fix but got ''You are not authorized to access bug #13620''. Any chance of having that fixed? Jim On Sun, 11 Nov 2007, Alex Tomas wrote:> I think this is bz13620. rhel4 kernel has a bug when two instances > of same inode can co-exist in the cache. you can find the fix in > https://bugzilla.lustre.org/show_bug.cgi?id=13620 > > thanks, Alex > > Wojciech Turek wrote: >> Hi, >> >> My lustre environment is: 2.6.9-55.0.9.EL_lustre.1.6.3smp >> >> One of my OSS''s crashed today. Below you can see messages sent by it >> (storage09) to the syslog (first three lines). Then it died (my guess is >> with kernel panic) and heartbeat software STONITH that OSS''s. >> >> Nov 9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error >> (device dm-5): mb_free_blocks: double-free of inode 38887437''s block >> 155560192(bit 10496 in group 4747) >> Nov 9 19:08:44 storage09.beowulf.cluster kernel: Nov 9 19:08:44 >> storage09.beowulf.cluster kernel: Remounting filesystem read-only >> Nov 9 19:08:44 storage09.beowulf.cluster kernel: LDISKFS-fs error >> (device dm-5): mb_free_blocks: double-free of inode 38887437''s block >> 155560193(bit 10497 in group 4747) >> Nov 9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: WARN: node >> storage09: is dead Nov 9 19:09:13 storage10.beowulf.cluster heartbeat: >> [21231]: info: Link storage09:eth0 dead. >> Nov 9 19:09:13 storage10.beowulf.cluster heartbeat: [21231]: info: Link >> storage09:eth2 dead. Nov 9 19:09:13 storage10.beowulf.cluster >> heartbeat: [32414]: info: Resetting node storage09 with [external/ipmi ] >> >> Do you know how serious are LDISKFS-fs errors? Is that indicates data >> corruption on the certain block device? Device dm-5 is a DDN LUN. DDN >> controller S2A9500 says that everything is Healthy there. >> >> Cheers >> >> Wojciech Turek >> >> >> Mr Wojciech Turek >> Assistant System Manager >> University of Cambridge >> High Performance Computing service >> email: wjt27 at cam.ac.uk <mailto:wjt27 at cam.ac.uk> >> tel. +441223763517 >> >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discuss >
On Sun, 11 Nov 2007 11:19:56 -0800 (PST), Jim Garlick <garlick at llnl.gov> wrote:> I thought I''d check if we had that fix but got ''You are not authorized > to access bug #13620''. Any chance of having that fixed? > JimIt''s our support case/bug, at the moment we''d prefer not make it public due to various internal reasons. Hopefully this will change in the future. I''ve attached the patch we applied against RHEL4 2.6.9-55.0.2, which fixed the double-free problems for us. -------------- next part -------------- A non-text attachment was scrubbed... Name: 13620-rhel4.5502.patch Type: text/x-patch Size: 843 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071112/7fb45e21/attachment-0002.bin
Hi all, ay 12 November 2007 13:14:09 James Braid wrote:> I''ve attached the patch we applied against RHEL4 2.6.9-55.0.2, which fixed > the double-free problems for us.Is it planned to be included in a future release? Thanks, -- Kilian
Kilian CAVALOTTI wrote:> Hi all, > > ay 12 November 2007 13:14:09 James Braid wrote: >> I''ve attached the patch we applied against RHEL4 2.6.9-55.0.2, which fixed >> the double-free problems for us. > > Is it planned to be included in a future release?It will be in 1.6.4. thanks, Alex