Dear All, What could cause this error? Kernel: 2.6.9-42.0.10.EL_lustre-1.6.0.1custom-drbd and 2.6.9-55.0.9.EL_lustre.1.6.4.1smp (CentOS 4.4) After the node freezed up, his failover pair took over the resource, but it did it too. I''ve just looked back in logs and I see, this header corrupted messages some more times in the last few days. After I turned it on again, it freezed up in 10 minutes. Mar 20 10:57:19 node2 kernel: LDISKFS-fs: header is corrupted! Mar 20 10:57:19 node2 kernel: LDISKFS-fs: invalid magic = 0x281e Mar 20 10:57:19 node2 kernel: LDISKFS-fs: header is corrupted! Mar 20 10:58:43 node2 kernel: Lustre: hallmark-OST0002: haven''t heard from client 078bd69d-b701-7dc9-3360-da43cd285d06 (at 192.168.0.150 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Mar 20 11:03:25 node2 kernel: ------------[ cut here ]------------ Mar 20 11:03:25 node2 kernel: kernel BUG at /usr/src/redhat/BUILD/lustre-1.6.0.1/lustre/ldiskfs/extents.c:1751! Mar 20 11:03:25 node2 kernel: invalid operand: 0000 [#1] Mar 20 11:03:25 node2 kernel: SMP Mar 20 11:03:25 node2 kernel: Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) lquota(U) mdc(U) ksocklnd(U) ptlrpc (U) obdclass(U) lnet(U) lvfs(U) libcfs(U) mptctl(U) mptbase(U) drbd(U) nfsd(U) exportfs(U) md5(U) ipv6(U) parport_pc(U) lp(U) parport(U) autofs4(U) i2c_dev(U ) i2c_core(U) nfs(U) lockd(U) nfs_acl(U) sunrpc(U) dm_mirror(U) dm_mod(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U) e1000(U) sk98lin(U) floppy(U) ext3(U) jbd(U) aacraid(U) ata_piix(U) libata(U) sd_mod(U) scsi_mod(U) Mar 20 11:03:25 node2 kernel: CPU: 1 Mar 20 11:03:25 node2 kernel: EIP: 0060:[<fb8ff40a>] Tainted: GF VLI Mar 20 11:03:25 node2 kernel: EFLAGS: 00010213 (2.6.9-42.0.10.EL_lustre-1.6.0.1custom-drbd) Mar 20 11:03:25 node2 kernel: EIP is at ldiskfs_ext_remove_space+0x13f/0x2cf [ldiskfs] Mar 20 11:03:25 node2 kernel: eax: 00007067 ebx: 00000018 ecx: e5658000 edx: 00001001 Mar 20 11:03:25 node2 kernel: esi: f6095e00 edi: 00000002 ebp: f6095e00 esp: e6b4bb60 Mar 20 11:03:25 node2 kernel: ds: 007b es: 007b ss: 0068 Mar 20 11:03:25 node2 kernel: Process ll_ost_io_38 (pid: 25495, threadinfo=e6b4b000 task=e6e77330) Mar 20 11:03:25 node2 kernel: Stack: 00000000 00000001 f5664304 00000002 f7cede00 ffffffff 00000000 e6b4bb9c Mar 20 11:03:25 node2 kernel: f7cede00 f5664304 e8f250fc e8f25028 fb8ffd3c 00000246 f7cede00 e8f250fc Mar 20 11:03:25 node2 kernel: e8f25028 e8f250fc 0000003c d190459c e8f25258 fb913b44 00000000 00080000 Mar 20 11:03:25 node2 kernel: Call Trace: Mar 20 11:03:25 node2 kernel: [<fb8ffd3c>] ldiskfs_ext_truncate+0x12d/0x176 [ldiskfs] Mar 20 11:03:25 node2 kernel: [<fb8f1213>] ldiskfs_truncate+0x112/0x486 [ldiskfs] Mar 20 11:03:25 node2 kernel: [<c02d4fd6>] __cond_resched+0x14/0x39 Mar 20 11:03:25 node2 kernel: [<fb8f1f4a>] ldiskfs_do_update_inode+0x320/0x347 [ldiskfs] Mar 20 11:03:25 node2 kernel: [<f8897d43>] journal_get_write_access+0x25/0x2c [jbd] Mar 20 11:03:25 node2 kernel: [<c014e3cc>] vmtruncate+0xcb/0xee Mar 20 11:03:25 node2 kernel: [<c0173247>] inode_setattr+0x64/0x1b3 Mar 20 11:03:25 node2 kernel: [<fb8f2129>] ldiskfs_setattr+0x179/0x1c9 [ldiskfs] Mar 20 11:03:25 node2 kernel: [<fb93ffb7>] fsfilt_ldiskfs_setattr+0x129/0x212 [fsfilt_ldiskfs] Mar 20 11:03:25 node2 kernel: [<fbbab7d2>] filter_setattr_internal+0x65f/0x177a [obdfilter] Mar 20 11:03:25 node2 kernel: [<fbba45c0>] filter_fid2dentry+0x654/0x8df [obdfilter] Mar 20 11:03:25 node2 kernel: [<fbb9e7ca>] filter_fmd_get+0x263/0x391 [obdfilter] Mar 20 11:03:25 node2 kernel: [<fbb9e8ee>] filter_fmd_get+0x387/0x391 [obdfilter] Mar 20 11:03:25 node2 kernel: [<fbbad2f1>] filter_setattr+0x260/0x48e [obdfilter] Mar 20 11:03:25 node2 kernel: [<fbbb339f>] filter_truncate+0x281/0x316 [obdfilter] Mar 20 11:03:25 node2 kernel: [<fb928bd1>] obd_punch+0x3f8/0x48b [ost] Mar 20 11:03:25 node2 kernel: [<fb92871f>] ost_punch+0x351/0x40b [ost] Mar 20 11:03:25 node2 kernel: [<fb93340a>] ost_handle+0x1e38/0x344c [ost] Mar 20 11:03:25 node2 kernel: [<fbef0389>] ptlrpc_server_handle_request+0xb76/0x136f [ptlrpc] Mar 20 11:03:25 node2 kernel: [<fbef1acc>] ptlrpc_main+0x7ee/0x9b5 [ptlrpc] Mar 20 11:03:25 node2 kernel: [<c011e7f5>] default_wake_function+0x0/0xc Mar 20 11:03:25 node2 kernel: [<fbef12d1>] ptlrpc_retry_rqbds+0x0/0xd [ptlrpc] Mar 20 11:03:25 node2 kernel: [<c02d693e>] ret_from_fork+0x6/0x14 Mar 20 11:03:25 node2 kernel: [<fbef12d1>] ptlrpc_retry_rqbds+0x0/0xd [ptlrpc] Mar 20 11:03:25 node2 kernel: [<fbef12de>] ptlrpc_main+0x0/0x9b5 [ptlrpc] Mar 20 11:03:25 node2 kernel: [<c01041f5>] kernel_thread_helper+0x5/0xb Mar 20 11:03:25 node2 kernel: Code: 00 75 0b 8b 44 33 14 8b 40 1c 89 44 33 10 8b 4c 33 10 0f b7 41 04 66 39 41 02 76 08 0f 0b d6 06 af 9b 90 fb 66 81 39 0a f 3 74 08 <0f> 0b d7 06 af 9b 90 fb 8b 44 33 0c 85 c0 75 1d 8b 54 24 14 89 Mar 20 11:03:25 node2 kernel: <0>Fatal exception: panic in 5 seconds :
On Mar 20, 2008 13:48 +0100, Papp Tam?s wrote:> What could cause this error? > Kernel: 2.6.9-42.0.10.EL_lustre-1.6.0.1custom-drbd and > 2.6.9-55.0.9.EL_lustre.1.6.4.1smp (CentOS 4.4) > > After the node freezed up, his failover pair took over the resource, but > it did it too. > > I''ve just looked back in logs and I see, this header corrupted messages > some more times in the last few days. > After I turned it on again, it freezed up in 10 minutes. > > > Mar 20 10:57:19 node2 kernel: LDISKFS-fs: header is corrupted! > Mar 20 10:57:19 node2 kernel: LDISKFS-fs: invalid magic = 0x281e > Mar 20 10:57:19 node2 kernel: LDISKFS-fs: header is corrupted!This means you have on-disk corruption and an "e2fsck -f" is needed (while filesystem is unmounted of course).> Mar 20 11:03:25 node2 kernel: ------------[ cut here ]------------ > Mar 20 11:03:25 node2 kernel: kernel BUG at > /usr/src/redhat/BUILD/lustre-1.6.0.1/lustre/ldiskfs/extents.c:1751!You have quite an old version of lustre, and several ldiskfs bugs have been fixed since then. I don''t think it will BUG() on finding disk errors anymore. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Andreas Dilger wrote:> This means you have on-disk corruption and an "e2fsck -f" is needed > (while filesystem is unmounted of course). >Yes I thought, I was curious about why is this could happen (if it''s not a HW bug), were there any bug is lustre about this or something like that. The cluster was working for monthes without errors.> You have quite an old version of lustre, and several ldiskfs bugs have > been fixed since then. I don''t think it will BUG() on finding disk > errors anymore. >Thank you, this weekend we upgrade to the new version and to x86_64. tamas