Hi,
We have here a lustre 1.8.1 filesystem running on a CentOS 5.4 system
using kernel 2.6.18-128.1.14.el5_lustre.1.8.1. Since 3 days ago, we''re
having problems with kernel panic in our mds machines (our setup has 2
mds and 2 oss), and when panic occurs, the other machine mount lustre
mdt filesystem and became principal mds, but after the recover, it panic
also. I attached the dump from kernel panic and would like to know if
someone has this kind of problem, and if someone can help me.
Many thanks in advance.
Regards,
--
Marco Gomes
Systems/HPC-Cluster
Numerical Offshore Tank
Naval and Ocean Engineering Department''s Laboratory
Escola Polit?cnica
University of S?o Paulo
+55 11 3777 4142 ext. 250
-------------- next part --------------
ustreError: 12942:0:(pack_generic.c:655:lustre_shrink_reply_v2())
ASSERTION(msg->lm_bufcount > segment) failed
LustreError: 12942:0:(pack_generic.c:655:lustre_shrink_reply_v2()) LBUG
LustreError: dumping log to /tmp/lustre-log.1278619163.12942
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...s/root/usr/local/src/aufs.wcvs/aufs/fs/aufs/f_op.c:706
invalid opcode: 0000 [1] SMP
last sysfs file:
/devices/pci0000:00/0000:00:07.0/0000:08:00.0/host4/rport-4:0-0/target4:0:0/4:0:0:1/state
CPU 0
Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) crc16(U)
lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U)
lnet(U) lvfs(U) libcfs(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U)
ib_ipoib(U) ib_cm(U) ib_sa(U) ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U)
cxgb3(U) ib_qib(U) mlx4_en(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ib_mad(U)
ib_core(U) lockd(U) sunrpc(U) ipoib_helper(U) ipv6(U) xfrm_nalgo(U)
crypto_api(U) dm_mirror(U) dm_log(U) dm_round_robin(U) scsi_dh_rdac(U)
dm_multipath(U) scsi_dh(U) dm_mod(U) video(U) hwmon(U) backlight(U) sbs(U)
i2c_ec(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U)
parport_pc(U) lp(U) parport(U) joydev(U) shpchp(U) sg(U) ehci_hcd(U) uhci_hcd(U)
pcspkr(U) qla2xxx(U) scsi_transport_fc(U) i2c_i801(U) i2c_core(U) ata_piix(U)
libata(U) sd_mod(U) scsi_mod(U) loop(U) squashfs(U) aufs(U) ext3(U) jbd(U)
e1000e(U)
Pid: 13151, comm: Tainted: G 2.6.18-128.1.14.el5_lustre.1.8.1 #1
RIP: 0010:[<ffffffff8808976a>] [<ffffffff8808976a>]
:aufs:aufs_fsync_nondir+0x86/0x380
RSP: 0000:ffff81066fed3e20 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8103320438c0 RCX: 0000000000000000
RDX: ffff8103352240d0 RSI: ffff81033202aa98 RDI: ffff8103490bc9c0
RBP: ffff81066fed3ec8 R08: 000000000000035a R09: ffff81037e612000
R10: 0000000000000080 R11: 0000000000000000 R12: ffff8103490bc9c0
R13: ffff8103490bc9c0 R14: ffff81033202aa98 R15: ffff8103352240d0
FS: 0000000000000000(0000) GS:ffffffff803f7000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b7ffc4dc3a0 CR3: 0000000000201000 CR4: 00000000000006e0
Process (pid: 13151, threadinfo ffff81066fed2000, task ffff81067f8c0100)
Stack: 0000000100000000 ffff81010c44a040 ffff8103490bc9f8 ffffffff800d73e1
ffff81010c476000 0000000000000000 ffff81033202aa98 0000000000100603
ffff81010b2f65f0 0000000000000286 0000000000000282 ffff8103320438c0
Call Trace:
[<ffffffff800d73e1>] cache_flusharray+0x74/0xa3
[<ffffffff88778e18>] :libcfs:tracefile_dump_all_pages+0x288/0x2d0
[<ffffffff8877605b>] :libcfs:libcfs_debug_dumplog_internal+0x8b/0xb0
[<ffffffff88776098>] :libcfs:libcfs_debug_dumplog_thread+0x18/0x40
[<ffffffff8005dfb1>] child_rip+0xa/0x11
[<ffffffff88776080>] :libcfs:libcfs_debug_dumplog_thread+0x0/0x40
[<ffffffff8005dfa7>] child_rip+0x0/0x11
Code: 0f 0b 68 db 37 0a 88 c2 c2 02 49 39 d7 74 18 48 8d ba b8 00
RIP [<ffffffff8808976a>] :aufs:aufs_fsync_nondir+0x86/0x380
RSP <ffff81066fed3e20>
<0>Kernel panic - not syncing: Fatal exception
<0>Dumping qib trace buffer from panic
Done dumping qib trace buffer