RHEL4 Lustre-1.6.6 Does the kernel panic below rings a bell to anyone? general protection fault: 0000 [1] SMP CPU 1 Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) ldiskfs(U) drbd(U) ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) autofs4(U) i2 c_nforce2(U) i2c_amd756(U) i2c_isa(U) i2c_amd8111(U) i2c_i801(U) i2c_core(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) cpufreq_powersave(U) ib_ipoib(U) md5(U) ipv6(U) ib_usa(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) mptctl(U) dm_ mirror(U) dm_round_robin(U) dm_multipath(U) dm_mod(U) sr_mod(U) usb_storage(U) joydev(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U) ib_ipath(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ata_piix(U) libata(U) sg(U) ext3(U) jbd(U) xfs(U) tg3(U) s2io(U) qla2xxx_conf(U) qla2xxx(U) nfs(U) nfs_acl(U) lockd(U) sunrpc(U) megaraid_sas(U) mptsas(U) mptscsi(U) mptbase(U) e1000(U) bnx2(U) sd_mod(U) scsi_mod(U) Pid: 13546, comm: collectl Tainted: GF 2.6.9-67.0.22.EL_lustre.1.6.6smp RIP: 0010:[<ffffffff801af8f0>] <ffffffff801af8f0>{proc_pid_status+534} RSP: 0018:0000010416fc9e48 EFLAGS: 00010203 RAX: 00047d5350f64e05 RBX: 0000010063767080 RCX: 0000000000000002 RDX: 0000000000000001 RSI: ffffffff80321d58 RDI: 000001006074b092 RBP: 000001030ae0b030 R08: 00000000fffffff9 R09: 0000000000000002 R10: 0000000000000000 R11: 0000000000000000 R12: 000001006074b092 R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000006e1f FS: 0000002a96310e80(0000) GS:ffffffff804fcb00(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000552abd58a8 CR3: 0000000008258000 CR4: 00000000000006e0 Process collectl (pid: 13546, threadinfo 0000010416fc8000, task 0000010411385030) Stack: 000001000001da80 ffffffff8032c17f 000001006074b000 0000000000000202 0000000000000001 0000000011385030 0000001000000001 0000000000000202 395f74646d5f6c6c 0072650065720037 Call Trace:<ffffffff801acfd3>{proc_info_read+85} <ffffffff80178dac>{vfs_read+207} <ffffffff80179008>{sys_read+69} <ffffffff80110236>{system_call+126} Code: 8b 14 90 31 c0 e8 9c d8 03 00 48 98 49 01 c4 8b 13 b8 20 00 RIP <ffffffff801af8f0>{proc_pid_status+534} RSP <0000010416fc9e48> <0>Kernel panic - not syncing: Oops -- -- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517
On 2010-01-18, at 19:59, Wojciech Turek wrote:> RHEL4 Lustre-1.6.6 > > Does the kernel panic below rings a bell to anyone? > > RIP: 0010:[<ffffffff801af8f0>] <ffffffff801af8f0>{proc_pid_status+534} > Process collectl (pid: 13546, threadinfo 0000010416fc8000, task > Call Trace:<ffffffff801acfd3>{proc_info_read+85} > <ffffffff80178dac>{vfs_read+207} > <ffffffff80179008>{sys_read+69} > <ffffffff80110236>{system_call+126}This looks like collectl reading from a /proc entry after it was cleaned up. I think several such bugs were already fixed. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Thanks Andreas for quick answer. So upgrading to a newer version of colletcl should fix it? Cheers Wojciech 2010/1/18 Andreas Dilger <adilger at sun.com>:> On 2010-01-18, at 19:59, Wojciech Turek wrote: >> >> RHEL4 Lustre-1.6.6 >> >> Does the kernel panic below rings a bell to anyone? >> >> RIP: 0010:[<ffffffff801af8f0>] <ffffffff801af8f0>{proc_pid_status+534} >> Process collectl (pid: 13546, threadinfo 0000010416fc8000, task >> Call Trace:<ffffffff801acfd3>{proc_info_read+85} >> ? ? ? ? ? <ffffffff80178dac>{vfs_read+207} >> ? ? ? ? ? <ffffffff80179008>{sys_read+69} >> ? ? ? ? ? <ffffffff80110236>{system_call+126} > > > This looks like collectl reading from a /proc entry after it was cleaned > up. ?I think several such bugs were already fixed. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > >-- -- Wojciech Turek Assistant System Manager High Performance Computing Service University of Cambridge Email: wjt27 at cam.ac.uk Tel: (+)44 1223 763517
On 2010-01-18, at 23:09, Wojciech Turek wrote:> Thanks Andreas for quick answer. So upgrading to a newer version of > colletcl should fix it?No, it is a Lustre bug, not collectl. I think a newer version of Lustre has fixes in lprocfs to avoid such races.> 2010/1/18 Andreas Dilger <adilger at sun.com>: >> On 2010-01-18, at 19:59, Wojciech Turek wrote: >>> >>> RHEL4 Lustre-1.6.6 >>> >>> Does the kernel panic below rings a bell to anyone? >>> >>> RIP: 0010:[<ffffffff801af8f0>] <ffffffff801af8f0>{proc_pid_status >>> +534} >>> Process collectl (pid: 13546, threadinfo 0000010416fc8000, task >>> Call Trace:<ffffffff801acfd3>{proc_info_read+85} >>> <ffffffff80178dac>{vfs_read+207} >>> <ffffffff80179008>{sys_read+69} >>> <ffffffff80110236>{system_call+126} >> >> >> This looks like collectl reading from a /proc entry after it was >> cleaned >> up. I think several such bugs were already fixed. >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> >> > > > > -- > -- > Wojciech Turek > > Assistant System Manager > > High Performance Computing Service > University of Cambridge > Email: wjt27 at cam.ac.uk > Tel: (+)44 1223 763517Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.