Wojciech Turek
2007-Nov-19 16:26 UTC
[Lustre-discuss] Kernel BUG causing kernel panic kernel 2.6.9-55.0.9.EL_lustre.1.6.3smp
Dear All, We are experiencing frequent OSS crashes. We have 4 OSS''s and each OSS serves 6 OST''s to 600 clients. We observe random OSS crashes every 1-2 days. See below console output captured during crash. Does is looks for some of you familiar? We have seen the same crashes with lustre 1.6.2 Nov 18 15:17:21 storage08 heartbeat: [25566]: info: Checking status of STONITH Nov 18 15:17:21 storage08 heartbeat: [24250]: info: Exiting STONITH- stat process Kernel BUG at mballoc:3352 invalid operand: 0000 [1] SMP CPU 0 Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) lustre(U) lov(U) mdc(U) lquota(U) ptlrpc(U) obdclass(U) lvfs(U) sg(U) ksocklnd(U) lnet(U) libcfs(U) cxgb3(U) ipmi_si(U) ipmi_devintf(U) ipmi_msghandler(U) md5(U) ipv6(U) autofs4(U) i2c_nforce2(U) i2c_amd756(U) i2c_isa(U) i2c_amd8111(U) i2c_i801(U) i2c_core(U) mptctl(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) dm_mod(U) sr_mod(U) usb_storage(U) joydev(U) button(U) battery(U) ac (U) uhci_hcd(U) ehci_hcd(U) hw_random(U) qla2400(U) qla2xxx(U) scsi_transport_fc(U) ata_piix(U) ext3(U) jbd(U) xfs(U) tg3(U) s2io(U) nfs(U) nfs_acl(U) lockd(U) sunrpc(U) mptsas(U) mptscsi(U) mptbase(U) megaraid_sas(U) e1000(U) bnx2(U) sd_mod(U) Pid: 9070, comm: ll_ost_io_151 Tainted: GF 2.6.9-55.0.9.EL_lustre. 1.6.3smp RIP: 0010:[<ffffffffa05e2923>] <ffffffffa05e2923> {:ldiskfs:ldiskfs_mb_generate_from_pa+179} RSP: 0018:00000100c9721268 EFLAGS: 00010297 RAX: 0000000000002177 RBX: 0000000000000000 RCX: 00000100c9721288 RDX: 0000000000000000 RSI: 0000000000002178 RDI: 0000010077ce42b0 RBP: 0000010077ce4290 R08: 00000100c9721280 R09: 01ff80000007c008 R10: 0000080000000000 R11: ffffffffffffffff R12: 0000010077ce42b0 R13: 000001007fb09000 R14: 0000000000000000 R15: 00000100ad763c28 FS: 0000002a95565b00(0000) GS:ffffffff804a6700(0000) knlGS: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000002a984c80e8 CR3: 0000000000101000 CR4: 00000000000006e0 Process ll_ost_io_151 (pid: 9070, threadinfo 00000100c9720000, task 00000100c96f1800) Stack: 0000000000001000 0000000000002177 00000100b5196400 0000000000002178 0000000000000000 0000000000002177 00000100a56d6ee0 0000000000000000 0000000000002177 0000000000000000 Call Trace:<ffffffffa05e310a>{:ldiskfs:ldiskfs_mb_init_cache+1898} <ffffffffa05e3340>{:ldiskfs:ldiskfs_mb_load_buddy+304} <ffffffffa05e96e2>{:ldiskfs:ldiskfs_mb_free_blocks+626} <ffffffffa0180920>{:jbd:journal_get_write_access+48} <ffffffff801589d9>{find_get_page+65} <ffffffff801798e7> {__find_get_block_slow+62} <ffffffff8017a097>{__find_get_block+162} <ffffffffa0180920> {:jbd:journal_get_write_access+48} <ffffffffa05c9933>{:ldiskfs:ldiskfs_free_blocks+163} <ffffffffa05e165a>{:ldiskfs:ldiskfs_remove_blocks+282} <ffffffffa05e0ff4>{:ldiskfs:ldiskfs_ext_remove_space+1508} <ffffffffa05ce27c>{:ldiskfs:ldiskfs_mark_inode_dirty+76} <ffffffffa05e1f80>{:ldiskfs:ldiskfs_ext_truncate+368} <ffffffffa05cfcb5>{:ldiskfs:ldiskfs_truncate+309} <ffffffff80167df9>{unmap_mapping_range+339} <ffffffffa05ce11a>{:ldiskfs:ldiskfs_mark_iloc_dirty+1034} <ffffffff80167ea4>{vmtruncate+162} <ffffffff80191c88> {inode_setattr+41} <ffffffffa05cf5bc>{:ldiskfs:ldiskfs_setattr+444} <ffffffffa062ae72>{:fsfilt_ldiskfs:fsfilt_ldiskfs_setattr+386} <ffffffffa064af7b>{:obdfilter:filter_destroy+3131} <ffffffffa0456da0>{:ptlrpc:ldlm_completion_ast+0} <ffffffff802f069d>{tcp_rcv_established+2099} <ffffffffa047bd83>{:ptlrpc:lustre_msg_add_version+83} <ffffffffa047d205>{:ptlrpc:lustre_msg_check_version+69} <ffffffffa061a25d>{:ost:ost_handle+6397} <ffffffff802dfc76> {ip_rcv+1046} <ffffffff802c6861>{netif_receive_skb+791} <ffffffffa031a9ba> {:cxgb3:lro_flush_session+154} <ffffffffa035fb58>{:lnet:lnet_match_blocked_msg+920} <ffffffffa0485b4c>{:ptlrpc:ptlrpc_server_handle_request+3036} <ffffffffa033cbae>{:libcfs:lcw_update_time+30} <ffffffff8013f448>{__mod_timer+293} <ffffffffa04881d8>{:ptlrpc:ptlrpc_main+2504} <ffffffff80133566>{default_wake_function+0} <ffffffffa0486860>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0486860>{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffff80110de3>{child_rip+8} <ffffffffa0487810> {:ptlrpc:ptlrpc_main+0} <ffffffff80110ddb>{child_rip+0} Code: 0f 0b d2 bb 5e a0 ff ff ff ff 18 0d 90 8b 4c 24 20 8d 34 0b RIP <ffffffffa05e2923>{:ldiskfs:ldiskfs_mb_generate_from_pa+179} RSP <00000100c9721268> <0>Kernel panic - not syncing: Oops Best regards Wojciech Turek Mr Wojciech Turek Assistant System Manager University of Cambridge High Performance Computing service email: wjt27 at cam.ac.uk tel. +441223763517 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071119/59df571e/attachment-0002.html
Johann Lombardi
2007-Nov-19 16:38 UTC
[Lustre-discuss] Kernel BUG causing kernel panic kernel 2.6.9-55.0.9.EL_lustre.1.6.3smp
On Mon, Nov 19, 2007 at 04:26:37PM +0000, Wojciech Turek wrote:> We are experiencing frequent OSS crashes. > We have 4 OSS''s and each OSS serves 6 OST''s to 600 clients. > We observe random OSS crashes every 1-2 days. See below console output > captured during crash. > Does is looks for some of you familiar? We have seen the same crashes with > lustre 1.6.2Yes, please see bugzilla ticket #13438. The fix has been landed for 1.6.4. Johann