Aaron Knister
2008-May-28 17:16 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
I''m seeing this bug (14465) under heavy load on my OSSes. If I reboot the MDS it appears to help...any ideas? What''s the status on this bug? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080528/88052956/attachment-0001.html
Alex Zhuravlev
2008-May-28 20:54 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
Aaron Knister wrote:> I''m seeing this bug (14465) under heavy load on my OSSes. If I reboot > the MDS it appears to help...any ideas? What''s the status on this bug?could you attach your dmesg to the bug? as for the status - I''m still not able to reproduce this, neither I found possible cause, sorry. thanks, Alex
Aaron Knister
2008-May-29 00:03 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
Thank you very much for looking into this. I''ve attached my dmesg to the bug. i looked at line number 1334 which the panic seems to reference. i can''t figure out what its doing though On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev <Alex.Zhuravlev at sun.com> wrote:> Aaron Knister wrote: > >> I''m seeing this bug (14465) under heavy load on my OSSes. If I reboot the >> MDS it appears to help...any ideas? What''s the status on this bug? >> > > could you attach your dmesg to the bug? as for the status - I''m still not > able to reproduce this, neither I found possible cause, sorry. > > thanks, Alex > >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080528/7c8c9b1a/attachment.html
Aaron Knister
2008-May-30 15:29 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
I bumped up debugging, and here''s (below) the last bit of debugging info from lustre that I have on the oss before it went belly up. My system is totally inoperable. Does anybody have any ideas? 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock()) About to add this lock: 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump()) -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) (pid: 6378) 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump()) Node: local 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump()) Resource: ffff8103849211c0 (3129942/0) 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump()) Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: 0x80004000 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump()) Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615) 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal()) forcing cancel of local lock 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread()) kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175) 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) started handle ffff8103766dfc78 (0000000000000000) 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) committing handle ffff8103766dfc78 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) started handle ffff8103766dfc78 (0000000000000000) 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock()) removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 from hash 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set()) kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279) 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy()) kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175). 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref()) putref res: ffff8103849211c0 count: 0 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref()) kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot 49135135). 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref()) slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135). 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main()) kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791). 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) committing handle ffff8103766dfc78 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) putting 3129942: ffff8103599cea98, count = 0 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated()) crew4-OST0001: after destroy: set last_objids[0] = 3129941 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid()) crew4-OST0001: server last_objid for group 0: 3129941 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335). 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req()) last_committed 17896268, xid 3841 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref()) connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 at o2ib 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc()) kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949). 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc()) kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285). 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection()) connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 at o2ib 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) RPC PUTting export ffff8103848e9000 : new rpc_count 0 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) PUTting export ffff8103848e9000 : new refcount 4 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request()) PUTting export ffff8103848e9000 : new refcount 3 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free()) kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149). 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free()) kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813). On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at gmail.com> wrote:> Thank you very much for looking into this. I''ve attached my dmesg to the > bug. i looked at line number 1334 which the panic seems to reference. i > can''t figure out what its doing though > > On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev <Alex.Zhuravlev at sun.com> > wrote: > >> Aaron Knister wrote: >> >>> I''m seeing this bug (14465) under heavy load on my OSSes. If I reboot the >>> MDS it appears to help...any ideas? What''s the status on this bug? >>> >> >> could you attach your dmesg to the bug? as for the status - I''m still not >> able to reproduce this, neither I found possible cause, sorry. >> >> thanks, Alex >> >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080530/0cdf7371/attachment-0001.html
Alex Zhuravlev
2008-Jun-02 07:40 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
what about dmesg output? it''s unlikely lustre debug can help here as the problem seem to be very internal to ldiskfs (mballoc piece of it) thanks, Alex Aaron Knister wrote:> I bumped up debugging, and here''s (below) the last bit of debugging > info from lustre that I have on the oss before it went belly up. My > system is totally inoperable. Does anybody have any ideas? > > 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock()) > About to add this lock: > 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump()) > -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) (pid: > 6378) > 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump()) > Node: local > 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump()) > Resource: ffff8103849211c0 (3129942/0) > 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump()) > Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: 0x80004000 > 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump()) > Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615) > 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal()) > forcing cancel of local lock > 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread()) > kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175) > 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) > started handle ffff8103766dfc78 (0000000000000000) > 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) > committing handle ffff8103766dfc78 > 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) > started handle ffff8103766dfc78 (0000000000000000) > 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock()) > removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 from hash > 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set()) > kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279) > 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy()) > kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175). > 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref()) > putref res: ffff8103849211c0 count: 0 > 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref()) > kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot 49135135). > 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref()) > slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135). > 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main()) > kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791). > 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) > committing handle ffff8103766dfc78 > 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) > putting 3129942: ffff8103599cea98, count = 0 > 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated()) > crew4-OST0001: after destroy: set last_objids[0] = 3129941 > 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid()) > crew4-OST0001: server last_objid for group 0: 3129941 > 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) > slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335). > 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req()) > last_committed 17896268, xid 3841 > 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref()) > connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 > <http://172.18.0.10>@o2ib > 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) > conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib > 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc()) > kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949). > 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc()) > kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285). > 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection()) > connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 > <http://172.18.0.10>@o2ib > 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) > RPC PUTting export ffff8103848e9000 : new rpc_count 0 > 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) > PUTting export ffff8103848e9000 : new refcount 4 > 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request()) > PUTting export ffff8103848e9000 : new refcount 3 > 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free()) > kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149). > 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free()) > kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813). > > > On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at gmail.com > <mailto:aaron.knister at gmail.com>> wrote: > > Thank you very much for looking into this. I''ve attached my dmesg to > the bug. i looked at line number 1334 which the panic seems to > reference. i can''t figure out what its doing though > > On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev > <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote: > > Aaron Knister wrote: > > I''m seeing this bug (14465) under heavy load on my OSSes. If > I reboot the MDS it appears to help...any ideas? What''s the > status on this bug? > > > could you attach your dmesg to the bug? as for the status - I''m > still not > able to reproduce this, neither I found possible cause, sorry. > > thanks, Alex > > >
Aaron Knister
2008-Jun-02 12:01 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
Thanks so much for looking into this. Here''s what I got from dmesg. Interestingly enough every time it panics the CPU listed is #4...do you think that points to a hardware problem? - ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334 invalid opcode: 0000 [1] SMP last sysfs file: /devices/ pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq CPU 4 Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) ptlrpc(U) obdclass( U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U) rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U) scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U) rdma_cm(U) ib_cm(U) iw_c m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) button(U) battery( U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U) ib_core(U) shpchp(U) i2c _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) sata_sil(U) libata(U) sd_mod(U ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) Pid: 5717, comm: ll_ost_io_11 Tainted: GF 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 RIP: 0010:[<ffffffff8885a01f>] [<ffffffff8885a01f>] :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 RSP: 0018:ffff81040e4dd320 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000 RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010 R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470 R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800 FS: 00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) knlGS: 0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0 Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task ffff81041228a040) Stack: ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b 0000000000010000 0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000 Call Trace: [<ffffffff8885cd4e>] :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50 [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160 [<ffffffff88859664>] :ldiskfs:ldiskfs_mb_initialize_context +0x144/0x160 [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280 [<ffffffff888bf091>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb +0x301/0x640 [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4 [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250 [<ffffffff888bed90>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb +0x0/0x640 [<ffffffff80054ccc>] mwait_idle+0x0/0x4a [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150 [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0 [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/0x2570 [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186 [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0 [<ffffffff80088431>] default_wake_function+0x0/0xe [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8 [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160 [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 [<ffffffff886a4a3b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 [<ffffffff80060f29>] thread_return+0x0/0xeb [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92 [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100 [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950 [<ffffffff80088431>] default_wake_function+0x0/0xe [<ffffffff8005bfb1>] child_rip+0xa/0x11 [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950 [<ffffffff8005bfa7>] child_rip+0x0/0x11 Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43 RIP [<ffffffff8885a01f>] :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 RSP <ffff81040e4dd320> <0>Kernel panic - not syncing: Fatal exception On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote:> what about dmesg output? it''s unlikely lustre debug can help here as > the problem > seem to be very internal to ldiskfs (mballoc piece of it) > > thanks, Alex > > Aaron Knister wrote: >> I bumped up debugging, and here''s (below) the last bit of debugging >> info from lustre that I have on the oss before it went belly up. My >> system is totally inoperable. Does anybody have any ideas? >> >> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c: >> 865:ldlm_resource_add_lock()) >> About to add this lock: >> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c: >> 1718:ldlm_lock_dump()) >> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) >> (pid: >> 6378) >> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c: >> 1726:ldlm_lock_dump()) >> Node: local >> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c: >> 1735:ldlm_lock_dump()) >> Resource: ffff8103849211c0 (3129942/0) >> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c: >> 1740:ldlm_lock_dump()) >> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: >> 0x80004000 >> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c: >> 1746:ldlm_lock_dump()) >> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615) >> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c: >> 615:ldlm_lock_decref_internal()) >> forcing cancel of local lock >> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c: >> 1357:ldlm_bl_to_thread()) >> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175) >> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h: >> 194:fsfilt_start_log()) >> started handle ffff8103766dfc78 (0000000000000000) >> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h: >> 270:fsfilt_commit()) >> committing handle ffff8103766dfc78 >> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h: >> 194:fsfilt_start_log()) >> started handle ffff8103766dfc78 (0000000000000000) >> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c: >> 121:class_handle_unhash_nolock()) >> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 >> from hash >> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c: >> 394:ptlrpc_prep_set()) >> kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279) >> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c: >> 457:ptlrpc_set_destroy()) >> kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175). >> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c: >> 818:ldlm_resource_putref()) >> putref res: ffff8103849211c0 count: 0 >> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c: >> 828:ldlm_resource_putref()) >> kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot 49135135). >> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c: >> 829:ldlm_resource_putref()) >> slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135). >> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c: >> 1657:ldlm_bl_thread_main()) >> kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791). >> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h: >> 270:fsfilt_commit()) >> committing handle ffff8103766dfc78 >> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c: >> 148:f_dput()) >> putting 3129942: ffff8103599cea98, count = 0 >> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c: >> 2689:filter_destroy_precreated()) >> crew4-OST0001: after destroy: set last_objids[0] = 3129941 >> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c: >> 607:filter_update_last_objid()) >> crew4-OST0001: server last_objid for group 0: 3129941 >> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c: >> 747:fsfilt_ldiskfs_cb_func()) >> slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335). >> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c: >> 1556:target_committed_to_req()) >> last_committed 17896268, xid 3841 >> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c: >> 191:ptlrpc_connection_addref()) >> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 >> <http://172.18.0.10>@o2ib >> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c: >> 46:ptl_send_buf()) >> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib >> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h: >> 247:lnet_md_alloc()) >> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949). >> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h: >> 295:lnet_msg_alloc()) >> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285). >> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c: >> 150:ptlrpc_put_connection()) >> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 >> <http://172.18.0.10>@o2ib >> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c: >> 648:ptlrpc_server_handle_request()) >> RPC PUTting export ffff8103848e9000 : new rpc_count 0 >> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c: >> 648:ptlrpc_server_handle_request()) >> PUTting export ffff8103848e9000 : new refcount 4 >> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c: >> 652:ptlrpc_server_handle_request()) >> PUTting export ffff8103848e9000 : new refcount 3 >> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h: >> 269:lnet_md_free()) >> kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149). >> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h: >> 312:lnet_msg_free()) >> kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813). >> >> >> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at gmail.com >> <mailto:aaron.knister at gmail.com>> wrote: >> >> Thank you very much for looking into this. I''ve attached my >> dmesg to >> the bug. i looked at line number 1334 which the panic seems to >> reference. i can''t figure out what its doing though >> >> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev >> <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote: >> >> Aaron Knister wrote: >> >> I''m seeing this bug (14465) under heavy load on my >> OSSes. If >> I reboot the MDS it appears to help...any ideas? What''s >> the >> status on this bug? >> >> >> could you attach your dmesg to the bug? as for the status - >> I''m >> still not >> able to reproduce this, neither I found possible cause, sorry. >> >> thanks, Alex >> >> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Alex Zhuravlev
2008-Jun-02 12:08 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
could you also send me your mballoc.c please? thanks, Alex Aaron Knister wrote:> Thanks so much for looking into this. Here''s what I got from dmesg. > Interestingly enough every time it panics the CPU listed is #4...do you > think that points to a hardware problem? - > > ----------- [cut here ] --------- [please bite here ] --------- > Kernel BUG at ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334 > invalid opcode: 0000 [1] SMP > last sysfs file: > /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq > CPU 4 > Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) > ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) > ptlrpc(U) obdclass( > U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U) rfcomm(U) > l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) xt_tcpudp(U) > ip6table_filter(U > ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U) > scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U) > rdma_cm(U) ib_cm(U) iw_c > m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) > ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) > button(U) battery( > U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) > ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U) ib_core(U) > shpchp(U) i2c > _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) > dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) sata_sil(U) libata(U) > sd_mod(U > ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) > Pid: 5717, comm: ll_ost_io_11 Tainted: GF > 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 > RIP: 0010:[<ffffffff8885a01f>] [<ffffffff8885a01f>] > :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 > RSP: 0018:ffff81040e4dd320 EFLAGS: 00010246 > RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f > RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000 > RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010 > R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470 > R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800 > FS: 00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0 > Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task > ffff81041228a040) > Stack: ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b 0000000000010000 > 0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e > ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000 > Call Trace: > [<ffffffff8885cd4e>] :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50 > [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160 > [<ffffffff88859664>] :ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160 > [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280 > [<ffffffff888bf091>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640 > [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf > [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4 > [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250 > [<ffffffff888bed90>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640 > [<ffffffff80054ccc>] mwait_idle+0x0/0x4a > [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150 > [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0 > [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/0x2570 > [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186 > [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c > [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0 > [<ffffffff80088431>] default_wake_function+0x0/0xe > [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8 > [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160 > [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 > [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 > [<ffffffff886a4a3b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 > [<ffffffff80060f29>] thread_return+0x0/0xeb > [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92 > [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100 > [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c > [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950 > [<ffffffff80088431>] default_wake_function+0x0/0xe > [<ffffffff8005bfb1>] child_rip+0xa/0x11 > [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950 > [<ffffffff8005bfa7>] child_rip+0x0/0x11 > > > Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43 > RIP [<ffffffff8885a01f>] :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 > RSP <ffff81040e4dd320> > <0>Kernel panic - not syncing: Fatal exception > > > On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote: > >> what about dmesg output? it''s unlikely lustre debug can help here as >> the problem >> seem to be very internal to ldiskfs (mballoc piece of it) >> >> thanks, Alex >> >> Aaron Knister wrote: >>> I bumped up debugging, and here''s (below) the last bit of debugging >>> info from lustre that I have on the oss before it went belly up. My >>> system is totally inoperable. Does anybody have any ideas? >>> >>> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock()) >>> >>> About to add this lock: >>> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump()) >>> >>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) >>> (pid:6378) >>> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump()) >>> >>> Node: local >>> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump()) >>> >>> Resource: ffff8103849211c0 (3129942/0) >>> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump()) >>> >>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: 0x80004000 >>> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump()) >>> >>> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615) >>> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal()) >>> >>> forcing cancel of local lock >>> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread()) >>> >>> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175) >>> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) >>> >>> started handle ffff8103766dfc78 (0000000000000000) >>> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) >>> >>> committing handle ffff8103766dfc78 >>> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) >>> >>> started handle ffff8103766dfc78 (0000000000000000) >>> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock()) >>> >>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 from >>> hash >>> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set()) >>> >>> kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279) >>> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy()) >>> >>> kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175). >>> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref()) >>> >>> putref res: ffff8103849211c0 count: 0 >>> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref()) >>> >>> kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot 49135135). >>> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref()) >>> >>> slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135). >>> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main()) >>> >>> kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791). >>> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) >>> >>> committing handle ffff8103766dfc78 >>> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) >>> putting 3129942: ffff8103599cea98, count = 0 >>> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated()) >>> >>> crew4-OST0001: after destroy: set last_objids[0] = 3129941 >>> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid()) >>> >>> crew4-OST0001: server last_objid for group 0: 3129941 >>> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) >>> >>> slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335). >>> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req()) >>> >>> last_committed 17896268, xid 3841 >>> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref()) >>> >>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 >>> <http://172.18.0.10>@o2ib >>> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) >>> >>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib >>> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc()) >>> >>> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949). >>> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc()) >>> >>> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285). >>> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection()) >>> >>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 >>> <http://172.18.0.10>@o2ib >>> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) >>> >>> RPC PUTting export ffff8103848e9000 : new rpc_count 0 >>> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) >>> >>> PUTting export ffff8103848e9000 : new refcount 4 >>> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request()) >>> >>> PUTting export ffff8103848e9000 : new refcount 3 >>> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free()) >>> >>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149). >>> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free()) >>> >>> kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813). >>> >>> >>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at gmail.com >>> <mailto:aaron.knister at gmail.com>> wrote: >>> >>> Thank you very much for looking into this. I''ve attached my dmesg to >>> the bug. i looked at line number 1334 which the panic seems to >>> reference. i can''t figure out what its doing though >>> >>> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev >>> <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote: >>> >>> Aaron Knister wrote: >>> >>> I''m seeing this bug (14465) under heavy load on my OSSes. If >>> I reboot the MDS it appears to help...any ideas? What''s the >>> status on this bug? >>> >>> >>> could you attach your dmesg to the bug? as for the status - I''m >>> still not >>> able to reproduce this, neither I found possible cause, sorry. >>> >>> thanks, Alex >>> >>> >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Aaron Knister
2008-Jun-02 12:24 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
I''m not sure how to obtain that. I''m running whatever was distributed with 1.6.4.3 for rhel5-x86_64. On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote:> could you also send me your mballoc.c please? > > thanks, Alex > > Aaron Knister wrote: >> Thanks so much for looking into this. Here''s what I got from dmesg. >> Interestingly enough every time it panics the CPU listed is #4...do >> you think that points to a hardware problem? - >> ----------- [cut here ] --------- [please bite here ] --------- >> Kernel BUG at ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c: >> 1334 >> invalid opcode: 0000 [1] SMP >> last sysfs file: /devices/ >> pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq >> CPU 4 >> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) >> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) >> ptlrpc(U) obdclass( >> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U) >> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) >> xt_tcpudp(U) ip6table_filter(U >> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U) >> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U) >> rdma_cm(U) ib_cm(U) iw_c >> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) >> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) >> button(U) battery( >> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) >> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U) >> ib_core(U) shpchp(U) i2c >> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) >> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) >> sata_sil(U) libata(U) sd_mod(U >> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) >> Pid: 5717, comm: ll_ost_io_11 Tainted: GF >> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 >> RIP: 0010:[<ffffffff8885a01f>] >> [<ffffffff8885a01f>] :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 >> RSP: 0018:ffff81040e4dd320 EFLAGS: 00010246 >> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f >> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000 >> RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010 >> R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470 >> R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800 >> FS: 00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) knlGS: >> 0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0 >> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task >> ffff81041228a040) >> Stack: ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b >> 0000000000010000 >> 0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e >> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000 >> Call Trace: >> [<ffffffff8885cd4e>] :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/ >> 0xd50 >> [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160 >> [<ffffffff88859664>] :ldiskfs:ldiskfs_mb_initialize_context >> +0x144/0x160 >> [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280 >> [<ffffffff888bf091>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb >> +0x301/0x640 >> [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf >> [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4 >> [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250 >> [<ffffffff888bed90>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb >> +0x0/0x640 >> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a >> [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150 >> [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0 >> [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/0x2570 >> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186 >> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0 >> [<ffffffff80088431>] default_wake_function+0x0/0xe >> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8 >> [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160 >> [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 >> [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 >> [<ffffffff886a4a3b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/ >> 0x1270 >> [<ffffffff80060f29>] thread_return+0x0/0xeb >> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92 >> [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100 >> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950 >> [<ffffffff80088431>] default_wake_function+0x0/0xe >> [<ffffffff8005bfb1>] child_rip+0xa/0x11 >> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950 >> [<ffffffff8005bfa7>] child_rip+0x0/0x11 >> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43 >> RIP [<ffffffff8885a01f>] :ldiskfs:ldiskfs_mb_use_best_found+0xef/ >> 0x520 >> RSP <ffff81040e4dd320> >> <0>Kernel panic - not syncing: Fatal exception >> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote: >>> what about dmesg output? it''s unlikely lustre debug can help here >>> as the problem >>> seem to be very internal to ldiskfs (mballoc piece of it) >>> >>> thanks, Alex >>> >>> Aaron Knister wrote: >>>> I bumped up debugging, and here''s (below) the last bit of >>>> debugging >>>> info from lustre that I have on the oss before it went belly up. My >>>> system is totally inoperable. Does anybody have any ideas? >>>> >>>> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c: >>>> 865:ldlm_resource_add_lock()) >>>> About to add this lock: >>>> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c: >>>> 1718:ldlm_lock_dump()) >>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: >>>> 0) (pid:6378) >>>> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c: >>>> 1726:ldlm_lock_dump()) >>>> Node: local >>>> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c: >>>> 1735:ldlm_lock_dump()) >>>> Resource: ffff8103849211c0 (3129942/0) >>>> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c: >>>> 1740:ldlm_lock_dump()) >>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: >>>> 0x80004000 >>>> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c: >>>> 1746:ldlm_lock_dump()) >>>> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615) >>>> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c: >>>> 615:ldlm_lock_decref_internal()) >>>> forcing cancel of local lock >>>> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c: >>>> 1357:ldlm_bl_to_thread()) >>>> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175) >>>> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h: >>>> 194:fsfilt_start_log()) >>>> started handle ffff8103766dfc78 (0000000000000000) >>>> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h: >>>> 270:fsfilt_commit()) >>>> committing handle ffff8103766dfc78 >>>> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h: >>>> 194:fsfilt_start_log()) >>>> started handle ffff8103766dfc78 (0000000000000000) >>>> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c: >>>> 121:class_handle_unhash_nolock()) >>>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 >>>> from hash >>>> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c: >>>> 394:ptlrpc_prep_set()) >>>> kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279) >>>> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c: >>>> 457:ptlrpc_set_destroy()) >>>> kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175). >>>> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c: >>>> 818:ldlm_resource_putref()) >>>> putref res: ffff8103849211c0 count: 0 >>>> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c: >>>> 828:ldlm_resource_putref()) >>>> kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot 49135135). >>>> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c: >>>> 829:ldlm_resource_putref()) >>>> slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135). >>>> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c: >>>> 1657:ldlm_bl_thread_main()) >>>> kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791). >>>> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h: >>>> 270:fsfilt_commit()) >>>> committing handle ffff8103766dfc78 >>>> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c: >>>> 148:f_dput()) >>>> putting 3129942: ffff8103599cea98, count = 0 >>>> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c: >>>> 2689:filter_destroy_precreated()) >>>> crew4-OST0001: after destroy: set last_objids[0] = 3129941 >>>> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c: >>>> 607:filter_update_last_objid()) >>>> crew4-OST0001: server last_objid for group 0: 3129941 >>>> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c: >>>> 747:fsfilt_ldiskfs_cb_func()) >>>> slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335). >>>> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c: >>>> 1556:target_committed_to_req()) >>>> last_committed 17896268, xid 3841 >>>> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c: >>>> 191:ptlrpc_connection_addref()) >>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 >>>> <http://172.18.0.10>@o2ib >>>> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c: >>>> 46:ptl_send_buf()) >>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib >>>> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h: >>>> 247:lnet_md_alloc()) >>>> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949). >>>> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h: >>>> 295:lnet_msg_alloc()) >>>> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285). >>>> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c: >>>> 150:ptlrpc_put_connection()) >>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 >>>> <http://172.18.0.10>@o2ib >>>> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c: >>>> 648:ptlrpc_server_handle_request()) >>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0 >>>> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c: >>>> 648:ptlrpc_server_handle_request()) >>>> PUTting export ffff8103848e9000 : new refcount 4 >>>> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c: >>>> 652:ptlrpc_server_handle_request()) >>>> PUTting export ffff8103848e9000 : new refcount 3 >>>> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h: >>>> 269:lnet_md_free()) >>>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149). >>>> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h: >>>> 312:lnet_msg_free()) >>>> kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813). >>>> >>>> >>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at gmail.com >>>> <mailto:aaron.knister at gmail.com>> wrote: >>>> >>>> Thank you very much for looking into this. I''ve attached my >>>> dmesg to >>>> the bug. i looked at line number 1334 which the panic seems to >>>> reference. i can''t figure out what its doing though >>>> >>>> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev >>>> <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote: >>>> >>>> Aaron Knister wrote: >>>> >>>> I''m seeing this bug (14465) under heavy load on my >>>> OSSes. If >>>> I reboot the MDS it appears to help...any ideas? What''s >>>> the >>>> status on this bug? >>>> >>>> >>>> could you attach your dmesg to the bug? as for the status - >>>> I''m >>>> still not >>>> able to reproduce this, neither I found possible cause, >>>> sorry. >>>> >>>> thanks, Alex >>>> >>>> >>>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Alex Zhuravlev
2008-Jun-05 07:18 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
hmm. could you provide me with ldiskfs.ok then? thanks, Alex Aaron Knister wrote:> I''m not sure how to obtain that. I''m running whatever was distributed > with 1.6.4.3 for rhel5-x86_64. > > On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote: > >> could you also send me your mballoc.c please? >> >> thanks, Alex >> >> Aaron Knister wrote: >>> Thanks so much for looking into this. Here''s what I got from dmesg. >>> Interestingly enough every time it panics the CPU listed is #4...do >>> you think that points to a hardware problem? - >>> ----------- [cut here ] --------- [please bite here ] --------- >>> Kernel BUG at ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334 >>> invalid opcode: 0000 [1] SMP >>> last sysfs file: >>> /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq >>> >>> CPU 4 >>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) >>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) >>> ptlrpc(U) obdclass( >>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U) >>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) xt_tcpudp(U) >>> ip6table_filter(U >>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U) >>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U) >>> rdma_cm(U) ib_cm(U) iw_c >>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) >>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) >>> button(U) battery( >>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) >>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U) >>> ib_core(U) shpchp(U) i2c >>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) >>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) >>> sata_sil(U) libata(U) sd_mod(U >>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) >>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF >>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 >>> RIP: 0010:[<ffffffff8885a01f>] [<ffffffff8885a01f>] >>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 >>> RSP: 0018:ffff81040e4dd320 EFLAGS: 00010246 >>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f >>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000 >>> RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010 >>> R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470 >>> R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800 >>> FS: 00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) >>> knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0 >>> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task >>> ffff81041228a040) >>> Stack: ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b >>> 0000000000010000 >>> 0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e >>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000 >>> Call Trace: >>> [<ffffffff8885cd4e>] :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50 >>> [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160 >>> [<ffffffff88859664>] :ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160 >>> [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280 >>> [<ffffffff888bf091>] >>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640 >>> [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf >>> [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4 >>> [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250 >>> [<ffffffff888bed90>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640 >>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a >>> [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150 >>> [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0 >>> [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/0x2570 >>> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186 >>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >>> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0 >>> [<ffffffff80088431>] default_wake_function+0x0/0xe >>> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8 >>> [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160 >>> [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 >>> [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 >>> [<ffffffff886a4a3b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 >>> [<ffffffff80060f29>] thread_return+0x0/0xeb >>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92 >>> [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100 >>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >>> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950 >>> [<ffffffff80088431>] default_wake_function+0x0/0xe >>> [<ffffffff8005bfb1>] child_rip+0xa/0x11 >>> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950 >>> [<ffffffff8005bfa7>] child_rip+0x0/0x11 >>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43 >>> RIP [<ffffffff8885a01f>] :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 >>> RSP <ffff81040e4dd320> >>> <0>Kernel panic - not syncing: Fatal exception >>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote: >>>> what about dmesg output? it''s unlikely lustre debug can help here as >>>> the problem >>>> seem to be very internal to ldiskfs (mballoc piece of it) >>>> >>>> thanks, Alex >>>> >>>> Aaron Knister wrote: >>>>> I bumped up debugging, and here''s (below) the last bit of debugging >>>>> info from lustre that I have on the oss before it went belly up. My >>>>> system is totally inoperable. Does anybody have any ideas? >>>>> >>>>> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock()) >>>>> >>>>> About to add this lock: >>>>> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump()) >>>>> >>>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) >>>>> (pid:6378) >>>>> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump()) >>>>> >>>>> Node: local >>>>> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump()) >>>>> >>>>> Resource: ffff8103849211c0 (3129942/0) >>>>> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump()) >>>>> >>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: >>>>> 0x80004000 >>>>> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump()) >>>>> >>>>> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615) >>>>> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal()) >>>>> >>>>> forcing cancel of local lock >>>>> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread()) >>>>> >>>>> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175) >>>>> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) >>>>> >>>>> started handle ffff8103766dfc78 (0000000000000000) >>>>> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) >>>>> >>>>> committing handle ffff8103766dfc78 >>>>> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) >>>>> >>>>> started handle ffff8103766dfc78 (0000000000000000) >>>>> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock()) >>>>> >>>>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 >>>>> from hash >>>>> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set()) >>>>> >>>>> kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279) >>>>> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy()) >>>>> >>>>> kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175). >>>>> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref()) >>>>> >>>>> putref res: ffff8103849211c0 count: 0 >>>>> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref()) >>>>> >>>>> kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot 49135135). >>>>> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref()) >>>>> >>>>> slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135). >>>>> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main()) >>>>> >>>>> kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791). >>>>> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) >>>>> >>>>> committing handle ffff8103766dfc78 >>>>> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) >>>>> putting 3129942: ffff8103599cea98, count = 0 >>>>> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated()) >>>>> >>>>> crew4-OST0001: after destroy: set last_objids[0] = 3129941 >>>>> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid()) >>>>> >>>>> crew4-OST0001: server last_objid for group 0: 3129941 >>>>> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) >>>>> >>>>> slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335). >>>>> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req()) >>>>> >>>>> last_committed 17896268, xid 3841 >>>>> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref()) >>>>> >>>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 >>>>> <http://172.18.0.10>@o2ib >>>>> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) >>>>> >>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib >>>>> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc()) >>>>> >>>>> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949). >>>>> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc()) >>>>> >>>>> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285). >>>>> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection()) >>>>> >>>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 >>>>> <http://172.18.0.10>@o2ib >>>>> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) >>>>> >>>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0 >>>>> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) >>>>> >>>>> PUTting export ffff8103848e9000 : new refcount 4 >>>>> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request()) >>>>> >>>>> PUTting export ffff8103848e9000 : new refcount 3 >>>>> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free()) >>>>> >>>>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149). >>>>> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free()) >>>>> >>>>> kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813). >>>>> >>>>> >>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister >>>>> <aaron.knister at gmail.com >>>>> <mailto:aaron.knister at gmail.com>> wrote: >>>>> >>>>> Thank you very much for looking into this. I''ve attached my dmesg to >>>>> the bug. i looked at line number 1334 which the panic seems to >>>>> reference. i can''t figure out what its doing though >>>>> >>>>> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev >>>>> <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote: >>>>> >>>>> Aaron Knister wrote: >>>>> >>>>> I''m seeing this bug (14465) under heavy load on my OSSes. If >>>>> I reboot the MDS it appears to help...any ideas? What''s the >>>>> status on this bug? >>>>> >>>>> >>>>> could you attach your dmesg to the bug? as for the status - I''m >>>>> still not >>>>> able to reproduce this, neither I found possible cause, sorry. >>>>> >>>>> thanks, Alex >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >
Aaron Knister
2008-Jun-05 12:42 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
Unfortunately the monkeys maintaining this setup (I left the company last week) have managed to destroy one of the 13 terabyte arrays...so they''ve got bigger issues right now. Thanks for your help though. On Jun 5, 2008, at 3:18 AM, Alex Zhuravlev wrote:> hmm. could you provide me with ldiskfs.ok then? > > thanks, Alex > > Aaron Knister wrote: >> I''m not sure how to obtain that. I''m running whatever was distributed >> with 1.6.4.3 for rhel5-x86_64. >> >> On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote: >> >>> could you also send me your mballoc.c please? >>> >>> thanks, Alex >>> >>> Aaron Knister wrote: >>>> Thanks so much for looking into this. Here''s what I got from dmesg. >>>> Interestingly enough every time it panics the CPU listed is #4...do >>>> you think that points to a hardware problem? - >>>> ----------- [cut here ] --------- [please bite here ] --------- >>>> Kernel BUG at ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/ >>>> mballoc.c:1334 >>>> invalid opcode: 0000 [1] SMP >>>> last sysfs file: >>>> /devices/ >>>> pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq >>>> >>>> CPU 4 >>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) >>>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) >>>> ptlrpc(U) obdclass( >>>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U) >>>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) >>>> xt_tcpudp(U) >>>> ip6table_filter(U >>>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U) >>>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U) >>>> rdma_cm(U) ib_cm(U) iw_c >>>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) >>>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) >>>> button(U) battery( >>>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) >>>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U) >>>> ib_core(U) shpchp(U) i2c >>>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) >>>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) >>>> sata_sil(U) libata(U) sd_mod(U >>>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) >>>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF >>>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 >>>> RIP: 0010:[<ffffffff8885a01f>] [<ffffffff8885a01f>] >>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 >>>> RSP: 0018:ffff81040e4dd320 EFLAGS: 00010246 >>>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f >>>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000 >>>> RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010 >>>> R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470 >>>> R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800 >>>> FS: 00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) >>>> knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0 >>>> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task >>>> ffff81041228a040) >>>> Stack: ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b >>>> 0000000000010000 >>>> 0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e >>>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000 >>>> Call Trace: >>>> [<ffffffff8885cd4e>] :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/ >>>> 0xd50 >>>> [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160 >>>> [<ffffffff88859664>] :ldiskfs:ldiskfs_mb_initialize_context >>>> +0x144/0x160 >>>> [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280 >>>> [<ffffffff888bf091>] >>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640 >>>> [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf >>>> [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4 >>>> [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250 >>>> [<ffffffff888bed90>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb >>>> +0x0/0x640 >>>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a >>>> [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150 >>>> [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0 >>>> [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/0x2570 >>>> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186 >>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >>>> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0 >>>> [<ffffffff80088431>] default_wake_function+0x0/0xe >>>> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8 >>>> [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160 >>>> [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 >>>> [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 >>>> [<ffffffff886a4a3b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/ >>>> 0x1270 >>>> [<ffffffff80060f29>] thread_return+0x0/0xeb >>>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92 >>>> [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100 >>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >>>> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950 >>>> [<ffffffff80088431>] default_wake_function+0x0/0xe >>>> [<ffffffff8005bfb1>] child_rip+0xa/0x11 >>>> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950 >>>> [<ffffffff8005bfa7>] child_rip+0x0/0x11 >>>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43 >>>> RIP [<ffffffff8885a01f>] :ldiskfs:ldiskfs_mb_use_best_found+0xef/ >>>> 0x520 >>>> RSP <ffff81040e4dd320> >>>> <0>Kernel panic - not syncing: Fatal exception >>>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote: >>>>> what about dmesg output? it''s unlikely lustre debug can help >>>>> here as >>>>> the problem >>>>> seem to be very internal to ldiskfs (mballoc piece of it) >>>>> >>>>> thanks, Alex >>>>> >>>>> Aaron Knister wrote: >>>>>> I bumped up debugging, and here''s (below) the last bit of >>>>>> debugging >>>>>> info from lustre that I have on the oss before it went belly >>>>>> up. My >>>>>> system is totally inoperable. Does anybody have any ideas? >>>>>> >>>>>> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c: >>>>>> 865:ldlm_resource_add_lock()) >>>>>> >>>>>> About to add this lock: >>>>>> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c: >>>>>> 1718:ldlm_lock_dump()) >>>>>> >>>>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: >>>>>> 0) >>>>>> (pid:6378) >>>>>> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c: >>>>>> 1726:ldlm_lock_dump()) >>>>>> >>>>>> Node: local >>>>>> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c: >>>>>> 1735:ldlm_lock_dump()) >>>>>> >>>>>> Resource: ffff8103849211c0 (3129942/0) >>>>>> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c: >>>>>> 1740:ldlm_lock_dump()) >>>>>> >>>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: >>>>>> 0x80004000 >>>>>> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c: >>>>>> 1746:ldlm_lock_dump()) >>>>>> >>>>>> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615) >>>>>> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c: >>>>>> 615:ldlm_lock_decref_internal()) >>>>>> >>>>>> forcing cancel of local lock >>>>>> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c: >>>>>> 1357:ldlm_bl_to_thread()) >>>>>> >>>>>> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175) >>>>>> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h: >>>>>> 194:fsfilt_start_log()) >>>>>> >>>>>> started handle ffff8103766dfc78 (0000000000000000) >>>>>> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h: >>>>>> 270:fsfilt_commit()) >>>>>> >>>>>> committing handle ffff8103766dfc78 >>>>>> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h: >>>>>> 194:fsfilt_start_log()) >>>>>> >>>>>> started handle ffff8103766dfc78 (0000000000000000) >>>>>> 00000020:00000040:4:1212157576.884957:0:5557:0: >>>>>> (lustre_handles.c:121:class_handle_unhash_nolock()) >>>>>> >>>>>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 >>>>>> from hash >>>>>> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c: >>>>>> 394:ptlrpc_prep_set()) >>>>>> >>>>>> kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279) >>>>>> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c: >>>>>> 457:ptlrpc_set_destroy()) >>>>>> >>>>>> kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175). >>>>>> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c: >>>>>> 818:ldlm_resource_putref()) >>>>>> >>>>>> putref res: ffff8103849211c0 count: 0 >>>>>> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c: >>>>>> 828:ldlm_resource_putref()) >>>>>> >>>>>> kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot 49135135). >>>>>> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c: >>>>>> 829:ldlm_resource_putref()) >>>>>> >>>>>> slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135). >>>>>> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c: >>>>>> 1657:ldlm_bl_thread_main()) >>>>>> >>>>>> kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791). >>>>>> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h: >>>>>> 270:fsfilt_commit()) >>>>>> >>>>>> committing handle ffff8103766dfc78 >>>>>> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c: >>>>>> 148:f_dput()) >>>>>> putting 3129942: ffff8103599cea98, count = 0 >>>>>> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c: >>>>>> 2689:filter_destroy_precreated()) >>>>>> >>>>>> crew4-OST0001: after destroy: set last_objids[0] = 3129941 >>>>>> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c: >>>>>> 607:filter_update_last_objid()) >>>>>> >>>>>> crew4-OST0001: server last_objid for group 0: 3129941 >>>>>> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt- >>>>>> ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) >>>>>> >>>>>> slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335). >>>>>> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c: >>>>>> 1556:target_committed_to_req()) >>>>>> >>>>>> last_committed 17896268, xid 3841 >>>>>> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c: >>>>>> 191:ptlrpc_connection_addref()) >>>>>> >>>>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 >>>>>> <http://172.18.0.10>@o2ib >>>>>> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c: >>>>>> 46:ptl_send_buf()) >>>>>> >>>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib >>>>>> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h: >>>>>> 247:lnet_md_alloc()) >>>>>> >>>>>> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949). >>>>>> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h: >>>>>> 295:lnet_msg_alloc()) >>>>>> >>>>>> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285). >>>>>> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c: >>>>>> 150:ptlrpc_put_connection()) >>>>>> >>>>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 >>>>>> <http://172.18.0.10>@o2ib >>>>>> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c: >>>>>> 648:ptlrpc_server_handle_request()) >>>>>> >>>>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0 >>>>>> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c: >>>>>> 648:ptlrpc_server_handle_request()) >>>>>> >>>>>> PUTting export ffff8103848e9000 : new refcount 4 >>>>>> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c: >>>>>> 652:ptlrpc_server_handle_request()) >>>>>> >>>>>> PUTting export ffff8103848e9000 : new refcount 3 >>>>>> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h: >>>>>> 269:lnet_md_free()) >>>>>> >>>>>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149). >>>>>> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h: >>>>>> 312:lnet_msg_free()) >>>>>> >>>>>> kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813). >>>>>> >>>>>> >>>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister >>>>>> <aaron.knister at gmail.com >>>>>> <mailto:aaron.knister at gmail.com>> wrote: >>>>>> >>>>>> Thank you very much for looking into this. I''ve attached my >>>>>> dmesg to >>>>>> the bug. i looked at line number 1334 which the panic seems to >>>>>> reference. i can''t figure out what its doing though >>>>>> >>>>>> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev >>>>>> <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote: >>>>>> >>>>>> Aaron Knister wrote: >>>>>> >>>>>> I''m seeing this bug (14465) under heavy load on my >>>>>> OSSes. If >>>>>> I reboot the MDS it appears to help...any ideas? >>>>>> What''s the >>>>>> status on this bug? >>>>>> >>>>>> >>>>>> could you attach your dmesg to the bug? as for the status >>>>>> - I''m >>>>>> still not >>>>>> able to reproduce this, neither I found possible cause, >>>>>> sorry. >>>>>> >>>>>> thanks, Alex >>>>>> >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Lustre-discuss mailing list >>>>> Lustre-discuss at lists.lustre.org >>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Alex Zhuravlev
2008-Jun-05 12:50 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
we''ve been investigating one issue which can be a source of your problem. fortunately we can reproduce that. thanks, Alex Aaron Knister wrote:> Unfortunately the monkeys maintaining this setup (I left the company > last week) have managed to destroy one of the 13 terabyte arrays...so > they''ve got bigger issues right now. Thanks for your help though. > > On Jun 5, 2008, at 3:18 AM, Alex Zhuravlev wrote: > >> hmm. could you provide me with ldiskfs.ok then? >> >> thanks, Alex >> >> Aaron Knister wrote: >>> I''m not sure how to obtain that. I''m running whatever was distributed >>> with 1.6.4.3 for rhel5-x86_64. >>> >>> On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote: >>> >>>> could you also send me your mballoc.c please? >>>> >>>> thanks, Alex >>>> >>>> Aaron Knister wrote: >>>>> Thanks so much for looking into this. Here''s what I got from dmesg. >>>>> Interestingly enough every time it panics the CPU listed is #4...do >>>>> you think that points to a hardware problem? - >>>>> ----------- [cut here ] --------- [please bite here ] --------- >>>>> Kernel BUG at >>>>> ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334 >>>>> invalid opcode: 0000 [1] SMP >>>>> last sysfs file: >>>>> /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq >>>>> >>>>> >>>>> CPU 4 >>>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) >>>>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) >>>>> ptlrpc(U) obdclass( >>>>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U) >>>>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) xt_tcpudp(U) >>>>> ip6table_filter(U >>>>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U) >>>>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U) >>>>> rdma_cm(U) ib_cm(U) iw_c >>>>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) >>>>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) >>>>> button(U) battery( >>>>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) >>>>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U) >>>>> ib_core(U) shpchp(U) i2c >>>>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) >>>>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) >>>>> sata_sil(U) libata(U) sd_mod(U >>>>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) >>>>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF >>>>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 >>>>> RIP: 0010:[<ffffffff8885a01f>] [<ffffffff8885a01f>] >>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 >>>>> RSP: 0018:ffff81040e4dd320 EFLAGS: 00010246 >>>>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f >>>>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000 >>>>> RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010 >>>>> R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470 >>>>> R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800 >>>>> FS: 00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) >>>>> knlGS:0000000000000000 >>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>>>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0 >>>>> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task >>>>> ffff81041228a040) >>>>> Stack: ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b >>>>> 0000000000010000 >>>>> 0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e >>>>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000 >>>>> Call Trace: >>>>> [<ffffffff8885cd4e>] :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50 >>>>> [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160 >>>>> [<ffffffff88859664>] >>>>> :ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160 >>>>> [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280 >>>>> [<ffffffff888bf091>] >>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640 >>>>> [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf >>>>> [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4 >>>>> [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250 >>>>> [<ffffffff888bed90>] >>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640 >>>>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a >>>>> [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150 >>>>> [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0 >>>>> [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/0x2570 >>>>> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186 >>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >>>>> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0 >>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe >>>>> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8 >>>>> [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160 >>>>> [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 >>>>> [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 >>>>> [<ffffffff886a4a3b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 >>>>> [<ffffffff80060f29>] thread_return+0x0/0xeb >>>>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92 >>>>> [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100 >>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >>>>> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950 >>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe >>>>> [<ffffffff8005bfb1>] child_rip+0xa/0x11 >>>>> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950 >>>>> [<ffffffff8005bfa7>] child_rip+0x0/0x11 >>>>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43 >>>>> RIP [<ffffffff8885a01f>] >>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 >>>>> RSP <ffff81040e4dd320> >>>>> <0>Kernel panic - not syncing: Fatal exception >>>>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote: >>>>>> what about dmesg output? it''s unlikely lustre debug can help here as >>>>>> the problem >>>>>> seem to be very internal to ldiskfs (mballoc piece of it) >>>>>> >>>>>> thanks, Alex >>>>>> >>>>>> Aaron Knister wrote: >>>>>>> I bumped up debugging, and here''s (below) the last bit of debugging >>>>>>> info from lustre that I have on the oss before it went belly up. My >>>>>>> system is totally inoperable. Does anybody have any ideas? >>>>>>> >>>>>>> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock()) >>>>>>> >>>>>>> >>>>>>> About to add this lock: >>>>>>> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump()) >>>>>>> >>>>>>> >>>>>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: >>>>>>> 0)(pid:6378) >>>>>>> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump()) >>>>>>> >>>>>>> >>>>>>> Node: local >>>>>>> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump()) >>>>>>> >>>>>>> >>>>>>> Resource: ffff8103849211c0 (3129942/0) >>>>>>> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump()) >>>>>>> >>>>>>> >>>>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: >>>>>>> 0x80004000 >>>>>>> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump()) >>>>>>> >>>>>>> >>>>>>> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615) >>>>>>> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal()) >>>>>>> >>>>>>> >>>>>>> forcing cancel of local lock >>>>>>> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread()) >>>>>>> >>>>>>> >>>>>>> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175) >>>>>>> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) >>>>>>> >>>>>>> >>>>>>> started handle ffff8103766dfc78 (0000000000000000) >>>>>>> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) >>>>>>> >>>>>>> >>>>>>> committing handle ffff8103766dfc78 >>>>>>> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) >>>>>>> >>>>>>> >>>>>>> started handle ffff8103766dfc78 (0000000000000000) >>>>>>> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock()) >>>>>>> >>>>>>> >>>>>>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 >>>>>>> from hash >>>>>>> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set()) >>>>>>> >>>>>>> >>>>>>> kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279) >>>>>>> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy()) >>>>>>> >>>>>>> >>>>>>> kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175). >>>>>>> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref()) >>>>>>> >>>>>>> >>>>>>> putref res: ffff8103849211c0 count: 0 >>>>>>> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref()) >>>>>>> >>>>>>> >>>>>>> kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot 49135135). >>>>>>> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref()) >>>>>>> >>>>>>> >>>>>>> slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135). >>>>>>> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main()) >>>>>>> >>>>>>> >>>>>>> kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791). >>>>>>> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) >>>>>>> >>>>>>> >>>>>>> committing handle ffff8103766dfc78 >>>>>>> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) >>>>>>> >>>>>>> putting 3129942: ffff8103599cea98, count = 0 >>>>>>> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated()) >>>>>>> >>>>>>> >>>>>>> crew4-OST0001: after destroy: set last_objids[0] = 3129941 >>>>>>> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid()) >>>>>>> >>>>>>> >>>>>>> crew4-OST0001: server last_objid for group 0: 3129941 >>>>>>> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) >>>>>>> >>>>>>> >>>>>>> slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335). >>>>>>> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req()) >>>>>>> >>>>>>> >>>>>>> last_committed 17896268, xid 3841 >>>>>>> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref()) >>>>>>> >>>>>>> >>>>>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 >>>>>>> <http://172.18.0.10>@o2ib >>>>>>> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) >>>>>>> >>>>>>> >>>>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib >>>>>>> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc()) >>>>>>> >>>>>>> >>>>>>> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949). >>>>>>> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc()) >>>>>>> >>>>>>> >>>>>>> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285). >>>>>>> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection()) >>>>>>> >>>>>>> >>>>>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 >>>>>>> <http://172.18.0.10>@o2ib >>>>>>> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) >>>>>>> >>>>>>> >>>>>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0 >>>>>>> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) >>>>>>> >>>>>>> >>>>>>> PUTting export ffff8103848e9000 : new refcount 4 >>>>>>> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request()) >>>>>>> >>>>>>> >>>>>>> PUTting export ffff8103848e9000 : new refcount 3 >>>>>>> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free()) >>>>>>> >>>>>>> >>>>>>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149). >>>>>>> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free()) >>>>>>> >>>>>>> >>>>>>> kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813). >>>>>>> >>>>>>> >>>>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister >>>>>>> <aaron.knister at gmail.com >>>>>>> <mailto:aaron.knister at gmail.com>> wrote: >>>>>>> >>>>>>> Thank you very much for looking into this. I''ve attached my >>>>>>> dmesg to >>>>>>> the bug. i looked at line number 1334 which the panic seems to >>>>>>> reference. i can''t figure out what its doing though >>>>>>> >>>>>>> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev >>>>>>> <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote: >>>>>>> >>>>>>> Aaron Knister wrote: >>>>>>> >>>>>>> I''m seeing this bug (14465) under heavy load on my >>>>>>> OSSes. If >>>>>>> I reboot the MDS it appears to help...any ideas? What''s the >>>>>>> status on this bug? >>>>>>> >>>>>>> >>>>>>> could you attach your dmesg to the bug? as for the status - I''m >>>>>>> still not >>>>>>> able to reproduce this, neither I found possible cause, sorry. >>>>>>> >>>>>>> thanks, Alex >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Lustre-discuss mailing list >>>>>> Lustre-discuss at lists.lustre.org >>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>> >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Aaron Knister
2008-Jun-05 12:55 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
For my own curiosity''s sake, is there a bugzilla id associated with it? I''m curious to know what the problem is/was. On Jun 5, 2008, at 8:50 AM, Alex Zhuravlev wrote:> we''ve been investigating one issue which can be a source of your > problem. > fortunately we can reproduce that. > > thanks, Alex > > Aaron Knister wrote: >> Unfortunately the monkeys maintaining this setup (I left the company >> last week) have managed to destroy one of the 13 terabyte arrays...so >> they''ve got bigger issues right now. Thanks for your help though. >> >> On Jun 5, 2008, at 3:18 AM, Alex Zhuravlev wrote: >> >>> hmm. could you provide me with ldiskfs.ok then? >>> >>> thanks, Alex >>> >>> Aaron Knister wrote: >>>> I''m not sure how to obtain that. I''m running whatever was >>>> distributed >>>> with 1.6.4.3 for rhel5-x86_64. >>>> >>>> On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote: >>>> >>>>> could you also send me your mballoc.c please? >>>>> >>>>> thanks, Alex >>>>> >>>>> Aaron Knister wrote: >>>>>> Thanks so much for looking into this. Here''s what I got from >>>>>> dmesg. >>>>>> Interestingly enough every time it panics the CPU listed is >>>>>> #4...do >>>>>> you think that points to a hardware problem? - >>>>>> ----------- [cut here ] --------- [please bite here ] --------- >>>>>> Kernel BUG at >>>>>> ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334 >>>>>> invalid opcode: 0000 [1] SMP >>>>>> last sysfs file: >>>>>> /devices/ >>>>>> pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/ >>>>>> irq >>>>>> >>>>>> >>>>>> CPU 4 >>>>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) >>>>>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) >>>>>> ptlrpc(U) obdclass( >>>>>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U) >>>>>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) >>>>>> xt_tcpudp(U) >>>>>> ip6table_filter(U >>>>>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U) >>>>>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U) >>>>>> rdma_cm(U) ib_cm(U) iw_c >>>>>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) >>>>>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) >>>>>> button(U) battery( >>>>>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) >>>>>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U) >>>>>> ib_core(U) shpchp(U) i2c >>>>>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) >>>>>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) >>>>>> sata_sil(U) libata(U) sd_mod(U >>>>>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) >>>>>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF >>>>>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 >>>>>> RIP: 0010:[<ffffffff8885a01f>] [<ffffffff8885a01f>] >>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 >>>>>> RSP: 0018:ffff81040e4dd320 EFLAGS: 00010246 >>>>>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f >>>>>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000 >>>>>> RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010 >>>>>> R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470 >>>>>> R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800 >>>>>> FS: 00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) >>>>>> knlGS:0000000000000000 >>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>>>>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0 >>>>>> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, >>>>>> task >>>>>> ffff81041228a040) >>>>>> Stack: ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b >>>>>> 0000000000010000 >>>>>> 0000000000010000 0000000000000020 ffff81040e4dd4c0 >>>>>> ffffffff8885cd4e >>>>>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 >>>>>> 0000000000000000 >>>>>> Call Trace: >>>>>> [<ffffffff8885cd4e>] :ldiskfs:ldiskfs_mb_regular_allocator >>>>>> +0x66e/0xd50 >>>>>> [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty >>>>>> +0x132/0x160 >>>>>> [<ffffffff88859664>] >>>>>> :ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160 >>>>>> [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280 >>>>>> [<ffffffff888bf091>] >>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640 >>>>>> [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf >>>>>> [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4 >>>>>> [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250 >>>>>> [<ffffffff888bed90>] >>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640 >>>>>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a >>>>>> [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/ >>>>>> 0x150 >>>>>> [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0 >>>>>> [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/ >>>>>> 0x2570 >>>>>> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186 >>>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >>>>>> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0 >>>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe >>>>>> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8 >>>>>> [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160 >>>>>> [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 >>>>>> [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 >>>>>> [<ffffffff886a4a3b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/ >>>>>> 0x1270 >>>>>> [<ffffffff80060f29>] thread_return+0x0/0xeb >>>>>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92 >>>>>> [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100 >>>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >>>>>> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950 >>>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe >>>>>> [<ffffffff8005bfb1>] child_rip+0xa/0x11 >>>>>> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950 >>>>>> [<ffffffff8005bfa7>] child_rip+0x0/0x11 >>>>>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43 >>>>>> RIP [<ffffffff8885a01f>] >>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 >>>>>> RSP <ffff81040e4dd320> >>>>>> <0>Kernel panic - not syncing: Fatal exception >>>>>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote: >>>>>>> what about dmesg output? it''s unlikely lustre debug can help >>>>>>> here as >>>>>>> the problem >>>>>>> seem to be very internal to ldiskfs (mballoc piece of it) >>>>>>> >>>>>>> thanks, Alex >>>>>>> >>>>>>> Aaron Knister wrote: >>>>>>>> I bumped up debugging, and here''s (below) the last bit of >>>>>>>> debugging >>>>>>>> info from lustre that I have on the oss before it went belly >>>>>>>> up. My >>>>>>>> system is totally inoperable. Does anybody have any ideas? >>>>>>>> >>>>>>>> 00010000:00001000:4:1212157576.884909:0:6378:0: >>>>>>>> (ldlm_resource.c:865:ldlm_resource_add_lock()) >>>>>>>> >>>>>>>> >>>>>>>> About to add this lock: >>>>>>>> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c: >>>>>>>> 1718:ldlm_lock_dump()) >>>>>>>> >>>>>>>> >>>>>>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) >>>>>>>> (pos: >>>>>>>> 0)(pid:6378) >>>>>>>> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c: >>>>>>>> 1726:ldlm_lock_dump()) >>>>>>>> >>>>>>>> >>>>>>>> Node: local >>>>>>>> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c: >>>>>>>> 1735:ldlm_lock_dump()) >>>>>>>> >>>>>>>> >>>>>>>> Resource: ffff8103849211c0 (3129942/0) >>>>>>>> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c: >>>>>>>> 1740:ldlm_lock_dump()) >>>>>>>> >>>>>>>> >>>>>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: >>>>>>>> 0x80004000 >>>>>>>> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c: >>>>>>>> 1746:ldlm_lock_dump()) >>>>>>>> >>>>>>>> >>>>>>>> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615) >>>>>>>> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c: >>>>>>>> 615:ldlm_lock_decref_internal()) >>>>>>>> >>>>>>>> >>>>>>>> forcing cancel of local lock >>>>>>>> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c: >>>>>>>> 1357:ldlm_bl_to_thread()) >>>>>>>> >>>>>>>> >>>>>>>> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175) >>>>>>>> 00002000:00000040:4:1212157576.884925:0:6378:0: >>>>>>>> (lustre_fsfilt.h:194:fsfilt_start_log()) >>>>>>>> >>>>>>>> >>>>>>>> started handle ffff8103766dfc78 (0000000000000000) >>>>>>>> 00002000:00000040:4:1212157576.884930:0:6378:0: >>>>>>>> (lustre_fsfilt.h:270:fsfilt_commit()) >>>>>>>> >>>>>>>> >>>>>>>> committing handle ffff8103766dfc78 >>>>>>>> 00002000:00000040:4:1212157576.884931:0:6378:0: >>>>>>>> (lustre_fsfilt.h:194:fsfilt_start_log()) >>>>>>>> >>>>>>>> >>>>>>>> started handle ffff8103766dfc78 (0000000000000000) >>>>>>>> 00000020:00000040:4:1212157576.884957:0:5557:0: >>>>>>>> (lustre_handles.c:121:class_handle_unhash_nolock()) >>>>>>>> >>>>>>>> >>>>>>>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 >>>>>>>> from hash >>>>>>>> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c: >>>>>>>> 394:ptlrpc_prep_set()) >>>>>>>> >>>>>>>> >>>>>>>> kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279) >>>>>>>> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c: >>>>>>>> 457:ptlrpc_set_destroy()) >>>>>>>> >>>>>>>> >>>>>>>> kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175). >>>>>>>> 00010000:00000040:4:1212157576.884964:0:5557:0: >>>>>>>> (ldlm_resource.c:818:ldlm_resource_putref()) >>>>>>>> >>>>>>>> >>>>>>>> putref res: ffff8103849211c0 count: 0 >>>>>>>> 00010000:00000010:4:1212157576.884969:0:5557:0: >>>>>>>> (ldlm_resource.c:828:ldlm_resource_putref()) >>>>>>>> >>>>>>>> >>>>>>>> kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot >>>>>>>> 49135135). >>>>>>>> 00010000:00000010:4:1212157576.885000:0:5557:0: >>>>>>>> (ldlm_resource.c:829:ldlm_resource_putref()) >>>>>>>> >>>>>>>> >>>>>>>> slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135). >>>>>>>> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c: >>>>>>>> 1657:ldlm_bl_thread_main()) >>>>>>>> >>>>>>>> >>>>>>>> kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791). >>>>>>>> 00002000:00000040:4:1212157576.885623:0:6378:0: >>>>>>>> (lustre_fsfilt.h:270:fsfilt_commit()) >>>>>>>> >>>>>>>> >>>>>>>> committing handle ffff8103766dfc78 >>>>>>>> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c: >>>>>>>> 148:f_dput()) >>>>>>>> >>>>>>>> putting 3129942: ffff8103599cea98, count = 0 >>>>>>>> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c: >>>>>>>> 2689:filter_destroy_precreated()) >>>>>>>> >>>>>>>> >>>>>>>> crew4-OST0001: after destroy: set last_objids[0] = 3129941 >>>>>>>> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c: >>>>>>>> 607:filter_update_last_objid()) >>>>>>>> >>>>>>>> >>>>>>>> crew4-OST0001: server last_objid for group 0: 3129941 >>>>>>>> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt- >>>>>>>> ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) >>>>>>>> >>>>>>>> >>>>>>>> slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335). >>>>>>>> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c: >>>>>>>> 1556:target_committed_to_req()) >>>>>>>> >>>>>>>> >>>>>>>> last_committed 17896268, xid 3841 >>>>>>>> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c: >>>>>>>> 191:ptlrpc_connection_addref()) >>>>>>>> >>>>>>>> >>>>>>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 >>>>>>>> <http://172.18.0.10>@o2ib >>>>>>>> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c: >>>>>>>> 46:ptl_send_buf()) >>>>>>>> >>>>>>>> >>>>>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib >>>>>>>> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h: >>>>>>>> 247:lnet_md_alloc()) >>>>>>>> >>>>>>>> >>>>>>>> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949). >>>>>>>> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h: >>>>>>>> 295:lnet_msg_alloc()) >>>>>>>> >>>>>>>> >>>>>>>> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285). >>>>>>>> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c: >>>>>>>> 150:ptlrpc_put_connection()) >>>>>>>> >>>>>>>> >>>>>>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 >>>>>>>> <http://172.18.0.10>@o2ib >>>>>>>> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c: >>>>>>>> 648:ptlrpc_server_handle_request()) >>>>>>>> >>>>>>>> >>>>>>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0 >>>>>>>> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c: >>>>>>>> 648:ptlrpc_server_handle_request()) >>>>>>>> >>>>>>>> >>>>>>>> PUTting export ffff8103848e9000 : new refcount 4 >>>>>>>> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c: >>>>>>>> 652:ptlrpc_server_handle_request()) >>>>>>>> >>>>>>>> >>>>>>>> PUTting export ffff8103848e9000 : new refcount 3 >>>>>>>> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h: >>>>>>>> 269:lnet_md_free()) >>>>>>>> >>>>>>>> >>>>>>>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149). >>>>>>>> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h: >>>>>>>> 312:lnet_msg_free()) >>>>>>>> >>>>>>>> >>>>>>>> kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813). >>>>>>>> >>>>>>>> >>>>>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister >>>>>>>> <aaron.knister at gmail.com >>>>>>>> <mailto:aaron.knister at gmail.com>> wrote: >>>>>>>> >>>>>>>> Thank you very much for looking into this. I''ve attached my >>>>>>>> dmesg to >>>>>>>> the bug. i looked at line number 1334 which the panic seems to >>>>>>>> reference. i can''t figure out what its doing though >>>>>>>> >>>>>>>> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev >>>>>>>> <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote: >>>>>>>> >>>>>>>> Aaron Knister wrote: >>>>>>>> >>>>>>>> I''m seeing this bug (14465) under heavy load on my >>>>>>>> OSSes. If >>>>>>>> I reboot the MDS it appears to help...any ideas? >>>>>>>> What''s the >>>>>>>> status on this bug? >>>>>>>> >>>>>>>> >>>>>>>> could you attach your dmesg to the bug? as for the status >>>>>>>> - I''m >>>>>>>> still not >>>>>>>> able to reproduce this, neither I found possible cause, >>>>>>>> sorry. >>>>>>>> >>>>>>>> thanks, Alex >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Lustre-discuss mailing list >>>>>>> Lustre-discuss at lists.lustre.org >>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>> >>>> >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Alex Zhuravlev
2008-Jun-05 13:04 UTC
[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found
https://bugzilla.lustre.org/show_bug.cgi?id=15932 thanks, Alex Aaron Knister wrote:> For my own curiosity''s sake, is there a bugzilla id associated with it? > I''m curious to know what the problem is/was. > > On Jun 5, 2008, at 8:50 AM, Alex Zhuravlev wrote: > >> we''ve been investigating one issue which can be a source of your problem. >> fortunately we can reproduce that. >> >> thanks, Alex >> >> Aaron Knister wrote: >>> Unfortunately the monkeys maintaining this setup (I left the company >>> last week) have managed to destroy one of the 13 terabyte arrays...so >>> they''ve got bigger issues right now. Thanks for your help though. >>> >>> On Jun 5, 2008, at 3:18 AM, Alex Zhuravlev wrote: >>> >>>> hmm. could you provide me with ldiskfs.ok then? >>>> >>>> thanks, Alex >>>> >>>> Aaron Knister wrote: >>>>> I''m not sure how to obtain that. I''m running whatever was distributed >>>>> with 1.6.4.3 for rhel5-x86_64. >>>>> >>>>> On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote: >>>>> >>>>>> could you also send me your mballoc.c please? >>>>>> >>>>>> thanks, Alex >>>>>> >>>>>> Aaron Knister wrote: >>>>>>> Thanks so much for looking into this. Here''s what I got from dmesg. >>>>>>> Interestingly enough every time it panics the CPU listed is #4...do >>>>>>> you think that points to a hardware problem? - >>>>>>> ----------- [cut here ] --------- [please bite here ] --------- >>>>>>> Kernel BUG at >>>>>>> ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334 >>>>>>> invalid opcode: 0000 [1] SMP >>>>>>> last sysfs file: >>>>>>> /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq >>>>>>> >>>>>>> >>>>>>> >>>>>>> CPU 4 >>>>>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) >>>>>>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) >>>>>>> ptlrpc(U) obdclass( >>>>>>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U) >>>>>>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) >>>>>>> xt_tcpudp(U) >>>>>>> ip6table_filter(U >>>>>>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U) >>>>>>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U) >>>>>>> rdma_cm(U) ib_cm(U) iw_c >>>>>>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) >>>>>>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) >>>>>>> button(U) battery( >>>>>>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) >>>>>>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U) >>>>>>> ib_core(U) shpchp(U) i2c >>>>>>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) >>>>>>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) >>>>>>> sata_sil(U) libata(U) sd_mod(U >>>>>>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U) >>>>>>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF >>>>>>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1 >>>>>>> RIP: 0010:[<ffffffff8885a01f>] [<ffffffff8885a01f>] >>>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 >>>>>>> RSP: 0018:ffff81040e4dd320 EFLAGS: 00010246 >>>>>>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f >>>>>>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000 >>>>>>> RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010 >>>>>>> R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470 >>>>>>> R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800 >>>>>>> FS: 00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) >>>>>>> knlGS:0000000000000000 >>>>>>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>>>>>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0 >>>>>>> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task >>>>>>> ffff81041228a040) >>>>>>> Stack: ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b >>>>>>> 0000000000010000 >>>>>>> 0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e >>>>>>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000 >>>>>>> Call Trace: >>>>>>> [<ffffffff8885cd4e>] >>>>>>> :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50 >>>>>>> [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160 >>>>>>> [<ffffffff88859664>] >>>>>>> :ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160 >>>>>>> [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280 >>>>>>> [<ffffffff888bf091>] >>>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640 >>>>>>> [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf >>>>>>> [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4 >>>>>>> [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250 >>>>>>> [<ffffffff888bed90>] >>>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640 >>>>>>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a >>>>>>> [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150 >>>>>>> [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0 >>>>>>> [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/0x2570 >>>>>>> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186 >>>>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >>>>>>> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0 >>>>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe >>>>>>> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8 >>>>>>> [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160 >>>>>>> [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90 >>>>>>> [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0 >>>>>>> [<ffffffff886a4a3b>] >>>>>>> :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270 >>>>>>> [<ffffffff80060f29>] thread_return+0x0/0xeb >>>>>>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92 >>>>>>> [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100 >>>>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c >>>>>>> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950 >>>>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe >>>>>>> [<ffffffff8005bfb1>] child_rip+0xa/0x11 >>>>>>> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950 >>>>>>> [<ffffffff8005bfa7>] child_rip+0x0/0x11 >>>>>>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43 >>>>>>> RIP [<ffffffff8885a01f>] >>>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520 >>>>>>> RSP <ffff81040e4dd320> >>>>>>> <0>Kernel panic - not syncing: Fatal exception >>>>>>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote: >>>>>>>> what about dmesg output? it''s unlikely lustre debug can help >>>>>>>> here as >>>>>>>> the problem >>>>>>>> seem to be very internal to ldiskfs (mballoc piece of it) >>>>>>>> >>>>>>>> thanks, Alex >>>>>>>> >>>>>>>> Aaron Knister wrote: >>>>>>>>> I bumped up debugging, and here''s (below) the last bit of >>>>>>>>> debugging >>>>>>>>> info from lustre that I have on the oss before it went belly >>>>>>>>> up. My >>>>>>>>> system is totally inoperable. Does anybody have any ideas? >>>>>>>>> >>>>>>>>> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> About to add this lock: >>>>>>>>> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) >>>>>>>>> (pos:0)(pid:6378) >>>>>>>>> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Node: local >>>>>>>>> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Resource: ffff8103849211c0 (3129942/0) >>>>>>>>> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: >>>>>>>>> 0x80004000 >>>>>>>>> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615) >>>>>>>>> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> forcing cancel of local lock >>>>>>>>> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175) >>>>>>>>> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> started handle ffff8103766dfc78 (0000000000000000) >>>>>>>>> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> committing handle ffff8103766dfc78 >>>>>>>>> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> started handle ffff8103766dfc78 (0000000000000000) >>>>>>>>> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 >>>>>>>>> from hash >>>>>>>>> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279) >>>>>>>>> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175). >>>>>>>>> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> putref res: ffff8103849211c0 count: 0 >>>>>>>>> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot 49135135). >>>>>>>>> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135). >>>>>>>>> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791). >>>>>>>>> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> committing handle ffff8103766dfc78 >>>>>>>>> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) >>>>>>>>> >>>>>>>>> >>>>>>>>> putting 3129942: ffff8103599cea98, count = 0 >>>>>>>>> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> crew4-OST0001: after destroy: set last_objids[0] = 3129941 >>>>>>>>> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> crew4-OST0001: server last_objid for group 0: 3129941 >>>>>>>>> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335). >>>>>>>>> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> last_committed 17896268, xid 3841 >>>>>>>>> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 >>>>>>>>> <http://172.18.0.10>@o2ib >>>>>>>>> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib >>>>>>>>> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949). >>>>>>>>> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285). >>>>>>>>> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 >>>>>>>>> <http://172.18.0.10>@o2ib >>>>>>>>> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0 >>>>>>>>> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> PUTting export ffff8103848e9000 : new refcount 4 >>>>>>>>> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> PUTting export ffff8103848e9000 : new refcount 3 >>>>>>>>> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149). >>>>>>>>> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free()) >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813). >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister >>>>>>>>> <aaron.knister at gmail.com >>>>>>>>> <mailto:aaron.knister at gmail.com>> wrote: >>>>>>>>> >>>>>>>>> Thank you very much for looking into this. I''ve attached my >>>>>>>>> dmesg to >>>>>>>>> the bug. i looked at line number 1334 which the panic seems to >>>>>>>>> reference. i can''t figure out what its doing though >>>>>>>>> >>>>>>>>> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev >>>>>>>>> <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at sun.com>> wrote: >>>>>>>>> >>>>>>>>> Aaron Knister wrote: >>>>>>>>> >>>>>>>>> I''m seeing this bug (14465) under heavy load on my >>>>>>>>> OSSes. If >>>>>>>>> I reboot the MDS it appears to help...any ideas? What''s >>>>>>>>> the >>>>>>>>> status on this bug? >>>>>>>>> >>>>>>>>> >>>>>>>>> could you attach your dmesg to the bug? as for the status - >>>>>>>>> I''m >>>>>>>>> still not >>>>>>>>> able to reproduce this, neither I found possible cause, sorry. >>>>>>>>> >>>>>>>>> thanks, Alex >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Lustre-discuss mailing list >>>>>>>> Lustre-discuss at lists.lustre.org >>>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Lustre-discuss mailing list >>>> Lustre-discuss at lists.lustre.org >>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >