thr3ads.net - Lustre discuss - [Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best

If this information is useful, please help other people find it:
Share via:

Aaron Knister

2008-May-28 17:16 UTC

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

I''m seeing this bug (14465) under heavy load on my OSSes. If I reboot
the
MDS it appears to help...any ideas? What''s the status on this bug?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080528/88052956/attachment-0001.html

Alex Zhuravlev

2008-May-28 20:54 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

Aaron Knister wrote:> I''m seeing this bug (14465) under heavy load on my OSSes. If I
reboot
> the MDS it appears to help...any ideas? What''s the status on this
bug?
could you attach your dmesg to the bug? as for the status - I''m still
not
able to reproduce this, neither I found possible cause, sorry.

thanks, Alex

Aaron Knister

2008-May-29 00:03 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

Thank you very much for looking into this. I''ve attached my dmesg to
the
bug. i looked at line number 1334 which the panic seems to reference. i
can''t figure out what its doing though

On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev <Alex.Zhuravlev at
sun.com>
wrote:
> Aaron Knister wrote:
>
>> I''m seeing this bug (14465) under heavy load on my OSSes. If I
reboot the
>> MDS it appears to help...any ideas? What''s the status on this
bug?
>>
>
> could you attach your dmesg to the bug? as for the status - I''m
still not
> able to reproduce this, neither I found possible cause, sorry.
>
> thanks, Alex
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080528/7c8c9b1a/attachment.html

Aaron Knister

2008-May-30 15:29 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

I bumped up debugging, and here''s  (below) the last bit of debugging
info
from lustre that I have on the oss before it went belly up. My system is
totally inoperable. Does anybody have any ideas?

00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock())
About to add this lock:
00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())
-- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) (pid:
6378)
00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())
Node: local
00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())
Resource: ffff8103849211c0 (3129942/0)
00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())
Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: 0x80004000
00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())
Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615)
00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal())
forcing cancel of local lock
00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread())
kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175)
00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
started handle ffff8103766dfc78 (0000000000000000)
00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
committing handle ffff8103766dfc78
00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
started handle ffff8103766dfc78 (0000000000000000)
00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock())
removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 from hash
00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set())
kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279)
00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy())
kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175).
00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref())
putref res: ffff8103849211c0 count: 0
00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref())
kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot
49135135).
00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref())
slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135).
00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main())
kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791).
00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
committing handle ffff8103766dfc78
00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput())
putting 3129942: ffff8103599cea98, count = 0
00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated())
crew4-OST0001: after destroy: set last_objids[0] = 3129941
00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid())
crew4-OST0001: server last_objid for group 0: 3129941
00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335).
00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req())
last_committed 17896268, xid 3841
00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref())
connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 at o2ib
00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf())
conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc())
kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949).
00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc())
kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285).
00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection())
connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 at o2ib
00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
RPC PUTting export ffff8103848e9000 : new rpc_count 0
00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
PUTting export ffff8103848e9000 : new refcount 4
00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request())
PUTting export ffff8103848e9000 : new refcount 3
00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free())
kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149).
00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free())
kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813).


On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at
gmail.com>
wrote:
> Thank you very much for looking into this. I''ve attached my dmesg
to the
> bug. i looked at line number 1334 which the panic seems to reference. i
> can''t figure out what its doing though
>
> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev <Alex.Zhuravlev at
sun.com>
> wrote:
>
>> Aaron Knister wrote:
>>
>>> I''m seeing this bug (14465) under heavy load on my OSSes.
If I reboot the
>>> MDS it appears to help...any ideas? What''s the status on
this bug?
>>>
>>
>> could you attach your dmesg to the bug? as for the status -
I''m still not
>> able to reproduce this, neither I found possible cause, sorry.
>>
>> thanks, Alex
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080530/0cdf7371/attachment-0001.html

Alex Zhuravlev

2008-Jun-02 07:40 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

what about dmesg output? it''s unlikely lustre debug can help here as
the problem
seem to be very internal to ldiskfs (mballoc piece of it)

thanks, Alex

Aaron Knister wrote:> I bumped up debugging, and here''s  (below) the last bit of
debugging
> info from lustre that I have on the oss before it went belly up. My 
> system is totally inoperable. Does anybody have any ideas?
> 
>
00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock())
> About to add this lock:
>
00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())
> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) (pid: 
> 6378)
>
00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())
> Node: local
>
00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())
> Resource: ffff8103849211c0 (3129942/0)
>
00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())
> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: 0x80004000
>
00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())
> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615)
>
00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal())
> forcing cancel of local lock
>
00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread())
> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot 49135175)
>
00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
> started handle ffff8103766dfc78 (0000000000000000)
>
00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
> committing handle ffff8103766dfc78
>
00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
> started handle ffff8103766dfc78 (0000000000000000)
>
00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock())
> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 from hash
>
00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set())
> kmalloced ''set'': 104 at ffff8104012d38c0 (tot 49135279)
>
00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy())
> kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175).
>
00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref())
> putref res: ffff8103849211c0 count: 0
>
00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref())
> kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880 (tot
49135135).
>
00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref())
> slab-freed ''res'': 224 at ffff8103849211c0 (tot 49135135).
>
00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main())
> kfreed ''blwi'': 120 at ffff81040e90a340 (tot 49134791).
>
00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
> committing handle ffff8103766dfc78
> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) 
> putting 3129942: ffff8103599cea98, count = 0
>
00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated())
> crew4-OST0001: after destroy: set last_objids[0] = 3129941
>
00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid())
> crew4-OST0001: server last_objid for group 0: 3129941
>
00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
> slab-freed ''fcb'': 56 at ffff810371404920 (tot 49134335).
>
00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req())
> last_committed 17896268, xid 3841
>
00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref())
> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 
> <http://172.18.0.10>@o2ib
> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf())
> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
>
00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc())
> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot 9568949).
>
00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc())
> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot 9569285).
>
00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection())
> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 
> <http://172.18.0.10>@o2ib
>
00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
> RPC PUTting export ffff8103848e9000 : new rpc_count 0
>
00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
> PUTting export ffff8103848e9000 : new refcount 4
>
00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request())
> PUTting export ffff8103848e9000 : new refcount 3
>
00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free())
> kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149).
>
00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free())
> kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813).
> 
> 
> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at
gmail.com
> <mailto:aaron.knister at gmail.com>> wrote:
> 
>     Thank you very much for looking into this. I''ve attached my
dmesg to
>     the bug. i looked at line number 1334 which the panic seems to
>     reference. i can''t figure out what its doing though
> 
>     On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>     <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at
sun.com>> wrote:
> 
>         Aaron Knister wrote:
> 
>             I''m seeing this bug (14465) under heavy load on my
OSSes. If
>             I reboot the MDS it appears to help...any ideas?
What''s the
>             status on this bug?
> 
> 
>         could you attach your dmesg to the bug? as for the status -
I''m
>         still not
>         able to reproduce this, neither I found possible cause, sorry.
> 
>         thanks, Alex
> 
> 
>

Aaron Knister

2008-Jun-02 12:01 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

Thanks so much for looking into this. Here''s what I got from dmesg.  
Interestingly enough every time it panics the CPU listed is #4...do  
you think that points to a hardware problem? -

----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334
invalid opcode: 0000 [1] SMP
last sysfs file: /devices/ 
pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq
CPU 4
Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U)  
ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U)  
ptlrpc(U) obdclass(
U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U)  
rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) xt_tcpudp(U)  
ip6table_filter(U
) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U)  
scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U)  
rdma_cm(U) ib_cm(U) iw_c
m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U)  
ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U)  
button(U) battery(
U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U)  
parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U)  
ib_core(U) shpchp(U) i2c
_core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U)  
dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) sata_sil(U)  
libata(U) sd_mod(U
) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
Pid: 5717, comm: ll_ost_io_11 Tainted: GF      
2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1
RIP: 0010:[<ffffffff8885a01f>]   
[<ffffffff8885a01f>] :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
RSP: 0018:ffff81040e4dd320  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f
RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000
RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010
R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470
R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800
FS:  00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) knlGS: 
0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0
Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task  
ffff81041228a040)
Stack:  ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b  
0000000000010000
  0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e
  ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000
Call Trace:
  [<ffffffff8885cd4e>] :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50
  [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160
  [<ffffffff88859664>] :ldiskfs:ldiskfs_mb_initialize_context 
+0x144/0x160
  [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280
  [<ffffffff888bf091>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb 
+0x301/0x640
  [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf
  [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4
  [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250
  [<ffffffff888bed90>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb 
+0x0/0x640
  [<ffffffff80054ccc>] mwait_idle+0x0/0x4a
  [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150
  [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0
  [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/0x2570
  [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186
  [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
  [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0
  [<ffffffff80088431>] default_wake_function+0x0/0xe
  [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8
  [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160
  [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
  [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
  [<ffffffff886a4a3b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270
  [<ffffffff80060f29>] thread_return+0x0/0xeb
  [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92
  [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100
  [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
  [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950
  [<ffffffff80088431>] default_wake_function+0x0/0xe
  [<ffffffff8005bfb1>] child_rip+0xa/0x11
  [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950
  [<ffffffff8005bfa7>] child_rip+0x0/0x11


Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43
RIP  [<ffffffff8885a01f>] :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
  RSP <ffff81040e4dd320>
  <0>Kernel panic - not syncing: Fatal exception


On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote:
> what about dmesg output? it''s unlikely lustre debug can help here
as
> the problem
> seem to be very internal to ldiskfs (mballoc piece of it)
>
> thanks, Alex
>
> Aaron Knister wrote:
>> I bumped up debugging, and here''s  (below) the last bit of
debugging
>> info from lustre that I have on the oss before it went belly up. My
>> system is totally inoperable. Does anybody have any ideas?
>>
>> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c: 
>> 865:ldlm_resource_add_lock())
>> About to add this lock:
>> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c: 
>> 1718:ldlm_lock_dump())
>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0)  
>> (pid: 
>> 6378)
>> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c: 
>> 1726:ldlm_lock_dump())
>> Node: local
>> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c: 
>> 1735:ldlm_lock_dump())
>> Resource: ffff8103849211c0 (3129942/0)
>> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c: 
>> 1740:ldlm_lock_dump())
>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags:  
>> 0x80004000
>> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c: 
>> 1746:ldlm_lock_dump())
>> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615)
>> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c: 
>> 615:ldlm_lock_decref_internal())
>> forcing cancel of local lock
>> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c: 
>> 1357:ldlm_bl_to_thread())
>> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot
49135175)
>> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h: 
>> 194:fsfilt_start_log())
>> started handle ffff8103766dfc78 (0000000000000000)
>> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h: 
>> 270:fsfilt_commit())
>> committing handle ffff8103766dfc78
>> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h: 
>> 194:fsfilt_start_log())
>> started handle ffff8103766dfc78 (0000000000000000)
>> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c: 
>> 121:class_handle_unhash_nolock())
>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6  
>> from hash
>> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c: 
>> 394:ptlrpc_prep_set())
>> kmalloced ''set'': 104 at ffff8104012d38c0 (tot
49135279)
>> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c: 
>> 457:ptlrpc_set_destroy())
>> kfreed ''set'': 104 at ffff8104012d38c0 (tot 49135175).
>> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c: 
>> 818:ldlm_resource_putref())
>> putref res: ffff8103849211c0 count: 0
>> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c: 
>> 828:ldlm_resource_putref())
>> kfreed ''res->lr_lvb_data'': 40 at ffff810379ded880
(tot 49135135).
>> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c: 
>> 829:ldlm_resource_putref())
>> slab-freed ''res'': 224 at ffff8103849211c0 (tot
49135135).
>> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c: 
>> 1657:ldlm_bl_thread_main())
>> kfreed ''blwi'': 120 at ffff81040e90a340 (tot
49134791).
>> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h: 
>> 270:fsfilt_commit())
>> committing handle ffff8103766dfc78
>> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c: 
>> 148:f_dput())
>> putting 3129942: ffff8103599cea98, count = 0
>> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c: 
>> 2689:filter_destroy_precreated())
>> crew4-OST0001: after destroy: set last_objids[0] = 3129941
>> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c: 
>> 607:filter_update_last_objid())
>> crew4-OST0001: server last_objid for group 0: 3129941
>> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c: 
>> 747:fsfilt_ldiskfs_cb_func())
>> slab-freed ''fcb'': 56 at ffff810371404920 (tot
49134335).
>> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c: 
>> 1556:target_committed_to_req())
>> last_committed 17896268, xid 3841
>> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c: 
>> 191:ptlrpc_connection_addref())
>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10
>> <http://172.18.0.10>@o2ib
>> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c: 
>> 46:ptl_send_buf())
>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
>> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h: 
>> 247:lnet_md_alloc())
>> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot
9568949).
>> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h: 
>> 295:lnet_msg_alloc())
>> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot
9569285).
>> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c: 
>> 150:ptlrpc_put_connection())
>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10
>> <http://172.18.0.10>@o2ib
>> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c: 
>> 648:ptlrpc_server_handle_request())
>> RPC PUTting export ffff8103848e9000 : new rpc_count 0
>> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c: 
>> 648:ptlrpc_server_handle_request())
>> PUTting export ffff8103848e9000 : new refcount 4
>> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c: 
>> 652:ptlrpc_server_handle_request())
>> PUTting export ffff8103848e9000 : new refcount 3
>> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h: 
>> 269:lnet_md_free())
>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot 9569149).
>> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h: 
>> 312:lnet_msg_free())
>> kfreed ''msg'': 336 at ffff8104285e1e00 (tot 9568813).
>>
>>
>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at
gmail.com
>> <mailto:aaron.knister at gmail.com>> wrote:
>>
>>    Thank you very much for looking into this. I''ve attached my
>> dmesg to
>>    the bug. i looked at line number 1334 which the panic seems to
>>    reference. i can''t figure out what its doing though
>>
>>    On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>>    <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at
sun.com>> wrote:
>>
>>        Aaron Knister wrote:
>>
>>            I''m seeing this bug (14465) under heavy load on my
>> OSSes. If
>>            I reboot the MDS it appears to help...any ideas?
What''s
>> the
>>            status on this bug?
>>
>>
>>        could you attach your dmesg to the bug? as for the status -  
>> I''m
>>        still not
>>        able to reproduce this, neither I found possible cause, sorry.
>>
>>        thanks, Alex
>>
>>
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Alex Zhuravlev

2008-Jun-02 12:08 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

could you also send me your mballoc.c please?

thanks, Alex

Aaron Knister wrote:> Thanks so much for looking into this. Here''s what I got from
dmesg.
> Interestingly enough every time it panics the CPU listed is #4...do you 
> think that points to a hardware problem? -
> 
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334
> invalid opcode: 0000 [1] SMP
> last sysfs file: 
> /devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq
> CPU 4
> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) 
> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) 
> ptlrpc(U) obdclass(
> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U) rfcomm(U) 
> l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U) xt_tcpudp(U) 
> ip6table_filter(U
> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U) 
> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U) 
> rdma_cm(U) ib_cm(U) iw_c
> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) 
> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) 
> button(U) battery(
> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) 
> ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U) ib_core(U) 
> shpchp(U) i2c
> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) 
> dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) sata_sil(U) libata(U) 
> sd_mod(U
> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
> Pid: 5717, comm: ll_ost_io_11 Tainted: GF     
> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1
> RIP: 0010:[<ffffffff8885a01f>]  [<ffffffff8885a01f>] 
> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
> RSP: 0018:ffff81040e4dd320  EFLAGS: 00010246
> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f
> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000
> RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010
> R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470
> R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800
> FS:  00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) 
> knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0
> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task 
> ffff81041228a040)
> Stack:  ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b 0000000000010000
>  0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e
>  ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000
> Call Trace:
>  [<ffffffff8885cd4e>]
:ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50
>  [<ffffffff888443c2>] :ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160
>  [<ffffffff88859664>]
:ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160
>  [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280
>  [<ffffffff888bf091>]
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640
>  [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf
>  [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4
>  [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250
>  [<ffffffff888bed90>]
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640
>  [<ffffffff80054ccc>] mwait_idle+0x0/0x4a
>  [<ffffffff888ba9ae>] :fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150
>  [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0
>  [<ffffffff888f14de>] :obdfilter:filter_commitrw_write+0x184e/0x2570
>  [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186
>  [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>  [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0
>  [<ffffffff80088431>] default_wake_function+0x0/0xe
>  [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8
>  [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160
>  [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
>  [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
>  [<ffffffff886a4a3b>]
:ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270
>  [<ffffffff80060f29>] thread_return+0x0/0xeb
>  [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92
>  [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100
>  [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>  [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950
>  [<ffffffff80088431>] default_wake_function+0x0/0xe
>  [<ffffffff8005bfb1>] child_rip+0xa/0x11
>  [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950
>  [<ffffffff8005bfa7>] child_rip+0x0/0x11
> 
> 
> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43
> RIP  [<ffffffff8885a01f>]
:ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>  RSP <ffff81040e4dd320>
>  <0>Kernel panic - not syncing: Fatal exception
> 
> 
> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote:
> 
>> what about dmesg output? it''s unlikely lustre debug can help
here as
>> the problem
>> seem to be very internal to ldiskfs (mballoc piece of it)
>>
>> thanks, Alex
>>
>> Aaron Knister wrote:
>>> I bumped up debugging, and here''s  (below) the last bit of
debugging
>>> info from lustre that I have on the oss before it went belly up. My
>>> system is totally inoperable. Does anybody have any ideas?
>>>
>>>
00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock())
>>>
>>> About to add this lock:
>>>
00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())
>>>
>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) 
>>> (pid:6378)
>>>
00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())
>>>
>>> Node: local
>>>
00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())
>>>
>>> Resource: ffff8103849211c0 (3129942/0)
>>>
00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())
>>>
>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags:
0x80004000
>>>
00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())
>>>
>>> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615)
>>>
00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal())
>>>
>>> forcing cancel of local lock
>>>
00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread())
>>>
>>> kmalloced ''blwi'': 120 at ffff81040e90a340 (tot
49135175)
>>>
00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>>>
>>> started handle ffff8103766dfc78 (0000000000000000)
>>>
00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>>>
>>> committing handle ffff8103766dfc78
>>>
00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>>>
>>> started handle ffff8103766dfc78 (0000000000000000)
>>>
00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock())
>>>
>>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6
from
>>> hash
>>>
00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set())
>>>
>>> kmalloced ''set'': 104 at ffff8104012d38c0 (tot
49135279)
>>>
00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy())
>>>
>>> kfreed ''set'': 104 at ffff8104012d38c0 (tot
49135175).
>>>
00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref())
>>>
>>> putref res: ffff8103849211c0 count: 0
>>>
00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref())
>>>
>>> kfreed ''res->lr_lvb_data'': 40 at
ffff810379ded880 (tot 49135135).
>>>
00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref())
>>>
>>> slab-freed ''res'': 224 at ffff8103849211c0 (tot
49135135).
>>>
00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main())
>>>
>>> kfreed ''blwi'': 120 at ffff81040e90a340 (tot
49134791).
>>>
00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>>>
>>> committing handle ffff8103766dfc78
>>>
00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput())
>>> putting 3129942: ffff8103599cea98, count = 0
>>>
00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated())
>>>
>>> crew4-OST0001: after destroy: set last_objids[0] = 3129941
>>>
00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid())
>>>
>>> crew4-OST0001: server last_objid for group 0: 3129941
>>>
00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
>>>
>>> slab-freed ''fcb'': 56 at ffff810371404920 (tot
49134335).
>>>
00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req())
>>>
>>> last_committed 17896268, xid 3841
>>>
00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref())
>>>
>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10
>>> <http://172.18.0.10>@o2ib
>>>
00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf())
>>>
>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
>>>
00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc())
>>>
>>> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot
9568949).
>>>
00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc())
>>>
>>> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot
9569285).
>>>
00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection())
>>>
>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10
>>> <http://172.18.0.10>@o2ib
>>>
00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>>>
>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0
>>>
00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>>>
>>> PUTting export ffff8103848e9000 : new refcount 4
>>>
00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request())
>>>
>>> PUTting export ffff8103848e9000 : new refcount 3
>>>
00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free())
>>>
>>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot
9569149).
>>>
00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free())
>>>
>>> kfreed ''msg'': 336 at ffff8104285e1e00 (tot
9568813).
>>>
>>>
>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <aaron.knister at
gmail.com
>>> <mailto:aaron.knister at gmail.com>> wrote:
>>>
>>>    Thank you very much for looking into this. I''ve
attached my dmesg to
>>>    the bug. i looked at line number 1334 which the panic seems to
>>>    reference. i can''t figure out what its doing though
>>>
>>>    On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>>>    <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at
sun.com>> wrote:
>>>
>>>        Aaron Knister wrote:
>>>
>>>            I''m seeing this bug (14465) under heavy load on
my OSSes. If
>>>            I reboot the MDS it appears to help...any ideas?
What''s the
>>>            status on this bug?
>>>
>>>
>>>        could you attach your dmesg to the bug? as for the status -
I''m
>>>        still not
>>>        able to reproduce this, neither I found possible cause,
sorry.
>>>
>>>        thanks, Alex
>>>
>>>
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Aaron Knister

2008-Jun-02 12:24 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

I''m not sure how to obtain that. I''m running whatever was
distributed
with 1.6.4.3 for rhel5-x86_64.

On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote:
> could you also send me your mballoc.c please?
>
> thanks, Alex
>
> Aaron Knister wrote:
>> Thanks so much for looking into this. Here''s what I got from
dmesg.
>> Interestingly enough every time it panics the CPU listed is #4...do  
>> you think that points to a hardware problem? -
>> ----------- [cut here ] --------- [please bite here ] ---------
>> Kernel BUG at ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c: 
>> 1334
>> invalid opcode: 0000 [1] SMP
>> last sysfs file: /devices/ 
>> pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq
>> CPU 4
>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U)  
>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U)  
>> ptlrpc(U) obdclass(
>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U)  
>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U)  
>> xt_tcpudp(U) ip6table_filter(U
>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U)  
>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U)  
>> rdma_cm(U) ib_cm(U) iw_c
>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U)  
>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U)  
>> button(U) battery(
>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U)  
>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U)  
>> ib_core(U) shpchp(U) i2c
>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U)  
>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U)  
>> sata_sil(U) libata(U) sd_mod(U
>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF      
>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1
>> RIP: 0010:[<ffffffff8885a01f>]   
>> [<ffffffff8885a01f>]
:ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>> RSP: 0018:ffff81040e4dd320  EFLAGS: 00010246
>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f
>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000
>> RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010
>> R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470
>> R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800
>> FS:  00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) knlGS: 
>> 0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0
>> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task  
>> ffff81041228a040)
>> Stack:  ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b  
>> 0000000000010000
>> 0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e
>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000
>> Call Trace:
>> [<ffffffff8885cd4e>] :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/
>> 0xd50
>> [<ffffffff888443c2>]
:ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160
>> [<ffffffff88859664>] :ldiskfs:ldiskfs_mb_initialize_context 
>> +0x144/0x160
>> [<ffffffff8885e1e6>] :ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280
>> [<ffffffff888bf091>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb 
>> +0x301/0x640
>> [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf
>> [<ffffffff882390b0>] :ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4
>> [<ffffffff88856d27>] :ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250
>> [<ffffffff888bed90>] :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb 
>> +0x0/0x640
>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a
>> [<ffffffff888ba9ae>]
:fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150
>> [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0
>> [<ffffffff888f14de>]
:obdfilter:filter_commitrw_write+0x184e/0x2570
>> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186
>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0
>> [<ffffffff80088431>] default_wake_function+0x0/0xe
>> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8
>> [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160
>> [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
>> [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
>> [<ffffffff886a4a3b>] :ptlrpc:ptlrpc_server_handle_request+0xb0b/ 
>> 0x1270
>> [<ffffffff80060f29>] thread_return+0x0/0xeb
>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92
>> [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100
>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950
>> [<ffffffff80088431>] default_wake_function+0x0/0xe
>> [<ffffffff8005bfb1>] child_rip+0xa/0x11
>> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950
>> [<ffffffff8005bfa7>] child_rip+0x0/0x11
>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43
>> RIP  [<ffffffff8885a01f>]
:ldiskfs:ldiskfs_mb_use_best_found+0xef/
>> 0x520
>> RSP <ffff81040e4dd320>
>> <0>Kernel panic - not syncing: Fatal exception
>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote:
>>> what about dmesg output? it''s unlikely lustre debug can
help here
>>> as the problem
>>> seem to be very internal to ldiskfs (mballoc piece of it)
>>>
>>> thanks, Alex
>>>
>>> Aaron Knister wrote:
>>>> I bumped up debugging, and here''s  (below) the last
bit of
>>>> debugging
>>>> info from lustre that I have on the oss before it went belly
up. My
>>>> system is totally inoperable. Does anybody have any ideas?
>>>>
>>>>
00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:
>>>> 865:ldlm_resource_add_lock())
>>>> About to add this lock:
>>>> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c: 
>>>> 1718:ldlm_lock_dump())
>>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos:
>>>> 0) (pid:6378)
>>>> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c: 
>>>> 1726:ldlm_lock_dump())
>>>> Node: local
>>>> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c: 
>>>> 1735:ldlm_lock_dump())
>>>> Resource: ffff8103849211c0 (3129942/0)
>>>> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c: 
>>>> 1740:ldlm_lock_dump())
>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags:  
>>>> 0x80004000
>>>> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c: 
>>>> 1746:ldlm_lock_dump())
>>>> Extent: 0 -> 18446744073709551615 (req
0-18446744073709551615)
>>>> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c: 
>>>> 615:ldlm_lock_decref_internal())
>>>> forcing cancel of local lock
>>>> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c: 
>>>> 1357:ldlm_bl_to_thread())
>>>> kmalloced ''blwi'': 120 at ffff81040e90a340
(tot 49135175)
>>>>
00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:
>>>> 194:fsfilt_start_log())
>>>> started handle ffff8103766dfc78 (0000000000000000)
>>>>
00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:
>>>> 270:fsfilt_commit())
>>>> committing handle ffff8103766dfc78
>>>>
00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:
>>>> 194:fsfilt_start_log())
>>>> started handle ffff8103766dfc78 (0000000000000000)
>>>>
00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:
>>>> 121:class_handle_unhash_nolock())
>>>> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6
>>>> from hash
>>>> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c: 
>>>> 394:ptlrpc_prep_set())
>>>> kmalloced ''set'': 104 at ffff8104012d38c0 (tot
49135279)
>>>> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c: 
>>>> 457:ptlrpc_set_destroy())
>>>> kfreed ''set'': 104 at ffff8104012d38c0 (tot
49135175).
>>>>
00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:
>>>> 818:ldlm_resource_putref())
>>>> putref res: ffff8103849211c0 count: 0
>>>>
00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:
>>>> 828:ldlm_resource_putref())
>>>> kfreed ''res->lr_lvb_data'': 40 at
ffff810379ded880 (tot 49135135).
>>>>
00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:
>>>> 829:ldlm_resource_putref())
>>>> slab-freed ''res'': 224 at ffff8103849211c0
(tot 49135135).
>>>> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c: 
>>>> 1657:ldlm_bl_thread_main())
>>>> kfreed ''blwi'': 120 at ffff81040e90a340 (tot
49134791).
>>>>
00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:
>>>> 270:fsfilt_commit())
>>>> committing handle ffff8103766dfc78
>>>> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c: 
>>>> 148:f_dput())
>>>> putting 3129942: ffff8103599cea98, count = 0
>>>> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c: 
>>>> 2689:filter_destroy_precreated())
>>>> crew4-OST0001: after destroy: set last_objids[0] = 3129941
>>>> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c: 
>>>> 607:filter_update_last_objid())
>>>> crew4-OST0001: server last_objid for group 0: 3129941
>>>>
00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:
>>>> 747:fsfilt_ldiskfs_cb_func())
>>>> slab-freed ''fcb'': 56 at ffff810371404920 (tot
49134335).
>>>> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c: 
>>>> 1556:target_committed_to_req())
>>>> last_committed 17896268, xid 3841
>>>> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c: 
>>>> 191:ptlrpc_connection_addref())
>>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10
>>>> <http://172.18.0.10>@o2ib
>>>> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c: 
>>>> 46:ptl_send_buf())
>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
>>>> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h: 
>>>> 247:lnet_md_alloc())
>>>> kmalloced ''md'': 136 at ffff81040cb6cb80 (tot
9568949).
>>>> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h: 
>>>> 295:lnet_msg_alloc())
>>>> kmalloced ''msg'': 336 at ffff8104285e1e00 (tot
9569285).
>>>> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c: 
>>>> 150:ptlrpc_put_connection())
>>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10
>>>> <http://172.18.0.10>@o2ib
>>>> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c: 
>>>> 648:ptlrpc_server_handle_request())
>>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0
>>>> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c: 
>>>> 648:ptlrpc_server_handle_request())
>>>> PUTting export ffff8103848e9000 : new refcount 4
>>>> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c: 
>>>> 652:ptlrpc_server_handle_request())
>>>> PUTting export ffff8103848e9000 : new refcount 3
>>>> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h: 
>>>> 269:lnet_md_free())
>>>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot
9569149).
>>>> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h: 
>>>> 312:lnet_msg_free())
>>>> kfreed ''msg'': 336 at ffff8104285e1e00 (tot
9568813).
>>>>
>>>>
>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister
<aaron.knister at gmail.com
>>>> <mailto:aaron.knister at gmail.com>> wrote:
>>>>
>>>>   Thank you very much for looking into this. I''ve
attached my
>>>> dmesg to
>>>>   the bug. i looked at line number 1334 which the panic seems
to
>>>>   reference. i can''t figure out what its doing though
>>>>
>>>>   On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>>>>   <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev at
sun.com>> wrote:
>>>>
>>>>       Aaron Knister wrote:
>>>>
>>>>           I''m seeing this bug (14465) under heavy load
on my
>>>> OSSes. If
>>>>           I reboot the MDS it appears to help...any ideas?
What''s
>>>> the
>>>>           status on this bug?
>>>>
>>>>
>>>>       could you attach your dmesg to the bug? as for the status
-
>>>> I''m
>>>>       still not
>>>>       able to reproduce this, neither I found possible cause,  
>>>> sorry.
>>>>
>>>>       thanks, Alex
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Alex Zhuravlev

2008-Jun-05 07:18 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

hmm. could you provide me with ldiskfs.ok then?

thanks, Alex

Aaron Knister wrote:> I''m not sure how to obtain that. I''m running whatever was
distributed
> with 1.6.4.3 for rhel5-x86_64.
> 
> On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote:
> 
>> could you also send me your mballoc.c please?
>>
>> thanks, Alex
>>
>> Aaron Knister wrote:
>>> Thanks so much for looking into this. Here''s what I got
from dmesg.
>>> Interestingly enough every time it panics the CPU listed is #4...do
>>> you think that points to a hardware problem? -
>>> ----------- [cut here ] --------- [please bite here ] ---------
>>> Kernel BUG at
...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334
>>> invalid opcode: 0000 [1] SMP
>>> last sysfs file: 
>>>
/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq
>>>
>>> CPU 4
>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) 
>>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U) ko2iblnd(U) 
>>> ptlrpc(U) obdclass(
>>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U) 
>>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U)
xt_tcpudp(U)
>>> ip6table_filter(U
>>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U) 
>>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U) ib_sdp(U) 
>>> rdma_cm(U) ib_cm(U) iw_c
>>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U) ib_uverbs(U) 
>>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U) i2c_ec(U) 
>>> button(U) battery(
>>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) 
>>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U) 
>>> ib_core(U) shpchp(U) i2c
>>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U) 
>>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U) 
>>> sata_sil(U) libata(U) sd_mod(U
>>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U) uhci_hcd(U)
>>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF     
>>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1
>>> RIP: 0010:[<ffffffff8885a01f>]  [<ffffffff8885a01f>] 
>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>> RSP: 0018:ffff81040e4dd320  EFLAGS: 00010246
>>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX: 000000000000007f
>>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI: 0000000000001000
>>> RBP: 0000000000000800 R08: 0000000000000020 R09: 0000000000000010
>>> R10: ffff81037d136ff8 R11: 0000000000008000 R12: ffff81040e4dd470
>>> R13: 0000000000010000 R14: 0000000000000020 R15: 0000000000000800
>>> FS:  00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000) 
>>> knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4: 00000000000006e0
>>> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000, task 
>>> ffff81041228a040)
>>> Stack:  ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b 
>>> 0000000000010000
>>> 0000000000010000 0000000000000020 ffff81040e4dd4c0 ffffffff8885cd4e
>>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0 0000000000000000
>>> Call Trace:
>>> [<ffffffff8885cd4e>]
:ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50
>>> [<ffffffff888443c2>]
:ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160
>>> [<ffffffff88859664>]
:ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160
>>> [<ffffffff8885e1e6>]
:ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280
>>> [<ffffffff888bf091>] 
>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640
>>> [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf
>>> [<ffffffff882390b0>]
:ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4
>>> [<ffffffff88856d27>]
:ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250
>>> [<ffffffff888bed90>]
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640
>>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a
>>> [<ffffffff888ba9ae>]
:fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150
>>> [<ffffffff888eee08>] :obdfilter:filter_direct_io+0x478/0xce0
>>> [<ffffffff888f14de>]
:obdfilter:filter_commitrw_write+0x184e/0x2570
>>> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186
>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>>> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0
>>> [<ffffffff80088431>] default_wake_function+0x0/0xe
>>> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8
>>> [<ffffffff88603c82>] :obdclass:class_handle2object+0xd2/0x160
>>> [<ffffffff8869f230>] :ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
>>> [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
>>> [<ffffffff886a4a3b>]
:ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270
>>> [<ffffffff80060f29>] thread_return+0x0/0xeb
>>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92
>>> [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100
>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>>> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950
>>> [<ffffffff80088431>] default_wake_function+0x0/0xe
>>> [<ffffffff8005bfb1>] child_rip+0xa/0x11
>>> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950
>>> [<ffffffff8005bfa7>] child_rip+0x0/0x11
>>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b 43
>>> RIP  [<ffffffff8885a01f>]
:ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>> RSP <ffff81040e4dd320>
>>> <0>Kernel panic - not syncing: Fatal exception
>>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote:
>>>> what about dmesg output? it''s unlikely lustre debug
can help here as
>>>> the problem
>>>> seem to be very internal to ldiskfs (mballoc piece of it)
>>>>
>>>> thanks, Alex
>>>>
>>>> Aaron Knister wrote:
>>>>> I bumped up debugging, and here''s  (below) the
last bit of debugging
>>>>> info from lustre that I have on the oss before it went
belly up. My
>>>>> system is totally inoperable. Does anybody have any ideas?
>>>>>
>>>>>
00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock())
>>>>>
>>>>> About to add this lock:
>>>>>
00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())
>>>>>
>>>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3)
(pos: 0)
>>>>> (pid:6378)
>>>>>
00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())
>>>>>
>>>>> Node: local
>>>>>
00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())
>>>>>
>>>>> Resource: ffff8103849211c0 (3129942/0)
>>>>>
00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())
>>>>>
>>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1
flags:
>>>>> 0x80004000
>>>>>
00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())
>>>>>
>>>>> Extent: 0 -> 18446744073709551615 (req
0-18446744073709551615)
>>>>>
00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal())
>>>>>
>>>>> forcing cancel of local lock
>>>>>
00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread())
>>>>>
>>>>> kmalloced ''blwi'': 120 at ffff81040e90a340
(tot 49135175)
>>>>>
00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>>>>>
>>>>> started handle ffff8103766dfc78 (0000000000000000)
>>>>>
00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>>>>>
>>>>> committing handle ffff8103766dfc78
>>>>>
00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>>>>>
>>>>> started handle ffff8103766dfc78 (0000000000000000)
>>>>>
00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock())
>>>>>
>>>>> removing object ffff81036e61f1c0 with handle
0x1fb135e1e3fd2cc6
>>>>> from hash
>>>>>
00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set())
>>>>>
>>>>> kmalloced ''set'': 104 at ffff8104012d38c0
(tot 49135279)
>>>>>
00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy())
>>>>>
>>>>> kfreed ''set'': 104 at ffff8104012d38c0
(tot 49135175).
>>>>>
00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref())
>>>>>
>>>>> putref res: ffff8103849211c0 count: 0
>>>>>
00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref())
>>>>>
>>>>> kfreed ''res->lr_lvb_data'': 40 at
ffff810379ded880 (tot 49135135).
>>>>>
00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref())
>>>>>
>>>>> slab-freed ''res'': 224 at ffff8103849211c0
(tot 49135135).
>>>>>
00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main())
>>>>>
>>>>> kfreed ''blwi'': 120 at ffff81040e90a340
(tot 49134791).
>>>>>
00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>>>>>
>>>>> committing handle ffff8103766dfc78
>>>>>
00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput())
>>>>> putting 3129942: ffff8103599cea98, count = 0
>>>>>
00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated())
>>>>>
>>>>> crew4-OST0001: after destroy: set last_objids[0] = 3129941
>>>>>
00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid())
>>>>>
>>>>> crew4-OST0001: server last_objid for group 0: 3129941
>>>>>
00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
>>>>>
>>>>> slab-freed ''fcb'': 56 at ffff810371404920
(tot 49134335).
>>>>>
00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req())
>>>>>
>>>>> last_committed 17896268, xid 3841
>>>>>
00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref())
>>>>>
>>>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10
>>>>> <http://172.18.0.10>@o2ib
>>>>>
00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf())
>>>>>
>>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
>>>>>
00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc())
>>>>>
>>>>> kmalloced ''md'': 136 at ffff81040cb6cb80
(tot 9568949).
>>>>>
00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc())
>>>>>
>>>>> kmalloced ''msg'': 336 at ffff8104285e1e00
(tot 9569285).
>>>>>
00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection())
>>>>>
>>>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10
>>>>> <http://172.18.0.10>@o2ib
>>>>>
00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>>>>>
>>>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0
>>>>>
00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>>>>>
>>>>> PUTting export ffff8103848e9000 : new refcount 4
>>>>>
00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request())
>>>>>
>>>>> PUTting export ffff8103848e9000 : new refcount 3
>>>>>
00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free())
>>>>>
>>>>> kfreed ''md'': 136 at ffff81040cb6cb80 (tot
9569149).
>>>>>
00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free())
>>>>>
>>>>> kfreed ''msg'': 336 at ffff8104285e1e00
(tot 9568813).
>>>>>
>>>>>
>>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister 
>>>>> <aaron.knister at gmail.com
>>>>> <mailto:aaron.knister at gmail.com>> wrote:
>>>>>
>>>>>   Thank you very much for looking into this. I''ve
attached my dmesg to
>>>>>   the bug. i looked at line number 1334 which the panic
seems to
>>>>>   reference. i can''t figure out what its doing
though
>>>>>
>>>>>   On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>>>>>   <Alex.Zhuravlev at sun.com <mailto:Alex.Zhuravlev
at sun.com>> wrote:
>>>>>
>>>>>       Aaron Knister wrote:
>>>>>
>>>>>           I''m seeing this bug (14465) under heavy
load on my OSSes. If
>>>>>           I reboot the MDS it appears to help...any ideas?
What''s the
>>>>>           status on this bug?
>>>>>
>>>>>
>>>>>       could you attach your dmesg to the bug? as for the
status - I''m
>>>>>       still not
>>>>>       able to reproduce this, neither I found possible
cause, sorry.
>>>>>
>>>>>       thanks, Alex
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>

Aaron Knister

2008-Jun-05 12:42 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

Unfortunately the monkeys maintaining this setup (I left the company  
last week) have managed to destroy one of the 13 terabyte arrays...so  
they''ve got bigger issues right now. Thanks for your help though.

On Jun 5, 2008, at 3:18 AM, Alex Zhuravlev wrote:
> hmm. could you provide me with ldiskfs.ok then?
>
> thanks, Alex
>
> Aaron Knister wrote:
>> I''m not sure how to obtain that. I''m running whatever
was distributed
>> with 1.6.4.3 for rhel5-x86_64.
>>
>> On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote:
>>
>>> could you also send me your mballoc.c please?
>>>
>>> thanks, Alex
>>>
>>> Aaron Knister wrote:
>>>> Thanks so much for looking into this. Here''s what I
got from dmesg.
>>>> Interestingly enough every time it panics the CPU listed is
#4...do
>>>> you think that points to a hardware problem? -
>>>> ----------- [cut here ] --------- [please bite here ] ---------
>>>> Kernel BUG at ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/ 
>>>> mballoc.c:1334
>>>> invalid opcode: 0000 [1] SMP
>>>> last sysfs file:
>>>> /devices/ 
>>>>
pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq
>>>>
>>>> CPU 4
>>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U)
>>>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U)
ko2iblnd(U)
>>>> ptlrpc(U) obdclass(
>>>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U) hidp(U)
>>>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U)  
>>>> xt_tcpudp(U)
>>>> ip6table_filter(U
>>>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U)
>>>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U)
ib_sdp(U)
>>>> rdma_cm(U) ib_cm(U) iw_c
>>>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U)
ib_uverbs(U)
>>>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U)
i2c_ec(U)
>>>> button(U) battery(
>>>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U)
>>>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U) ide_cd(U)
>>>> ib_core(U) shpchp(U) i2c
>>>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U)
>>>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U)
>>>> sata_sil(U) libata(U) sd_mod(U
>>>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U)
uhci_hcd(U)
>>>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF
>>>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1
>>>> RIP: 0010:[<ffffffff8885a01f>] 
[<ffffffff8885a01f>]
>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>>> RSP: 0018:ffff81040e4dd320  EFLAGS: 00010246
>>>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX:
000000000000007f
>>>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI:
0000000000001000
>>>> RBP: 0000000000000800 R08: 0000000000000020 R09:
0000000000000010
>>>> R10: ffff81037d136ff8 R11: 0000000000008000 R12:
ffff81040e4dd470
>>>> R13: 0000000000010000 R14: 0000000000000020 R15:
0000000000000800
>>>> FS:  00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000)
>>>> knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4:
00000000000006e0
>>>> Process ll_ost_io_11 (pid: 5717, threadinfo ffff81040e4dc000,
task
>>>> ffff81041228a040)
>>>> Stack:  ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b
>>>> 0000000000010000
>>>> 0000000000010000 0000000000000020 ffff81040e4dd4c0
ffffffff8885cd4e
>>>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0
0000000000000000
>>>> Call Trace:
>>>> [<ffffffff8885cd4e>]
:ldiskfs:ldiskfs_mb_regular_allocator+0x66e/
>>>> 0xd50
>>>> [<ffffffff888443c2>]
:ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160
>>>> [<ffffffff88859664>]
:ldiskfs:ldiskfs_mb_initialize_context
>>>> +0x144/0x160
>>>> [<ffffffff8885e1e6>]
:ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280
>>>> [<ffffffff888bf091>]
>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640
>>>> [<ffffffff8000a63f>] get_page_from_freelist+0x223/0x3cf
>>>> [<ffffffff882390b0>]
:ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4
>>>> [<ffffffff88856d27>]
:ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250
>>>> [<ffffffff888bed90>]
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb
>>>> +0x0/0x640
>>>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a
>>>> [<ffffffff888ba9ae>]
:fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150
>>>> [<ffffffff888eee08>]
:obdfilter:filter_direct_io+0x478/0xce0
>>>> [<ffffffff888f14de>]
:obdfilter:filter_commitrw_write+0x184e/0x2570
>>>> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186
>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>>>> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0
>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe
>>>> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8
>>>> [<ffffffff88603c82>]
:obdclass:class_handle2object+0xd2/0x160
>>>> [<ffffffff8869f230>]
:ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
>>>> [<ffffffff8869cde5>] :ptlrpc:lustre_swab_buf+0xc5/0xf0
>>>> [<ffffffff886a4a3b>]
:ptlrpc:ptlrpc_server_handle_request+0xb0b/
>>>> 0x1270
>>>> [<ffffffff80060f29>] thread_return+0x0/0xeb
>>>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92
>>>> [<ffffffff8855d056>] :libcfs:lcw_update_time+0x16/0x100
>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>>>> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950
>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe
>>>> [<ffffffff8005bfb1>] child_rip+0xa/0x11
>>>> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950
>>>> [<ffffffff8005bfa7>] child_rip+0x0/0x11
>>>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48 8b
43
>>>> RIP  [<ffffffff8885a01f>]
:ldiskfs:ldiskfs_mb_use_best_found+0xef/
>>>> 0x520
>>>> RSP <ffff81040e4dd320>
>>>> <0>Kernel panic - not syncing: Fatal exception
>>>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote:
>>>>> what about dmesg output? it''s unlikely lustre
debug can help
>>>>> here as
>>>>> the problem
>>>>> seem to be very internal to ldiskfs (mballoc piece of it)
>>>>>
>>>>> thanks, Alex
>>>>>
>>>>> Aaron Knister wrote:
>>>>>> I bumped up debugging, and here''s  (below) the
last bit of
>>>>>> debugging
>>>>>> info from lustre that I have on the oss before it went
belly
>>>>>> up. My
>>>>>> system is totally inoperable. Does anybody have any
ideas?
>>>>>>
>>>>>>
00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:
>>>>>> 865:ldlm_resource_add_lock())
>>>>>>
>>>>>> About to add this lock:
>>>>>>
00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:
>>>>>> 1718:ldlm_lock_dump())
>>>>>>
>>>>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc:
3) (pos:
>>>>>> 0) 
>>>>>> (pid:6378)
>>>>>>
00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:
>>>>>> 1726:ldlm_lock_dump())
>>>>>>
>>>>>> Node: local
>>>>>>
00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:
>>>>>> 1735:ldlm_lock_dump())
>>>>>>
>>>>>> Resource: ffff8103849211c0 (3129942/0)
>>>>>>
00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:
>>>>>> 1740:ldlm_lock_dump())
>>>>>>
>>>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1
flags:
>>>>>> 0x80004000
>>>>>>
00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:
>>>>>> 1746:ldlm_lock_dump())
>>>>>>
>>>>>> Extent: 0 -> 18446744073709551615 (req
0-18446744073709551615)
>>>>>>
00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:
>>>>>> 615:ldlm_lock_decref_internal())
>>>>>>
>>>>>> forcing cancel of local lock
>>>>>>
00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:
>>>>>> 1357:ldlm_bl_to_thread())
>>>>>>
>>>>>> kmalloced ''blwi'': 120 at
ffff81040e90a340 (tot 49135175)
>>>>>>
00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:
>>>>>> 194:fsfilt_start_log())
>>>>>>
>>>>>> started handle ffff8103766dfc78 (0000000000000000)
>>>>>>
00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:
>>>>>> 270:fsfilt_commit())
>>>>>>
>>>>>> committing handle ffff8103766dfc78
>>>>>>
00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:
>>>>>> 194:fsfilt_start_log())
>>>>>>
>>>>>> started handle ffff8103766dfc78 (0000000000000000)
>>>>>> 00000020:00000040:4:1212157576.884957:0:5557:0: 
>>>>>> (lustre_handles.c:121:class_handle_unhash_nolock())
>>>>>>
>>>>>> removing object ffff81036e61f1c0 with handle
0x1fb135e1e3fd2cc6
>>>>>> from hash
>>>>>>
00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:
>>>>>> 394:ptlrpc_prep_set())
>>>>>>
>>>>>> kmalloced ''set'': 104 at
ffff8104012d38c0 (tot 49135279)
>>>>>>
00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:
>>>>>> 457:ptlrpc_set_destroy())
>>>>>>
>>>>>> kfreed ''set'': 104 at ffff8104012d38c0
(tot 49135175).
>>>>>>
00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:
>>>>>> 818:ldlm_resource_putref())
>>>>>>
>>>>>> putref res: ffff8103849211c0 count: 0
>>>>>>
00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:
>>>>>> 828:ldlm_resource_putref())
>>>>>>
>>>>>> kfreed ''res->lr_lvb_data'': 40 at
ffff810379ded880 (tot 49135135).
>>>>>>
00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:
>>>>>> 829:ldlm_resource_putref())
>>>>>>
>>>>>> slab-freed ''res'': 224 at
ffff8103849211c0 (tot 49135135).
>>>>>>
00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:
>>>>>> 1657:ldlm_bl_thread_main())
>>>>>>
>>>>>> kfreed ''blwi'': 120 at
ffff81040e90a340 (tot 49134791).
>>>>>>
00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:
>>>>>> 270:fsfilt_commit())
>>>>>>
>>>>>> committing handle ffff8103766dfc78
>>>>>>
00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:
>>>>>> 148:f_dput())
>>>>>> putting 3129942: ffff8103599cea98, count = 0
>>>>>>
00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:
>>>>>> 2689:filter_destroy_precreated())
>>>>>>
>>>>>> crew4-OST0001: after destroy: set last_objids[0] =
3129941
>>>>>>
00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:
>>>>>> 607:filter_update_last_objid())
>>>>>>
>>>>>> crew4-OST0001: server last_objid for group 0: 3129941
>>>>>> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-
>>>>>> ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
>>>>>>
>>>>>> slab-freed ''fcb'': 56 at
ffff810371404920 (tot 49134335).
>>>>>>
00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:
>>>>>> 1556:target_committed_to_req())
>>>>>>
>>>>>> last_committed 17896268, xid 3841
>>>>>>
00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:
>>>>>> 191:ptlrpc_connection_addref())
>>>>>>
>>>>>> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10
>>>>>> <http://172.18.0.10>@o2ib
>>>>>>
00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:
>>>>>> 46:ptl_send_buf())
>>>>>>
>>>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
>>>>>>
00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:
>>>>>> 247:lnet_md_alloc())
>>>>>>
>>>>>> kmalloced ''md'': 136 at
ffff81040cb6cb80 (tot 9568949).
>>>>>>
00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:
>>>>>> 295:lnet_msg_alloc())
>>>>>>
>>>>>> kmalloced ''msg'': 336 at
ffff8104285e1e00 (tot 9569285).
>>>>>>
00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:
>>>>>> 150:ptlrpc_put_connection())
>>>>>>
>>>>>> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10
>>>>>> <http://172.18.0.10>@o2ib
>>>>>>
00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:
>>>>>> 648:ptlrpc_server_handle_request())
>>>>>>
>>>>>> RPC PUTting export ffff8103848e9000 : new rpc_count 0
>>>>>>
00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:
>>>>>> 648:ptlrpc_server_handle_request())
>>>>>>
>>>>>> PUTting export ffff8103848e9000 : new refcount 4
>>>>>>
00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:
>>>>>> 652:ptlrpc_server_handle_request())
>>>>>>
>>>>>> PUTting export ffff8103848e9000 : new refcount 3
>>>>>>
00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:
>>>>>> 269:lnet_md_free())
>>>>>>
>>>>>> kfreed ''md'': 136 at ffff81040cb6cb80
(tot 9569149).
>>>>>>
00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:
>>>>>> 312:lnet_msg_free())
>>>>>>
>>>>>> kfreed ''msg'': 336 at ffff8104285e1e00
(tot 9568813).
>>>>>>
>>>>>>
>>>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister
>>>>>> <aaron.knister at gmail.com
>>>>>> <mailto:aaron.knister at gmail.com>> wrote:
>>>>>>
>>>>>>  Thank you very much for looking into this.
I''ve attached my
>>>>>> dmesg to
>>>>>>  the bug. i looked at line number 1334 which the panic
seems to
>>>>>>  reference. i can''t figure out what its doing
though
>>>>>>
>>>>>>  On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>>>>>>  <Alex.Zhuravlev at sun.com
<mailto:Alex.Zhuravlev at sun.com>> wrote:
>>>>>>
>>>>>>      Aaron Knister wrote:
>>>>>>
>>>>>>          I''m seeing this bug (14465) under
heavy load on my
>>>>>> OSSes. If
>>>>>>          I reboot the MDS it appears to help...any
ideas?
>>>>>> What''s the
>>>>>>          status on this bug?
>>>>>>
>>>>>>
>>>>>>      could you attach your dmesg to the bug? as for the
status
>>>>>> - I''m
>>>>>>      still not
>>>>>>      able to reproduce this, neither I found possible
cause,
>>>>>> sorry.
>>>>>>
>>>>>>      thanks, Alex
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Lustre-discuss mailing list
>>>>> Lustre-discuss at lists.lustre.org
>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Alex Zhuravlev

2008-Jun-05 12:50 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

we''ve been investigating one issue which can be a source of your
problem.
fortunately we can reproduce that.

thanks, Alex

Aaron Knister wrote:> Unfortunately the monkeys maintaining this setup (I left the company 
> last week) have managed to destroy one of the 13 terabyte arrays...so 
> they''ve got bigger issues right now. Thanks for your help though.
> 
> On Jun 5, 2008, at 3:18 AM, Alex Zhuravlev wrote:
> 
>> hmm. could you provide me with ldiskfs.ok then?
>>
>> thanks, Alex
>>
>> Aaron Knister wrote:
>>> I''m not sure how to obtain that. I''m running
whatever was distributed
>>> with 1.6.4.3 for rhel5-x86_64.
>>>
>>> On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote:
>>>
>>>> could you also send me your mballoc.c please?
>>>>
>>>> thanks, Alex
>>>>
>>>> Aaron Knister wrote:
>>>>> Thanks so much for looking into this. Here''s what
I got from dmesg.
>>>>> Interestingly enough every time it panics the CPU listed is
#4...do
>>>>> you think that points to a hardware problem? -
>>>>> ----------- [cut here ] --------- [please bite here ]
---------
>>>>> Kernel BUG at 
>>>>> ...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334
>>>>> invalid opcode: 0000 [1] SMP
>>>>> last sysfs file:
>>>>>
/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq
>>>>>
>>>>>
>>>>> CPU 4
>>>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U)
mgc(U)
>>>>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U)
ko2iblnd(U)
>>>>> ptlrpc(U) obdclass(
>>>>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U)
hidp(U)
>>>>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U) ip6t_REJECT(U)
xt_tcpudp(U)
>>>>> ip6table_filter(U
>>>>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U) libiscsi(U)
>>>>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U)
ib_sdp(U)
>>>>> rdma_cm(U) ib_cm(U) iw_c
>>>>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U)
ib_uverbs(U)
>>>>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U)
i2c_ec(U)
>>>>> button(U) battery(
>>>>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U)
lp(U)
>>>>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U)
ide_cd(U)
>>>>> ib_core(U) shpchp(U) i2c
>>>>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U) dm_zero(U)
>>>>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U) megaraid_sas(U)
>>>>> sata_sil(U) libata(U) sd_mod(U
>>>>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U)
uhci_hcd(U)
>>>>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF
>>>>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1
>>>>> RIP: 0010:[<ffffffff8885a01f>] 
[<ffffffff8885a01f>]
>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>>>> RSP: 0018:ffff81040e4dd320  EFLAGS: 00010246
>>>>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX:
000000000000007f
>>>>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI:
0000000000001000
>>>>> RBP: 0000000000000800 R08: 0000000000000020 R09:
0000000000000010
>>>>> R10: ffff81037d136ff8 R11: 0000000000008000 R12:
ffff81040e4dd470
>>>>> R13: 0000000000010000 R14: 0000000000000020 R15:
0000000000000800
>>>>> FS:  00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000)
>>>>> knlGS:0000000000000000
>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4:
00000000000006e0
>>>>> Process ll_ost_io_11 (pid: 5717, threadinfo
ffff81040e4dc000, task
>>>>> ffff81041228a040)
>>>>> Stack:  ffff81040e4dd4c0 ffff81040e4dd470 000000000000000b
>>>>> 0000000000010000
>>>>> 0000000000010000 0000000000000020 ffff81040e4dd4c0
ffffffff8885cd4e
>>>>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0
0000000000000000
>>>>> Call Trace:
>>>>> [<ffffffff8885cd4e>]
:ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50
>>>>> [<ffffffff888443c2>]
:ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160
>>>>> [<ffffffff88859664>] 
>>>>> :ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160
>>>>> [<ffffffff8885e1e6>]
:ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280
>>>>> [<ffffffff888bf091>]
>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640
>>>>> [<ffffffff8000a63f>]
get_page_from_freelist+0x223/0x3cf
>>>>> [<ffffffff882390b0>]
:ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4
>>>>> [<ffffffff88856d27>]
:ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250
>>>>> [<ffffffff888bed90>] 
>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640
>>>>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a
>>>>> [<ffffffff888ba9ae>]
:fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150
>>>>> [<ffffffff888eee08>]
:obdfilter:filter_direct_io+0x478/0xce0
>>>>> [<ffffffff888f14de>]
:obdfilter:filter_commitrw_write+0x184e/0x2570
>>>>> [<ffffffff8005a2ea>] cache_alloc_refill+0x106/0x186
>>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>>>>> [<ffffffff8889eb46>] :ost:ost_brw_write+0x21b6/0x28c0
>>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe
>>>>> [<ffffffff888a283e>] :ost:ost_handle+0x2a8e/0x58d8
>>>>> [<ffffffff88603c82>]
:obdclass:class_handle2object+0xd2/0x160
>>>>> [<ffffffff8869f230>]
:ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
>>>>> [<ffffffff8869cde5>]
:ptlrpc:lustre_swab_buf+0xc5/0xf0
>>>>> [<ffffffff886a4a3b>]
:ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270
>>>>> [<ffffffff80060f29>] thread_return+0x0/0xeb
>>>>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92
>>>>> [<ffffffff8855d056>]
:libcfs:lcw_update_time+0x16/0x100
>>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>>>>> [<ffffffff886a747c>] :ptlrpc:ptlrpc_main+0x7dc/0x950
>>>>> [<ffffffff80088431>] default_wake_function+0x0/0xe
>>>>> [<ffffffff8005bfb1>] child_rip+0xa/0x11
>>>>> [<ffffffff886a6ca0>] :ptlrpc:ptlrpc_main+0x0/0x950
>>>>> [<ffffffff8005bfa7>] child_rip+0x0/0x11
>>>>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00 00 48
8b 43
>>>>> RIP  [<ffffffff8885a01f>] 
>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>>>> RSP <ffff81040e4dd320>
>>>>> <0>Kernel panic - not syncing: Fatal exception
>>>>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote:
>>>>>> what about dmesg output? it''s unlikely lustre
debug can help here as
>>>>>> the problem
>>>>>> seem to be very internal to ldiskfs (mballoc piece of
it)
>>>>>>
>>>>>> thanks, Alex
>>>>>>
>>>>>> Aaron Knister wrote:
>>>>>>> I bumped up debugging, and here''s  (below)
the last bit of debugging
>>>>>>> info from lustre that I have on the oss before it
went belly up. My
>>>>>>> system is totally inoperable. Does anybody have any
ideas?
>>>>>>>
>>>>>>>
00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock())
>>>>>>>
>>>>>>>
>>>>>>> About to add this lock:
>>>>>>>
00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())
>>>>>>>
>>>>>>>
>>>>>>> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6
(rc: 3) (pos:
>>>>>>> 0)(pid:6378)
>>>>>>>
00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())
>>>>>>>
>>>>>>>
>>>>>>> Node: local
>>>>>>>
00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())
>>>>>>>
>>>>>>>
>>>>>>> Resource: ffff8103849211c0 (3129942/0)
>>>>>>>
00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())
>>>>>>>
>>>>>>>
>>>>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0,
write: 1 flags:
>>>>>>> 0x80004000
>>>>>>>
00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())
>>>>>>>
>>>>>>>
>>>>>>> Extent: 0 -> 18446744073709551615 (req
0-18446744073709551615)
>>>>>>>
00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal())
>>>>>>>
>>>>>>>
>>>>>>> forcing cancel of local lock
>>>>>>>
00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread())
>>>>>>>
>>>>>>>
>>>>>>> kmalloced ''blwi'': 120 at
ffff81040e90a340 (tot 49135175)
>>>>>>>
00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>>>>>>>
>>>>>>>
>>>>>>> started handle ffff8103766dfc78 (0000000000000000)
>>>>>>>
00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>>>>>>>
>>>>>>>
>>>>>>> committing handle ffff8103766dfc78
>>>>>>>
00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>>>>>>>
>>>>>>>
>>>>>>> started handle ffff8103766dfc78 (0000000000000000)
>>>>>>>
00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock())
>>>>>>>
>>>>>>>
>>>>>>> removing object ffff81036e61f1c0 with handle
0x1fb135e1e3fd2cc6
>>>>>>> from hash
>>>>>>>
00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set())
>>>>>>>
>>>>>>>
>>>>>>> kmalloced ''set'': 104 at
ffff8104012d38c0 (tot 49135279)
>>>>>>>
00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy())
>>>>>>>
>>>>>>>
>>>>>>> kfreed ''set'': 104 at
ffff8104012d38c0 (tot 49135175).
>>>>>>>
00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref())
>>>>>>>
>>>>>>>
>>>>>>> putref res: ffff8103849211c0 count: 0
>>>>>>>
00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref())
>>>>>>>
>>>>>>>
>>>>>>> kfreed ''res->lr_lvb_data'': 40
at ffff810379ded880 (tot 49135135).
>>>>>>>
00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref())
>>>>>>>
>>>>>>>
>>>>>>> slab-freed ''res'': 224 at
ffff8103849211c0 (tot 49135135).
>>>>>>>
00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main())
>>>>>>>
>>>>>>>
>>>>>>> kfreed ''blwi'': 120 at
ffff81040e90a340 (tot 49134791).
>>>>>>>
00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>>>>>>>
>>>>>>>
>>>>>>> committing handle ffff8103766dfc78
>>>>>>>
00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput())
>>>>>>>
>>>>>>> putting 3129942: ffff8103599cea98, count = 0
>>>>>>>
00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated())
>>>>>>>
>>>>>>>
>>>>>>> crew4-OST0001: after destroy: set last_objids[0] =
3129941
>>>>>>>
00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid())
>>>>>>>
>>>>>>>
>>>>>>> crew4-OST0001: server last_objid for group 0:
3129941
>>>>>>>
00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
>>>>>>>
>>>>>>>
>>>>>>> slab-freed ''fcb'': 56 at
ffff810371404920 (tot 49134335).
>>>>>>>
00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req())
>>>>>>>
>>>>>>>
>>>>>>> last_committed 17896268, xid 3841
>>>>>>>
00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref())
>>>>>>>
>>>>>>>
>>>>>>> connection=ffff8103fbe9e2c0 refcount 10 to
172.18.0.10
>>>>>>> <http://172.18.0.10>@o2ib
>>>>>>>
00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf())
>>>>>>>
>>>>>>>
>>>>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at o2ib
>>>>>>>
00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc())
>>>>>>>
>>>>>>>
>>>>>>> kmalloced ''md'': 136 at
ffff81040cb6cb80 (tot 9568949).
>>>>>>>
00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc())
>>>>>>>
>>>>>>>
>>>>>>> kmalloced ''msg'': 336 at
ffff8104285e1e00 (tot 9569285).
>>>>>>>
00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection())
>>>>>>>
>>>>>>>
>>>>>>> connection=ffff8103fbe9e2c0 refcount 9 to
172.18.0.10
>>>>>>> <http://172.18.0.10>@o2ib
>>>>>>>
00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>>>>>>>
>>>>>>>
>>>>>>> RPC PUTting export ffff8103848e9000 : new rpc_count
0
>>>>>>>
00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>>>>>>>
>>>>>>>
>>>>>>> PUTting export ffff8103848e9000 : new refcount 4
>>>>>>>
00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request())
>>>>>>>
>>>>>>>
>>>>>>> PUTting export ffff8103848e9000 : new refcount 3
>>>>>>>
00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free())
>>>>>>>
>>>>>>>
>>>>>>> kfreed ''md'': 136 at
ffff81040cb6cb80 (tot 9569149).
>>>>>>>
00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free())
>>>>>>>
>>>>>>>
>>>>>>> kfreed ''msg'': 336 at
ffff8104285e1e00 (tot 9568813).
>>>>>>>
>>>>>>>
>>>>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister
>>>>>>> <aaron.knister at gmail.com
>>>>>>> <mailto:aaron.knister at gmail.com>>
wrote:
>>>>>>>
>>>>>>>  Thank you very much for looking into this.
I''ve attached my
>>>>>>> dmesg to
>>>>>>>  the bug. i looked at line number 1334 which the
panic seems to
>>>>>>>  reference. i can''t figure out what its
doing though
>>>>>>>
>>>>>>>  On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>>>>>>>  <Alex.Zhuravlev at sun.com
<mailto:Alex.Zhuravlev at sun.com>> wrote:
>>>>>>>
>>>>>>>      Aaron Knister wrote:
>>>>>>>
>>>>>>>          I''m seeing this bug (14465) under
heavy load on my
>>>>>>> OSSes. If
>>>>>>>          I reboot the MDS it appears to help...any
ideas? What''s the
>>>>>>>          status on this bug?
>>>>>>>
>>>>>>>
>>>>>>>      could you attach your dmesg to the bug? as for
the status - I''m
>>>>>>>      still not
>>>>>>>      able to reproduce this, neither I found
possible cause, sorry.
>>>>>>>
>>>>>>>      thanks, Alex
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Lustre-discuss mailing list
>>>>>> Lustre-discuss at lists.lustre.org
>>>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Aaron Knister

2008-Jun-05 12:55 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

For my own curiosity''s sake, is there a bugzilla id associated with  
it? I''m curious to know what the problem is/was.

On Jun 5, 2008, at 8:50 AM, Alex Zhuravlev wrote:
> we''ve been investigating one issue which can be a source of your  
> problem.
> fortunately we can reproduce that.
>
> thanks, Alex
>
> Aaron Knister wrote:
>> Unfortunately the monkeys maintaining this setup (I left the company
>> last week) have managed to destroy one of the 13 terabyte arrays...so
>> they''ve got bigger issues right now. Thanks for your help
though.
>>
>> On Jun 5, 2008, at 3:18 AM, Alex Zhuravlev wrote:
>>
>>> hmm. could you provide me with ldiskfs.ok then?
>>>
>>> thanks, Alex
>>>
>>> Aaron Knister wrote:
>>>> I''m not sure how to obtain that. I''m running
whatever was
>>>> distributed
>>>> with 1.6.4.3 for rhel5-x86_64.
>>>>
>>>> On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote:
>>>>
>>>>> could you also send me your mballoc.c please?
>>>>>
>>>>> thanks, Alex
>>>>>
>>>>> Aaron Knister wrote:
>>>>>> Thanks so much for looking into this. Here''s
what I got from
>>>>>> dmesg.
>>>>>> Interestingly enough every time it panics the CPU
listed is
>>>>>> #4...do
>>>>>> you think that points to a hardware problem? -
>>>>>> ----------- [cut here ] --------- [please bite here ]
---------
>>>>>> Kernel BUG at
>>>>>>
...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334
>>>>>> invalid opcode: 0000 [1] SMP
>>>>>> last sysfs file:
>>>>>> /devices/ 
>>>>>>
pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/
>>>>>> irq
>>>>>>
>>>>>>
>>>>>> CPU 4
>>>>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U)
ost(U) mgc(U)
>>>>>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U) mdc(U)
ko2iblnd(U)
>>>>>> ptlrpc(U) obdclass(
>>>>>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U) autofs4(U)
hidp(U)
>>>>>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U)
ip6t_REJECT(U)
>>>>>> xt_tcpudp(U)
>>>>>> ip6table_filter(U
>>>>>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U)
libiscsi(U)
>>>>>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U) ib_srp(U)
ib_sdp(U)
>>>>>> rdma_cm(U) ib_cm(U) iw_c
>>>>>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U)
ib_uverbs(U)
>>>>>> ib_umad(U) dm_multipath(U) video(U) sbs(U) backlight(U)
i2c_ec(U)
>>>>>> button(U) battery(
>>>>>> U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U)
lp(U)
>>>>>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U) e1000(U)
ide_cd(U)
>>>>>> ib_core(U) shpchp(U) i2c
>>>>>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U)
dm_zero(U)
>>>>>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U)
megaraid_sas(U)
>>>>>> sata_sil(U) libata(U) sd_mod(U
>>>>>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U) ohci_hcd(U)
uhci_hcd(U)
>>>>>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF
>>>>>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1
>>>>>> RIP: 0010:[<ffffffff8885a01f>] 
[<ffffffff8885a01f>]
>>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>>>>> RSP: 0018:ffff81040e4dd320  EFLAGS: 00010246
>>>>>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX:
000000000000007f
>>>>>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI:
0000000000001000
>>>>>> RBP: 0000000000000800 R08: 0000000000000020 R09:
0000000000000010
>>>>>> R10: ffff81037d136ff8 R11: 0000000000008000 R12:
ffff81040e4dd470
>>>>>> R13: 0000000000010000 R14: 0000000000000020 R15:
0000000000000800
>>>>>> FS:  00002aaaaaac0220(0000) GS:ffff81042fc20b40(0000)
>>>>>> knlGS:0000000000000000
>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4:
00000000000006e0
>>>>>> Process ll_ost_io_11 (pid: 5717, threadinfo
ffff81040e4dc000,
>>>>>> task
>>>>>> ffff81041228a040)
>>>>>> Stack:  ffff81040e4dd4c0 ffff81040e4dd470
000000000000000b
>>>>>> 0000000000010000
>>>>>> 0000000000010000 0000000000000020 ffff81040e4dd4c0  
>>>>>> ffffffff8885cd4e
>>>>>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0  
>>>>>> 0000000000000000
>>>>>> Call Trace:
>>>>>> [<ffffffff8885cd4e>]
:ldiskfs:ldiskfs_mb_regular_allocator
>>>>>> +0x66e/0xd50
>>>>>> [<ffffffff888443c2>]
:ldiskfs:ldiskfs_mark_inode_dirty
>>>>>> +0x132/0x160
>>>>>> [<ffffffff88859664>]
>>>>>> :ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160
>>>>>> [<ffffffff8885e1e6>]
:ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280
>>>>>> [<ffffffff888bf091>]
>>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640
>>>>>> [<ffffffff8000a63f>]
get_page_from_freelist+0x223/0x3cf
>>>>>> [<ffffffff882390b0>]
:ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4
>>>>>> [<ffffffff88856d27>]
:ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250
>>>>>> [<ffffffff888bed90>]
>>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640
>>>>>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a
>>>>>> [<ffffffff888ba9ae>]
:fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/
>>>>>> 0x150
>>>>>> [<ffffffff888eee08>]
:obdfilter:filter_direct_io+0x478/0xce0
>>>>>> [<ffffffff888f14de>]
:obdfilter:filter_commitrw_write+0x184e/
>>>>>> 0x2570
>>>>>> [<ffffffff8005a2ea>]
cache_alloc_refill+0x106/0x186
>>>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>>>>>> [<ffffffff8889eb46>]
:ost:ost_brw_write+0x21b6/0x28c0
>>>>>> [<ffffffff80088431>]
default_wake_function+0x0/0xe
>>>>>> [<ffffffff888a283e>]
:ost:ost_handle+0x2a8e/0x58d8
>>>>>> [<ffffffff88603c82>]
:obdclass:class_handle2object+0xd2/0x160
>>>>>> [<ffffffff8869f230>]
:ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
>>>>>> [<ffffffff8869cde5>]
:ptlrpc:lustre_swab_buf+0xc5/0xf0
>>>>>> [<ffffffff886a4a3b>]
:ptlrpc:ptlrpc_server_handle_request+0xb0b/
>>>>>> 0x1270
>>>>>> [<ffffffff80060f29>] thread_return+0x0/0xeb
>>>>>> [<ffffffff8006b6c9>] do_gettimeofday+0x50/0x92
>>>>>> [<ffffffff8855d056>]
:libcfs:lcw_update_time+0x16/0x100
>>>>>> [<ffffffff8003ce86>] lock_timer_base+0x1b/0x3c
>>>>>> [<ffffffff886a747c>]
:ptlrpc:ptlrpc_main+0x7dc/0x950
>>>>>> [<ffffffff80088431>]
default_wake_function+0x0/0xe
>>>>>> [<ffffffff8005bfb1>] child_rip+0xa/0x11
>>>>>> [<ffffffff886a6ca0>]
:ptlrpc:ptlrpc_main+0x0/0x950
>>>>>> [<ffffffff8005bfa7>] child_rip+0x0/0x11
>>>>>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00 00
00 48 8b 43
>>>>>> RIP  [<ffffffff8885a01f>]
>>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>>>>> RSP <ffff81040e4dd320>
>>>>>> <0>Kernel panic - not syncing: Fatal exception
>>>>>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote:
>>>>>>> what about dmesg output? it''s unlikely
lustre debug can help
>>>>>>> here as
>>>>>>> the problem
>>>>>>> seem to be very internal to ldiskfs (mballoc piece
of it)
>>>>>>>
>>>>>>> thanks, Alex
>>>>>>>
>>>>>>> Aaron Knister wrote:
>>>>>>>> I bumped up debugging, and here''s 
(below) the last bit of
>>>>>>>> debugging
>>>>>>>> info from lustre that I have on the oss before
it went belly
>>>>>>>> up. My
>>>>>>>> system is totally inoperable. Does anybody have
any ideas?
>>>>>>>>
>>>>>>>> 00010000:00001000:4:1212157576.884909:0:6378:0:
>>>>>>>> (ldlm_resource.c:865:ldlm_resource_add_lock())
>>>>>>>>
>>>>>>>>
>>>>>>>> About to add this lock:
>>>>>>>>
00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:
>>>>>>>> 1718:ldlm_lock_dump())
>>>>>>>>
>>>>>>>>
>>>>>>>> -- Lock dump:
ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3)
>>>>>>>> (pos: 
>>>>>>>> 0)(pid:6378)
>>>>>>>>
00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:
>>>>>>>> 1726:ldlm_lock_dump())
>>>>>>>>
>>>>>>>>
>>>>>>>> Node: local
>>>>>>>>
00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:
>>>>>>>> 1735:ldlm_lock_dump())
>>>>>>>>
>>>>>>>>
>>>>>>>> Resource: ffff8103849211c0 (3129942/0)
>>>>>>>>
00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:
>>>>>>>> 1740:ldlm_lock_dump())
>>>>>>>>
>>>>>>>>
>>>>>>>> Req mode: PW, grant mode: PW, rc: 3, read: 0,
write: 1 flags:
>>>>>>>> 0x80004000
>>>>>>>>
00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:
>>>>>>>> 1746:ldlm_lock_dump())
>>>>>>>>
>>>>>>>>
>>>>>>>> Extent: 0 -> 18446744073709551615 (req
0-18446744073709551615)
>>>>>>>>
00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:
>>>>>>>> 615:ldlm_lock_decref_internal())
>>>>>>>>
>>>>>>>>
>>>>>>>> forcing cancel of local lock
>>>>>>>>
00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:
>>>>>>>> 1357:ldlm_bl_to_thread())
>>>>>>>>
>>>>>>>>
>>>>>>>> kmalloced ''blwi'': 120 at
ffff81040e90a340 (tot 49135175)
>>>>>>>> 00002000:00000040:4:1212157576.884925:0:6378:0:
>>>>>>>> (lustre_fsfilt.h:194:fsfilt_start_log())
>>>>>>>>
>>>>>>>>
>>>>>>>> started handle ffff8103766dfc78
(0000000000000000)
>>>>>>>> 00002000:00000040:4:1212157576.884930:0:6378:0:
>>>>>>>> (lustre_fsfilt.h:270:fsfilt_commit())
>>>>>>>>
>>>>>>>>
>>>>>>>> committing handle ffff8103766dfc78
>>>>>>>> 00002000:00000040:4:1212157576.884931:0:6378:0:
>>>>>>>> (lustre_fsfilt.h:194:fsfilt_start_log())
>>>>>>>>
>>>>>>>>
>>>>>>>> started handle ffff8103766dfc78
(0000000000000000)
>>>>>>>> 00000020:00000040:4:1212157576.884957:0:5557:0:
>>>>>>>>
(lustre_handles.c:121:class_handle_unhash_nolock())
>>>>>>>>
>>>>>>>>
>>>>>>>> removing object ffff81036e61f1c0 with handle
0x1fb135e1e3fd2cc6
>>>>>>>> from hash
>>>>>>>>
00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:
>>>>>>>> 394:ptlrpc_prep_set())
>>>>>>>>
>>>>>>>>
>>>>>>>> kmalloced ''set'': 104 at
ffff8104012d38c0 (tot 49135279)
>>>>>>>>
00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:
>>>>>>>> 457:ptlrpc_set_destroy())
>>>>>>>>
>>>>>>>>
>>>>>>>> kfreed ''set'': 104 at
ffff8104012d38c0 (tot 49135175).
>>>>>>>> 00010000:00000040:4:1212157576.884964:0:5557:0:
>>>>>>>> (ldlm_resource.c:818:ldlm_resource_putref())
>>>>>>>>
>>>>>>>>
>>>>>>>> putref res: ffff8103849211c0 count: 0
>>>>>>>> 00010000:00000010:4:1212157576.884969:0:5557:0:
>>>>>>>> (ldlm_resource.c:828:ldlm_resource_putref())
>>>>>>>>
>>>>>>>>
>>>>>>>> kfreed ''res->lr_lvb_data'':
40 at ffff810379ded880 (tot
>>>>>>>> 49135135).
>>>>>>>> 00010000:00000010:4:1212157576.885000:0:5557:0:
>>>>>>>> (ldlm_resource.c:829:ldlm_resource_putref())
>>>>>>>>
>>>>>>>>
>>>>>>>> slab-freed ''res'': 224 at
ffff8103849211c0 (tot 49135135).
>>>>>>>>
00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:
>>>>>>>> 1657:ldlm_bl_thread_main())
>>>>>>>>
>>>>>>>>
>>>>>>>> kfreed ''blwi'': 120 at
ffff81040e90a340 (tot 49134791).
>>>>>>>> 00002000:00000040:4:1212157576.885623:0:6378:0:
>>>>>>>> (lustre_fsfilt.h:270:fsfilt_commit())
>>>>>>>>
>>>>>>>>
>>>>>>>> committing handle ffff8103766dfc78
>>>>>>>>
00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:
>>>>>>>> 148:f_dput())
>>>>>>>>
>>>>>>>> putting 3129942: ffff8103599cea98, count = 0
>>>>>>>>
00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:
>>>>>>>> 2689:filter_destroy_precreated())
>>>>>>>>
>>>>>>>>
>>>>>>>> crew4-OST0001: after destroy: set
last_objids[0] = 3129941
>>>>>>>>
00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:
>>>>>>>> 607:filter_update_last_objid())
>>>>>>>>
>>>>>>>>
>>>>>>>> crew4-OST0001: server last_objid for group 0:
3129941
>>>>>>>>
00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-
>>>>>>>> ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
>>>>>>>>
>>>>>>>>
>>>>>>>> slab-freed ''fcb'': 56 at
ffff810371404920 (tot 49134335).
>>>>>>>>
00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:
>>>>>>>> 1556:target_committed_to_req())
>>>>>>>>
>>>>>>>>
>>>>>>>> last_committed 17896268, xid 3841
>>>>>>>>
00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:
>>>>>>>> 191:ptlrpc_connection_addref())
>>>>>>>>
>>>>>>>>
>>>>>>>> connection=ffff8103fbe9e2c0 refcount 10 to
172.18.0.10
>>>>>>>> <http://172.18.0.10>@o2ib
>>>>>>>>
00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:
>>>>>>>> 46:ptl_send_buf())
>>>>>>>>
>>>>>>>>
>>>>>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10 at
o2ib
>>>>>>>>
00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:
>>>>>>>> 247:lnet_md_alloc())
>>>>>>>>
>>>>>>>>
>>>>>>>> kmalloced ''md'': 136 at
ffff81040cb6cb80 (tot 9568949).
>>>>>>>>
00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:
>>>>>>>> 295:lnet_msg_alloc())
>>>>>>>>
>>>>>>>>
>>>>>>>> kmalloced ''msg'': 336 at
ffff8104285e1e00 (tot 9569285).
>>>>>>>>
00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:
>>>>>>>> 150:ptlrpc_put_connection())
>>>>>>>>
>>>>>>>>
>>>>>>>> connection=ffff8103fbe9e2c0 refcount 9 to
172.18.0.10
>>>>>>>> <http://172.18.0.10>@o2ib
>>>>>>>>
00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:
>>>>>>>> 648:ptlrpc_server_handle_request())
>>>>>>>>
>>>>>>>>
>>>>>>>> RPC PUTting export ffff8103848e9000 : new
rpc_count 0
>>>>>>>>
00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:
>>>>>>>> 648:ptlrpc_server_handle_request())
>>>>>>>>
>>>>>>>>
>>>>>>>> PUTting export ffff8103848e9000 : new refcount
4
>>>>>>>>
00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:
>>>>>>>> 652:ptlrpc_server_handle_request())
>>>>>>>>
>>>>>>>>
>>>>>>>> PUTting export ffff8103848e9000 : new refcount
3
>>>>>>>>
00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:
>>>>>>>> 269:lnet_md_free())
>>>>>>>>
>>>>>>>>
>>>>>>>> kfreed ''md'': 136 at
ffff81040cb6cb80 (tot 9569149).
>>>>>>>>
00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:
>>>>>>>> 312:lnet_msg_free())
>>>>>>>>
>>>>>>>>
>>>>>>>> kfreed ''msg'': 336 at
ffff8104285e1e00 (tot 9568813).
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister
>>>>>>>> <aaron.knister at gmail.com
>>>>>>>> <mailto:aaron.knister at gmail.com>>
wrote:
>>>>>>>>
>>>>>>>> Thank you very much for looking into this.
I''ve attached my
>>>>>>>> dmesg to
>>>>>>>> the bug. i looked at line number 1334 which the
panic seems to
>>>>>>>> reference. i can''t figure out what its
doing though
>>>>>>>>
>>>>>>>> On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>>>>>>>> <Alex.Zhuravlev at sun.com
<mailto:Alex.Zhuravlev at sun.com>> wrote:
>>>>>>>>
>>>>>>>>     Aaron Knister wrote:
>>>>>>>>
>>>>>>>>         I''m seeing this bug (14465)
under heavy load on my
>>>>>>>> OSSes. If
>>>>>>>>         I reboot the MDS it appears to
help...any ideas?
>>>>>>>> What''s the
>>>>>>>>         status on this bug?
>>>>>>>>
>>>>>>>>
>>>>>>>>     could you attach your dmesg to the bug? as
for the status
>>>>>>>> - I''m
>>>>>>>>     still not
>>>>>>>>     able to reproduce this, neither I found
possible cause,
>>>>>>>> sorry.
>>>>>>>>
>>>>>>>>     thanks, Alex
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Lustre-discuss mailing list
>>>>>>> Lustre-discuss at lists.lustre.org
>>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> Lustre-discuss mailing list
>>> Lustre-discuss at lists.lustre.org
>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Alex Zhuravlev

2008-Jun-05 13:04 UTC

head link

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

https://bugzilla.lustre.org/show_bug.cgi?id=15932

thanks, Alex

Aaron Knister wrote:> For my own curiosity''s sake, is there a bugzilla id associated
with it?
> I''m curious to know what the problem is/was.
> 
> On Jun 5, 2008, at 8:50 AM, Alex Zhuravlev wrote:
> 
>> we''ve been investigating one issue which can be a source of
your problem.
>> fortunately we can reproduce that.
>>
>> thanks, Alex
>>
>> Aaron Knister wrote:
>>> Unfortunately the monkeys maintaining this setup (I left the
company
>>> last week) have managed to destroy one of the 13 terabyte
arrays...so
>>> they''ve got bigger issues right now. Thanks for your help
though.
>>>
>>> On Jun 5, 2008, at 3:18 AM, Alex Zhuravlev wrote:
>>>
>>>> hmm. could you provide me with ldiskfs.ok then?
>>>>
>>>> thanks, Alex
>>>>
>>>> Aaron Knister wrote:
>>>>> I''m not sure how to obtain that. I''m
running whatever was distributed
>>>>> with 1.6.4.3 for rhel5-x86_64.
>>>>>
>>>>> On Jun 2, 2008, at 8:08 AM, Alex Zhuravlev wrote:
>>>>>
>>>>>> could you also send me your mballoc.c please?
>>>>>>
>>>>>> thanks, Alex
>>>>>>
>>>>>> Aaron Knister wrote:
>>>>>>> Thanks so much for looking into this.
Here''s what I got from dmesg.
>>>>>>> Interestingly enough every time it panics the CPU
listed is #4...do
>>>>>>> you think that points to a hardware problem? -
>>>>>>> ----------- [cut here ] --------- [please bite here
] ---------
>>>>>>> Kernel BUG at
>>>>>>>
...build/BUILD/lustre-ldiskfs-3.0.4/ldiskfs/mballoc.c:1334
>>>>>>> invalid opcode: 0000 [1] SMP
>>>>>>> last sysfs file:
>>>>>>>
/devices/pci0000:00/0000:00:02.0/0000:01:00.0/0000:02:02.0/0000:04:00.1/irq
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> CPU 4
>>>>>>> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U)
ost(U) mgc(U)
>>>>>>> ldiskfs(U) crc16(U) lustre(U) lov(U) lquota(U)
mdc(U) ko2iblnd(U)
>>>>>>> ptlrpc(U) obdclass(
>>>>>>> U) lnet(U) lvfs(U) libcfs(U) netconsole(U)
autofs4(U) hidp(U)
>>>>>>> rfcomm(U) l2cap(U) bluetooth(U) sunrpc(U)
ip6t_REJECT(U)
>>>>>>> xt_tcpudp(U)
>>>>>>> ip6table_filter(U
>>>>>>> ) ip6_tables(U) x_tables(U) ipv6(U) ib_iser(U)
libiscsi(U)
>>>>>>> scsi_transport_iscsi(U) rdma_ucm(U) ib_ucm(U)
ib_srp(U) ib_sdp(U)
>>>>>>> rdma_cm(U) ib_cm(U) iw_c
>>>>>>> m(U) ib_addr(U) ib_local_sa(U) ib_ipoib(U) ib_sa(U)
ib_uverbs(U)
>>>>>>> ib_umad(U) dm_multipath(U) video(U) sbs(U)
backlight(U) i2c_ec(U)
>>>>>>> button(U) battery(
>>>>>>> U) asus_acpi(U) acpi_memhotplug(U) ac(U)
parport_pc(U) lp(U)
>>>>>>> parport(U) ib_mthca(U) ib_mad(U) i2c_i801(U)
e1000(U) ide_cd(U)
>>>>>>> ib_core(U) shpchp(U) i2c
>>>>>>> _core(U) cdrom(U) sg(U) pcspkr(U) dm_snapshot(U)
dm_zero(U)
>>>>>>> dm_mirror(U) dm_mod(U) ata_piix(U) ahci(U)
megaraid_sas(U)
>>>>>>> sata_sil(U) libata(U) sd_mod(U
>>>>>>> ) scsi_mod(U) ext3(U) jbd(U) ehci_hcd(U)
ohci_hcd(U) uhci_hcd(U)
>>>>>>> Pid: 5717, comm: ll_ost_io_11 Tainted: GF
>>>>>>> 2.6.18-53.1.13.el5_lustre.1.6.4.3smp #1
>>>>>>> RIP: 0010:[<ffffffff8885a01f>] 
[<ffffffff8885a01f>]
>>>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>>>>>> RSP: 0018:ffff81040e4dd320  EFLAGS: 00010246
>>>>>>> RAX: 0000000000000000 RBX: ffff81040e4dd3d0 RCX:
000000000000007f
>>>>>>> RDX: ffff810406833000 RSI: ffff8103a8250000 RDI:
0000000000001000
>>>>>>> RBP: 0000000000000800 R08: 0000000000000020 R09:
0000000000000010
>>>>>>> R10: ffff81037d136ff8 R11: 0000000000008000 R12:
ffff81040e4dd470
>>>>>>> R13: 0000000000010000 R14: 0000000000000020 R15:
0000000000000800
>>>>>>> FS:  00002aaaaaac0220(0000)
GS:ffff81042fc20b40(0000)
>>>>>>> knlGS:0000000000000000
>>>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>>>>> CR2: 00002aaaae2ad0a0 CR3: 000000041c6d1000 CR4:
00000000000006e0
>>>>>>> Process ll_ost_io_11 (pid: 5717, threadinfo
ffff81040e4dc000, task
>>>>>>> ffff81041228a040)
>>>>>>> Stack:  ffff81040e4dd4c0 ffff81040e4dd470
000000000000000b
>>>>>>> 0000000000010000
>>>>>>> 0000000000010000 0000000000000020 ffff81040e4dd4c0
ffffffff8885cd4e
>>>>>>> ffff81038c2b0a38 0000000000000010 ffff81040e4dd4a0
0000000000000000
>>>>>>> Call Trace:
>>>>>>> [<ffffffff8885cd4e>] 
>>>>>>> :ldiskfs:ldiskfs_mb_regular_allocator+0x66e/0xd50
>>>>>>> [<ffffffff888443c2>]
:ldiskfs:ldiskfs_mark_inode_dirty+0x132/0x160
>>>>>>> [<ffffffff88859664>]
>>>>>>> :ldiskfs:ldiskfs_mb_initialize_context+0x144/0x160
>>>>>>> [<ffffffff8885e1e6>]
:ldiskfs:ldiskfs_mb_new_blocks+0x166/0x280
>>>>>>> [<ffffffff888bf091>]
>>>>>>>
:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x301/0x640
>>>>>>> [<ffffffff8000a63f>]
get_page_from_freelist+0x223/0x3cf
>>>>>>> [<ffffffff882390b0>]
:ib_mthca:mthca_arbel_post_send+0x5a2/0x5b4
>>>>>>> [<ffffffff88856d27>]
:ldiskfs:ldiskfs_ext_walk_space+0x1b7/0x250
>>>>>>> [<ffffffff888bed90>]
>>>>>>> :fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0x0/0x640
>>>>>>> [<ffffffff80054ccc>] mwait_idle+0x0/0x4a
>>>>>>> [<ffffffff888ba9ae>]
:fsfilt_ldiskfs:fsfilt_map_nblocks+0xfe/0x150
>>>>>>> [<ffffffff888eee08>]
:obdfilter:filter_direct_io+0x478/0xce0
>>>>>>> [<ffffffff888f14de>]
:obdfilter:filter_commitrw_write+0x184e/0x2570
>>>>>>> [<ffffffff8005a2ea>]
cache_alloc_refill+0x106/0x186
>>>>>>> [<ffffffff8003ce86>]
lock_timer_base+0x1b/0x3c
>>>>>>> [<ffffffff8889eb46>]
:ost:ost_brw_write+0x21b6/0x28c0
>>>>>>> [<ffffffff80088431>]
default_wake_function+0x0/0xe
>>>>>>> [<ffffffff888a283e>]
:ost:ost_handle+0x2a8e/0x58d8
>>>>>>> [<ffffffff88603c82>]
:obdclass:class_handle2object+0xd2/0x160
>>>>>>> [<ffffffff8869f230>]
:ptlrpc:lustre_swab_ptlrpc_body+0x0/0x90
>>>>>>> [<ffffffff8869cde5>]
:ptlrpc:lustre_swab_buf+0xc5/0xf0
>>>>>>> [<ffffffff886a4a3b>] 
>>>>>>> :ptlrpc:ptlrpc_server_handle_request+0xb0b/0x1270
>>>>>>> [<ffffffff80060f29>] thread_return+0x0/0xeb
>>>>>>> [<ffffffff8006b6c9>]
do_gettimeofday+0x50/0x92
>>>>>>> [<ffffffff8855d056>]
:libcfs:lcw_update_time+0x16/0x100
>>>>>>> [<ffffffff8003ce86>]
lock_timer_base+0x1b/0x3c
>>>>>>> [<ffffffff886a747c>]
:ptlrpc:ptlrpc_main+0x7dc/0x950
>>>>>>> [<ffffffff80088431>]
default_wake_function+0x0/0xe
>>>>>>> [<ffffffff8005bfb1>] child_rip+0xa/0x11
>>>>>>> [<ffffffff886a6ca0>]
:ptlrpc:ptlrpc_main+0x0/0x950
>>>>>>> [<ffffffff8005bfa7>] child_rip+0x0/0x11
>>>>>>> Code: 0f 0b 68 50 5c 86 88 c2 36 05 0f 1f 80 00 00
00 00 48 8b 43
>>>>>>> RIP  [<ffffffff8885a01f>]
>>>>>>> :ldiskfs:ldiskfs_mb_use_best_found+0xef/0x520
>>>>>>> RSP <ffff81040e4dd320>
>>>>>>> <0>Kernel panic - not syncing: Fatal
exception
>>>>>>> On Jun 2, 2008, at 3:40 AM, Alex Zhuravlev wrote:
>>>>>>>> what about dmesg output? it''s unlikely
lustre debug can help
>>>>>>>> here as
>>>>>>>> the problem
>>>>>>>> seem to be very internal to ldiskfs (mballoc
piece of it)
>>>>>>>>
>>>>>>>> thanks, Alex
>>>>>>>>
>>>>>>>> Aaron Knister wrote:
>>>>>>>>> I bumped up debugging, and here''s 
(below) the last bit of
>>>>>>>>> debugging
>>>>>>>>> info from lustre that I have on the oss
before it went belly
>>>>>>>>> up. My
>>>>>>>>> system is totally inoperable. Does anybody
have any ideas?
>>>>>>>>>
>>>>>>>>>
00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> About to add this lock:
>>>>>>>>>
00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Lock dump:
ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3)
>>>>>>>>> (pos:0)(pid:6378)
>>>>>>>>>
00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Node: local
>>>>>>>>>
00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Resource: ffff8103849211c0 (3129942/0)
>>>>>>>>>
00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Req mode: PW, grant mode: PW, rc: 3, read:
0, write: 1 flags:
>>>>>>>>> 0x80004000
>>>>>>>>>
00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Extent: 0 -> 18446744073709551615 (req
0-18446744073709551615)
>>>>>>>>>
00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> forcing cancel of local lock
>>>>>>>>>
00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kmalloced ''blwi'': 120 at
ffff81040e90a340 (tot 49135175)
>>>>>>>>>
00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> started handle ffff8103766dfc78
(0000000000000000)
>>>>>>>>>
00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> committing handle ffff8103766dfc78
>>>>>>>>>
00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> started handle ffff8103766dfc78
(0000000000000000)
>>>>>>>>>
00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> removing object ffff81036e61f1c0 with
handle 0x1fb135e1e3fd2cc6
>>>>>>>>> from hash
>>>>>>>>>
00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kmalloced ''set'': 104 at
ffff8104012d38c0 (tot 49135279)
>>>>>>>>>
00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kfreed ''set'': 104 at
ffff8104012d38c0 (tot 49135175).
>>>>>>>>>
00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> putref res: ffff8103849211c0 count: 0
>>>>>>>>>
00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kfreed
''res->lr_lvb_data'': 40 at ffff810379ded880 (tot 49135135).
>>>>>>>>>
00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> slab-freed ''res'': 224 at
ffff8103849211c0 (tot 49135135).
>>>>>>>>>
00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kfreed ''blwi'': 120 at
ffff81040e90a340 (tot 49134791).
>>>>>>>>>
00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> committing handle ffff8103766dfc78
>>>>>>>>>
00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> putting 3129942: ffff8103599cea98, count =
0
>>>>>>>>>
00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> crew4-OST0001: after destroy: set
last_objids[0] = 3129941
>>>>>>>>>
00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> crew4-OST0001: server last_objid for group
0: 3129941
>>>>>>>>>
00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> slab-freed ''fcb'': 56 at
ffff810371404920 (tot 49134335).
>>>>>>>>>
00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> last_committed 17896268, xid 3841
>>>>>>>>>
00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> connection=ffff8103fbe9e2c0 refcount 10 to
172.18.0.10
>>>>>>>>> <http://172.18.0.10>@o2ib
>>>>>>>>>
00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> conn=ffff8103fbe9e2c0 id 12345-172.18.0.10
at o2ib
>>>>>>>>>
00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kmalloced ''md'': 136 at
ffff81040cb6cb80 (tot 9568949).
>>>>>>>>>
00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kmalloced ''msg'': 336 at
ffff8104285e1e00 (tot 9569285).
>>>>>>>>>
00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> connection=ffff8103fbe9e2c0 refcount 9 to
172.18.0.10
>>>>>>>>> <http://172.18.0.10>@o2ib
>>>>>>>>>
00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> RPC PUTting export ffff8103848e9000 : new
rpc_count 0
>>>>>>>>>
00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PUTting export ffff8103848e9000 : new
refcount 4
>>>>>>>>>
00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> PUTting export ffff8103848e9000 : new
refcount 3
>>>>>>>>>
00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kfreed ''md'': 136 at
ffff81040cb6cb80 (tot 9569149).
>>>>>>>>>
00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free())
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> kfreed ''msg'': 336 at
ffff8104285e1e00 (tot 9568813).
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, May 28, 2008 at 8:03 PM, Aaron
Knister
>>>>>>>>> <aaron.knister at gmail.com
>>>>>>>>> <mailto:aaron.knister at
gmail.com>> wrote:
>>>>>>>>>
>>>>>>>>> Thank you very much for looking into this.
I''ve attached my
>>>>>>>>> dmesg to
>>>>>>>>> the bug. i looked at line number 1334 which
the panic seems to
>>>>>>>>> reference. i can''t figure out what
its doing though
>>>>>>>>>
>>>>>>>>> On Wed, May 28, 2008 at 4:54 PM, Alex
Zhuravlev
>>>>>>>>> <Alex.Zhuravlev at sun.com
<mailto:Alex.Zhuravlev at sun.com>> wrote:
>>>>>>>>>
>>>>>>>>>     Aaron Knister wrote:
>>>>>>>>>
>>>>>>>>>         I''m seeing this bug
(14465) under heavy load on my
>>>>>>>>> OSSes. If
>>>>>>>>>         I reboot the MDS it appears to
help...any ideas? What''s
>>>>>>>>> the
>>>>>>>>>         status on this bug?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>     could you attach your dmesg to the bug?
as for the status -
>>>>>>>>> I''m
>>>>>>>>>     still not
>>>>>>>>>     able to reproduce this, neither I found
possible cause, sorry.
>>>>>>>>>
>>>>>>>>>     thanks, Alex
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Lustre-discuss mailing list
>>>>>>>> Lustre-discuss at lists.lustre.org
>>>>>>>>
http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Lustre-discuss mailing list
>>>> Lustre-discuss at lists.lustre.org
>>>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Lustre discuss - May 2008 - ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

[Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found