Hi, Last night one of our MDT servers paniced with the following errors: LustreError: 26312:0:(llog_server.c:433:llog_origin_handle_cancel()) cancel 124 llog-records failed: -22 Lustre: diinofs-OST000a-osc: Connection to service diinofs-OST000a via nid 10.4.22.213 at tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: 4588:0:(quota_master.c:1576:mds_quota_recovery()) Not all osts are active, abort quota recovery Lustre: diinofs-OST000a-osc: Connection restored to service diinofs-OST000a using nid 10.4.22.213 at tcp. Lustre: MDS diinofs-MDT0000: diinofs-OST000a_UUID now active, resetting orphans LustreError: 6823:0:(llog_server.c:433:llog_origin_handle_cancel()) cancel 124 llog-records failed: -22 Lustre: diinofs-OST000b-osc: Connection to service diinofs-OST000b via nid 10.4.22.213 at tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: diinofs-OST000b-osc: Connection restored to service diinofs-OST000b using nid 10.4.22.213 at tcp. Lustre: MDS diinofs-MDT0000: diinofs-OST000b_UUID now active, resetting orphans LustreError: 6822:0:(llog_server.c:433:llog_origin_handle_cancel()) cancel 124 llog-records failed: -22 process `syslogd'' is using obsolete setsockopt SO_BSDCOMPAT LustreError: 6844:0:(mds_reint.c:1528:mds_orphan_add_link()) ASSERTION(inode->i_nlink == 1) failed:dir nlink == 0 LustreError: 6844:0:(mds_reint.c:1528:mds_orphan_add_link()) LBUG Lustre: 6844:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing stack for process 6844 ll_mdt_07 R running task 0 6844 1 6845 6843 (L-TLB) 0000003000000030 ffff810113b876b0 ffff810113b875d0 0000000000000001 0000000000000002 ffffffff802ecde9 3830326563646539 3338333033323635 ffff810008afb000 0000000000000000 0000000000000000 ffffffff8032875e Call Trace: [<ffffffff8032875e>] scrup+0x60/0xc5 [<ffffffff803287df>] lf+0x1c/0x3a [<ffffffff8032c51a>] vt_console_print+0x213/0x228 [<ffffffff8032c51a>] vt_console_print+0x213/0x228 [<ffffffff80228520>] printk+0x4e/0x56 [<ffffffff80228520>] printk+0x4e/0x56 [<ffffffff802431d5>] kallsyms_lookup+0xe7/0x1af [<ffffffff802431d5>] kallsyms_lookup+0xe7/0x1af [<ffffffff8020ac42>] printk_address+0x9f/0xac [<ffffffff8032875e>] scrup+0x60/0xc5 [<ffffffff802403b7>] module_text_address+0x33/0x3b [<ffffffff80238a51>] kernel_text_address+0x1a/0x26 [<ffffffff80238a51>] kernel_text_address+0x1a/0x26 [<ffffffff8020ae48>] show_trace+0x1f9/0x21f [<ffffffff8020af67>] _show_stack+0xe2/0xf1 [<ffffffff88234b8a>] :libcfs:lbug_with_loc+0x7a/0xc0 [<ffffffff88535021>] :mds:mds_orphan_add_link+0x641/0x7e0 [<ffffffff884aa02d>] :ldiskfs:__ldiskfs_journal_stop+0x2d/0x60 [<ffffffff8029b1ea>] dnotify_parent+0x1c/0x6e [<ffffffff80282ec5>] dput+0x23/0x152 [<ffffffff8853fae5>] :mds:mds_reint_unlink+0x1665/0x23b0 [<ffffffff8835608a>] :ptlrpc:ptlrpc_set_destroy+0x26a/0x2d0 [<ffffffff8835bbc3>] :ptlrpc:ptl_send_buf+0x3f3/0x5b0 [<ffffffff80282ec5>] dput+0x23/0x152 [<ffffffff88534909>] :mds:mds_reint_rec+0x1d9/0x2b0 [<ffffffff88559cfc>] :mds:mds_unlink_unpack+0x28c/0x3b0 [<ffffffff88558c8f>] :mds:mds_update_unpack+0x20f/0x2b0 [<ffffffff8832572f>] :ptlrpc:ldlm_reprocess_all+0x17f/0x210 [<ffffffff8852745a>] :mds:mds_reint+0x35a/0x420 [<ffffffff88364eb9>] :ptlrpc:lustre_pack_reply+0x29/0xb0 [<ffffffff8836028f>] :ptlrpc:lustre_msg_get_opc+0x4f/0x100 [<ffffffff8852dd2b>] :mds:mds_handle+0x247b/0x4c30 [<ffffffff802206c2>] find_busiest_group+0x20d/0x634 [<ffffffff882c8c29>] :obdclass:class_handle2object+0xd9/0x160 [<ffffffff8835f99f>] :ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100 [<ffffffff88369dea>] :ptlrpc:ptlrpc_check_req+0x1a/0x110 [<ffffffff8836bf39>] :ptlrpc:ptlrpc_server_handle_request+0x999/0x1030 [<ffffffff8020d490>] do_gettimeofday+0x50/0x94 [<ffffffff8823d326>] :libcfs:lcw_update_time+0x16/0x100 [<ffffffff8021f971>] __wake_up_common+0x3e/0x68 [<ffffffff8836ee1b>] :ptlrpc:ptlrpc_main+0xd0b/0xe80 [<ffffffff802213f2>] default_wake_function+0x0/0xe [<ffffffff8020a6a4>] child_rip+0xa/0x12 [<ffffffff8836e110>] :ptlrpc:ptlrpc_main+0x0/0xe80 [<ffffffff8020a69a>] child_rip+0x0/0x12 LustreError: dumping log to /tmp/lustre-log.1226867690.6844 Lustre: 6855:0:(ldlm_lib.c:525:target_handle_reconnect()) diinofs-MDT0000: ec04ce81-b4df-720d-47a0-ca676c240a24 reconnecting Lustre: 6855:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 60 previous similar messages Lustre: 6855:0:(ldlm_lib.c:760:target_handle_connect()) diinofs-MDT0000: refuse reconnection from ec04ce81-b4df-720d-47a0-ca676c240a24 at 10.4.31.103@tcp to 0xffff81021a606000; still busy with 2 active RPCs Lustre: 6855:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 50 previous similar messages LustreError: 6855:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-16) req at ffff8102214e5400 x8763679/t0 o38->ec04ce81-b4df-720d-47a0-ca676c240a24 at NET_0x200000a041f67_UUID:0/0 lens 304/200 e 0 to 0 dl 1226868090 ref 1 fl Interpret:/0/0 rc -16/0 LustreError: 6855:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 46 previous similar messages Lustre: 6864:0:(ldlm_lib.c:525:target_handle_reconnect()) diinofs-MDT0000: bed5bd81-726a-ebda-b9c7-2dfe270f90c0 reconnecting Lustre: 6864:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 13 previous similar messages Lustre: 6864:0:(ldlm_lib.c:760:target_handle_connect()) diinofs-MDT0000: refuse reconnection from bed5bd81-726a-ebda-b9c7-2dfe270f90c0 at 10.4.31.114@tcp to 0xffff8101192bc000; still busy with 3 active RPCs Lustre: 6864:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 13 previous similar messages LustreError: 6864:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-16) req at ffff8101139edc00 x12195715/t0 o38->bed5bd81-726a-ebda-b9c7-2dfe270f90c0 at NET_0x200000a041f72_UUID:0/0 lens 304/200 e 0 to 0 dl 1226868167 ref 1 fl Interpret:/0/0 rc -16/0 LustreError: 6864:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 13 previous similar messages What does it mean? Now we can''t reboot either of the MDT''s without panics (DRBD synced meta-data). I see a "LBUG" error but that shouldnt panic the system after a complete cluster reboot(?) - we are running lustre 1.6.5.1 on Debian Etch with 2.6.18 vanilla kernel, the clients are running patchless with Debian Etch 2.6.18 aswell. The strange thing is that it has been working for about 8 months already but now suddenly we got alot of small errors both from oss''es and mdt''s. The panic is most urgent to solve though. Any ideas? -- Regards, Timh Bergstr?m Server Operations Manager Diino AB - www.diino.com :wq
I solved it with mounting the mdt then the ost''s in order to get them all reconnected and then waiting out the recovery period before mounting the clients. Im still curious to what it was though. //T 2008/11/17 Timh Bergstr?m <timh.bergstrom at diino.net>:> Hi, > > Last night one of our MDT servers paniced with the following errors: > > LustreError: 26312:0:(llog_server.c:433:llog_origin_handle_cancel()) > cancel 124 llog-records failed: -22 > Lustre: diinofs-OST000a-osc: Connection to service diinofs-OST000a via > nid 10.4.22.213 at tcp was lost; in progress operations using this > service will wait for recovery to complete. > Lustre: 4588:0:(quota_master.c:1576:mds_quota_recovery()) Not all osts > are active, abort quota recovery > Lustre: diinofs-OST000a-osc: Connection restored to service > diinofs-OST000a using nid 10.4.22.213 at tcp. > Lustre: MDS diinofs-MDT0000: diinofs-OST000a_UUID now active, resetting orphans > LustreError: 6823:0:(llog_server.c:433:llog_origin_handle_cancel()) > cancel 124 llog-records failed: -22 > Lustre: diinofs-OST000b-osc: Connection to service diinofs-OST000b via > nid 10.4.22.213 at tcp was lost; in progress operations using this > service will wait for recovery to complete. > Lustre: diinofs-OST000b-osc: Connection restored to service > diinofs-OST000b using nid 10.4.22.213 at tcp. > Lustre: MDS diinofs-MDT0000: diinofs-OST000b_UUID now active, resetting orphans > LustreError: 6822:0:(llog_server.c:433:llog_origin_handle_cancel()) > cancel 124 llog-records failed: -22 > process `syslogd'' is using obsolete setsockopt SO_BSDCOMPAT > LustreError: 6844:0:(mds_reint.c:1528:mds_orphan_add_link()) > ASSERTION(inode->i_nlink == 1) failed:dir nlink == 0 > LustreError: 6844:0:(mds_reint.c:1528:mds_orphan_add_link()) LBUG > Lustre: 6844:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing > stack for process 6844 > ll_mdt_07 R running task 0 6844 1 6845 6843 (L-TLB) > 0000003000000030 ffff810113b876b0 ffff810113b875d0 0000000000000001 > 0000000000000002 ffffffff802ecde9 3830326563646539 3338333033323635 > ffff810008afb000 0000000000000000 0000000000000000 ffffffff8032875e > Call Trace: > [<ffffffff8032875e>] scrup+0x60/0xc5 > [<ffffffff803287df>] lf+0x1c/0x3a > [<ffffffff8032c51a>] vt_console_print+0x213/0x228 > [<ffffffff8032c51a>] vt_console_print+0x213/0x228 > [<ffffffff80228520>] printk+0x4e/0x56 > [<ffffffff80228520>] printk+0x4e/0x56 > [<ffffffff802431d5>] kallsyms_lookup+0xe7/0x1af > [<ffffffff802431d5>] kallsyms_lookup+0xe7/0x1af > [<ffffffff8020ac42>] printk_address+0x9f/0xac > [<ffffffff8032875e>] scrup+0x60/0xc5 > [<ffffffff802403b7>] module_text_address+0x33/0x3b > [<ffffffff80238a51>] kernel_text_address+0x1a/0x26 > [<ffffffff80238a51>] kernel_text_address+0x1a/0x26 > [<ffffffff8020ae48>] show_trace+0x1f9/0x21f > [<ffffffff8020af67>] _show_stack+0xe2/0xf1 > [<ffffffff88234b8a>] :libcfs:lbug_with_loc+0x7a/0xc0 > [<ffffffff88535021>] :mds:mds_orphan_add_link+0x641/0x7e0 > [<ffffffff884aa02d>] :ldiskfs:__ldiskfs_journal_stop+0x2d/0x60 > [<ffffffff8029b1ea>] dnotify_parent+0x1c/0x6e > [<ffffffff80282ec5>] dput+0x23/0x152 > [<ffffffff8853fae5>] :mds:mds_reint_unlink+0x1665/0x23b0 > [<ffffffff8835608a>] :ptlrpc:ptlrpc_set_destroy+0x26a/0x2d0 > [<ffffffff8835bbc3>] :ptlrpc:ptl_send_buf+0x3f3/0x5b0 > [<ffffffff80282ec5>] dput+0x23/0x152 > [<ffffffff88534909>] :mds:mds_reint_rec+0x1d9/0x2b0 > [<ffffffff88559cfc>] :mds:mds_unlink_unpack+0x28c/0x3b0 > [<ffffffff88558c8f>] :mds:mds_update_unpack+0x20f/0x2b0 > [<ffffffff8832572f>] :ptlrpc:ldlm_reprocess_all+0x17f/0x210 > [<ffffffff8852745a>] :mds:mds_reint+0x35a/0x420 > [<ffffffff88364eb9>] :ptlrpc:lustre_pack_reply+0x29/0xb0 > [<ffffffff8836028f>] :ptlrpc:lustre_msg_get_opc+0x4f/0x100 > [<ffffffff8852dd2b>] :mds:mds_handle+0x247b/0x4c30 > [<ffffffff802206c2>] find_busiest_group+0x20d/0x634 > [<ffffffff882c8c29>] :obdclass:class_handle2object+0xd9/0x160 > [<ffffffff8835f99f>] :ptlrpc:lustre_msg_get_conn_cnt+0x4f/0x100 > [<ffffffff88369dea>] :ptlrpc:ptlrpc_check_req+0x1a/0x110 > [<ffffffff8836bf39>] :ptlrpc:ptlrpc_server_handle_request+0x999/0x1030 > [<ffffffff8020d490>] do_gettimeofday+0x50/0x94 > [<ffffffff8823d326>] :libcfs:lcw_update_time+0x16/0x100 > [<ffffffff8021f971>] __wake_up_common+0x3e/0x68 > [<ffffffff8836ee1b>] :ptlrpc:ptlrpc_main+0xd0b/0xe80 > [<ffffffff802213f2>] default_wake_function+0x0/0xe > [<ffffffff8020a6a4>] child_rip+0xa/0x12 > [<ffffffff8836e110>] :ptlrpc:ptlrpc_main+0x0/0xe80 > [<ffffffff8020a69a>] child_rip+0x0/0x12 > > LustreError: dumping log to /tmp/lustre-log.1226867690.6844 > Lustre: 6855:0:(ldlm_lib.c:525:target_handle_reconnect()) > diinofs-MDT0000: ec04ce81-b4df-720d-47a0-ca676c240a24 reconnecting > Lustre: 6855:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 60 > previous similar messages > Lustre: 6855:0:(ldlm_lib.c:760:target_handle_connect()) > diinofs-MDT0000: refuse reconnection from > ec04ce81-b4df-720d-47a0-ca676c240a24 at 10.4.31.103@tcp to > 0xffff81021a606000; still busy with 2 active RPCs > Lustre: 6855:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 50 > previous similar messages > LustreError: 6855:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ > processing error (-16) req at ffff8102214e5400 x8763679/t0 > o38->ec04ce81-b4df-720d-47a0-ca676c240a24 at NET_0x200000a041f67_UUID:0/0 > lens 304/200 e 0 to 0 dl 1226868090 ref 1 fl Interpret:/0/0 rc -16/0 > LustreError: 6855:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped > 46 previous similar messages > Lustre: 6864:0:(ldlm_lib.c:525:target_handle_reconnect()) > diinofs-MDT0000: bed5bd81-726a-ebda-b9c7-2dfe270f90c0 reconnecting > Lustre: 6864:0:(ldlm_lib.c:525:target_handle_reconnect()) Skipped 13 > previous similar messages > Lustre: 6864:0:(ldlm_lib.c:760:target_handle_connect()) > diinofs-MDT0000: refuse reconnection from > bed5bd81-726a-ebda-b9c7-2dfe270f90c0 at 10.4.31.114@tcp to > 0xffff8101192bc000; still busy with 3 active RPCs > Lustre: 6864:0:(ldlm_lib.c:760:target_handle_connect()) Skipped 13 > previous similar messages > LustreError: 6864:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ > processing error (-16) req at ffff8101139edc00 x12195715/t0 > o38->bed5bd81-726a-ebda-b9c7-2dfe270f90c0 at NET_0x200000a041f72_UUID:0/0 > lens 304/200 e 0 to 0 dl 1226868167 ref 1 fl Interpret:/0/0 rc -16/0 > LustreError: 6864:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped > 13 previous similar messages > > What does it mean? Now we can''t reboot either of the MDT''s without > panics (DRBD synced meta-data). I see a "LBUG" error but that shouldnt > panic the system after a complete cluster reboot(?) - we are running > lustre 1.6.5.1 on Debian Etch with 2.6.18 vanilla kernel, the clients > are running patchless with Debian Etch 2.6.18 aswell. The strange > thing is that it has been working for about 8 months already but now > suddenly we got alot of small errors both from oss''es and mdt''s. The > panic is most urgent to solve though. Any ideas? > > -- > Regards, > Timh Bergstr?m > Server Operations Manager > Diino AB - www.diino.com > :wq >-- Timh Bergstr?m System Operations Manager Diino AB - www.diino.com :wq