thr3ads.net - Lustre discuss - [Lustre-discuss] MDS crashes daily at the same hour [Jan 2010]

If this information is useful, please help other people find it:
Share via:

David Cohen

2010-Jan-04 10:02 UTC

[Lustre-discuss] MDS crashes daily at the same hour

Hi,
I''m using a mixed environment of 1.8.0.1 MDS and 1.6.6 OSS''s
(had a problem
with qlogic drivers and rolled back to 1.6.6).
My MDS get unresponsive each day at 4-5 am local time, no kernel panic or 
error messages before.
Some errors and an LBUG appear in the log after force booting the MDS and 
mounting the MDT and then the log is clear until next morning:

Jan  4 06:27:32 tech-mds kernel: LustreError: 6290:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 34 
clients in recovery for 337s
Jan  4 06:27:32 tech-mds kernel: LustreError: 6290:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
req at ffff81006f99cc00 x1323646107950586/t0 o38-><?>@<?>:0/0
lens 368/264 e 0 to 0
dl 1262579352 ref 1 fl Interpret:/0/0 rc -16/0
Jan  4 06:27:41 tech-mds kernel: Lustre: 6280:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT0000: 33 
recoverable clients remain                                       
Jan  4 06:27:57 tech-mds kernel: LustreError: 6284:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 312s
Jan  4 06:27:57 tech-mds kernel: LustreError: 6284:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
req at ffff81011c69d400 x1323646107950600/t0 o38-><?>@<?>:0/0
lens 368/264 e 0 to 0
dl 1262579377 ref 1 fl Interpret:/0/0 rc -16/0
Jan  4 06:28:22 tech-mds kernel: LustreError: 6302:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 287s
Jan  4 06:28:22 tech-mds kernel: LustreError: 6302:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
req at ffff81006fa4e000 x1323646107950612/t0 o38-><?>@<?>:0/0
lens 368/264 e 0 to 0
dl 1262579402 ref 1 fl Interpret:/0/0 rc -16/0
Jan  4 06:28:47 tech-mds kernel: LustreError: 6305:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 262s
Jan  4 06:28:47 tech-mds kernel: LustreError: 6305:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
req at ffff81011c69d800 x1323646107950624/t0 o38-><?>@<?>:0/0
lens 368/264 e 0 to 0
dl 1262579427 ref 1 fl Interpret:/0/0 rc -16/0
Jan  4 06:29:01 tech-mds ntpd[5999]: synchronized to 132.68.238.40, stratum 2
Jan  4 06:29:01 tech-mds ntpd[5999]: kernel time sync enabled 0001
Jan  4 06:29:12 tech-mds kernel: LustreError: 6278:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 237s
Jan  4 06:29:12 tech-mds kernel: LustreError: 6278:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
req at ffff81007053ac00 x1323646107950636/t0 o38-><?>@<?>:0/0
lens 368/264 e 0 to 0
dl 1262579452 ref 1 fl Interpret:/0/0 rc -16/0
Jan  4 06:29:37 tech-mds kernel: LustreError: 6293:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 212s
Jan  4 06:29:37 tech-mds kernel: LustreError: 6293:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
req at ffff81006f8a7000 x1323646107950648/t0 o38-><?>@<?>:0/0
lens 368/264 e 0 to 0
dl 1262579477 ref 1 fl Interpret:/0/0 rc -16/0
Jan  4 06:30:02 tech-mds kernel: LustreError: 6277:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 187s
Jan  4 06:30:02 tech-mds kernel: LustreError: 6277:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
req at ffff81010bb61000 x1323646107950660/t0 o38-><?>@<?>:0/0
lens 368/264 e 0 to 0
dl 1262579502 ref 1 fl Interpret:/0/0 rc -16/0
Jan  4 06:30:27 tech-mds kernel: LustreError: 6300:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 33 
clients in recovery for 162s
Jan  4 06:30:52 tech-mds kernel: LustreError: 6281:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
req at ffff81006f8fd400 x1323646107950684/t0 o38-><?>@<?>:0/0
lens 368/264 e 0 to 0
dl 1262579552 ref 1 fl Interpret:/0/0 rc -16/0
Jan  4 06:30:52 tech-mds kernel: LustreError: 6281:0:
(ldlm_lib.c:1826:target_send_reply_msg()) Skipped 1 previous similar message
Jan  4 06:31:11 tech-mds kernel: Lustre: 6264:0:
(ldlm_lib.c:538:target_handle_reconnect()) MGS: ca34b32b-6fd6-
b367-9c76-870c8c944b50 reconnecting                                        
Jan  4 06:31:11 tech-mds kernel: Lustre: 6305:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT0000: 32 
recoverable clients remain                                       
Jan  4 06:31:17 tech-mds kernel: LustreError: 6285:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 32 
clients in recovery for 112s
Jan  4 06:31:17 tech-mds kernel: LustreError: 6285:0:
(ldlm_lib.c:884:target_handle_connect()) Skipped 1 previous similar message
Jan  4 06:31:19 tech-mds kernel: Lustre: 6263:0:
(ldlm_lib.c:538:target_handle_reconnect()) MGS: c26bf58b-6583-5577-e6b8-
f2ff1d0e5df8 reconnecting                                        
Jan  4 06:31:19 tech-mds kernel: Lustre: 6299:0:
(ldlm_lib.c:815:target_handle_connect()) technion-MDT0000: refuse reconnection 
from c6e1cf14-2820-92bb-4471-e48c5a5a0cbf at 192.114.101.25@tcp to 
0xffff81006fc1e000; still busy with 2 active RPCs
Jan  4 06:31:19 tech-mds kernel: Lustre: 6288:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT0000: 31 
recoverable clients remain                                       
Jan  4 06:31:32 tech-mds kernel: Lustre: 6302:0:
(ldlm_lib.c:538:target_handle_reconnect()) technion-MDT0000: 
5887f548-0db2-2b71-ff4c-0063614c0686 reconnecting                           
Jan  4 06:31:32 tech-mds kernel: Lustre: 6302:0:
(ldlm_lib.c:538:target_handle_reconnect()) Skipped 2 previous similar messages
Jan  4 06:31:32 tech-mds kernel: LustreError: 6280:0:
(service.c:612:ptlrpc_check_req()) @@@ DROPPING req from old connection 203 <
204  req at ffff8100d3ecc450 x1323646281069438/t0 o101->5887f548-0db2-2b71-
ff4c-0063614c0686 at NET_0x20000c0726514_UUID:0/0 lens 296/0 e 0 to 0 dl 
1262579747 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:31:36 tech-mds kernel: Lustre: 6283:0:
(ldlm_lib.c:538:target_handle_reconnect()) technion-MDT0000: 
c6e1cf14-2820-92bb-4471-e48c5a5a0cbf reconnecting                           
Jan  4 06:31:36 tech-mds kernel: Lustre: 6283:0:
(ldlm_lib.c:538:target_handle_reconnect()) Skipped 1 previous similar message
Jan  4 06:31:39 tech-mds kernel: Lustre: 6283:0:
(ldlm_lib.c:538:target_handle_reconnect()) technion-MDT0000: 410e0e8a-b08b-
f77d-a88e-a216da983909 reconnecting                           
Jan  4 06:31:39 tech-mds kernel: Lustre: 6283:0:
(ldlm_lib.c:538:target_handle_reconnect()) Skipped 1 previous similar message
Jan  4 06:31:47 tech-mds kernel: Lustre: 6302:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT0000: 30 
recoverable clients remain                                       
Jan  4 06:31:48 tech-mds kernel: Lustre: 6292:0:
(ldlm_lib.c:538:target_handle_reconnect()) technion-MDT0000: 8383ca9c-
fdbf-1edf-06d9-0fb98f7e1472 reconnecting                           
Jan  4 06:31:52 tech-mds kernel: Lustre: 6306:0:
(ldlm_lib.c:815:target_handle_connect()) technion-MDT0000: refuse reconnection 
from ec65e3e4-19af-a532-f0c3-ae73899a251a at 192.114.101.30@tcp to 
0xffff81006fc88000; still busy with 2 active RPCs
Jan  4 06:31:57 tech-mds kernel: Lustre: 6281:0:
(ldlm_lib.c:538:target_handle_reconnect()) technion-MDT0000: 
58b52546-23b2-4857-cd8c-c172d4f64069 reconnecting                           
Jan  4 06:31:57 tech-mds kernel: Lustre: 6281:0:
(ldlm_lib.c:538:target_handle_reconnect()) Skipped 4 previous similar messages
Jan  4 06:31:57 tech-mds kernel: Lustre: 6291:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT0000: 28 
recoverable clients remain                                       
Jan  4 06:31:57 tech-mds kernel: Lustre: 6291:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) Skipped 1 previous similar 
message                                                    
Jan  4 06:32:07 tech-mds kernel: LustreError: 6305:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
req at ffff810054eb3000 x1323646107950720/t0 o38-><?>@<?>:0/0
lens 368/264 e 0 to 0
dl 1262579627 ref 1 fl Interpret:/0/0 rc -16/0
Jan  4 06:32:07 tech-mds kernel: LustreError: 6305:0:
(ldlm_lib.c:1826:target_send_reply_msg()) Skipped 4 previous similar messages
Jan  4 06:32:13 tech-mds kernel: Lustre: 6304:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT0000: 27 
recoverable clients remain                                       
Jan  4 06:32:15 tech-mds kernel: Lustre: 6301:0:
(ldlm_lib.c:538:target_handle_reconnect()) technion-MDT0000: 
9ba11766-2e56-35b9-957e-5d186169b9c8 reconnecting                           
Jan  4 06:32:15 tech-mds kernel: Lustre: 6301:0:
(ldlm_lib.c:538:target_handle_reconnect()) Skipped 1 previous similar message
Jan  4 06:32:17 tech-mds kernel: LustreError: 6164:0:
(socklnd.c:1639:ksocknal_destroy_conn()) Completing partial receive from 
12345-192.114.101.24 at tcp, ip 192.114.101.24:1022, with error
Jan  4 06:32:17 tech-mds kernel: LustreError: 6164:0:
(events.c:229:request_in_callback()) event type 1, status -5, service mds
Jan  4 06:32:17 tech-mds kernel: LustreError: 6289:0:
(pack_generic.c:871:lustre_unpack_msg()) message length 0 too small for 
magic/version check                                         
Jan  4 06:32:17 tech-mds kernel: LustreError: 6289:0:
(service.c:1102:ptlrpc_server_handle_req_in()) error unpacking request: ptl 12 
from 12345-192.114.101.24 at tcp xid 1323646241338075   
Jan  4 06:32:31 tech-mds kernel: Lustre: 6283:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT0000: 25 
recoverable clients remain                                       
Jan  4 06:32:31 tech-mds kernel: Lustre: 6283:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) Skipped 1 previous similar 
message                                                    
Jan  4 06:32:32 tech-mds kernel: LustreError: 6291:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 25 
clients in recovery for 37s
Jan  4 06:32:32 tech-mds kernel: LustreError: 6291:0:
(ldlm_lib.c:884:target_handle_connect()) Skipped 2 previous similar messages
Jan  4 06:32:55 tech-mds kernel: Lustre: 6295:0:
(ldlm_lib.c:538:target_handle_reconnect()) technion-MDT0000: 1d2eebf8-
db26-7093-3a42-f7f0ca8a6b1b reconnecting                           
Jan  4 06:32:55 tech-mds kernel: Lustre: 6295:0:
(ldlm_lib.c:538:target_handle_reconnect()) Skipped 5 previous similar messages
Jan  4 06:33:02 tech-mds kernel: LustreError: 6164:0:
(socklnd.c:1639:ksocknal_destroy_conn()) Completing partial receive from 
12345-192.114.101.9 at tcp, ip 192.114.101.9:1023, with error 
Jan  4 06:33:02 tech-mds kernel: LustreError: 6164:0:
(events.c:229:request_in_callback()) event type 1, status -5, service mds
Jan  4 06:33:02 tech-mds kernel: LustreError: 6299:0:
(pack_generic.c:871:lustre_unpack_msg()) message length 0 too small for 
magic/version check                                         
Jan  4 06:33:02 tech-mds kernel: LustreError: 6299:0:
(service.c:1102:ptlrpc_server_handle_req_in()) error unpacking request: ptl 12 
from 12345-192.114.101.9 at tcp xid 1323646400066660    
Jan  4 06:33:08 tech-mds kernel: Lustre: MGS: haven''t heard from client
7f39d026-7d8e-6127-a73a-e0e30f4a0cbf (at 192.114.101.24 at tcp) in 193 seconds.
I
think it''s dead, and I am evicting it.
Jan  4 06:33:09 tech-mds kernel: Lustre: 6298:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT0000: 21 
recoverable clients remain                                       
Jan  4 06:33:09 tech-mds kernel: Lustre: 6298:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) Skipped 3 previous similar 
messages                                                   
Jan  4 06:33:10 tech-mds kernel: Lustre: technion-MDT0000: recovery period 
over; 21 clients never reconnected after 375s (35 clients did)
Jan  4 06:33:19 tech-mds kernel: LustreError: 6263:0:
(mgs_handler.c:572:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS
Jan  4 06:33:20 tech-mds kernel: LustreError: 6263:0:
(mgs_handler.c:572:mgs_handle()) lustre_mgs: operation 400 on unconnected MGS
Jan  4 06:33:26 tech-mds kernel: Lustre: 6281:0:
(ldlm_lib.c:815:target_handle_connect()) technion-MDT0000: refuse reconnection 
from 7535d83e-42c3-217f-e06c-f503c9eac0fe at 192.114.101.4@tcp to 
0xffff81006fc80000; still busy with 2 active RPCs
Jan  4 06:33:26 tech-mds kernel: LustreError: 6264:0:
(service.c:612:ptlrpc_check_req()) @@@ DROPPING req from old connection 298 <
299  req at ffff810070eb7850 x1323646478059177/t0
o400->7eb753db-5828-0ada-8a05-
fa96abac87b2 at NET_0x20000c0726504_UUID:0/0 lens 192/0 e 0 to 0 dl 1262579612 
ref 1 fl Interpret:H/0/0 rc 0/0
Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
(class_hash.c:225:lustre_hash_findadd_unique_hnode()) 
ASSERTION(hlist_unhashed(hnode)) failed                                       
Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
(class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG
Jan  4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux-
debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357
Jan  4 06:33:31 tech-mds kernel: ll_mgs_02     R  running task       0  6357
1                6340 (L-TLB)
Jan  4 06:33:31 tech-mds kernel:  ffff810110dfde50 ffffffff80063097
ffff810070f28000
0000000000000082
Jan  4 06:33:31 tech-mds kernel:  0000008100002000 ffff810070e325b0 
ffff810070ebb148 0000000000000001
Jan  4 06:33:31 tech-mds kernel:  ffff810070e325a8 0000000000000000 
ffff810110dfde10 ffffffff8008882b
Jan  4 06:33:31 tech-mds kernel: Call Trace:
Jan  4 06:33:31 tech-mds kernel:  [<ffffffff80063097>]
thread_return+0x62/0xfe
Jan  4 06:33:31 tech-mds kernel:  [<ffffffff8008882b>]
__wake_up_common+0x3e/0x68
Jan  4 06:33:31 tech-mds kernel:  [<ffffffff886682e8>] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Jan  4 06:33:31 tech-mds kernel:  [<ffffffff8008a3f6>] 
default_wake_function+0x0/0xe
Jan  4 06:33:31 tech-mds kernel:  [<ffffffff800b491a>] 
audit_syscall_exit+0x31b/0x336
Jan  4 06:33:31 tech-mds kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jan  4 06:33:31 tech-mds kernel:  [<ffffffff886670d0>] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Jan  4 06:33:31 tech-mds kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jan  4 06:33:31 tech-mds kernel:
Jan  4 06:33:31 tech-mds kernel: LustreError: dumping log to /tmp/lustre-
log.1262579611.6357
Jan  4 06:34:35 tech-mds kernel: Lustre: 6264:0:
(ldlm_lib.c:538:target_handle_reconnect()) MGS: 055e7f6a-94fb-97e0-2117-
bc6afa3f8b10 reconnecting                                        
Jan  4 06:34:35 tech-mds kernel: Lustre: 6276:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT0000: 15 
recoverable clients remain                                       
Jan  4 06:34:35 tech-mds kernel: Lustre: 6276:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) Skipped 5 previous similar 
messages                                                   
Jan  4 06:34:35 tech-mds kernel: Lustre: 6264:0:
(ldlm_lib.c:538:target_handle_reconnect()) Skipped 21 previous similar 
messages                                                          
Jan  4 06:34:37 tech-mds kernel: LustreError: 6287:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-16)  
req at ffff8100d3eb0800 x1323646107950792/t0 o38-><?>@<?>:0/0
lens 368/264 e 0 to 0
dl 1262579777 ref 1 fl Interpret:/0/0 rc -16/0
Jan  4 06:34:37 tech-mds kernel: LustreError: 6287:0:
(ldlm_lib.c:1826:target_send_reply_msg()) Skipped 8 previous similar messages
Jan  4 06:35:02 tech-mds kernel: LustreError: 6290:0:
(ldlm_lib.c:884:target_handle_connect()) technion-MDT0000: denying connection 
for new client 192.114.101.31 at tcp (ab671897-b1e2-76d3-b661-7b87e82d23e7): 14 
clients in recovery for 187s
Jan  4 06:35:02 tech-mds kernel: LustreError: 6290:0:
(ldlm_lib.c:884:target_handle_connect()) Skipped 5 previous similar messages
Jan  4 06:35:21 tech-mds kernel: LustreError: 6263:0:
(service.c:612:ptlrpc_check_req()) @@@ DROPPING req from old connection 296 <
297  req at ffff810070e3f850 x1323645340887548/t0 o400->8c347320-a2f7-
aa5a-14a1-35d466efdc70 at NET_0x20000c0726522_UUID:0/0 lens 192/0 e 0 to 0 dl 
1262579727 ref 1 fl Interpret:H/0/0 rc 0/0
Jan  4 06:36:51 tech-mds kernel: Lustre: 0:0:(watchdog.c:153:lcw_cb()) 
Watchdog triggered for pid 6357: it was inactive for 200.00s
Jan  4 06:36:51 tech-mds kernel: Lustre: 0:0:(linux-
debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357
Jan  4 06:36:51 tech-mds kernel: ll_mgs_02     D ffff81000237e980     0  6357
1                6340 (L-TLB)
Jan  4 06:36:51 tech-mds kernel:  ffff810110dfd9d0 0000000000000046 
0000000000000000 0000000000000000
Jan  4 06:36:51 tech-mds kernel:  ffff810110dfd990 0000000000000009 
ffff81011e687820 ffff8101023ca080
Jan  4 06:36:51 tech-mds kernel:  00000057eeacb9c0 000000000000167b 
ffff81011e687a08 00000001000000e1
Jan  4 06:36:51 tech-mds kernel: Call Trace:
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff8008a3f6>] 
default_wake_function+0x0/0xe
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff884fab26>] 
:libcfs:lbug_with_loc+0xc6/0xd0
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff88502c70>] 
:libcfs:tracefile_init+0x0/0x110
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff88597702>] 
:obdclass:lustre_hash_findadd_unique_hnode+0x1a2/0x380
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff8859897e>] 
:obdclass:lustre_hash_add_unique+0x7e/0x230
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff8862941f>] 
:ptlrpc:target_handle_connect+0x250f/0x2880
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff8865e900>] 
:ptlrpc:lustre_msg_set_conn_cnt+0xc0/0x120
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff88653d78>] 
:ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff888c1cce>] 
:mgs:mgs_handle+0x4ee/0x1540
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff88664db3>] 
:ptlrpc:ptlrpc_server_handle_request+0xa93/0x1160
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff80063097>]
thread_return+0x62/0xfe
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff8008882b>]
__wake_up_common+0x3e/0x68
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff886682e8>] 
:ptlrpc:ptlrpc_main+0x1218/0x13e0
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff8008a3f6>] 
default_wake_function+0x0/0xe
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff800b491a>] 
audit_syscall_exit+0x31b/0x336
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff886670d0>] 
:ptlrpc:ptlrpc_main+0x0/0x13e0
Jan  4 06:36:51 tech-mds kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jan  4 06:36:51 tech-mds kernel:
Jan  4 06:36:51 tech-mds kernel: LustreError: dumping log to /tmp/lustre-
log.1262579811.6357
Jan  4 06:37:01 tech-mds kernel: Lustre: 6306:0:
(ldlm_lib.c:538:target_handle_reconnect()) technion-MDT0000: 
5887f548-0db2-2b71-ff4c-0063614c0686 reconnecting                           
Jan  4 06:37:01 tech-mds kernel: Lustre: 6306:0:
(ldlm_lib.c:538:target_handle_reconnect()) Skipped 6 previous similar messages
Jan  4 06:37:02 tech-mds kernel: Lustre: 6304:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) technion-MDT0000: 10 
recoverable clients remain                                       
Jan  4 06:37:02 tech-mds kernel: Lustre: 6304:0:
(ldlm_lib.c:1718:target_queue_last_replay_reply()) Skipped 4 previous similar 
messages                                                   
Jan  4 06:38:10 tech-mds kernel: Lustre: technion-MDT0000: recovery period 
over; 10 clients never reconnected after 675s (35 clients did)
Jan  4 06:38:10 tech-mds kernel: LustreError: 6275:0:
(handler.c:1554:mds_handle()) operation 101 on unconnected MDS from 
12345-192.114.101.5 at tcp                                         
Jan  4 06:38:10 tech-mds kernel: LustreError: 6303:0:
(handler.c:1554:mds_handle()) operation 101 on unconnected MDS from 
12345-192.114.101.17 at tcp                                        
Jan  4 06:38:11 tech-mds kernel: LustreError: 6281:0:
(handler.c:1554:mds_handle()) operation 101 on unconnected MDS from 
12345-192.114.101.11 at tcp                                        
Jan  4 06:38:12 tech-mds kernel: LustreError: 6296:0:
(handler.c:1554:mds_handle()) operation 101 on unconnected MDS from 
12345-192.114.101.6 at tcp                                         
Jan  4 06:38:12 tech-mds kernel: LustreError: 6296:0:
(handler.c:1554:mds_handle()) Skipped 1 previous similar message
Jan  4 06:38:14 tech-mds kernel: Lustre: 6301:0:
(quota_master.c:1680:mds_quota_recovery()) Only 13/10 OSTs are active, abort 
quota recovery                                              
Jan  4 06:38:14 tech-mds kernel: Lustre: technion-MDT0000: recovery complete: 
rc 0
Jan  4 06:38:14 tech-mds kernel: Lustre: technion-MDT0000: sending delayed 
replies to recovered clients
Jan  4 06:38:14 tech-mds kernel: LustreError: 6276:0:
(mds_open.c:664:reconstruct_open()) Re-opened file
Jan  4 06:38:14 tech-mds kernel: LustreError: 6139:0:
(handler.c:416:mds_destroy_export())
ASSERTION(list_empty(&exp->exp_mds_data.med_open_head)) failed                                 Jan  4 06:38:14 tech-mds kernel: LustreError: 6139:0:
(handler.c:416:mds_destroy_export()) LBUG
Jan  4 06:38:14 tech-mds kernel: Lustre: 6139:0:(linux-
debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6139
Jan  4 06:38:14 tech-mds kernel: obd_zombid    R  running task       0  6139
1          6156  6026 (L-TLB)
Jan  4 06:38:14 tech-mds kernel:  ffffffff88505ab5 ffffffff8895f8f8
ffff81006fc20000
ffff810071e29f00
Jan  4 06:38:14 tech-mds kernel:  00002b35469e6010 ffffffff8892604f
ffff81011bc447a0
ffff81006fc20000
Jan  4 06:38:14 tech-mds kernel:  ffff81011ab58078 ffff810071e29f00 
00002b35469e6010 ffff81006fc20000
Jan  4 06:38:14 tech-mds kernel: Call Trace:
Jan  4 06:38:14 tech-mds kernel:  [<ffffffff8892604f>] 
:mds:mds_destroy_export+0x9f/0x120
Jan  4 06:38:14 tech-mds kernel:  [<ffffffff8859d3bc>] 
:obdclass:class_export_destroy+0x20c/0x2c0
Jan  4 06:38:15 tech-mds kernel:  [<ffffffff8859bac1>] 
:obdclass:obd_zombi_impexp_check+0x11/0xc0
Jan  4 06:38:15 tech-mds kernel:  [<ffffffff8859d4f2>] 
:obdclass:obd_zombie_impexp_cull+0x82/0xa0
Jan  4 06:38:15 tech-mds kernel:  [<ffffffff885a226c>] 
:obdclass:obd_zombie_impexp_thread+0x1ec/0x290
Jan  4 06:38:15 tech-mds kernel:  [<ffffffff8008a3f6>] 
default_wake_function+0x0/0xe
Jan  4 06:38:15 tech-mds kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Jan  4 06:38:15 tech-mds kernel:  [<ffffffff885a2080>] 
:obdclass:obd_zombie_impexp_thread+0x0/0x290
Jan  4 06:38:15 tech-mds kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
Jan  4 06:38:16 tech-mds kernel:
Jan  4 06:38:16 tech-mds kernel: LustreError: dumping log to /tmp/lustre-
log.1262579894.6139
Jan  4 06:38:16 tech-mds kernel: LustreError: 6298:0:
(handler.c:1554:mds_handle()) operation 101 on unconnected MDS from 
12345-192.114.101.10 at tcp
Jan  4 06:38:16 tech-mds kernel: LustreError: 6298:0:
(handler.c:1554:mds_handle()) Skipped 4 previous similar messages
Jan  4 06:38:16 tech-mds kernel: LustreError: 6288:0:
(mds_open.c:664:reconstruct_open()) Re-opened file
Jan  4 06:38:16 tech-mds kernel: Lustre: MDS technion-MDT0000: technion-
OST0008_UUID now active, resetting orphans
Jan  4 06:38:16 tech-mds kernel: Lustre: MDS technion-MDT0000: technion-
OST000a_UUID now active, resetting orphans
Jan  4 06:38:21 tech-mds kernel: Lustre: MDS technion-MDT0000: technion-
OST0002_UUID now active, resetting orphans
Jan  4 06:38:21 tech-mds kernel: Lustre: Skipped 5 previous similar messages
Jan  4 06:38:26 tech-mds kernel: Lustre: MDS technion-MDT0000: technion-
OST0001_UUID now active, resetting orphans
Jan  4 06:38:31 tech-mds kernel: Lustre: MDS technion-MDT0000: technion-
OST0000_UUID now active, resetting orphans
Jan  4 06:38:41 tech-mds kernel: LustreError: 6392:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18531070: cookie 
0xdcb9c7fd999ea709  req at ffff8100d3ed0000 x1323646224495072/t0
o35->5d1ee8c1-
f826-9ab3-89bf-342c4f9e242d at NET_0x20000c0726512_UUID:0/0 lens 408/976 e 0 to
0
dl 1262579964 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:41 tech-mds kernel: LustreError: 6398:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18531068: cookie 
0xdcb9c7fd999e9dfc  req at ffff8100dc7c8c00 x1323646224495073/t0
o35->5d1ee8c1-
f826-9ab3-89bf-342c4f9e242d at NET_0x20000c0726512_UUID:0/0 lens 408/976 e 0 to
0
dl 1262579927 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:41 tech-mds kernel: LustreError: 6415:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18508458: cookie 
0xdcb9c7fd9983617e  req at ffff8100d4bfb400 x1323646224495345/t0
o35->5d1ee8c1-
f826-9ab3-89bf-342c4f9e242d at NET_0x20000c0726512_UUID:0/0 lens 408/976 e 0 to
0
dl 1262579927 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:41 tech-mds kernel: LustreError: 6415:0:
(mds_open.c:1665:mds_close()) Skipped 271 previous similar messages
Jan  4 06:38:42 tech-mds kernel: LustreError: 6409:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18498078: cookie 
0xdcb9c7fd99273a35  req at ffff810054d2e800 x1323646224496303/t0
o35->5d1ee8c1-
f826-9ab3-89bf-342c4f9e242d at NET_0x20000c0726512_UUID:0/0 lens 408/976 e 0 to
0
dl 1262579928 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:42 tech-mds kernel: LustreError: 6409:0:
(mds_open.c:1665:mds_close()) Skipped 957 previous similar messages
Jan  4 06:38:44 tech-mds kernel: LustreError: 6413:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18464618: cookie 
0xdcb9c7fd9893064a  req at ffff8100d39f3400 x1323646224498078/t0
o35->5d1ee8c1-
f826-9ab3-89bf-342c4f9e242d at NET_0x20000c0726512_UUID:0/0 lens 408/976 e 0 to
0
dl 1262579930 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:44 tech-mds kernel: LustreError: 6413:0:
(mds_open.c:1665:mds_close()) Skipped 1774 previous similar messages
Jan  4 06:38:48 tech-mds kernel: LustreError: 6423:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 18437710: cookie 
0xdcb9c7fd9817e589  req at ffff8100d45b5c00 x1323646224499484/t0
o35->5d1ee8c1-
f826-9ab3-89bf-342c4f9e242d at NET_0x20000c0726512_UUID:0/0 lens 408/976 e 0 to
0
dl 1262579934 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:48 tech-mds kernel: LustreError: 6423:0:
(mds_open.c:1665:mds_close()) Skipped 1405 previous similar messages
Jan  4 06:38:53 tech-mds kernel: LustreError: 6422:0:
(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error (-116)  
req at ffff810054d38000 x1323646224500886/t0 o35->5d1ee8c1-
f826-9ab3-89bf-342c4f9e242d at NET_0x20000c0726512_UUID:0/0 lens 408/976 e 0 to
0
dl 1262579939 ref 1 fl Interpret:/0/0 rc -116/0
Jan  4 06:38:53 tech-mds kernel: LustreError: 6422:0:
(ldlm_lib.c:1826:target_send_reply_msg()) Skipped 5838 previous similar 
messages
Jan  4 06:38:56 tech-mds kernel: LustreError: 6420:0:
(mds_open.c:1665:mds_close()) @@@ no handle for file close ino 13567564: cookie 
0xde1fda06cd4d058c  req at ffff810055378800 x1323646224501408/t0
o35->5d1ee8c1-
f826-9ab3-89bf-342c4f9e242d at NET_0x20000c0726512_UUID:0/0 lens 408/976 e 0 to
0
dl 1262579942 ref 1 fl Interpret:/0/0 rc 0/0
Jan  4 06:38:56 tech-mds kernel: LustreError: 6420:0:
(mds_open.c:1665:mds_close()) Skipped 1923 previous similar messages




-- 
David Cohen

Andreas Dilger

2010-Jan-04 18:42 UTC

head link

[Lustre-discuss] MDS crashes daily at the same hour

On 2010-01-04, at 03:02, David Cohen wrote:> I''m using a mixed environment of 1.8.0.1 MDS and 1.6.6
OSS''s (had a
> problem
> with qlogic drivers and rolled back to 1.6.6).
> My MDS get unresponsive each day at 4-5 am local time, no kernel  
> panic or
> error messages before.
Judging by the time, I''d guess this is "slocate" or
"mlocate" running
on all of your clients at the same time.  This used to be a source of  
extremely high load back in the old days, but I thought that Lustre  
was in the exclude list in newer versions of *locate.  Looking at the  
installed mlocate on my system, that doesn''t seem to be the case...   
strange.
> Some errors and an LBUG appear in the log after force booting the  
> MDS and
> mounting the MDT and then the log is clear until next morning:
>
> Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
> (class_hash.c:225:lustre_hash_findadd_unique_hnode())
> ASSERTION(hlist_unhashed(hnode)) failed
> Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
> (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG
> Jan  4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux-
> debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357
> Jan  4 06:33:31 tech-mds kernel: ll_mgs_02     R  running task        
> 0  6357
> 1                6340 (L-TLB)
> Jan  4 06:33:31 tech-mds kernel: Call Trace:
> Jan  4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe
> Jan  4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68
> Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0
> Jan  4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe
> Jan  4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336
> Jan  4 06:33:31 tech-mds kernel: child_rip+0xa/0x11
> Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0
> Jan  4 06:33:31 tech-mds kernel: child_rip+0x0/0x11
It shouldn''t LBUG during recovery, however.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

David Cohen

2010-Jan-06 09:25 UTC

head link

[Lustre-discuss] MDS crashes daily at the same hour

On Monday 04 January 2010 20:42:12 Andreas Dilger wrote:> On 2010-01-04, at 03:02, David Cohen wrote:
> > I''m using a mixed environment of 1.8.0.1 MDS and 1.6.6
OSS''s (had a
> > problem
> > with qlogic drivers and rolled back to 1.6.6).
> > My MDS get unresponsive each day at 4-5 am local time, no kernel
> > panic or
> > error messages before.
I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the 
clients and the system is stable again.
Many Thanks.

> 
> Judging by the time, I''d guess this is "slocate" or
"mlocate" running
> on all of your clients at the same time.  This used to be a source of
> extremely high load back in the old days, but I thought that Lustre
> was in the exclude list in newer versions of *locate.  Looking at the
> installed mlocate on my system, that doesn''t seem to be the
case...
> strange.
> 
> > Some errors and an LBUG appear in the log after force booting the
> > MDS and
> > mounting the MDT and then the log is clear until next morning:
> >
> > Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
> > (class_hash.c:225:lustre_hash_findadd_unique_hnode())
> > ASSERTION(hlist_unhashed(hnode)) failed
> > Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
> > (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG
> > Jan  4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux-
> > debug.c:222:libcfs_debug_dumpstack()) showing stack for process 6357
> > Jan  4 06:33:31 tech-mds kernel: ll_mgs_02     R  running task
> > 0  6357
> > 1                6340 (L-TLB)
> > Jan  4 06:33:31 tech-mds kernel: Call Trace:
> > Jan  4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe
> > Jan  4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68
> > Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0
> > Jan  4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe
> > Jan  4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336
> > Jan  4 06:33:31 tech-mds kernel: child_rip+0xa/0x11
> > Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0
> > Jan  4 06:33:31 tech-mds kernel: child_rip+0x0/0x11
> 
> It shouldn''t LBUG during recovery, however.
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
-- 
David Cohen
Grid Computing
Physics Department
Technion - Israel Institute of Technology

Brian J. Murrell

2010-Jan-06 15:51 UTC

head link

[Lustre-discuss] MDS crashes daily at the same hour

On Wed, 2010-01-06 at 11:25 +0200, David Cohen wrote: > 
> I was indeed the *locate update, a simple edit of /etc/updatedb.conf on the
> clients and the system is stable again.
Great.  But as Andreas said previously, load should not have caused the
LBUG that you got.  Could you open a bug on our bugzilla about that?
Please attach to that bug an excerpt from the tech-mds log that covers a
window of time of 12h hours prior to the the LBUG and an hour after.

Thanx,
b.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100106/b5201471/attachment.bin

Andreas Dilger

2010-Jan-22 11:32 UTC

head link

[Lustre-discuss] MDS crashes daily at the same hour

On 2010-01-06, at 04:25, David Cohen wrote:> On Monday 04 January 2010 20:42:12 Andreas Dilger wrote:
>> On 2010-01-04, at 03:02, David Cohen wrote:
>>> I''m using a mixed environment of 1.8.0.1 MDS and 1.6.6
OSS''s (had a
>>> problem with qlogic drivers and rolled back to 1.6.6).
>>> My MDS get unresponsive each day at 4-5 am local time, no kernel
>>> panic or error messages before.
>
> I was indeed the *locate update, a simple edit of /etc/updatedb.conf  
> on the
> clients and the system is stable again.
I asked the upstream Fedora/RHEL maintainer of mlocate to add "lustre"
to the exception list in updatedb.conf, and he has already done so for  
Fedora.  There is also a bug filed for RHEL5 to do the same, if anyone  
is interested in following it:

https://bugzilla.redhat.com/show_bug.cgi?id=557712
>> Judging by the time, I''d guess this is "slocate" or
"mlocate" running
>> on all of your clients at the same time.  This used to be a source of
>> extremely high load back in the old days, but I thought that Lustre
>> was in the exclude list in newer versions of *locate.  Looking at the
>> installed mlocate on my system, that doesn''t seem to be the
case...
>> strange.
>>
>>> Some errors and an LBUG appear in the log after force booting the
>>> MDS and
>>> mounting the MDT and then the log is clear until next morning:
>>>
>>> Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
>>> (class_hash.c:225:lustre_hash_findadd_unique_hnode())
>>> ASSERTION(hlist_unhashed(hnode)) failed
>>> Jan  4 06:33:31 tech-mds kernel: LustreError: 6357:0:
>>> (class_hash.c:225:lustre_hash_findadd_unique_hnode()) LBUG
>>> Jan  4 06:33:31 tech-mds kernel: Lustre: 6357:0:(linux-
>>> debug.c:222:libcfs_debug_dumpstack()) showing stack for process
6357
>>> Jan  4 06:33:31 tech-mds kernel: ll_mgs_02     R  running task
>>> 0  6357
>>> 1                6340 (L-TLB)
>>> Jan  4 06:33:31 tech-mds kernel: Call Trace:
>>> Jan  4 06:33:31 tech-mds kernel: thread_return+0x62/0xfe
>>> Jan  4 06:33:31 tech-mds kernel: __wake_up_common+0x3e/0x68
>>> Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x1218/0x13e0
>>> Jan  4 06:33:31 tech-mds kernel: default_wake_function+0x0/0xe
>>> Jan  4 06:33:31 tech-mds kernel: audit_syscall_exit+0x31b/0x336
>>> Jan  4 06:33:31 tech-mds kernel: child_rip+0xa/0x11
>>> Jan  4 06:33:31 tech-mds kernel: :ptlrpc:ptlrpc_main+0x0/0x13e0
>>> Jan  4 06:33:31 tech-mds kernel: child_rip+0x0/0x11
>>
>> It shouldn''t LBUG during recovery, however.
>>
>> Cheers, Andreas
>> --
>> Andreas Dilger
>> Sr. Staff Engineer, Lustre Group
>> Sun Microsystems of Canada, Inc.
>>
>
> -- 
> David Cohen
> Grid Computing
> Physics Department
> Technion - Israel Institute of Technology
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Christopher J. Walker

2010-Jan-23 22:12 UTC

head link

[Lustre-discuss] MDS crashes daily at the same hour

Brian J. Murrell wrote:> On Wed, 2010-01-06 at 11:25 +0200, David Cohen wrote: 
>> I was indeed the *locate update, a simple edit of /etc/updatedb.conf on
the
>> clients and the system is stable again.
> 
I''ve just encountered the same thing - the mds crashing at the same
time
several times this week. It''s just after the *locate update -
I''ve added
lustre to the excluded filesystems, and so far so good.
> Great.  But as Andreas said previously, load should not have caused the
> LBUG that you got.  Could you open a bug on our bugzilla about that?
> Please attach to that bug an excerpt from the tech-mds log that covers a
> window of time of 12h hours prior to the the LBUG and an hour after.
> 
I don''t see an LBUG in my logs, but there are several Call Traces.
Would
it be useful if I filed a bug too or I could add to David''s bug if
you''d
prefer  - if so, can you let me know the bug number as I can''t find it 
in bugzilla.  Would you like /tmp/lustre-log.* too?

Chris

Andreas Dilger

2010-Jan-25 05:54 UTC

head link

[Lustre-discuss] MDS crashes daily at the same hour

On 2010-01-23, at 15:12, Christopher J. Walker wrote:> I don''t see an LBUG in my logs, but there are several Call Traces.
> Would
> it be useful if I filed a bug too or I could add to David''s bug if
> you''d
> prefer  - if so, can you let me know the bug number as I can''t
find it
> in bugzilla.  Would you like /tmp/lustre-log.* too?

If they are call traces due to the watchdog timer, then this is somewhat
expected for extremely high load.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Brian J. Murrell

2010-Jan-25 13:51 UTC

head link

[Lustre-discuss] MDS crashes daily at the same hour

On Sun, 2010-01-24 at 22:54 -0700, Andreas Dilger wrote:
> 
> If they are call traces due to the watchdog timer, then this is somewhat
> expected for extremely high load.
Andreas,

Do you know, does adaptive timeouts take care of setting the timeout
appropriately on watchdogs?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100125/36fa5d9f/attachment.bin

Johann Lombardi

2010-Jan-25 14:09 UTC

head link

[Lustre-discuss] MDS crashes daily at the same hour

On Mon, Jan 25, 2010 at 08:51:59AM -0500, Brian J. Murrell
wrote:> Do you know, does adaptive timeouts take care of setting the timeout
> appropriately on watchdogs?
Yes, the watchdog timer is updated based on the estimated rpc service
time (multiplied by a factor which is usually 2).

Johann

Brian J. Murrell

2010-Jan-25 16:11 UTC

head link

[Lustre-discuss] MDS crashes daily at the same hour

On Mon, 2010-01-25 at 15:09 +0100, Johann Lombardi wrote:
> 
> Yes, the watchdog timer is updated based on the estimated rpc service
> time (multiplied by a factor which is usually 2).
Ahhh.  Great.  It would be interesting to know which Lustre release the
poster seeing the stack traces was using.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100125/bfeadc55/attachment.bin

Christopher J.Walker

2010-Jan-25 16:45 UTC

head link

[Lustre-discuss] MDS crashes daily at the same hour

Brian J. Murrell wrote:> On Sun, 2010-01-24 at 22:54 -0700, Andreas Dilger wrote: 
>> If they are call traces due to the watchdog timer, then this is
somewhat
>> expected for extremely high load.
> 
> Andreas,
> 
> Do you know, does adaptive timeouts take care of setting the timeout
> appropriately on watchdogs?
> 
I don''t think this is quite what you are asking, but some details on
our
setup.

We have a mixture of 1.6.7.2 clients and 1.8.1.1 clients. The 1.6.7.2 
clients were not using adaptive timeouts when the problem occurred[1]. 
At least one of the 1.6 machines gets regularly swamped with network 
traffic - leading to packet loss.

It was 40 1.8.1.1 clients running updatedb that caused the problem.

Chris

[1] One machine is the interface to the outside world - and runs 
1.6.7.2. I see packet loss to this machine at times and have observed 
lustre  hanging for a while. I suspect the problem is that it is 
occasionally overloaded with network packets, lustre packets are then 
lost (probably at the router), followed by a timeout and recovery. I''ve
now enabled adaptive timeouts on this machine - and will install a 
10GigE card too.

Lustre discuss - Jan 2010 - MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour

[Lustre-discuss] MDS crashes daily at the same hour