Please bear with me while I try to explain this problem. It''s very strange. I have a 33TB lustre file system with 5.6 underlying LUNs. The interconnect is infiniband. Any write I/O to the disk causes the mount to hang, and the underlying disk starts doing lots and lots of tiny reads for about 10 minutes. There is only one client mounted and this small I/O continues after killling the client. I am running lustre 1.6.3 with the latest rhel5 kernel on rhel5. I cannot find any suggestive error messages and am at a loss. It''s a production filesystem and it''s pretty much unusable. -Aaron
Make that 6, 5.6TB underlying luns On Dec 13, 2007, at 11:24 AM, Aaron Knister wrote:> Please bear with me while I try to explain this problem. It''s very > strange. > > I have a 33TB lustre file system with 5.6 underlying LUNs. The > interconnect is infiniband. Any write I/O to the disk causes the mount > to hang, and the underlying disk starts doing lots and lots of tiny > reads for about 10 minutes. There is only one client mounted and this > small I/O continues after killling the client. I am running lustre > 1.6.3 with the latest rhel5 kernel on rhel5. I cannot find any > suggestive error messages and am at a loss. It''s a production > filesystem and it''s pretty much unusable. > > -Aaron > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussAaron Knister Associate Systems Administrator/Web Designer Center for Research on Environment and Water (301) 595-7001 aaron at iges.org
Hello! On Dec 13, 2007, at 11:24 AM, Aaron Knister wrote:> Please bear with me while I try to explain this problem. It''s very > strange. > I have a 33TB lustre file system with 5.6 underlying LUNs. The > interconnect is infiniband. Any write I/O to the disk causes the mount > to hang, and the underlying disk starts doing lots and lots of tiny > reads for about 10 minutes. There is only one client mounted and this > small I/O continues after killling the client. I am running lustre > 1.6.3 with the latest rhel5 kernel on rhel5. I cannot find any > suggestive error messages and am at a loss. It''s a production > filesystem and it''s pretty much unusable.This is native client, not nfs client, I presume? Also is write is a sort of append write (to the end of files) or a rewrite (into existing files in the middle)? How do you do the write? you can enable some debug logging on your OST and/or clients (echo +vfstrace +rpctrace +inode >/proc/sys/lnet/debug), replicate a problem, then do lctl dk >somewhere (important as the debug buffer is cleaned by this) and inspect this file where you dumped the log to see what are those reads and where do they come from. Bye, Oleg
On the client i see this -- 00000100:00100000:1:1197564425.523422:0:4024:0:(client.c: 1339:ptlrpc_free_committed()) data-MDT0000-mdc-ffff8108275ec800: skip recheck: last_committed 54963341 00000100:00100000:1:1197564425.523425:0:4024:0:(client.c: 949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:87702303- a9bb-4c4f-6494-392157b9b9f7:4024:370:192.168.64.70 at o2ib:400 00000100:00100000:1:1197564425.523434:0:4024:0:(client.c: 1339:ptlrpc_free_committed()) data-OST0000-osc-ffff8108275ec800: skip recheck: last_committed 13452811 00000100:00100000:1:1197564425.523437:0:4024:0:(client.c: 949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:87702303- a9bb-4c4f-6494-392157b9b9f7:4024:371:192.168.64.71 at o2ib:400 00000100:00100000:1:1197564425.523440:0:4024:0:(client.c: 1339:ptlrpc_free_committed()) data-OST0001-osc-ffff8108275ec800: skip recheck: last_committed 13219554 00000100:00100000:1:1197564425.523443:0:4024:0:(client.c: 949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:87702303- a9bb-4c4f-6494-392157b9b9f7:4024:372:192.168.64.71 at o2ib:400 00000100:00100000:1:1197564425.523470:0:4025:0:(client.c: 733:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:87702303- a9bb-4c4f-6494-392157b9b9f7:4025:376:192.168.64.71 at o2ib:8 00000100:00100000:1:1197564425.523517:0:4024:0:(client.c: 1339:ptlrpc_free_committed()) data-OST0002-osc-ffff8108275ec800: skip recheck: last_committed 12117104 00000100:00100000:1:1197564425.523520:0:4024:0:(client.c: 949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:87702303- a9bb-4c4f-6494-392157b9b9f7:4024:373:192.168.64.71 at o2ib:400 00000100:00100000:1:1197564425.523524:0:4024:0:(client.c: 1339:ptlrpc_free_committed()) data-OST0003-osc-ffff8108275ec800: skip recheck: last_committed 11979465 00000100:00100000:1:1197564425.523532:0:4024:0:(client.c: 949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:87702303- a9bb-4c4f-6494-392157b9b9f7:4024:374:192.168.64.71 at o2ib:400 00000100:00100000:1:1197564425.523535:0:4024:0:(client.c: 1339:ptlrpc_free_committed()) data-OST0004-osc-ffff8108275ec800: skip recheck: last_committed 13572310 00000100:00100000:1:1197564425.523537:0:4024:0:(client.c: 949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:87702303- a9bb-4c4f-6494-392157b9b9f7:4024:375:192.168.64.71 at o2ib:400 00000100:02020000:1:1197564425.524253:0:4025:0:(client.c: 581:ptlrpc_check_status()) 11-0: an error occurred while communicating with 192.168.64.71 at o2ib. The ost_connect operation failed with -16 00000100:00100000:1:1197564425.524256:0:4025:0:(client.c: 1339:ptlrpc_free_committed()) data-OST0005-osc-ffff8108275ec800: skip recheck: last_committed 13074662 00000100:00080000:1:1197564425.524265:0:4025:0:(import.c: 759:ptlrpc_connect_interpret()) ffff810827423200 data-OST0005_UUID: changing import state from CONNECTING to DISCONN 00000100:00080000:1:1197564425.524268:0:4025:0:(import.c: 801:ptlrpc_connect_interpret()) recovery of data-OST0005_UUID on 192.168.64.71 at o2ib failed (-16) 00000100:00100000:1:1197564425.524276:0:4025:0:(client.c: 949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd-recov:87702303- a9bb-4c4f-6494-392157b9b9f7:4025:376:192.168.64.71 at o2ib:8 and on the server -- 00000100:00100000:0:1197564324.597517:0:6891:0:(events.c: 259:request_in_callback()) @@@ incoming req req at ffff810402a64a00 x344/ t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 00000100:00100000:2:1197564324.597537:0:7120:0:(service.c: 1048:ptlrpc_server_handle_request()) Handled RPC pname:cluuid +ref:pid:xid:nid:opc ll_ost_125:87702303- a9bb-4c4f-6494-392157b9b9f7+6:4024:x343:12345-192.168.64.102 at o2ib:400 00000100:00100000:2:1197564324.597541:0:7120:0:(service.c: 1076:ptlrpc_server_handle_request()) request x343 opc 400 from 12345-192.168.64.102 at o2ib processed in 33us (63us total) trans 0 rc 0/0 00000100:00100000:3:1197564324.597542:0:7119:0:(service.c: 1048:ptlrpc_server_handle_request()) Handled RPC pname:cluuid +ref:pid:xid:nid:opc ll_ost_124:87702303- a9bb-4c4f-6494-392157b9b9f7+6:4024:x342:12345-192.168.64.102 at o2ib:400 00000100:00100000:1:1197564324.597543:0:7004:0:(service.c: 1048:ptlrpc_server_handle_request()) Handled RPC pname:cluuid +ref:pid:xid:nid:opc ll_ost_09:87702303- a9bb-4c4f-6494-392157b9b9f7+6:4024:x341:12345-192.168.64.102 at o2ib:400 00000100:00100000:3:1197564324.597547:0:7119:0:(service.c: 1076:ptlrpc_server_handle_request()) request x342 opc 400 from 12345-192.168.64.102 at o2ib processed in 39us (78us total) trans 0 rc 0/0 00000100:00100000:1:1197564324.597548:0:7004:0:(service.c: 1076:ptlrpc_server_handle_request()) request x341 opc 400 from 12345-192.168.64.102 at o2ib processed in 42us (96us total) trans 0 rc 0/0 00000100:00100000:0:1197564324.597550:0:7001:0:(service.c: 1032:ptlrpc_server_handle_request()) Handling RPC pname:cluuid +ref:pid:xid:nid:opc ll_ost_06:87702303- a9bb-4c4f-6494-392157b9b9f7+31:4025:x344:12345-192.168.64.102 at o2ib:8 00000010:00000002:0:1197564324.597553:0:7001:0:(ost_handler.c: 1489:ost_handle()) connect 00010000:00000400:0:1197564324.597565:0:7001:0:(ldlm_lib.c: 515:target_handle_reconnect()) data-OST0005: 87702303- a9bb-4c4f-6494-392157b9b9f7 reconnecting 00010000:00000400:0:1197564324.597568:0:7001:0:(ldlm_lib.c: 744:target_handle_connect()) data-OST0005: refuse reconnection from 87702303-a9bb-4c4f-6494-392157b9b9f7 at 192.168.64.102 @o2ib to 0xffff810401c96000; still busy with 9 active RPCs 00010000:00020000:0:1197564324.598221:0:7001:0:(ldlm_lib.c: 1458:target_send_reply_msg()) @@@ processing error (-16) req at ffff810402a64a00 x344/t0 o8->87702303-a9bb-4c4f-6494-392157b9b9f7@: 0/0 lens 304/200 e 0 to 0 dl 1197564424 ref 1 fl Interpret:/0/0 rc -16/0 00000100:00100000:0:1197564324.598232:0:7001:0:(service.c: 1048:ptlrpc_server_handle_request()) Handled RPC pname:cluuid +ref:pid:xid:nid:opc ll_ost_06:87702303- a9bb-4c4f-6494-392157b9b9f7+31:4025:x344:12345-192.168.64.102 at o2ib:8 00000100:00100000:0:1197564324.598237:0:7001:0:(service.c: 1076:ptlrpc_server_handle_request()) request x344 opc 8 from 12345-192.168.64.102 at o2ib processed in 689us (721us total) trans 0 rc -16/-16 00000100:00100000:2:1197564336.503728:0:6894:0:(client.c: 925:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:8a71c591-7623-dd81- fbc5-47fb75e4b360:6894:114:192.168.64.70 at o2ib:400 00000100:00100000:2:1197564336.503851:0:6894:0:(events.c: 104:reply_in_callback()) @@@ unlink req at ffff8104029dc400 x114/t0 o400- >MGS at MGC192.168.64.70@o2ib_0:26/25 lens 128/256 e 0 to 6 dl 1197564343 ref 1 fl Interpret:RN/8/0 rc 0/0 00000100:00100000:2:1197564336.503860:0:6894:0:(client.c: 1158:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc ptlrpcd:8a71c591-7623-dd81- fbc5-47fb75e4b360:6894:114:192.168.64.70 at o2ib:400 Debug log: 269 lines, 269 kept, 0 dropped. I don''t see any insight here. On Dec 13, 2007, at 11:36 AM, Oleg Drokin wrote:> echo +vfstrace +rpctrace +inode >/proc/sys/lnet/debugAaron Knister Associate Systems Administrator/Web Designer Center for Research on Environment and Water (301) 595-7001 aaron at iges.org
Hello! On Dec 13, 2007, at 11:48 AM, Aaron Knister wrote:> On the client i see this --This shows no activity aside from the fact that client is disconnected from OST5.> and on the server --This one shows that served does not allow client reconnection because it is still busy processing other requests from this client. That''s the reason for "mount hang". This is all I can tell from those logs you provided. If the logs actually span long in the past, might be there is more useful info. Since there was disconnection - perhaps dmesg on client and server contain more info about the disconnection reasons, also on server if you do sysrq-t, you will see what is going on with those server threads that are supposedly still process client requests. Bye, Oleg
Thanks for your help! I have some more information from the lctl dk-- 10000000:01000000:3:1197576228.177725:0:8816:0:(mgc_request.c: 1130:mgc_process_log()) Can''t get cfg lock: -108 10000000:01000000:1:1197576228.177727:0:8511:0:(mgc_request.c: 558:mgc_blocking_ast()) Lock res 0x61746164 (data) 00000100:00020000:3:1197576228.177728:0:8816:0:(client.c: 710:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8103dba84c00 x390/t0 o501->MGS at MGC192.168.64.70@o2ib_0:26/25 lens 200/304 e 0 to 11 dl 0 ref 1 fl Rpc:/8/0 rc 0/0 10000000:01000000:1:1197576228.177729:0:8511:0:(mgc_request.c: 583:mgc_blocking_ast()) log data-OST0000: original grant failed, will requeue later 10000000:01000000:3:1197576228.177731:0:8816:0:(mgc_request.c: 1182:mgc_process_log()) MGC192.168.64.70 at o2ib: configuration from log ''data-OST0000'' failed (-108). 00000100:00080000:1:1197576236.900462:0:8444:0:(pinger.c: 143:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or recovery disabled: 0/1) This is on the OSS. Also on the OSS -- 00010000:00000400:2:1197576684.886679:0:8597:0:(ldlm_lib.c: 515:target_handle_reconnect()) data-OST0005: 532a7ed7-8e93-e086-885a- b064e46adb12 reconnecting00010000:00000400:2:1197576684.886683:0:8597:0:(ldlm_lib.c: 744:target_handle_connect()) data-OST0005: refuse reconnection from 532a7ed7-8e93-e086-885a-b064e46adb12 at 192.168.64.102 @o2ib to 0xffff8103cc9e3000; st ill busy with 9 active RPCs00000100:00100000:1:1197576684.886683:0:8599:0:(service.c: 1032:ptlrpc_server_handle_request()) Handling RPC pname:cluuid +ref:pid:xid:nid:opc ll_ost_55:532a7ed7-8e93-e086-885a- b064e46adb12+6:3962:x868:12345-192 .168.64.102 at o2ib:40000000010:00000002:1:1197576684.886687:0:8599:0: (ost_handler.c:1598:ost_handle()) @@@ ping req at ffff81042f7a3c00 x868/ t0 o400->532a7ed7-8e93-e086-885a-b064e46adb12 at NET_0x50000c0a84066_UUID: 0/0 lens 128/0 e 0 to 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc 0/000010000:00020000:2:1197576684.886688:0:8597:0:(ldlm_lib.c: 1458:target_send_reply_msg()) @@@ processing error (-16) req at ffff8104167fe850 x871/t0 o8->532a7ed7-8e93-e086-885a- b064e46adb12 at NET_0x50000c0a84066_UU ID:0/0 lens 304/200 e 0 to 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc -16/0 On the client it shows -- 00000100:00080000:0:1197576416.143577:0:3964:0:(recover.c: 54:ptlrpc_initiate_recovery()) data-OST0004_UUID: starting recovery 00000100:00080000:0:1197576416.143585:0:3964:0:(import.c: 381:ptlrpc_connect_import()) ffff81082f49a000 data-OST0004_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:0:1197576416.143590:0:3964:0:(import.c: 275:import_select_connection()) data-OST0004-osc-ffff81082ae12400: connect to NID 192.168.64.71 at o2ib last attempt 4296998987 00000100:00080000:0:1197576416.143597:0:3964:0:(import.c: 339:import_select_connection()) data-OST0004-osc-ffff81082ae12400: import ffff81082f49a000 using connection 192.168.64.71 at o2ib/ 192.168.64.71 at o2ib 00000100:02020000:0:1197576416.143864:0:3963:0:(client.c: 581:ptlrpc_check_status()) 11-0: an error occurred while communicating with 192.168.64.71 at o2ib. The ost_connect operation failed with -16 00000100:00080000:0:1197576416.144314:0:3963:0:(import.c: 759:ptlrpc_connect_interpret()) ffff81082f49a000 data-OST0004_UUID: changing import state from CONNECTING to DISCONN 00000100:00080000:0:1197576416.144316:0:3963:0:(import.c: 801:ptlrpc_connect_interpret()) recovery of data-OST0004_UUID on 192.168.64.71 at o2ib failed (-16) I''m at a loss. On Dec 13, 2007, at 11:59 AM, Oleg Drokin wrote:> Hello! > > On Dec 13, 2007, at 11:48 AM, Aaron Knister wrote: > >> On the client i see this -- > > This shows no activity aside from the fact that client is > disconnected from OST5. > >> and on the server -- > > This one shows that served does not allow client reconnection > because it is still > busy processing other requests from this client. That''s the reason > for "mount hang". > > This is all I can tell from those logs you provided. If the logs > actually span > long in the past, might be there is more useful info. > Since there was disconnection - perhaps dmesg on client and server > contain > more info about the disconnection reasons, also on server if you do > sysrq-t, you will see what is going on with those server threads > that are supposedly > still process client requests. > > Bye, > OlegAaron Knister Associate Systems Administrator/Web Designer Center for Research on Environment and Water (301) 595-7001 aaron at iges.org
Don''t ask me how but it out of the blue resolved itself. I have 0 idea what went wrong... On Dec 13, 2007, at 3:12 PM, Aaron Knister wrote:> Thanks for your help! I have some more information from the lctl dk-- > > 10000000:01000000:3:1197576228.177725:0:8816:0:(mgc_request.c: > 1130:mgc_process_log()) Can''t get cfg lock: -108 > 10000000:01000000:1:1197576228.177727:0:8511:0:(mgc_request.c: > 558:mgc_blocking_ast()) Lock res 0x61746164 (data) > 00000100:00020000:3:1197576228.177728:0:8816:0:(client.c: > 710:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8103dba84c00 > x390/t0 o501->MGS at MGC192.168.64.70@o2ib_0:26/25 lens 200/304 e 0 to 11 > dl 0 ref 1 fl Rpc:/8/0 rc 0/0 > 10000000:01000000:1:1197576228.177729:0:8511:0:(mgc_request.c: > 583:mgc_blocking_ast()) log data-OST0000: original grant failed, will > requeue later > 10000000:01000000:3:1197576228.177731:0:8816:0:(mgc_request.c: > 1182:mgc_process_log()) MGC192.168.64.70 at o2ib: configuration from log > ''data-OST0000'' failed (-108). > 00000100:00080000:1:1197576236.900462:0:8444:0:(pinger.c: > 143:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or > recovery disabled: 0/1) > > This is on the OSS. > > Also on the OSS -- > > 00010000:00000400:2:1197576684.886679:0:8597:0:(ldlm_lib.c: > 515:target_handle_reconnect()) data-OST0005: 532a7ed7-8e93-e086-885a- > b064e46adb12 > reconnecting00010000:00000400:2:1197576684.886683:0:8597:0: > (ldlm_lib.c: > 744:target_handle_connect()) data-OST0005: refuse reconnection from 532a7ed7-8e93-e086-885a-b064e46adb12 at 192.168.64.102 > @o2ib to 0xffff8103cc9e3000; st > ill busy with 9 active > RPCs00000100:00100000:1:1197576684.886683:0:8599:0:(service.c: > 1032:ptlrpc_server_handle_request()) Handling RPC pname:cluuid > +ref:pid:xid:nid:opc ll_ost_55:532a7ed7-8e93-e086-885a- > b064e46adb12+6:3962:x868:12345-192 > .168.64.102 at o2ib:40000000010:00000002:1:1197576684.886687:0:8599:0: > (ost_handler.c:1598:ost_handle()) @@@ ping req at ffff81042f7a3c00 x868/ > t0 o400->532a7ed7-8e93-e086-885a- > b064e46adb12 at NET_0x50000c0a84066_UUID: > 0/0 lens 128/0 e 0 to > 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc > 0/000010000:00020000:2:1197576684.886688:0:8597:0:(ldlm_lib.c: > 1458:target_send_reply_msg()) @@@ processing error (-16) > req at ffff8104167fe850 x871/t0 o8->532a7ed7-8e93-e086-885a- > b064e46adb12 at NET_0x50000c0a84066_UU > ID:0/0 lens 304/200 e 0 to 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc > -16/0 > > On the client it shows -- > > 00000100:00080000:0:1197576416.143577:0:3964:0:(recover.c: > 54:ptlrpc_initiate_recovery()) data-OST0004_UUID: starting recovery > 00000100:00080000:0:1197576416.143585:0:3964:0:(import.c: > 381:ptlrpc_connect_import()) ffff81082f49a000 data-OST0004_UUID: > changing import state from DISCONN to CONNECTING > 00000100:00080000:0:1197576416.143590:0:3964:0:(import.c: > 275:import_select_connection()) data-OST0004-osc-ffff81082ae12400: > connect to NID 192.168.64.71 at o2ib last attempt 4296998987 > 00000100:00080000:0:1197576416.143597:0:3964:0:(import.c: > 339:import_select_connection()) data-OST0004-osc-ffff81082ae12400: > import ffff81082f49a000 using connection 192.168.64.71 at o2ib/ > 192.168.64.71 at o2ib > 00000100:02020000:0:1197576416.143864:0:3963:0:(client.c: > 581:ptlrpc_check_status()) 11-0: an error occurred while communicating > with 192.168.64.71 at o2ib. The ost_connect operation failed with -16 > 00000100:00080000:0:1197576416.144314:0:3963:0:(import.c: > 759:ptlrpc_connect_interpret()) ffff81082f49a000 data-OST0004_UUID: > changing import state from CONNECTING to DISCONN > 00000100:00080000:0:1197576416.144316:0:3963:0:(import.c: > 801:ptlrpc_connect_interpret()) recovery of data-OST0004_UUID on > 192.168.64.71 at o2ib failed (-16) > > I''m at a loss. > > On Dec 13, 2007, at 11:59 AM, Oleg Drokin wrote: > >> Hello! >> >> On Dec 13, 2007, at 11:48 AM, Aaron Knister wrote: >> >>> On the client i see this -- >> >> This shows no activity aside from the fact that client is >> disconnected from OST5. >> >>> and on the server -- >> >> This one shows that served does not allow client reconnection >> because it is still >> busy processing other requests from this client. That''s the reason >> for "mount hang". >> >> This is all I can tell from those logs you provided. If the logs >> actually span >> long in the past, might be there is more useful info. >> Since there was disconnection - perhaps dmesg on client and server >> contain >> more info about the disconnection reasons, also on server if you do >> sysrq-t, you will see what is going on with those server threads >> that are supposedly >> still process client requests. >> >> Bye, >> Oleg > > Aaron Knister > Associate Systems Administrator/Web Designer > Center for Research on Environment and Water > > (301) 595-7001 > aaron at iges.org > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at clusterfs.com > https://mail.clusterfs.com/mailman/listinfo/lustre-discussAaron Knister Associate Systems Administrator/Web Designer Center for Research on Environment and Water (301) 595-7001 aaron at iges.org
Just kidding...I spoke WAY too soon. It''s acting up again. On Dec 13, 2007, at 6:51 PM, Aaron Knister wrote:> Don''t ask me how but it out of the blue resolved itself. I have 0 > idea what went wrong... > > On Dec 13, 2007, at 3:12 PM, Aaron Knister wrote: > >> Thanks for your help! I have some more information from the lctl dk-- >> >> 10000000:01000000:3:1197576228.177725:0:8816:0:(mgc_request.c: >> 1130:mgc_process_log()) Can''t get cfg lock: -108 >> 10000000:01000000:1:1197576228.177727:0:8511:0:(mgc_request.c: >> 558:mgc_blocking_ast()) Lock res 0x61746164 (data) >> 00000100:00020000:3:1197576228.177728:0:8816:0:(client.c: >> 710:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8103dba84c00 >> x390/t0 o501->MGS at MGC192.168.64.70@o2ib_0:26/25 lens 200/304 e 0 to >> 11 >> dl 0 ref 1 fl Rpc:/8/0 rc 0/0 >> 10000000:01000000:1:1197576228.177729:0:8511:0:(mgc_request.c: >> 583:mgc_blocking_ast()) log data-OST0000: original grant failed, will >> requeue later >> 10000000:01000000:3:1197576228.177731:0:8816:0:(mgc_request.c: >> 1182:mgc_process_log()) MGC192.168.64.70 at o2ib: configuration from log >> ''data-OST0000'' failed (-108). >> 00000100:00080000:1:1197576236.900462:0:8444:0:(pinger.c: >> 143:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or >> recovery disabled: 0/1) >> >> This is on the OSS. >> >> Also on the OSS -- >> >> 00010000:00000400:2:1197576684.886679:0:8597:0:(ldlm_lib.c: >> 515:target_handle_reconnect()) data-OST0005: 532a7ed7-8e93-e086-885a- >> b064e46adb12 >> reconnecting00010000:00000400:2:1197576684.886683:0:8597:0: >> (ldlm_lib.c: >> 744:target_handle_connect()) data-OST0005: refuse reconnection from 532a7ed7-8e93-e086-885a-b064e46adb12 at 192.168.64.102 >> @o2ib to 0xffff8103cc9e3000; st >> ill busy with 9 active >> RPCs00000100:00100000:1:1197576684.886683:0:8599:0:(service.c: >> 1032:ptlrpc_server_handle_request()) Handling RPC pname:cluuid >> +ref:pid:xid:nid:opc ll_ost_55:532a7ed7-8e93-e086-885a- >> b064e46adb12+6:3962:x868:12345-192 >> .168.64.102 at o2ib:40000000010:00000002:1:1197576684.886687:0:8599:0: >> (ost_handler.c:1598:ost_handle()) @@@ ping req at ffff81042f7a3c00 >> x868/ >> t0 o400->532a7ed7-8e93-e086-885a- >> b064e46adb12 at NET_0x50000c0a84066_UUID: >> 0/0 lens 128/0 e 0 to >> 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc >> 0/000010000:00020000:2:1197576684.886688:0:8597:0:(ldlm_lib.c: >> 1458:target_send_reply_msg()) @@@ processing error (-16) >> req at ffff8104167fe850 x871/t0 o8->532a7ed7-8e93-e086-885a- >> b064e46adb12 at NET_0x50000c0a84066_UU >> ID:0/0 lens 304/200 e 0 to 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc >> -16/0 >> >> On the client it shows -- >> >> 00000100:00080000:0:1197576416.143577:0:3964:0:(recover.c: >> 54:ptlrpc_initiate_recovery()) data-OST0004_UUID: starting recovery >> 00000100:00080000:0:1197576416.143585:0:3964:0:(import.c: >> 381:ptlrpc_connect_import()) ffff81082f49a000 data-OST0004_UUID: >> changing import state from DISCONN to CONNECTING >> 00000100:00080000:0:1197576416.143590:0:3964:0:(import.c: >> 275:import_select_connection()) data-OST0004-osc-ffff81082ae12400: >> connect to NID 192.168.64.71 at o2ib last attempt 4296998987 >> 00000100:00080000:0:1197576416.143597:0:3964:0:(import.c: >> 339:import_select_connection()) data-OST0004-osc-ffff81082ae12400: >> import ffff81082f49a000 using connection 192.168.64.71 at o2ib/ >> 192.168.64.71 at o2ib >> 00000100:02020000:0:1197576416.143864:0:3963:0:(client.c: >> 581:ptlrpc_check_status()) 11-0: an error occurred while >> communicating >> with 192.168.64.71 at o2ib. The ost_connect operation failed with -16 >> 00000100:00080000:0:1197576416.144314:0:3963:0:(import.c: >> 759:ptlrpc_connect_interpret()) ffff81082f49a000 data-OST0004_UUID: >> changing import state from CONNECTING to DISCONN >> 00000100:00080000:0:1197576416.144316:0:3963:0:(import.c: >> 801:ptlrpc_connect_interpret()) recovery of data-OST0004_UUID on >> 192.168.64.71 at o2ib failed (-16) >> >> I''m at a loss. >> >> On Dec 13, 2007, at 11:59 AM, Oleg Drokin wrote: >> >>> Hello! >>> >>> On Dec 13, 2007, at 11:48 AM, Aaron Knister wrote: >>> >>>> On the client i see this -- >>> >>> This shows no activity aside from the fact that client is >>> disconnected from OST5. >>> >>>> and on the server -- >>> >>> This one shows that served does not allow client reconnection >>> because it is still >>> busy processing other requests from this client. That''s the reason >>> for "mount hang". >>> >>> This is all I can tell from those logs you provided. If the logs >>> actually span >>> long in the past, might be there is more useful info. >>> Since there was disconnection - perhaps dmesg on client and server >>> contain >>> more info about the disconnection reasons, also on server if you do >>> sysrq-t, you will see what is going on with those server threads >>> that are supposedly >>> still process client requests. >>> >>> Bye, >>> Oleg >> >> Aaron Knister >> Associate Systems Administrator/Web Designer >> Center for Research on Environment and Water >> >> (301) 595-7001 >> aaron at iges.org >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at clusterfs.com >> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss > > Aaron Knister > Associate Systems Administrator/Web Designer > Center for Research on Environment and Water > > (301) 595-7001 > aaron at iges.org > > >Aaron Knister Associate Systems Administrator/Web Designer Center for Research on Environment and Water (301) 595-7001 aaron at iges.org
On Dec 13, 2007, at 6:52 PM, Aaron Knister wrote:> Just kidding...I spoke WAY too soon. It''s acting up again.Unfortunately all of logs you provided so far are in the middle of problems, when lustre client was evicted already. How about point when it all started? Also, sysrq-t from OSSes to find out what is it doing would be useful. Bye, Oleg
I have some logs from the MDS/OSS startup. One thing I see consistently on the OST''s having this problem is "still busy with 9 active RPCs". The number is always 9 when it hangs up. I''ve attached some of the dk logs. Thank you so much for your help. 00010000:00000400:0:1198000430.377418:0:5835:0:(ldlm_lib.c: 517:target_handle_reconnect()) data-OST0004: 5ca85557-1b99-442d-d494- f1096d3fa4c4 reconnecting 00010000:00000400:0:1198000430.377496:0:5835:0:(ldlm_lib.c: 746:target_handle_connect()) data-OST0004: refuse reconnection from 5ca85557-1b99-442d-d494-f1096d3fa4c4 at 192.168.64.102 @o2ib to 0xffff810405037000; still busy with 9 active RPCs 00010000:00020000:0:1198000430.377787:0:5835:0:(ldlm_lib.c: 1460:target_send_reply_msg()) @@@ processing error (-16) req at ffff810407a35400 x35081142/t0 o8->5ca85557-1b99-442d-d494- f1096d3fa4c4 at NET_0x50000c0a84066_UUID:0/0 lens 304/200 e 0 to 0 dl 1198000530 ref 1 fl Interpret:/0/0 rc -16/0 00010000:00000400:0:1198000455.377760:0:5839:0:(ldlm_lib.c: 517:target_handle_reconnect()) data-OST0004: 5ca85557-1b99-442d-d494- f1096d3fa4c4 reconnecting 00010000:00000400:0:1198000455.378250:0:5839:0:(ldlm_lib.c: 746:target_handle_connect()) data-OST0004: refuse reconnection from 5ca85557-1b99-442d-d494-f1096d3fa4c4 at 192.168.64.102 @o2ib to 0xffff810405037000; still busy with 9 active RPCs 00010000:00020000:0:1198000455.378955:0:5839:0:(ldlm_lib.c: 1460:target_send_reply_msg()) @@@ processing error (-16) req at ffff810423ceba00 x35081148/t0 o8->5ca85557-1b99-442d-d494- f1096d3fa4c4 at NET_0x50000c0a84066_UUID:0/0 lens 304/200 e 0 to 0 dl 1198000555 ref 1 fl Interpret:/0/0 rc -16/0 Debug log: 6 lines, 6 kept, 0 dropped. -------------- next part -------------- A non-text attachment was scrubbed... Name: dklogs.zip Type: application/zip Size: 38189 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071218/84b3dcf2/attachment-0002.zip -------------- next part -------------- On Dec 15, 2007, at 11:32 PM, Oleg Drokin wrote:> > On Dec 13, 2007, at 6:52 PM, Aaron Knister wrote: > >> Just kidding...I spoke WAY too soon. It''s acting up again. > > Unfortunately all of logs you provided so far are in the middle of > problems, when lustre client was evicted already. How about point when > it all started? > Also, sysrq-t from OSSes to find out what is it doing would be useful. > > Bye, > OlegAaron Knister Associate Systems Administrator/Web Designer Center for Research on Environment and Water (301) 595-7001 aaron at iges.org