thr3ads.net - Lustre discuss - [Lustre-discuss] strange slowdown [Dec 2007]

If this information is useful, please help other people find it:
Share via:

Aaron Knister

2007-Dec-13 16:24 UTC

[Lustre-discuss] strange slowdown

Please bear with me while I try to explain this problem. It''s very  
strange.

I have a 33TB lustre file system with 5.6 underlying LUNs. The  
interconnect is infiniband. Any write I/O to the disk causes the mount  
to hang, and the underlying disk starts doing lots and lots of tiny  
reads for about 10 minutes. There is only one client mounted and this  
small I/O continues after killling the client. I am running lustre  
1.6.3 with the latest rhel5 kernel on rhel5. I cannot find any  
suggestive error messages and am at a loss. It''s a production  
filesystem and it''s pretty much unusable.

-Aaron

Aaron Knister

2007-Dec-13 16:32 UTC

head link

[Lustre-discuss] strange slowdown

Make that 6, 5.6TB underlying luns

On Dec 13, 2007, at 11:24 AM, Aaron Knister wrote:
> Please bear with me while I try to explain this problem. It''s very
> strange.
>
> I have a 33TB lustre file system with 5.6 underlying LUNs. The
> interconnect is infiniband. Any write I/O to the disk causes the mount
> to hang, and the underlying disk starts doing lots and lots of tiny
> reads for about 10 minutes. There is only one client mounted and this
> small I/O continues after killling the client. I am running lustre
> 1.6.3 with the latest rhel5 kernel on rhel5. I cannot find any
> suggestive error messages and am at a loss. It''s a production
> filesystem and it''s pretty much unusable.
>
> -Aaron
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org

Oleg Drokin

2007-Dec-13 16:36 UTC

head link

[Lustre-discuss] strange slowdown

Hello!

On Dec 13, 2007, at 11:24 AM, Aaron Knister wrote:
> Please bear with me while I try to explain this problem. It''s very
> strange.
> I have a 33TB lustre file system with 5.6 underlying LUNs. The
> interconnect is infiniband. Any write I/O to the disk causes the mount
> to hang, and the underlying disk starts doing lots and lots of tiny
> reads for about 10 minutes. There is only one client mounted and this
> small I/O continues after killling the client. I am running lustre
> 1.6.3 with the latest rhel5 kernel on rhel5. I cannot find any
> suggestive error messages and am at a loss. It''s a production
> filesystem and it''s pretty much unusable.
This is native client, not nfs client, I presume?
Also is write is a sort of append write (to the end of files) or a
rewrite (into existing files in the middle)?
How do you do the write?

you can enable some debug logging on your OST and/or clients
(echo +vfstrace +rpctrace +inode >/proc/sys/lnet/debug), replicate
a problem, then do lctl dk >somewhere (important as the debug buffer is
cleaned by this) and inspect this file where you dumped the log to
see what are those reads and where do they come from.

Bye,
     Oleg

Aaron Knister

2007-Dec-13 16:48 UTC

head link

[Lustre-discuss] strange slowdown

On the client i see this --

00000100:00100000:1:1197564425.523422:0:4024:0:(client.c: 
1339:ptlrpc_free_committed()) data-MDT0000-mdc-ffff8108275ec800: skip  
recheck: last_committed 54963341
00000100:00100000:1:1197564425.523425:0:4024:0:(client.c: 
949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc  
ptlrpcd:87702303- 
a9bb-4c4f-6494-392157b9b9f7:4024:370:192.168.64.70 at o2ib:400
00000100:00100000:1:1197564425.523434:0:4024:0:(client.c: 
1339:ptlrpc_free_committed()) data-OST0000-osc-ffff8108275ec800: skip  
recheck: last_committed 13452811
00000100:00100000:1:1197564425.523437:0:4024:0:(client.c: 
949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc  
ptlrpcd:87702303- 
a9bb-4c4f-6494-392157b9b9f7:4024:371:192.168.64.71 at o2ib:400
00000100:00100000:1:1197564425.523440:0:4024:0:(client.c: 
1339:ptlrpc_free_committed()) data-OST0001-osc-ffff8108275ec800: skip  
recheck: last_committed 13219554
00000100:00100000:1:1197564425.523443:0:4024:0:(client.c: 
949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc  
ptlrpcd:87702303- 
a9bb-4c4f-6494-392157b9b9f7:4024:372:192.168.64.71 at o2ib:400
00000100:00100000:1:1197564425.523470:0:4025:0:(client.c: 
733:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc  
ptlrpcd-recov:87702303- 
a9bb-4c4f-6494-392157b9b9f7:4025:376:192.168.64.71 at o2ib:8
00000100:00100000:1:1197564425.523517:0:4024:0:(client.c: 
1339:ptlrpc_free_committed()) data-OST0002-osc-ffff8108275ec800: skip  
recheck: last_committed 12117104
00000100:00100000:1:1197564425.523520:0:4024:0:(client.c: 
949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc  
ptlrpcd:87702303- 
a9bb-4c4f-6494-392157b9b9f7:4024:373:192.168.64.71 at o2ib:400
00000100:00100000:1:1197564425.523524:0:4024:0:(client.c: 
1339:ptlrpc_free_committed()) data-OST0003-osc-ffff8108275ec800: skip  
recheck: last_committed 11979465
00000100:00100000:1:1197564425.523532:0:4024:0:(client.c: 
949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc  
ptlrpcd:87702303- 
a9bb-4c4f-6494-392157b9b9f7:4024:374:192.168.64.71 at o2ib:400
00000100:00100000:1:1197564425.523535:0:4024:0:(client.c: 
1339:ptlrpc_free_committed()) data-OST0004-osc-ffff8108275ec800: skip  
recheck: last_committed 13572310
00000100:00100000:1:1197564425.523537:0:4024:0:(client.c: 
949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc  
ptlrpcd:87702303- 
a9bb-4c4f-6494-392157b9b9f7:4024:375:192.168.64.71 at o2ib:400
00000100:02020000:1:1197564425.524253:0:4025:0:(client.c: 
581:ptlrpc_check_status()) 11-0: an error occurred while communicating  
with 192.168.64.71 at o2ib. The ost_connect operation failed with -16
00000100:00100000:1:1197564425.524256:0:4025:0:(client.c: 
1339:ptlrpc_free_committed()) data-OST0005-osc-ffff8108275ec800: skip  
recheck: last_committed 13074662
00000100:00080000:1:1197564425.524265:0:4025:0:(import.c: 
759:ptlrpc_connect_interpret()) ffff810827423200 data-OST0005_UUID:  
changing import state from CONNECTING to DISCONN
00000100:00080000:1:1197564425.524268:0:4025:0:(import.c: 
801:ptlrpc_connect_interpret()) recovery of data-OST0005_UUID on  
192.168.64.71 at o2ib failed (-16)
00000100:00100000:1:1197564425.524276:0:4025:0:(client.c: 
949:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc  
ptlrpcd-recov:87702303- 
a9bb-4c4f-6494-392157b9b9f7:4025:376:192.168.64.71 at o2ib:8


and on the server --

00000100:00100000:0:1197564324.597517:0:6891:0:(events.c: 
259:request_in_callback()) @@@ incoming req  req at ffff810402a64a00 x344/ 
t0 o8-><?>@<?>:0/0 lens 304/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc
0/0
00000100:00100000:2:1197564324.597537:0:7120:0:(service.c: 
1048:ptlrpc_server_handle_request()) Handled RPC pname:cluuid 
+ref:pid:xid:nid:opc ll_ost_125:87702303- 
a9bb-4c4f-6494-392157b9b9f7+6:4024:x343:12345-192.168.64.102 at o2ib:400
00000100:00100000:2:1197564324.597541:0:7120:0:(service.c: 
1076:ptlrpc_server_handle_request()) request x343 opc 400 from  
12345-192.168.64.102 at o2ib processed in 33us (63us total) trans 0 rc 0/0
00000100:00100000:3:1197564324.597542:0:7119:0:(service.c: 
1048:ptlrpc_server_handle_request()) Handled RPC pname:cluuid 
+ref:pid:xid:nid:opc ll_ost_124:87702303- 
a9bb-4c4f-6494-392157b9b9f7+6:4024:x342:12345-192.168.64.102 at o2ib:400
00000100:00100000:1:1197564324.597543:0:7004:0:(service.c: 
1048:ptlrpc_server_handle_request()) Handled RPC pname:cluuid 
+ref:pid:xid:nid:opc ll_ost_09:87702303- 
a9bb-4c4f-6494-392157b9b9f7+6:4024:x341:12345-192.168.64.102 at o2ib:400
00000100:00100000:3:1197564324.597547:0:7119:0:(service.c: 
1076:ptlrpc_server_handle_request()) request x342 opc 400 from  
12345-192.168.64.102 at o2ib processed in 39us (78us total) trans 0 rc 0/0
00000100:00100000:1:1197564324.597548:0:7004:0:(service.c: 
1076:ptlrpc_server_handle_request()) request x341 opc 400 from  
12345-192.168.64.102 at o2ib processed in 42us (96us total) trans 0 rc 0/0
00000100:00100000:0:1197564324.597550:0:7001:0:(service.c: 
1032:ptlrpc_server_handle_request()) Handling RPC pname:cluuid 
+ref:pid:xid:nid:opc ll_ost_06:87702303- 
a9bb-4c4f-6494-392157b9b9f7+31:4025:x344:12345-192.168.64.102 at o2ib:8
00000010:00000002:0:1197564324.597553:0:7001:0:(ost_handler.c: 
1489:ost_handle()) connect
00010000:00000400:0:1197564324.597565:0:7001:0:(ldlm_lib.c: 
515:target_handle_reconnect()) data-OST0005: 87702303- 
a9bb-4c4f-6494-392157b9b9f7 reconnecting
00010000:00000400:0:1197564324.597568:0:7001:0:(ldlm_lib.c: 
744:target_handle_connect()) data-OST0005: refuse reconnection from
87702303-a9bb-4c4f-6494-392157b9b9f7 at 192.168.64.102
@o2ib to 0xffff810401c96000; still busy with 9 active RPCs
00010000:00020000:0:1197564324.598221:0:7001:0:(ldlm_lib.c: 
1458:target_send_reply_msg()) @@@ processing error (-16)   
req at ffff810402a64a00 x344/t0 o8->87702303-a9bb-4c4f-6494-392157b9b9f7@: 
0/0 lens 304/200 e 0 to 0 dl 1197564424 ref 1 fl Interpret:/0/0 rc -16/0
00000100:00100000:0:1197564324.598232:0:7001:0:(service.c: 
1048:ptlrpc_server_handle_request()) Handled RPC pname:cluuid 
+ref:pid:xid:nid:opc ll_ost_06:87702303- 
a9bb-4c4f-6494-392157b9b9f7+31:4025:x344:12345-192.168.64.102 at o2ib:8
00000100:00100000:0:1197564324.598237:0:7001:0:(service.c: 
1076:ptlrpc_server_handle_request()) request x344 opc 8 from  
12345-192.168.64.102 at o2ib processed in 689us (721us total) trans 0 rc  
-16/-16
00000100:00100000:2:1197564336.503728:0:6894:0:(client.c: 
925:ptlrpc_send_new_req()) Sending RPC pname:cluuid:pid:xid:nid:opc  
ptlrpcd:8a71c591-7623-dd81- 
fbc5-47fb75e4b360:6894:114:192.168.64.70 at o2ib:400
00000100:00100000:2:1197564336.503851:0:6894:0:(events.c: 
104:reply_in_callback()) @@@ unlink  req at ffff8104029dc400 x114/t0 o400- 
 >MGS at MGC192.168.64.70@o2ib_0:26/25 lens 128/256 e 0 to 6 dl  
1197564343 ref 1 fl Interpret:RN/8/0 rc 0/0
00000100:00100000:2:1197564336.503860:0:6894:0:(client.c: 
1158:ptlrpc_check_set()) Completed RPC pname:cluuid:pid:xid:nid:opc  
ptlrpcd:8a71c591-7623-dd81- 
fbc5-47fb75e4b360:6894:114:192.168.64.70 at o2ib:400
Debug log: 269 lines, 269 kept, 0 dropped.

I don''t see any insight here.



On Dec 13, 2007, at 11:36 AM, Oleg Drokin wrote:
> echo +vfstrace +rpctrace +inode >/proc/sys/lnet/debug
Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org

Oleg Drokin

2007-Dec-13 16:59 UTC

head link

[Lustre-discuss] strange slowdown

Hello!

On Dec 13, 2007, at 11:48 AM, Aaron Knister wrote:
> On the client i see this --
This shows no activity aside from the fact that client is disconnected  
from OST5.
> and on the server --
This one shows that served does not allow client reconnection because  
it is still
busy processing other requests from this client. That''s the reason for
"mount hang".

This is all I can tell from those logs you provided. If the logs  
actually span
long in the past, might be there is more useful info.
Since there was disconnection - perhaps dmesg on client and server  
contain
more info about the disconnection reasons, also on server if you do
sysrq-t, you will see what is going on with those server threads that  
are supposedly
still process client requests.

Bye,
     Oleg

Aaron Knister

2007-Dec-13 20:12 UTC

head link

[Lustre-discuss] strange slowdown

Thanks for your help! I have some more information from the lctl dk--

10000000:01000000:3:1197576228.177725:0:8816:0:(mgc_request.c: 
1130:mgc_process_log()) Can''t get cfg lock: -108
10000000:01000000:1:1197576228.177727:0:8511:0:(mgc_request.c: 
558:mgc_blocking_ast()) Lock res 0x61746164 (data)
00000100:00020000:3:1197576228.177728:0:8816:0:(client.c: 
710:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8103dba84c00  
x390/t0 o501->MGS at MGC192.168.64.70@o2ib_0:26/25 lens 200/304 e 0 to 11  
dl 0 ref 1 fl Rpc:/8/0 rc 0/0
10000000:01000000:1:1197576228.177729:0:8511:0:(mgc_request.c: 
583:mgc_blocking_ast()) log data-OST0000: original grant failed, will  
requeue later
10000000:01000000:3:1197576228.177731:0:8816:0:(mgc_request.c: 
1182:mgc_process_log()) MGC192.168.64.70 at o2ib: configuration from log  
''data-OST0000'' failed (-108).
00000100:00080000:1:1197576236.900462:0:8444:0:(pinger.c: 
143:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or  
recovery disabled: 0/1)

This is on the OSS.

Also on the OSS --

00010000:00000400:2:1197576684.886679:0:8597:0:(ldlm_lib.c: 
515:target_handle_reconnect()) data-OST0005: 532a7ed7-8e93-e086-885a- 
b064e46adb12  
reconnecting00010000:00000400:2:1197576684.886683:0:8597:0:(ldlm_lib.c: 
744:target_handle_connect()) data-OST0005: refuse reconnection from
532a7ed7-8e93-e086-885a-b064e46adb12 at 192.168.64.102
@o2ib to 0xffff8103cc9e3000; st
ill busy with 9 active  
RPCs00000100:00100000:1:1197576684.886683:0:8599:0:(service.c: 
1032:ptlrpc_server_handle_request()) Handling RPC pname:cluuid 
+ref:pid:xid:nid:opc ll_ost_55:532a7ed7-8e93-e086-885a- 
b064e46adb12+6:3962:x868:12345-192
.168.64.102 at o2ib:40000000010:00000002:1:1197576684.886687:0:8599:0: 
(ost_handler.c:1598:ost_handle()) @@@ ping  req at ffff81042f7a3c00 x868/ 
t0 o400->532a7ed7-8e93-e086-885a-b064e46adb12 at NET_0x50000c0a84066_UUID: 
0/0 lens 128/0 e 0 to
  0 dl 1197576784 ref 1 fl Interpret:/0/0 rc  
0/000010000:00020000:2:1197576684.886688:0:8597:0:(ldlm_lib.c: 
1458:target_send_reply_msg()) @@@ processing error (-16)   
req at ffff8104167fe850 x871/t0 o8->532a7ed7-8e93-e086-885a- 
b064e46adb12 at NET_0x50000c0a84066_UU
ID:0/0 lens 304/200 e 0 to 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc  
-16/0

On the client it shows --

00000100:00080000:0:1197576416.143577:0:3964:0:(recover.c: 
54:ptlrpc_initiate_recovery()) data-OST0004_UUID: starting recovery
00000100:00080000:0:1197576416.143585:0:3964:0:(import.c: 
381:ptlrpc_connect_import()) ffff81082f49a000 data-OST0004_UUID:  
changing import state from DISCONN to CONNECTING
00000100:00080000:0:1197576416.143590:0:3964:0:(import.c: 
275:import_select_connection()) data-OST0004-osc-ffff81082ae12400:  
connect to NID 192.168.64.71 at o2ib last attempt 4296998987
00000100:00080000:0:1197576416.143597:0:3964:0:(import.c: 
339:import_select_connection()) data-OST0004-osc-ffff81082ae12400:  
import ffff81082f49a000 using connection 192.168.64.71 at o2ib/ 
192.168.64.71 at o2ib
00000100:02020000:0:1197576416.143864:0:3963:0:(client.c: 
581:ptlrpc_check_status()) 11-0: an error occurred while communicating  
with 192.168.64.71 at o2ib. The ost_connect operation failed with -16
00000100:00080000:0:1197576416.144314:0:3963:0:(import.c: 
759:ptlrpc_connect_interpret()) ffff81082f49a000 data-OST0004_UUID:  
changing import state from CONNECTING to DISCONN
00000100:00080000:0:1197576416.144316:0:3963:0:(import.c: 
801:ptlrpc_connect_interpret()) recovery of data-OST0004_UUID on  
192.168.64.71 at o2ib failed (-16)

I''m at a loss.

On Dec 13, 2007, at 11:59 AM, Oleg Drokin wrote:
> Hello!
>
> On Dec 13, 2007, at 11:48 AM, Aaron Knister wrote:
>
>> On the client i see this --
>
> This shows no activity aside from the fact that client is  
> disconnected from OST5.
>
>> and on the server --
>
> This one shows that served does not allow client reconnection  
> because it is still
> busy processing other requests from this client. That''s the reason
> for "mount hang".
>
> This is all I can tell from those logs you provided. If the logs  
> actually span
> long in the past, might be there is more useful info.
> Since there was disconnection - perhaps dmesg on client and server  
> contain
> more info about the disconnection reasons, also on server if you do
> sysrq-t, you will see what is going on with those server threads  
> that are supposedly
> still process client requests.
>
> Bye,
>    Oleg
Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org

Aaron Knister

2007-Dec-13 23:51 UTC

head link

[Lustre-discuss] strange slowdown

Don''t ask me how but it out of the blue resolved itself. I have 0 idea
what went wrong...

On Dec 13, 2007, at 3:12 PM, Aaron Knister wrote:
> Thanks for your help! I have some more information from the lctl dk--
>
> 10000000:01000000:3:1197576228.177725:0:8816:0:(mgc_request.c:
> 1130:mgc_process_log()) Can''t get cfg lock: -108
> 10000000:01000000:1:1197576228.177727:0:8511:0:(mgc_request.c:
> 558:mgc_blocking_ast()) Lock res 0x61746164 (data)
> 00000100:00020000:3:1197576228.177728:0:8816:0:(client.c:
> 710:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8103dba84c00
> x390/t0 o501->MGS at MGC192.168.64.70@o2ib_0:26/25 lens 200/304 e 0 to
11
> dl 0 ref 1 fl Rpc:/8/0 rc 0/0
> 10000000:01000000:1:1197576228.177729:0:8511:0:(mgc_request.c:
> 583:mgc_blocking_ast()) log data-OST0000: original grant failed, will
> requeue later
> 10000000:01000000:3:1197576228.177731:0:8816:0:(mgc_request.c:
> 1182:mgc_process_log()) MGC192.168.64.70 at o2ib: configuration from log
> ''data-OST0000'' failed (-108).
> 00000100:00080000:1:1197576236.900462:0:8444:0:(pinger.c:
> 143:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or
> recovery disabled: 0/1)
>
> This is on the OSS.
>
> Also on the OSS --
>
> 00010000:00000400:2:1197576684.886679:0:8597:0:(ldlm_lib.c:
> 515:target_handle_reconnect()) data-OST0005: 532a7ed7-8e93-e086-885a-
> b064e46adb12
> reconnecting00010000:00000400:2:1197576684.886683:0:8597:0: 
> (ldlm_lib.c:
> 744:target_handle_connect()) data-OST0005: refuse reconnection from
532a7ed7-8e93-e086-885a-b064e46adb12 at 192.168.64.102
> @o2ib to 0xffff8103cc9e3000; st
> ill busy with 9 active
> RPCs00000100:00100000:1:1197576684.886683:0:8599:0:(service.c:
> 1032:ptlrpc_server_handle_request()) Handling RPC pname:cluuid
> +ref:pid:xid:nid:opc ll_ost_55:532a7ed7-8e93-e086-885a-
> b064e46adb12+6:3962:x868:12345-192
> .168.64.102 at o2ib:40000000010:00000002:1:1197576684.886687:0:8599:0:
> (ost_handler.c:1598:ost_handle()) @@@ ping  req at ffff81042f7a3c00 x868/
> t0 o400->532a7ed7-8e93-e086-885a- 
> b064e46adb12 at NET_0x50000c0a84066_UUID:
> 0/0 lens 128/0 e 0 to
>  0 dl 1197576784 ref 1 fl Interpret:/0/0 rc
> 0/000010000:00020000:2:1197576684.886688:0:8597:0:(ldlm_lib.c:
> 1458:target_send_reply_msg()) @@@ processing error (-16)
> req at ffff8104167fe850 x871/t0 o8->532a7ed7-8e93-e086-885a-
> b064e46adb12 at NET_0x50000c0a84066_UU
> ID:0/0 lens 304/200 e 0 to 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc
> -16/0
>
> On the client it shows --
>
> 00000100:00080000:0:1197576416.143577:0:3964:0:(recover.c:
> 54:ptlrpc_initiate_recovery()) data-OST0004_UUID: starting recovery
> 00000100:00080000:0:1197576416.143585:0:3964:0:(import.c:
> 381:ptlrpc_connect_import()) ffff81082f49a000 data-OST0004_UUID:
> changing import state from DISCONN to CONNECTING
> 00000100:00080000:0:1197576416.143590:0:3964:0:(import.c:
> 275:import_select_connection()) data-OST0004-osc-ffff81082ae12400:
> connect to NID 192.168.64.71 at o2ib last attempt 4296998987
> 00000100:00080000:0:1197576416.143597:0:3964:0:(import.c:
> 339:import_select_connection()) data-OST0004-osc-ffff81082ae12400:
> import ffff81082f49a000 using connection 192.168.64.71 at o2ib/
> 192.168.64.71 at o2ib
> 00000100:02020000:0:1197576416.143864:0:3963:0:(client.c:
> 581:ptlrpc_check_status()) 11-0: an error occurred while communicating
> with 192.168.64.71 at o2ib. The ost_connect operation failed with -16
> 00000100:00080000:0:1197576416.144314:0:3963:0:(import.c:
> 759:ptlrpc_connect_interpret()) ffff81082f49a000 data-OST0004_UUID:
> changing import state from CONNECTING to DISCONN
> 00000100:00080000:0:1197576416.144316:0:3963:0:(import.c:
> 801:ptlrpc_connect_interpret()) recovery of data-OST0004_UUID on
> 192.168.64.71 at o2ib failed (-16)
>
> I''m at a loss.
>
> On Dec 13, 2007, at 11:59 AM, Oleg Drokin wrote:
>
>> Hello!
>>
>> On Dec 13, 2007, at 11:48 AM, Aaron Knister wrote:
>>
>>> On the client i see this --
>>
>> This shows no activity aside from the fact that client is
>> disconnected from OST5.
>>
>>> and on the server --
>>
>> This one shows that served does not allow client reconnection
>> because it is still
>> busy processing other requests from this client. That''s the
reason
>> for "mount hang".
>>
>> This is all I can tell from those logs you provided. If the logs
>> actually span
>> long in the past, might be there is more useful info.
>> Since there was disconnection - perhaps dmesg on client and server
>> contain
>> more info about the disconnection reasons, also on server if you do
>> sysrq-t, you will see what is going on with those server threads
>> that are supposedly
>> still process client requests.
>>
>> Bye,
>>   Oleg
>
> Aaron Knister
> Associate Systems Administrator/Web Designer
> Center for Research on Environment and Water
>
> (301) 595-7001
> aaron at iges.org
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at clusterfs.com
> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org

Aaron Knister

2007-Dec-13 23:52 UTC

head link

[Lustre-discuss] strange slowdown

Just kidding...I spoke WAY too soon. It''s acting up again.

On Dec 13, 2007, at 6:51 PM, Aaron Knister wrote:
> Don''t ask me how but it out of the blue resolved itself. I have 0
> idea what went wrong...
>
> On Dec 13, 2007, at 3:12 PM, Aaron Knister wrote:
>
>> Thanks for your help! I have some more information from the lctl dk--
>>
>> 10000000:01000000:3:1197576228.177725:0:8816:0:(mgc_request.c:
>> 1130:mgc_process_log()) Can''t get cfg lock: -108
>> 10000000:01000000:1:1197576228.177727:0:8511:0:(mgc_request.c:
>> 558:mgc_blocking_ast()) Lock res 0x61746164 (data)
>> 00000100:00020000:3:1197576228.177728:0:8816:0:(client.c:
>> 710:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at ffff8103dba84c00
>> x390/t0 o501->MGS at MGC192.168.64.70@o2ib_0:26/25 lens 200/304 e 0
to
>> 11
>> dl 0 ref 1 fl Rpc:/8/0 rc 0/0
>> 10000000:01000000:1:1197576228.177729:0:8511:0:(mgc_request.c:
>> 583:mgc_blocking_ast()) log data-OST0000: original grant failed, will
>> requeue later
>> 10000000:01000000:3:1197576228.177731:0:8816:0:(mgc_request.c:
>> 1182:mgc_process_log()) MGC192.168.64.70 at o2ib: configuration from
log
>> ''data-OST0000'' failed (-108).
>> 00000100:00080000:1:1197576236.900462:0:8444:0:(pinger.c:
>> 143:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or
>> recovery disabled: 0/1)
>>
>> This is on the OSS.
>>
>> Also on the OSS --
>>
>> 00010000:00000400:2:1197576684.886679:0:8597:0:(ldlm_lib.c:
>> 515:target_handle_reconnect()) data-OST0005: 532a7ed7-8e93-e086-885a-
>> b064e46adb12
>> reconnecting00010000:00000400:2:1197576684.886683:0:8597:0: 
>> (ldlm_lib.c:
>> 744:target_handle_connect()) data-OST0005: refuse reconnection from
532a7ed7-8e93-e086-885a-b064e46adb12 at 192.168.64.102
>> @o2ib to 0xffff8103cc9e3000; st
>> ill busy with 9 active
>> RPCs00000100:00100000:1:1197576684.886683:0:8599:0:(service.c:
>> 1032:ptlrpc_server_handle_request()) Handling RPC pname:cluuid
>> +ref:pid:xid:nid:opc ll_ost_55:532a7ed7-8e93-e086-885a-
>> b064e46adb12+6:3962:x868:12345-192
>> .168.64.102 at o2ib:40000000010:00000002:1:1197576684.886687:0:8599:0:
>> (ost_handler.c:1598:ost_handle()) @@@ ping  req at ffff81042f7a3c00  
>> x868/
>> t0 o400->532a7ed7-8e93-e086-885a- 
>> b064e46adb12 at NET_0x50000c0a84066_UUID:
>> 0/0 lens 128/0 e 0 to
>> 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc
>> 0/000010000:00020000:2:1197576684.886688:0:8597:0:(ldlm_lib.c:
>> 1458:target_send_reply_msg()) @@@ processing error (-16)
>> req at ffff8104167fe850 x871/t0 o8->532a7ed7-8e93-e086-885a-
>> b064e46adb12 at NET_0x50000c0a84066_UU
>> ID:0/0 lens 304/200 e 0 to 0 dl 1197576784 ref 1 fl Interpret:/0/0 rc
>> -16/0
>>
>> On the client it shows --
>>
>> 00000100:00080000:0:1197576416.143577:0:3964:0:(recover.c:
>> 54:ptlrpc_initiate_recovery()) data-OST0004_UUID: starting recovery
>> 00000100:00080000:0:1197576416.143585:0:3964:0:(import.c:
>> 381:ptlrpc_connect_import()) ffff81082f49a000 data-OST0004_UUID:
>> changing import state from DISCONN to CONNECTING
>> 00000100:00080000:0:1197576416.143590:0:3964:0:(import.c:
>> 275:import_select_connection()) data-OST0004-osc-ffff81082ae12400:
>> connect to NID 192.168.64.71 at o2ib last attempt 4296998987
>> 00000100:00080000:0:1197576416.143597:0:3964:0:(import.c:
>> 339:import_select_connection()) data-OST0004-osc-ffff81082ae12400:
>> import ffff81082f49a000 using connection 192.168.64.71 at o2ib/
>> 192.168.64.71 at o2ib
>> 00000100:02020000:0:1197576416.143864:0:3963:0:(client.c:
>> 581:ptlrpc_check_status()) 11-0: an error occurred while  
>> communicating
>> with 192.168.64.71 at o2ib. The ost_connect operation failed with -16
>> 00000100:00080000:0:1197576416.144314:0:3963:0:(import.c:
>> 759:ptlrpc_connect_interpret()) ffff81082f49a000 data-OST0004_UUID:
>> changing import state from CONNECTING to DISCONN
>> 00000100:00080000:0:1197576416.144316:0:3963:0:(import.c:
>> 801:ptlrpc_connect_interpret()) recovery of data-OST0004_UUID on
>> 192.168.64.71 at o2ib failed (-16)
>>
>> I''m at a loss.
>>
>> On Dec 13, 2007, at 11:59 AM, Oleg Drokin wrote:
>>
>>> Hello!
>>>
>>> On Dec 13, 2007, at 11:48 AM, Aaron Knister wrote:
>>>
>>>> On the client i see this --
>>>
>>> This shows no activity aside from the fact that client is
>>> disconnected from OST5.
>>>
>>>> and on the server --
>>>
>>> This one shows that served does not allow client reconnection
>>> because it is still
>>> busy processing other requests from this client. That''s
the reason
>>> for "mount hang".
>>>
>>> This is all I can tell from those logs you provided. If the logs
>>> actually span
>>> long in the past, might be there is more useful info.
>>> Since there was disconnection - perhaps dmesg on client and server
>>> contain
>>> more info about the disconnection reasons, also on server if you do
>>> sysrq-t, you will see what is going on with those server threads
>>> that are supposedly
>>> still process client requests.
>>>
>>> Bye,
>>>  Oleg
>>
>> Aaron Knister
>> Associate Systems Administrator/Web Designer
>> Center for Research on Environment and Water
>>
>> (301) 595-7001
>> aaron at iges.org
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at clusterfs.com
>> https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
>
> Aaron Knister
> Associate Systems Administrator/Web Designer
> Center for Research on Environment and Water
>
> (301) 595-7001
> aaron at iges.org
>
>
>
Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org

Oleg Drokin

2007-Dec-16 04:32 UTC

head link

[Lustre-discuss] strange slowdown

On Dec 13, 2007, at 6:52 PM, Aaron Knister wrote:
> Just kidding...I spoke WAY too soon. It''s acting up again.
Unfortunately all of logs you provided so far are in the middle of
problems, when lustre client was evicted already. How about point when
it all started?
Also, sysrq-t from OSSes to find out what is it doing would be useful.

Bye,
     Oleg

Aaron Knister

2007-Dec-18 18:05 UTC

head link

[Lustre-discuss] strange slowdown

I have some logs from the MDS/OSS startup. One thing I see  
consistently on the OST''s having this problem is "still busy with
9
active RPCs". The number is always 9 when it hangs up. I''ve
attached
some of the dk logs. Thank you so much for your help.

  00010000:00000400:0:1198000430.377418:0:5835:0:(ldlm_lib.c: 
517:target_handle_reconnect()) data-OST0004: 5ca85557-1b99-442d-d494- 
f1096d3fa4c4 reconnecting
00010000:00000400:0:1198000430.377496:0:5835:0:(ldlm_lib.c: 
746:target_handle_connect()) data-OST0004: refuse reconnection from
5ca85557-1b99-442d-d494-f1096d3fa4c4 at 192.168.64.102
@o2ib to 0xffff810405037000; still busy with 9 active RPCs
00010000:00020000:0:1198000430.377787:0:5835:0:(ldlm_lib.c: 
1460:target_send_reply_msg()) @@@ processing error (-16)   
req at ffff810407a35400 x35081142/t0 o8->5ca85557-1b99-442d-d494- 
f1096d3fa4c4 at NET_0x50000c0a84066_UUID:0/0 lens 304/200 e 0 to 0 dl  
1198000530 ref 1 fl Interpret:/0/0 rc -16/0
00010000:00000400:0:1198000455.377760:0:5839:0:(ldlm_lib.c: 
517:target_handle_reconnect()) data-OST0004: 5ca85557-1b99-442d-d494- 
f1096d3fa4c4 reconnecting
00010000:00000400:0:1198000455.378250:0:5839:0:(ldlm_lib.c: 
746:target_handle_connect()) data-OST0004: refuse reconnection from
5ca85557-1b99-442d-d494-f1096d3fa4c4 at 192.168.64.102
@o2ib to 0xffff810405037000; still busy with 9 active RPCs
00010000:00020000:0:1198000455.378955:0:5839:0:(ldlm_lib.c: 
1460:target_send_reply_msg()) @@@ processing error (-16)   
req at ffff810423ceba00 x35081148/t0 o8->5ca85557-1b99-442d-d494- 
f1096d3fa4c4 at NET_0x50000c0a84066_UUID:0/0 lens 304/200 e 0 to 0 dl  
1198000555 ref 1 fl Interpret:/0/0 rc -16/0
Debug log: 6 lines, 6 kept, 0 dropped.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dklogs.zip
Type: application/zip
Size: 38189 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20071218/84b3dcf2/attachment-0002.zip
-------------- next part --------------




On Dec 15, 2007, at 11:32 PM, Oleg Drokin wrote:
>
> On Dec 13, 2007, at 6:52 PM, Aaron Knister wrote:
>
>> Just kidding...I spoke WAY too soon. It''s acting up again.
>
> Unfortunately all of logs you provided so far are in the middle of
> problems, when lustre client was evicted already. How about point when
> it all started?
> Also, sysrq-t from OSSes to find out what is it doing would be useful.
>
> Bye,
>    Oleg
Aaron Knister
Associate Systems Administrator/Web Designer
Center for Research on Environment and Water

(301) 595-7001
aaron at iges.org

Lustre discuss - Dec 2007 - strange slowdown

[Lustre-discuss] strange slowdown

[Lustre-discuss] strange slowdown

[Lustre-discuss] strange slowdown

[Lustre-discuss] strange slowdown

[Lustre-discuss] strange slowdown

[Lustre-discuss] strange slowdown

[Lustre-discuss] strange slowdown

[Lustre-discuss] strange slowdown

[Lustre-discuss] strange slowdown

[Lustre-discuss] strange slowdown