thr3ads.net - Lustre discuss - [Lustre-discuss] "unexpectedly long timeout" [Nov 2008]

If this information is useful, please help other people find it:
Share via:

Brock Palen

2008-Nov-05 20:53 UTC

[Lustre-discuss] "unexpectedly long timeout"

New error I have never seen before, googling didn''t fine much other  
than an error involving IB. This node has IB, but lustre runs over TCP.

Nov  5 02:19:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: 
305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc  
000001041802f600  req at 0000010005151a00 x1071812/t0 o4->nobackup- 
OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl  
1225842598 ref 2 fl Rpc:X/0/0 rc 0/0Nov  5 02:19:54 nyx668 kernel:  
Lustre: 4329:0:(niobuf.c:305:ptlrpc_unregister_bulk()) Skipped 1  
previous similar message
Nov  5 02:29:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: 
305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc  
000001041802f600  req at 0000010005151a00 x1071812/t0 o4->nobackup- 
OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl  
1225842598 ref 2 fl Rpc:X/0/0 rc 0/0

On the OSS that provides OST000c  The only errors I see from that  
node are the usual, ''can''t hear from node''

Nov  4 18:46:02 oss2 kernel: Lustre: 6426:0:(ost_handler.c: 
1270:ost_brw_write()) nobackup-OST000c: ignoring bulk IO comm error  
with 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e at NET_0x200000aa4029c_UUID id  
12345-10.164.2.156 at tcp - client will retry
Nov  4 18:49:42 oss2 kernel: Lustre: nobackup-OST000c: haven''t heard  
from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at  
10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I am  
evicting it.
Nov  4 18:49:42 oss2 kernel: Lustre: nobackup-OST000d: haven''t heard  
from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at  
10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I am  
evicting it.

Any thoughts?


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985

Yuriy Umanets

2008-Nov-09 09:31 UTC

head link

[Lustre-discuss] "unexpectedly long timeout"

Brock Palen wrote:> New error I have never seen before, googling didn''t fine much
other
> than an error involving IB. This node has IB, but lustre runs over TCP.
>
> Nov  5 02:19:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: 
> 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc  
> 000001041802f600  req at 0000010005151a00 x1071812/t0 o4->nobackup- 
> OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl  
> 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0Nov  5 02:19:54 nyx668 kernel:  
> Lustre: 4329:0:(niobuf.c:305:ptlrpc_unregister_bulk()) Skipped 1  
> previous similar message
> Nov  5 02:29:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: 
> 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc  
> 000001041802f600  req at 0000010005151a00 x1071812/t0 o4->nobackup- 
> OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl  
> 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0
>
> On the OSS that provides OST000c  The only errors I see from that  
> node are the usual, ''can''t hear from node''
>
> Nov  4 18:46:02 oss2 kernel: Lustre: 6426:0:(ost_handler.c: 
> 1270:ost_brw_write()) nobackup-OST000c: ignoring bulk IO comm error  
> with 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e at NET_0x200000aa4029c_UUID id  
> 12345-10.164.2.156 at tcp - client will retry
> Nov  4 18:49:42 oss2 kernel: Lustre: nobackup-OST000c: haven''t
heard
> from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at  
> 10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I am  
> evicting it.
> Nov  4 18:49:42 oss2 kernel: Lustre: nobackup-OST000d: haven''t
heard
> from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at  
> 10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I am  
> evicting it.
>
> Any thoughts?
>
>   Brock,

This usually means network issues. Basically is happens when lnet can''t
receive bulk unlink callback long time due to sluggish network and 
Lustre needs to wait in loop abnormally long time (300s) to make sure 
that bulk is unregistered. It really can''t do it another way because an
rpcs waiting for bulk unregistering can''t be freed, returned to rq
pool,
etc., that is, can''t be used for another purposes before this is done 
because its buffers still get mapped by lnet and network data may come 
any next moment. It would be disaster if this happened when rpc is 
already freed.

Bad thing here is that, such an blocked rpc also blocks other rpcs in 
set from processing. While I can''t suggest how to fix this, because
this
is not directly to Lustre issues and rather related to network issues, I 
can assure you that we''re working on this issue.

While we can''t make it not happen again with such a long bulk 
unregister, we can make lustre do not block on this and let other rpcs 
proceed so this could only be visible if such a long unlink happens 
about umout time that would make you wait longer for it than usually 
with all sorts messages that some rpcs are still on sending list.

Please look at bug 17310, there was discussed some issue with 
ptlrpc_unregister_reply() as well as patches provided wotj that issue 
around.

Thanks.> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>   

-- 
umka

Yuriy Umanets

2008-Nov-09 09:41 UTC

head link

[Lustre-discuss] "unexpectedly long timeout"

Yuriy Umanets wrote:> Brock Palen wrote:
>   
>> New error I have never seen before, googling didn''t fine much
other
>> than an error involving IB. This node has IB, but lustre runs over TCP.
>>
>> Nov  5 02:19:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: 
>> 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc  
>> 000001041802f600  req at 0000010005151a00 x1071812/t0 o4->nobackup- 
>> OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl  
>> 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0Nov  5 02:19:54 nyx668 kernel:  
>> Lustre: 4329:0:(niobuf.c:305:ptlrpc_unregister_bulk()) Skipped 1  
>> previous similar message
>> Nov  5 02:29:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: 
>> 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc  
>> 000001041802f600  req at 0000010005151a00 x1071812/t0 o4->nobackup- 
>> OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl  
>> 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0
>>
>> On the OSS that provides OST000c  The only errors I see from that  
>> node are the usual, ''can''t hear from node''
>>
>> Nov  4 18:46:02 oss2 kernel: Lustre: 6426:0:(ost_handler.c: 
>> 1270:ost_brw_write()) nobackup-OST000c: ignoring bulk IO comm error  
>> with 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e at NET_0x200000aa4029c_UUID
id
>> 12345-10.164.2.156 at tcp - client will retry
>> Nov  4 18:49:42 oss2 kernel: Lustre: nobackup-OST000c: haven''t
heard
>> from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at  
>> 10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I
am
>> evicting it.
>> Nov  4 18:49:42 oss2 kernel: Lustre: nobackup-OST000d: haven''t
heard
>> from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at  
>> 10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I
am
>> evicting it.
>>
>> Any thoughts?
>>
>>   
>>     
> Brock,
>
> This usually means network issues. Basically is happens when lnet
can''t
> receive bulk unlink callback long time due to sluggish network and 
> Lustre needs to wait in loop abnormally long time (300s) to make sure 
> that bulk is unregistered. It really can''t do it another way
because an
> rpcs waiting for bulk unregistering can''t be freed, returned to rq
pool,
> etc., that is, can''t be used for another purposes before this is
done
> because its buffers still get mapped by lnet and network data may come 
> any next moment. It would be disaster if this happened when rpc is 
> already freed.
>
> Bad thing here is that, such an blocked rpc also blocks other rpcs in 
> set from processing. While I can''t suggest how to fix this,
because this
> is not directly to Lustre issues and rather related to network issues, I 
> can assure you that we''re working on this issue.
>
> While we can''t make it not happen again with such a long bulk 
> unregister, we can make lustre do not block on this and let other rpcs 
> proceed so this could only be visible if such a long unlink happens 
> about umout time that would make you wait longer for it than usually 
> with all sorts messages that some rpcs are still on sending list.
>
> Please look at bug 17310, there was discussed some issue with 
> ptlrpc_unregister_reply() as well as patches provided wotj that issue 
> around.
>
> Thanks.
>   
Brock,

I have actually created the bug  for this thing. Please refer to 
https://bugzilla.lustre.org/show_bug.cgi?id=17631. Hope to work this out 
shortly.

Thanks>> Brock Palen
>> www.umich.edu/~brockp
>> Center for Advanced Computing
>> brockp at umich.edu
>> (734)936-1985
>>
>>
>>
>> _______________________________________________
>> Lustre-discuss mailing list
>> Lustre-discuss at lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>   
>>     
>
>
>   

-- 
umka

Lustre discuss - Nov 2008 - "unexpectedly long timeout"

[Lustre-discuss] "unexpectedly long timeout"

[Lustre-discuss] "unexpectedly long timeout"

[Lustre-discuss] "unexpectedly long timeout"