New error I have never seen before, googling didn''t fine much other than an error involving IB. This node has IB, but lustre runs over TCP. Nov 5 02:19:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc 000001041802f600 req at 0000010005151a00 x1071812/t0 o4->nobackup- OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0Nov 5 02:19:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c:305:ptlrpc_unregister_bulk()) Skipped 1 previous similar message Nov 5 02:29:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc 000001041802f600 req at 0000010005151a00 x1071812/t0 o4->nobackup- OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0 On the OSS that provides OST000c The only errors I see from that node are the usual, ''can''t hear from node'' Nov 4 18:46:02 oss2 kernel: Lustre: 6426:0:(ost_handler.c: 1270:ost_brw_write()) nobackup-OST000c: ignoring bulk IO comm error with 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e at NET_0x200000aa4029c_UUID id 12345-10.164.2.156 at tcp - client will retry Nov 4 18:49:42 oss2 kernel: Lustre: nobackup-OST000c: haven''t heard from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at 10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Nov 4 18:49:42 oss2 kernel: Lustre: nobackup-OST000d: haven''t heard from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at 10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I am evicting it. Any thoughts? Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
Brock Palen wrote:> New error I have never seen before, googling didn''t fine much other > than an error involving IB. This node has IB, but lustre runs over TCP. > > Nov 5 02:19:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: > 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc > 000001041802f600 req at 0000010005151a00 x1071812/t0 o4->nobackup- > OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl > 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0Nov 5 02:19:54 nyx668 kernel: > Lustre: 4329:0:(niobuf.c:305:ptlrpc_unregister_bulk()) Skipped 1 > previous similar message > Nov 5 02:29:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: > 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc > 000001041802f600 req at 0000010005151a00 x1071812/t0 o4->nobackup- > OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl > 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0 > > On the OSS that provides OST000c The only errors I see from that > node are the usual, ''can''t hear from node'' > > Nov 4 18:46:02 oss2 kernel: Lustre: 6426:0:(ost_handler.c: > 1270:ost_brw_write()) nobackup-OST000c: ignoring bulk IO comm error > with 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e at NET_0x200000aa4029c_UUID id > 12345-10.164.2.156 at tcp - client will retry > Nov 4 18:49:42 oss2 kernel: Lustre: nobackup-OST000c: haven''t heard > from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at > 10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I am > evicting it. > Nov 4 18:49:42 oss2 kernel: Lustre: nobackup-OST000d: haven''t heard > from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at > 10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I am > evicting it. > > Any thoughts? > >Brock, This usually means network issues. Basically is happens when lnet can''t receive bulk unlink callback long time due to sluggish network and Lustre needs to wait in loop abnormally long time (300s) to make sure that bulk is unregistered. It really can''t do it another way because an rpcs waiting for bulk unregistering can''t be freed, returned to rq pool, etc., that is, can''t be used for another purposes before this is done because its buffers still get mapped by lnet and network data may come any next moment. It would be disaster if this happened when rpc is already freed. Bad thing here is that, such an blocked rpc also blocks other rpcs in set from processing. While I can''t suggest how to fix this, because this is not directly to Lustre issues and rather related to network issues, I can assure you that we''re working on this issue. While we can''t make it not happen again with such a long bulk unregister, we can make lustre do not block on this and let other rpcs proceed so this could only be visible if such a long unlink happens about umout time that would make you wait longer for it than usually with all sorts messages that some rpcs are still on sending list. Please look at bug 17310, there was discussed some issue with ptlrpc_unregister_reply() as well as patches provided wotj that issue around. Thanks.> Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > brockp at umich.edu > (734)936-1985 > > > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-- umka
Yuriy Umanets wrote:> Brock Palen wrote: > >> New error I have never seen before, googling didn''t fine much other >> than an error involving IB. This node has IB, but lustre runs over TCP. >> >> Nov 5 02:19:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: >> 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc >> 000001041802f600 req at 0000010005151a00 x1071812/t0 o4->nobackup- >> OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl >> 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0Nov 5 02:19:54 nyx668 kernel: >> Lustre: 4329:0:(niobuf.c:305:ptlrpc_unregister_bulk()) Skipped 1 >> previous similar message >> Nov 5 02:29:54 nyx668 kernel: Lustre: 4329:0:(niobuf.c: >> 305:ptlrpc_unregister_bulk()) @@@ Unexpectedly long timeout: desc >> 000001041802f600 req at 0000010005151a00 x1071812/t0 o4->nobackup- >> OST000c_UUID at 10.164.3.245@tcp:6/4 lens 384/480 e 0 to 100 dl >> 1225842598 ref 2 fl Rpc:X/0/0 rc 0/0 >> >> On the OSS that provides OST000c The only errors I see from that >> node are the usual, ''can''t hear from node'' >> >> Nov 4 18:46:02 oss2 kernel: Lustre: 6426:0:(ost_handler.c: >> 1270:ost_brw_write()) nobackup-OST000c: ignoring bulk IO comm error >> with 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e at NET_0x200000aa4029c_UUID id >> 12345-10.164.2.156 at tcp - client will retry >> Nov 4 18:49:42 oss2 kernel: Lustre: nobackup-OST000c: haven''t heard >> from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at >> 10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I am >> evicting it. >> Nov 4 18:49:42 oss2 kernel: Lustre: nobackup-OST000d: haven''t heard >> from client 0d8e8d79-bfac-9d81-a345-39aaf2d4bc0e (at >> 10.164.2.156 at tcp) in 227 seconds. I think it''s dead, and I am >> evicting it. >> >> Any thoughts? >> >> >> > Brock, > > This usually means network issues. Basically is happens when lnet can''t > receive bulk unlink callback long time due to sluggish network and > Lustre needs to wait in loop abnormally long time (300s) to make sure > that bulk is unregistered. It really can''t do it another way because an > rpcs waiting for bulk unregistering can''t be freed, returned to rq pool, > etc., that is, can''t be used for another purposes before this is done > because its buffers still get mapped by lnet and network data may come > any next moment. It would be disaster if this happened when rpc is > already freed. > > Bad thing here is that, such an blocked rpc also blocks other rpcs in > set from processing. While I can''t suggest how to fix this, because this > is not directly to Lustre issues and rather related to network issues, I > can assure you that we''re working on this issue. > > While we can''t make it not happen again with such a long bulk > unregister, we can make lustre do not block on this and let other rpcs > proceed so this could only be visible if such a long unlink happens > about umout time that would make you wait longer for it than usually > with all sorts messages that some rpcs are still on sending list. > > Please look at bug 17310, there was discussed some issue with > ptlrpc_unregister_reply() as well as patches provided wotj that issue > around. > > Thanks. >Brock, I have actually created the bug for this thing. Please refer to https://bugzilla.lustre.org/show_bug.cgi?id=17631. Hope to work this out shortly. Thanks>> Brock Palen >> www.umich.edu/~brockp >> Center for Advanced Computing >> brockp at umich.edu >> (734)936-1985 >> >> >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >> > > >-- umka