Rick Wagner
2011-Jul-11 22:39 UTC
[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?
Hi, We are seeing intermittent client evictions from a new Lustre installation that we are testing. The errors on writes from a parallel job running on 32 client nodes, each with 16 tasks writing a single HDF5 file of ~40MB (512 tasks total). Occasionally, one nodes will be evicted from an OST, and the code running on the client will experience an IO error. The directory with the data has a stripe count of 1, and a comparable amount is read in at the start of the job. Sometimes the evictions occur the first time a write is attempted, sometimes after a successful write. There is about 15 minutes before the first and subsequent write attempts. The client and server errors are attached. In the server errors, XXX.XXX.118.141 refers to the client that gets evicted. In the client errors, here are the server names to match with the NIDS: lustre-oss-0-2: 172.25.33.248 lustre-oss-2-0: 172.25.33.246 lustre-oss-2-2: 172.25.32.118 I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used. We''ve been experiencing similar problems for a while, and we''ve never seen IP traffic have a problem. But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID. The filesystem is in testing, so there is no other load on it, and when watching the load during writes, the OSS machines hardly notice. The servers are running version 1.8.5, and the client 1.8.4. Any advice, or pointers to possible bugs would be appreciated. Thanks, Rick -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: server-errors.txt Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110711/39358645/attachment.txt -------------- next part -------------- -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: client.txt Url: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110711/39358645/attachment-0001.txt
Kevin Van Maren
2011-Jul-12 16:10 UTC
[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?
Rick Wagner wrote:> Hi, > > We are seeing intermittent client evictions from a new Lustre installation that we are testing. The errors on writes from a parallel job running on 32 client nodes, each with 16 tasks writing a single HDF5 file of ~40MB (512 tasks total). Occasionally, one nodes will be evicted from an OST, and the code running on the client will experience an IO error. >Yes, evictions are very bad. Worse than an IO errors, however, is the knowledge that a write that previously "succeeded" never made it out of the client cache to disk (eviction forces client to drop any dirty cache on the floor).> The directory with the data has a stripe count of 1, and a comparable amount is read in at the start of the job. Sometimes the evictions occur the first time a write is attempted, sometimes after a successful write. There is about 15 minutes before the first and subsequent write attempts. >So you have 512 processes on 32 nodes writing to a single file, which exists on a single OST. Have you tuned any of the network or Lustre tunables? For example, max_dirty_mb, max_rpcs_in_flight? socket buffer sizes? What size are the RPCs, application IO sizes?> The client and server errors are attached. In the server errors, XXX.XXX.118.141 refers to the client that gets evicted. In the client errors, here are the server names to match with the NIDS: > lustre-oss-0-2: 172.25.33.248 > lustre-oss-2-0: 172.25.33.246 > lustre-oss-2-2: 172.25.32.118 > I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used. > > We''ve been experiencing similar problems for a while, and we''ve never seen IP traffic have a problem.You are using gigabit Ethernet for Lustre? These errors are indicating issues with IP traffic. When you say you have never seen IP traffic have a problem, you mean "ssh" and "ping" work, or have you stress-tested the network outside Lustre (run network tests from 32 clients to a single server)?> But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID. > > The filesystem is in testing, so there is no other load on it, and when watching the load during writes, the OSS machines hardly notice. The servers are running version 1.8.5, and the client 1.8.4. > > Any advice, or pointers to possible bugs would be appreciated. >You have provided no information about your network (NICs/drivers, switches, MTU, settings, etc), but it sounds like you are having network issues, which are exhibiting themselves under load. It is possible a NIC or the switch is getting overwhelmed by the Lustre traffic, and getting stuck long enough for TCP to time out. Are the nics or the switch reporting dropped packets? Any error counters on any links? Are pause frames enabled on the nics and the switch? Is that some sort of socket BW test reporting _10_Mb? Kevin> Thanks, > Rick >
Rick Wagner
2011-Jul-12 17:44 UTC
[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?
Hi Kevin, Thank very much for the reply, answers to your questions are below. On Jul 12, 2011, at 9:10 AM, Kevin Van Maren wrote:> Rick Wagner wrote: >> Hi, >> >> We are seeing intermittent client evictions from a new Lustre installation that we are testing. The errors on writes from a parallel job running on 32 client nodes, each with 16 tasks writing a single HDF5 file of ~40MB (512 tasks total). Occasionally, one nodes will be evicted from an OST, and the code running on the client will experience an IO error. >> > > Yes, evictions are very bad. Worse than an IO errors, however, is the knowledge that a write that previously "succeeded" never made it out of the client cache to disk (eviction forces client to drop any dirty cache on the floor). > >> The directory with the data has a stripe count of 1, and a comparable amount is read in at the start of the job. Sometimes the evictions occur the first time a write is attempted, sometimes after a successful write. There is about 15 minutes before the first and subsequent write attempts. >> > > So you have 512 processes on 32 nodes writing to a single file, which exists on a single OST.No, each task is writing it''s own HDF5 file of ~40MB; the total amount of data per write is 20GB. This avoids the need for synchronizing writes to a single file.> Have you tuned any of the network or Lustre tunables? For example, max_dirty_mb, max_rpcs_in_flight? socket buffer sizes?For network tuning, this is what we have on the clients and servers: net.core.somaxconn = 10000 net.core.rmem_max = 33554432 net.core.wmem_max = 33554432 net.ipv4.tcp_rmem = 4096 87380 33554432 net.ipv4.tcp_wmem = 4096 65536 33554432 net.core.netdev_max_backlog = 250000 net.ipv4.tcp_congestion_control = htcp net.ipv4.ip_forward = 0 net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_moderate_rcvbuf = 1 On the Lustre side, max_rpcs_in_flights = 8, max_dirty_mb = 32. We don''t have as much experience tuning Lustre, so we''ve tended to use the defaults.> What size are the RPCs, application IO sizes?As I mentioned above, each task is writing a single file, in 5 consecutive 8MB chunks. There are a few other files written by the master tasks, but the failure hasn''t occurred on that particular node (yet).>> The client and server errors are attached. In the server errors, XXX.XXX.118.141 refers to the client that gets evicted. In the client errors, here are the server names to match with the NIDS: >> lustre-oss-0-2: 172.25.33.248 >> lustre-oss-2-0: 172.25.33.246 >> lustre-oss-2-2: 172.25.32.118 >> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used. >> >> We''ve been experiencing similar problems for a while, and we''ve never seen IP traffic have a problem. > > You are using gigabit Ethernet for Lustre?The servers are using bonded Myricom 10Gbs cards. On the client side, the nodes have Mellanox QDR InfiniBand HCAs, but we use a Mellanox BridgeX BX 4010, and the clients have virtual 10Gbs NICs. Hence the use of the tcp driver. We do have a problem with setting the MTU on the client side, so currently the servers are using an MTU of 9000, and the client 1500, which means more work for the central 10Gbs switch and the bridge.> These errors are indicating issues with IP traffic. When you say you have never seen IP traffic have a problem, you mean "ssh" and "ping" work, or have you stress-tested the network outside Lustre (run network tests from 32 clients to a single server)?You''re right about how I defined IP functionality being available, but that''s a good point about stressing the fabric. We''ve run simultaneous iperf tests, but only until we reach a desired bandwidth. Given our goals, I think the multiple tests will be necessary.>> But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID. >> >> The filesystem is in testing, so there is no other load on it, and when watching the load during writes, the OSS machines hardly notice. The servers are running version 1.8.5, and the client 1.8.4. >> >> Any advice, or pointers to possible bugs would be appreciated. >> > > You have provided no information about your network (NICs/drivers, switches, MTU, settings, etc), but it sounds like you are having network issues, which are exhibiting themselves under load. It is possible a NIC or the switch is getting overwhelmed by the Lustre traffic, and getting stuck long enough for TCP to time out.I''ve tried to fill in the network information above.> Are the nics or the switch reporting dropped packets? Any error counters on any links? Are pause frames enabled on the nics and the switch?Looking at the servers and clients, we see no dropped packets on the servers, while the clients do have a few (~10 dropped received packets on each client, out of 1e9). There are no errors on either side, so those packets should have been resent if they were coming over LNET. I will be following up with our networking group to find out if the bridge or switch are showing dropped packed, and I''ll ask about the pause frames.> Is that some sort of socket BW test reporting _10_Mb?I''m not sure what this question refers to. Below, I''ve posted some of the IOR results we''ve seen from this system, using 32 and 64 client nodes, 1 task per node. As you can see, the system should have more than enough capacity to handle the IO that I describe above. We''ve also done and IOR run using 512 tasks, which succeeded, but I don''t have the numbers. It''s really the intermittency that''s drives us crazy. Thanks again, Rick Participating tasks: 32 Summary: api = POSIX test filename = testFile access = file-per-process clients = 32 (1 per node) repetitions = 1 xfersize = 1 MiB blocksize = 32 GiB aggregate filesize = 1024 GiB access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter ------ --------- ---------- --------- -------- -------- -------- ---- write 8419 33554432 1024.00 0.051805 124.51 58.80 0 read 10135 33554432 1024.00 0.002511 103.46 64.72 0 Max Write: 8418.70 MiB/sec (8827.65 MB/sec) Max Read: 10134.86 MiB/sec (10627.18 MB/sec) Participating tasks: 64 Summary: api = POSIX test filename = testFile access = file-per-process clients = 64 (1 per node) repetitions = 1 xfersize = 1 MiB blocksize = 32 GiB aggregate filesize = 2048 GiB access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) iter ------ --------- ---------- --------- -------- -------- -------- ---- write 8554 33554432 1024.00 0.067592 245.12 165.60 0 read 12284 33554432 1024.00 0.008495 170.73 109.98 0 Max Write: 8553.74 MiB/sec (8969.25 MB/sec) Max Read: 12283.52 MiB/sec (12880.20 MB/sec)
Isaac Huang
2011-Jul-12 18:01 UTC
[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?
On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote:> Hi, > ...... > I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used. > > We''ve been experiencing similar problems for a while, and we''ve never seen IP traffic have a problem. But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID.I''d suggest to enable console logging of network errors, by ''echo +neterror > /proc/sys/lnet/printk''. Then some detailed debug messages should show up in ''dmesg'' when you have LNET connectivity problems. - Isaac ______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
Rick Wagner
2011-Jul-12 18:06 UTC
[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?
On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote:> On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote: >> Hi, >> ...... >> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used. >> >> We''ve been experiencing similar problems for a while, and we''ve never seen IP traffic have a problem. But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID. > > I''d suggest to enable console logging of network errors, by ''echo > +neterror > /proc/sys/lnet/printk''. Then some detailed debug messages > should show up in ''dmesg'' when you have LNET connectivity problems.Thanks, Isaac, I have put that in place. We have that in the sysctl configuration, as part of lnet.debug, and thought that was sufficient. But so far, dmesg and /var/log/messages have looked very similar. [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk warning error emerg console [root at lustre-oss-0-2 ~]# sysctl -a | grep neterr lnet.debug = ioctl neterror net warning error emerg ha config console --Rick> > - Isaac > ______________________________________________________________________ > This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. > > Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. > > Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. > > The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. > ______________________________________________________________________ >
Isaac Huang
2011-Jul-12 18:32 UTC
[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?
On Tue, Jul 12, 2011 at 11:06:40AM -0700, Rick Wagner wrote:> On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote: > > > On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote: > >> Hi, > >> ...... > >> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used. > >> > >> We''ve been experiencing similar problems for a while, and we''ve never seen IP traffic have a problem. But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID. > > > > I''d suggest to enable console logging of network errors, by ''echo > > +neterror > /proc/sys/lnet/printk''. Then some detailed debug messages > > should show up in ''dmesg'' when you have LNET connectivity problems. > > Thanks, Isaac, I have put that in place. We have that in the sysctl configuration, as part of lnet.debug, and thought that was sufficient. But so far, dmesg and /var/log/messages have looked very similar. > > [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk > warning error emerg consoleYou should be able to see ''neterror'' in ''cat /proc/sys/lnet/printk'' output after ''echo +neterror > /proc/sys/lnet/printk'', otherwise it''s a bug. This is different from lnet.debug.> [root at lustre-oss-0-2 ~]# sysctl -a | grep neterr > lnet.debug = ioctl neterror net warning error emerg ha config console- Isaac ______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
Rick Wagner
2011-Jul-12 18:34 UTC
[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?
On Jul 12, 2011, at 11:32 AM, Isaac Huang wrote:> On Tue, Jul 12, 2011 at 11:06:40AM -0700, Rick Wagner wrote: >> On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote: >> >>> On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote: >>>> Hi, >>>> ...... >>>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used. >>>> >>>> We''ve been experiencing similar problems for a while, and we''ve never seen IP traffic have a problem. But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID. >>> >>> I''d suggest to enable console logging of network errors, by ''echo >>> +neterror > /proc/sys/lnet/printk''. Then some detailed debug messages >>> should show up in ''dmesg'' when you have LNET connectivity problems. >> >> Thanks, Isaac, I have put that in place. We have that in the sysctl configuration, as part of lnet.debug, and thought that was sufficient. But so far, dmesg and /var/log/messages have looked very similar. >> >> [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk >> warning error emerg console > > You should be able to see ''neterror'' in ''cat /proc/sys/lnet/printk'' > output after ''echo +neterror > /proc/sys/lnet/printk'', otherwise > it''s a bug. This is different from lnet.debug.I am, sorry if the previous post was misleading. [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk neterror warning error emerg console [root at lustre-oss-0-2 ~]# --Rick> >> [root at lustre-oss-0-2 ~]# sysctl -a | grep neterr >> lnet.debug = ioctl neterror net warning error emerg ha config console > > - Isaac > ______________________________________________________________________ > This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. > > Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. > > Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. > > The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. > ______________________________________________________________________ >
Rick Wagner
2011-Jul-12 20:56 UTC
[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?
On Jul 12, 2011, at 11:34 AM, Rick Wagner wrote:> > On Jul 12, 2011, at 11:32 AM, Isaac Huang wrote: > >> On Tue, Jul 12, 2011 at 11:06:40AM -0700, Rick Wagner wrote: >>> On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote: >>> >>>> On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote: >>>>> Hi, >>>>> ...... >>>>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error codes from errno.h are being used. >>>>> >>>>> We''ve been experiencing similar problems for a while, and we''ve never seen IP traffic have a problem. But, clients will begin to have trouble communicating with the Lustre server (seen because an LNET ping will return an I/O error), and things will only recover when an LNET ping is performed from the server to the client NID. >>>> >>>> I''d suggest to enable console logging of network errors, by ''echo >>>> +neterror > /proc/sys/lnet/printk''. Then some detailed debug messages >>>> should show up in ''dmesg'' when you have LNET connectivity problems. >>> >>> Thanks, Isaac, I have put that in place. We have that in the sysctl configuration, as part of lnet.debug, and thought that was sufficient. But so far, dmesg and /var/log/messages have looked very similar. >>> >>> [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk >>> warning error emerg console >> >> You should be able to see ''neterror'' in ''cat /proc/sys/lnet/printk'' >> output after ''echo +neterror > /proc/sys/lnet/printk'', otherwise >> it''s a bug. This is different from lnet.debug. > > I am, sorry if the previous post was misleading. > [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk > neterror warning error emerg console > [root at lustre-oss-0-2 ~]#Isaac, I think your suggestion led to more information. I re-ran the client code, and experienced the same problems. However, this time there were additional messages, in particular one about "No usable routes". Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes to 12345-XXX.XXX.118.137 at tcp In our configuration, the cluster I''m running on uses TCP, and should not have traffic going over a router. There is a second test system which does have an LNET router configured, so this is what have for our LNET configuration on the servers: [root at lustre-oss-0-2 ~]# cat /etc/modprobe.d/lnet.conf options lnet networks=tcp(bond0) routes="o2ib XXX.XXX.81.18 at tcp" For the servers, all of the traffic goes out the bond0 interface. The dmesg output is below. It shows peer not alive for the client that I saw fail. When I logged onto that client, it reported an I/O error when performing and LNET ping to the server that I''ve posted the messages for, and then they both recovered. Thanks again, Rick From dmesg on lustre-oss-2-0 Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes to 12345-XXX.XXX.118.137 at tcp LustreError: 24086:0:(events.c:381:server_bulk_callback()) event type 2, status -113, desc ffff8104d5950000 LustreError: 24502:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(1048576) req at ffff8101cbfb3000 x1373631114768318/t0 o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 448/416 e 0 to 0 dl 1310502335 ref 1 fl Interpret:/0/0 rc 0/0 Lustre: 24502:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0020: ignoring bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry Lustre: 24349:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0023: ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting Lustre: 24349:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 2 previous similar messages Lustre: 24307:0:(ldlm_lib.c:803:target_handle_connect()) phase1-OST0023: exp ffff810292b50e00 already connecting LustreError: 24307:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-114) req at ffff81033c17bc50 x1373631114768595/t0 o8->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 368/264 e 0 to 0 dl 1310502459 ref 1 fl Interpret:/0/0 rc -114/0 LustreError: 24307:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 338 previous similar messages Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes to 12345-XXX.XXX.118.137 at tcp LustreError: 24086:0:(events.c:381:server_bulk_callback()) event type 2, status -113, desc ffff810226052000 LustreError: 24553:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 0(581632) req at ffff81024b96b000 x1373631114821718/t0 o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 448/416 e 0 to 0 dl 1310502365 ref 1 fl Interpret:/2/0 rc 0/0 Lustre: 24553:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0023: ignoring bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry Lustre: 24527:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for 12345-XXX.XXX.118.137 at tcp: peer not alive Lustre: 24527:0:(lib-move.c:2445:LNetPut()) Error sending PUT to 12345-XXX.XXX.118.137 at tcp: -113 Lustre: 24536:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for 12345-XXX.XXX.118.137 at tcp: peer not alive Lustre: 24536:0:(lib-move.c:1033:lnet_post_send_locked()) Skipped 3 previous similar messages Lustre: 24536:0:(lib-move.c:2445:LNetPut()) Error sending PUT to 12345-XXX.XXX.118.137 at tcp: -113 Lustre: 24536:0:(lib-move.c:2445:LNetPut()) Skipped 3 previous similar messages Lustre: 24496:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for 12345-XXX.XXX.118.137 at tcp: peer not alive Lustre: 24496:0:(lib-move.c:1033:lnet_post_send_locked()) Skipped 3 previous similar messages Lustre: 24496:0:(lib-move.c:2445:LNetPut()) Error sending PUT to 12345-XXX.XXX.118.137 at tcp: -113 Lustre: 24496:0:(lib-move.c:2445:LNetPut()) Skipped 3 previous similar messages LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at XXX.XXX.118.137 at tcp ns: filter-phase1-OST0023_UUID lock: ffff810154ed5000/0x891870f2e8a00d6b lrc: 3/0,0 mode: PR/PR res: 6449/0 rrc: 2 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x20 remote: 0x202026369bcdd0b6 expref: 17 pid: 18170 timeout 6508323681 LustreError: 24280:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff8105ff468c00 x1372975847776577/t0 o105->@NET_0x20000c6ca7689_UUID:15/16 lens 344/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 LustreError: 24280:0:(ldlm_lockd.c:607:ldlm_handle_ast_error()) ### client (nid XXX.XXX.118.137 at tcp) returned 0 from completion AST ns: filter-phase1-OST0023_UUID lock: ffff810233131e00/0x891870f2e8a00f4e lrc: 3/0,0 mode: PW/PW res: 6451/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->1048575) flags: 0x0 remote: 0x202026369bcdecc4 expref: 15 pid: 18171 timeout 0 LustreError: 24483:0:(ost_handler.c:1060:ost_brw_write()) @@@ Eviction on bulk GET req at ffff81061cce1800 x1373631114821716/t0 o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 448/416 e 3 to 0 dl 1310502547 ref 1 fl Interpret:/2/0 rc 0/0 Lustre: 24483:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0023: ignoring bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry Lustre: 24483:0:(ost_handler.c:1224:ost_brw_write()) Skipped 1 previous similar message Lustre: 24436:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for 12345-XXX.XXX.118.137 at tcp: peer not alive Lustre: 24436:0:(lib-move.c:1033:lnet_post_send_locked()) Skipped 3 previous similar messages Lustre: 24436:0:(lib-move.c:2445:LNetPut()) Error sending PUT to 12345-XXX.XXX.118.137 at tcp: -113 Lustre: 24436:0:(lib-move.c:2445:LNetPut()) Skipped 3 previous similar messages Lustre: 18190:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0021: ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting Lustre: 18190:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 5 previous similar messages Lustre: Service thread pid 24433 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 24433, comm: ll_ost_io_00 Call Trace: [<ffffffff800638ab>] schedule_timeout+0x8a/0xad [<ffffffff80097cb7>] process_timeout+0x0/0x5 [<ffffffff889fd3fa>] ost_brw_write+0x139a/0x2480 [ost] [<ffffffff88761658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] [<ffffffff8872c8b0>] target_committed_to_req+0x40/0x120 [ptlrpc] [<ffffffff88765b05>] lustre_msg_get_version+0x35/0xf0 [ptlrpc] [<ffffffff8008cf93>] default_wake_function+0x0/0xe [<ffffffff88765bc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] [<ffffffff88a0108e>] ost_handle+0x2bae/0x55b0 [ost] [<ffffffff8877515a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] [<ffffffff8008cf93>] default_wake_function+0x0/0xe [<ffffffff88776817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc] [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff887758e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc] [<ffffffff8005dfa7>] child_rip+0x0/0x11 Lustre: Service thread pid 24438 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: LustreError: dumping log to /tmp/lustre-log.1310502559.24433 Pid: 24438, comm: ll_ost_io_05 Call Trace: [<ffffffff800638ab>] schedule_timeout+0x8a/0xad [<ffffffff80097cb7>] process_timeout+0x0/0x5 [<ffffffff889fd3fa>] ost_brw_write+0x139a/0x2480 [ost] [<ffffffff88761658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc] [<ffffffff8872c8b0>] target_committed_to_req+0x40/0x120 [ptlrpc] [<ffffffff88765b05>] lustre_msg_get_version+0x35/0xf0 [ptlrpc] [<ffffffff8008cf93>] default_wake_function+0x0/0xe [<ffffffff88765bc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc] [<ffffffff88a0108e>] ost_handle+0x2bae/0x55b0 [ost] [<ffffffff8877515a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc] [<ffffffff88776817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc] [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff887758e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc] [<ffffffff8005dfa7>] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1310502560.24438 Lustre: 24406:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0020: ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting Lustre: 24406:0:(ldlm_lib.c:874:target_handle_connect()) phase1-OST0020: refuse reconnection from ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at XXX.XXX.118.137@tcp to 0xffff810496028400; still busy with 2 active RPCs LustreError: 24406:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-16) req at ffff81025c5a2050 x1373631114823077/t0 o8->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 368/264 e 0 to 0 dl 1310502660 ref 1 fl Interpret:/0/0 rc -16/0 LustreError: 24438:0:(ost_handler.c:1064:ost_brw_write()) @@@ Reconnect on bulk GET req at ffff81061cce1400 x1373631114821715/t0 o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 448/416 e 4 to 0 dl 1310502755 ref 1 fl Interpret:/2/0 rc 0/0 Lustre: 24438:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0020: ignoring bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry Lustre: Service thread pid 24438 completed after 201.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). LustreError: 24433:0:(ost_handler.c:1064:ost_brw_write()) @@@ Reconnect on bulk GET req at ffff810242d00000 x1373631114821719/t0 o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens 448/416 e 4 to 0 dl 1310502755 ref 1 fl Interpret:/2/0 rc 0/0 Lustre: Service thread pid 24433 completed after 201.00s. This indicates the system was overloaded (too many service threads, or there were not enough hardware resources). Lustre: 24373:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0020: ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting