thr3ads.net - Lustre discuss - [Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN? [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Rick Wagner

2011-Jul-11 22:39 UTC

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

Hi,

We are seeing intermittent client evictions from a new Lustre installation that
we are testing. The errors on writes from a parallel job running on 32 client
nodes, each with 16 tasks writing a single HDF5 file of ~40MB (512 tasks total).
Occasionally, one nodes will be evicted from an OST, and the code running on the
client will experience an IO error.

The directory with the data has a stripe count of 1, and a comparable amount is
read in at the start of the job. Sometimes the evictions occur the first time a
write is attempted, sometimes after a successful write. There is about 15
minutes before the first and subsequent write attempts.

The client and server errors are attached. In the server errors, XXX.XXX.118.141
refers to the client that gets evicted. In the client errors, here are the
server names to match with the NIDS:
lustre-oss-0-2: 172.25.33.248
lustre-oss-2-0: 172.25.33.246
lustre-oss-2-2: 172.25.32.118
I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the error
codes from errno.h are being used.

We''ve been experiencing similar problems for a while, and
we''ve never seen IP traffic have a problem. But, clients will begin to
have trouble communicating with the Lustre server (seen because an LNET ping
will return an I/O error), and things will only recover when an LNET ping is
performed from the server to the client NID.

The filesystem is in testing, so there is no other load on it, and when watching
the load during writes, the OSS machines hardly notice. The servers are running
version 1.8.5, and the client 1.8.4.

Any advice, or pointers to possible bugs would be appreciated.

Thanks,
Rick

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: server-errors.txt
Url:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110711/39358645/attachment.txt
-------------- next part --------------

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: client.txt
Url:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110711/39358645/attachment-0001.txt

Kevin Van Maren

2011-Jul-12 16:10 UTC

head link

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

Rick Wagner wrote:> Hi,
>
> We are seeing intermittent client evictions from a new Lustre installation
that we are testing. The errors on writes from a parallel job running on 32
client nodes, each with 16 tasks writing a single HDF5 file of ~40MB (512 tasks
total). Occasionally, one nodes will be evicted from an OST, and the code
running on the client will experience an IO error.
>   
Yes, evictions are very bad.  Worse than an IO errors, however, is the 
knowledge that a write that previously "succeeded" never made it out
of
the client cache to disk (eviction forces client to drop any dirty cache 
on the floor).
> The directory with the data has a stripe count of 1, and a comparable
amount is read in at the start of the job. Sometimes the evictions occur the
first time a write is attempted, sometimes after a successful write. There is
about 15 minutes before the first and subsequent write attempts.
>   
So you have 512 processes on 32 nodes writing to a single file, which 
exists on a single OST.

Have you tuned any of the network or Lustre tunables?  For example, 
max_dirty_mb, max_rpcs_in_flight?  socket buffer sizes?

What size are the RPCs, application IO sizes?
> The client and server errors are attached. In the server errors,
XXX.XXX.118.141 refers to the client that gets evicted. In the client errors,
here are the server names to match with the NIDS:
>   lustre-oss-0-2: 172.25.33.248
>   lustre-oss-2-0: 172.25.33.246
>   lustre-oss-2-2: 172.25.32.118
> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the
error codes from errno.h are being used.
>
> We''ve been experiencing similar problems for a while, and
we''ve never seen IP traffic have a problem.
You are using gigabit Ethernet for Lustre?

These errors are indicating issues with IP traffic.  When you say you 
have never seen IP traffic have a problem, you mean "ssh" and
"ping"
work, or have you stress-tested the network outside Lustre (run network 
tests from 32 clients to a single server)?
> But, clients will begin to have trouble communicating with the Lustre
server (seen because an LNET ping will return an I/O error), and things will
only recover when an LNET ping is performed from the server to the client NID.
>
> The filesystem is in testing, so there is no other load on it, and when
watching the load during writes, the OSS machines hardly notice. The servers are
running version 1.8.5, and the client 1.8.4.
>
> Any advice, or pointers to possible bugs would be appreciated.
>   
You have provided no information about your network (NICs/drivers, 
switches, MTU, settings, etc), but it sounds like you are having network 
issues, which are exhibiting themselves under load.  It is possible a 
NIC or the switch is getting overwhelmed by the Lustre traffic, and 
getting stuck long enough for TCP to time out.

Are the nics or the switch reporting dropped packets?  Any error 
counters on any links?  Are pause frames enabled on the nics and the switch?


Is that some sort of socket BW test reporting _10_Mb?

Kevin
> Thanks,
> Rick
>

Rick Wagner

2011-Jul-12 17:44 UTC

head link

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

Hi Kevin,

Thank very much for the reply, answers to your questions are below.

On Jul 12, 2011, at 9:10 AM, Kevin Van Maren wrote:
> Rick Wagner wrote:
>> Hi,
>> 
>> We are seeing intermittent client evictions from a new Lustre
installation that we are testing. The errors on writes from a parallel job
running on 32 client nodes, each with 16 tasks writing a single HDF5 file of
~40MB (512 tasks total). Occasionally, one nodes will be evicted from an OST,
and the code running on the client will experience an IO error.
>>  
> 
> Yes, evictions are very bad.  Worse than an IO errors, however, is the
knowledge that a write that previously "succeeded" never made it out
of the client cache to disk (eviction forces client to drop any dirty cache on
the floor).
> 
>> The directory with the data has a stripe count of 1, and a comparable
amount is read in at the start of the job. Sometimes the evictions occur the
first time a write is attempted, sometimes after a successful write. There is
about 15 minutes before the first and subsequent write attempts.
>>  
> 
> So you have 512 processes on 32 nodes writing to a single file, which
exists on a single OST.
No, each task is writing it''s own HDF5 file of ~40MB; the total amount
of data per write is 20GB. This avoids the need for synchronizing writes to a
single file.
> Have you tuned any of the network or Lustre tunables?  For example,
max_dirty_mb, max_rpcs_in_flight?  socket buffer sizes?
For network tuning, this is what we have on the clients and servers:

net.core.somaxconn = 10000
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_congestion_control = htcp
net.ipv4.ip_forward = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1

On the Lustre side, max_rpcs_in_flights = 8, max_dirty_mb = 32. We
don''t have as much experience tuning Lustre, so we''ve tended
to use the defaults.

> What size are the RPCs, application IO sizes?
As I mentioned above, each task is writing a single file, in 5 consecutive 8MB
chunks. There are a few other files written by the master tasks, but the failure
hasn''t occurred on that particular node (yet).
>> The client and server errors are attached. In the server errors,
XXX.XXX.118.141 refers to the client that gets evicted. In the client errors,
here are the server names to match with the NIDS:
>>  lustre-oss-0-2: 172.25.33.248
>>  lustre-oss-2-0: 172.25.33.246
>>  lustre-oss-2-2: 172.25.32.118
>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that
the error codes from errno.h are being used.
>> 
>> We''ve been experiencing similar problems for a while, and
we''ve never seen IP traffic have a problem.
> 
> You are using gigabit Ethernet for Lustre?
The servers are using bonded Myricom 10Gbs cards. On the client side, the nodes
have Mellanox QDR InfiniBand HCAs, but we use a Mellanox BridgeX BX 4010, and
the clients have virtual 10Gbs NICs. Hence the use of the tcp driver. We do have
a problem with setting the MTU on the client side, so currently the servers are
using an MTU of 9000, and the client 1500, which means more work for the central
10Gbs switch and the bridge.
> These errors are indicating issues with IP traffic.  When you say you have
never seen IP traffic have a problem, you mean "ssh" and
"ping" work, or have you stress-tested the network outside Lustre (run
network tests from 32 clients to a single server)?
You''re right about how I defined IP functionality being available, but
that''s a good point about stressing the fabric. We''ve run
simultaneous iperf tests, but only until we reach a desired bandwidth. Given our
goals, I think the multiple tests will be necessary.
>> But, clients will begin to have trouble communicating with the Lustre
server (seen because an LNET ping will return an I/O error), and things will
only recover when an LNET ping is performed from the server to the client NID.
>> 
>> The filesystem is in testing, so there is no other load on it, and when
watching the load during writes, the OSS machines hardly notice. The servers are
running version 1.8.5, and the client 1.8.4.
>> 
>> Any advice, or pointers to possible bugs would be appreciated.
>>  
> 
> You have provided no information about your network (NICs/drivers,
switches, MTU, settings, etc), but it sounds like you are having network issues,
which are exhibiting themselves under load.  It is possible a NIC or the switch
is getting overwhelmed by the Lustre traffic, and getting stuck long enough for
TCP to time out.
I''ve tried to fill in the network information above.
> Are the nics or the switch reporting dropped packets?  Any error counters
on any links?  Are pause frames enabled on the nics and the switch?
Looking at the servers and clients, we see no dropped packets on the servers,
while the clients do have a few (~10 dropped received packets on each client,
out of 1e9). There are no errors on either side, so those packets should have
been resent if they were coming over LNET. I will be following up with our
networking group to find out if the bridge or switch are showing dropped packed,
and I''ll ask about the pause frames.
> Is that some sort of socket BW test reporting _10_Mb?
I''m not sure what this question refers to. Below, I''ve posted
some of the IOR results we''ve seen from this system, using 32 and 64
client nodes, 1 task per node. As you can see, the system should have more than
enough capacity to handle the IO that I describe above. We''ve also done
and IOR run using 512 tasks, which succeeded, but I don''t have the
numbers. It''s really the intermittency that''s drives us crazy.

Thanks again,
Rick


Participating tasks: 32

Summary:
	api                = POSIX
	test filename      = testFile
	access             = file-per-process
	clients            = 32 (1 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 32 GiB
	aggregate filesize = 1024 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   ----
write     8419       33554432   1024.00    0.051805   124.51     58.80      0   
read      10135      33554432   1024.00    0.002511   103.46     64.72      0   

Max Write: 8418.70 MiB/sec (8827.65 MB/sec)
Max Read:  10134.86 MiB/sec (10627.18 MB/sec)


Participating tasks: 64

Summary:
	api                = POSIX
	test filename      = testFile
	access             = file-per-process
	clients            = 64 (1 per node)
	repetitions        = 1
	xfersize           = 1 MiB
	blocksize          = 32 GiB
	aggregate filesize = 2048 GiB

access    bw(MiB/s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   iter
------    ---------  ---------- ---------  --------   --------   --------   ----
write     8554       33554432   1024.00    0.067592   245.12     165.60     0   
read      12284      33554432   1024.00    0.008495   170.73     109.98     0   

Max Write: 8553.74 MiB/sec (8969.25 MB/sec)
Max Read:  12283.52 MiB/sec (12880.20 MB/sec)

Isaac Huang

2011-Jul-12 18:01 UTC

head link

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner
wrote:> Hi,
> ......
> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that the
error codes from errno.h are being used.
> 
> We''ve been experiencing similar problems for a while, and
we''ve never seen IP traffic have a problem. But, clients will begin to
have trouble communicating with the Lustre server (seen because an LNET ping
will return an I/O error), and things will only recover when an LNET ping is
performed from the server to the client NID.
I''d suggest to enable console logging of network errors, by
''echo
+neterror > /proc/sys/lnet/printk''. Then some detailed debug
messages
should show up in ''dmesg'' when you have LNET connectivity
problems.

- Isaac
______________________________________________________________________
This email may contain privileged or confidential information, which should only
be used for the purpose for which it was sent by Xyratex. No further rights or
licenses are granted to use such information. If you are not the intended
recipient of this message, please notify the sender by return and delete it. You
may not use, copy, disclose or rely on the information contained in it.

Internet email is susceptible to data corruption, interception and unauthorised
amendment for which Xyratex does not accept liability. While we have taken
reasonable precautions to ensure that this email is free of viruses, Xyratex
does not accept liability for the presence of any computer viruses in this
email, nor for any losses caused as a result of viruses.

Xyratex Technology Limited (03134912), Registered in England & Wales,
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.

The Xyratex group of companies also includes, Xyratex Ltd, registered in
Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia)
Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in
The People''s Republic of China and Xyratex Japan Limited registered in
Japan.
______________________________________________________________________

Rick Wagner

2011-Jul-12 18:06 UTC

head link

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote:
> On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote:
>> Hi,
>> ......
>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and that
the error codes from errno.h are being used.
>> 
>> We''ve been experiencing similar problems for a while, and
we''ve never seen IP traffic have a problem. But, clients will begin to
have trouble communicating with the Lustre server (seen because an LNET ping
will return an I/O error), and things will only recover when an LNET ping is
performed from the server to the client NID.
> 
> I''d suggest to enable console logging of network errors, by
''echo
> +neterror > /proc/sys/lnet/printk''. Then some detailed debug
messages
> should show up in ''dmesg'' when you have LNET connectivity
problems.
Thanks, Isaac, I have put that in place. We have that in the sysctl
configuration, as part of lnet.debug, and thought that was sufficient. But so
far, dmesg and /var/log/messages have looked very similar.

[root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
warning error emerg console
[root at lustre-oss-0-2 ~]# sysctl -a | grep neterr
lnet.debug = ioctl neterror net warning error emerg ha config console

--Rick

> 
> - Isaac
> ______________________________________________________________________
> This email may contain privileged or confidential information, which should
only be used for the purpose for which it was sent by Xyratex. No further rights
or licenses are granted to use such information. If you are not the intended
recipient of this message, please notify the sender by return and delete it. You
may not use, copy, disclose or rely on the information contained in it.
> 
> Internet email is susceptible to data corruption, interception and
unauthorised amendment for which Xyratex does not accept liability. While we
have taken reasonable precautions to ensure that this email is free of viruses,
Xyratex does not accept liability for the presence of any computer viruses in
this email, nor for any losses caused as a result of viruses.
> 
> Xyratex Technology Limited (03134912), Registered in England & Wales,
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> 
> The Xyratex group of companies also includes, Xyratex Ltd, registered in
Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia)
Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in
The People''s Republic of China and Xyratex Japan Limited registered in
Japan.
> ______________________________________________________________________
>

Isaac Huang

2011-Jul-12 18:32 UTC

head link

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

On Tue, Jul 12, 2011 at 11:06:40AM -0700, Rick Wagner
wrote:> On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote:
> 
> > On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote:
> >> Hi,
> >> ......
> >> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN, and
that the error codes from errno.h are being used.
> >> 
> >> We''ve been experiencing similar problems for a while, and
we''ve never seen IP traffic have a problem. But, clients will begin to
have trouble communicating with the Lustre server (seen because an LNET ping
will return an I/O error), and things will only recover when an LNET ping is
performed from the server to the client NID.
> > 
> > I''d suggest to enable console logging of network errors, by
''echo
> > +neterror > /proc/sys/lnet/printk''. Then some detailed
debug messages
> > should show up in ''dmesg'' when you have LNET
connectivity problems.
> 
> Thanks, Isaac, I have put that in place. We have that in the sysctl
configuration, as part of lnet.debug, and thought that was sufficient. But so
far, dmesg and /var/log/messages have looked very similar.
> 
> [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
> warning error emerg console
You should be able to see ''neterror'' in ''cat
/proc/sys/lnet/printk''
output after ''echo +neterror > /proc/sys/lnet/printk'',
otherwise
it''s a bug. This is different from lnet.debug.
> [root at lustre-oss-0-2 ~]# sysctl -a | grep neterr
> lnet.debug = ioctl neterror net warning error emerg ha config console
- Isaac
______________________________________________________________________
This email may contain privileged or confidential information, which should only
be used for the purpose for which it was sent by Xyratex. No further rights or
licenses are granted to use such information. If you are not the intended
recipient of this message, please notify the sender by return and delete it. You
may not use, copy, disclose or rely on the information contained in it.

Internet email is susceptible to data corruption, interception and unauthorised
amendment for which Xyratex does not accept liability. While we have taken
reasonable precautions to ensure that this email is free of viruses, Xyratex
does not accept liability for the presence of any computer viruses in this
email, nor for any losses caused as a result of viruses.

Xyratex Technology Limited (03134912), Registered in England & Wales,
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.

The Xyratex group of companies also includes, Xyratex Ltd, registered in
Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia)
Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in
The People''s Republic of China and Xyratex Japan Limited registered in
Japan.
______________________________________________________________________

Rick Wagner

2011-Jul-12 18:34 UTC

head link

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

On Jul 12, 2011, at 11:32 AM, Isaac Huang wrote:
> On Tue, Jul 12, 2011 at 11:06:40AM -0700, Rick Wagner wrote:
>> On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote:
>> 
>>> On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote:
>>>> Hi,
>>>> ......
>>>> I am assuming that -113 is EHOSTUNREACH and -107 is ENOTCONN,
and that the error codes from errno.h are being used.
>>>> 
>>>> We''ve been experiencing similar problems for a while,
and we''ve never seen IP traffic have a problem. But, clients will begin
to have trouble communicating with the Lustre server (seen because an LNET ping
will return an I/O error), and things will only recover when an LNET ping is
performed from the server to the client NID.
>>> 
>>> I''d suggest to enable console logging of network errors,
by ''echo
>>> +neterror > /proc/sys/lnet/printk''. Then some detailed
debug messages
>>> should show up in ''dmesg'' when you have LNET
connectivity problems.
>> 
>> Thanks, Isaac, I have put that in place. We have that in the sysctl
configuration, as part of lnet.debug, and thought that was sufficient. But so
far, dmesg and /var/log/messages have looked very similar.
>> 
>> [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
>> warning error emerg console
> 
> You should be able to see ''neterror'' in ''cat
/proc/sys/lnet/printk''
> output after ''echo +neterror > /proc/sys/lnet/printk'',
otherwise
> it''s a bug. This is different from lnet.debug.
I am, sorry if the previous post was misleading.
[root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
neterror warning error emerg console
[root at lustre-oss-0-2 ~]# 

--Rick
> 
>> [root at lustre-oss-0-2 ~]# sysctl -a | grep neterr
>> lnet.debug = ioctl neterror net warning error emerg ha config console
> 
> - Isaac
> ______________________________________________________________________
> This email may contain privileged or confidential information, which should
only be used for the purpose for which it was sent by Xyratex. No further rights
or licenses are granted to use such information. If you are not the intended
recipient of this message, please notify the sender by return and delete it. You
may not use, copy, disclose or rely on the information contained in it.
> 
> Internet email is susceptible to data corruption, interception and
unauthorised amendment for which Xyratex does not accept liability. While we
have taken reasonable precautions to ensure that this email is free of viruses,
Xyratex does not accept liability for the presence of any computer viruses in
this email, nor for any losses caused as a result of viruses.
> 
> Xyratex Technology Limited (03134912), Registered in England & Wales,
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.
> 
> The Xyratex group of companies also includes, Xyratex Ltd, registered in
Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia)
Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in
The People''s Republic of China and Xyratex Japan Limited registered in
Japan.
> ______________________________________________________________________
>

Rick Wagner

2011-Jul-12 20:56 UTC

head link

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

On Jul 12, 2011, at 11:34 AM, Rick Wagner wrote:
> 
> On Jul 12, 2011, at 11:32 AM, Isaac Huang wrote:
> 
>> On Tue, Jul 12, 2011 at 11:06:40AM -0700, Rick Wagner wrote:
>>> On Jul 12, 2011, at 11:01 AM, Isaac Huang wrote:
>>> 
>>>> On Mon, Jul 11, 2011 at 03:39:34PM -0700, Rick Wagner wrote:
>>>>> Hi,
>>>>> ......
>>>>> I am assuming that -113 is EHOSTUNREACH and -107 is
ENOTCONN, and that the error codes from errno.h are being used.
>>>>> 
>>>>> We''ve been experiencing similar problems for a
while, and we''ve never seen IP traffic have a problem. But, clients
will begin to have trouble communicating with the Lustre server (seen because an
LNET ping will return an I/O error), and things will only recover when an LNET
ping is performed from the server to the client NID.
>>>> 
>>>> I''d suggest to enable console logging of network
errors, by ''echo
>>>> +neterror > /proc/sys/lnet/printk''. Then some
detailed debug messages
>>>> should show up in ''dmesg'' when you have LNET
connectivity problems.
>>> 
>>> Thanks, Isaac, I have put that in place. We have that in the sysctl
configuration, as part of lnet.debug, and thought that was sufficient. But so
far, dmesg and /var/log/messages have looked very similar.
>>> 
>>> [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
>>> warning error emerg console
>> 
>> You should be able to see ''neterror'' in ''cat
/proc/sys/lnet/printk''
>> output after ''echo +neterror >
/proc/sys/lnet/printk'', otherwise
>> it''s a bug. This is different from lnet.debug.
> 
> I am, sorry if the previous post was misleading.
> [root at lustre-oss-0-2 ~]# cat /proc/sys/lnet/printk 
> neterror warning error emerg console
> [root at lustre-oss-0-2 ~]# 
Isaac, I think your suggestion led to more information. I re-ran the client
code, and experienced the same problems. However, this time there were
additional messages, in particular one about "No usable routes".

  Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes
to 12345-XXX.XXX.118.137 at tcp

In our configuration, the cluster I''m running on uses TCP, and should
not have traffic going over a router. There is a second test system which does
have an LNET router configured, so this is what have for our LNET configuration
on the servers:

  [root at lustre-oss-0-2 ~]# cat /etc/modprobe.d/lnet.conf 
  options lnet networks=tcp(bond0)  routes="o2ib XXX.XXX.81.18 at tcp"

For the servers, all of the traffic goes out the bond0 interface.

The dmesg output is below. It shows peer not alive for the client that I saw
fail. When I logged onto that client, it reported an I/O error when performing
and LNET ping to the server that I''ve posted the messages for, and then
they both recovered.

Thanks again,
Rick

From dmesg on lustre-oss-2-0

Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes to
12345-XXX.XXX.118.137 at tcp
LustreError: 24086:0:(events.c:381:server_bulk_callback()) event type 2, status
-113, desc ffff8104d5950000
LustreError: 24502:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on
bulk GET 0(1048576)  req at ffff8101cbfb3000 x1373631114768318/t0
o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens
448/416 e 0 to 0 dl 1310502335 ref 1 fl Interpret:/0/0 rc 0/0
Lustre: 24502:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0020: ignoring
bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at
NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry
Lustre: 24349:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0023:
ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting
Lustre: 24349:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 2 previous
similar messages
Lustre: 24307:0:(ldlm_lib.c:803:target_handle_connect()) phase1-OST0023: exp
ffff810292b50e00 already connecting
LustreError: 24307:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing
error (-114)  req at ffff81033c17bc50 x1373631114768595/t0
o8->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens
368/264 e 0 to 0 dl 1310502459 ref 1 fl Interpret:/0/0 rc -114/0
LustreError: 24307:0:(ldlm_lib.c:1919:target_send_reply_msg()) Skipped 338
previous similar messages
Lustre: 24086:0:(socklnd_cb.c:922:ksocknal_launch_packet()) No usable routes to
12345-XXX.XXX.118.137 at tcp
LustreError: 24086:0:(events.c:381:server_bulk_callback()) event type 2, status
-113, desc ffff810226052000
LustreError: 24553:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on
bulk GET 0(581632)  req at ffff81024b96b000 x1373631114821718/t0
o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens
448/416 e 0 to 0 dl 1310502365 ref 1 fl Interpret:/2/0 rc 0/0
Lustre: 24553:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0023: ignoring
bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at
NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry
Lustre: 24527:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for
12345-XXX.XXX.118.137 at tcp: peer not alive
Lustre: 24527:0:(lib-move.c:2445:LNetPut()) Error sending PUT to
12345-XXX.XXX.118.137 at tcp: -113
Lustre: 24536:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for
12345-XXX.XXX.118.137 at tcp: peer not alive
Lustre: 24536:0:(lib-move.c:1033:lnet_post_send_locked()) Skipped 3 previous
similar messages
Lustre: 24536:0:(lib-move.c:2445:LNetPut()) Error sending PUT to
12345-XXX.XXX.118.137 at tcp: -113
Lustre: 24536:0:(lib-move.c:2445:LNetPut()) Skipped 3 previous similar messages
Lustre: 24496:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for
12345-XXX.XXX.118.137 at tcp: peer not alive
Lustre: 24496:0:(lib-move.c:1033:lnet_post_send_locked()) Skipped 3 previous
similar messages
Lustre: 24496:0:(lib-move.c:2445:LNetPut()) Error sending PUT to
12345-XXX.XXX.118.137 at tcp: -113
Lustre: 24496:0:(lib-move.c:2445:LNetPut()) Skipped 3 previous similar messages
LustreError: 0:0:(ldlm_lockd.c:308:waiting_locks_callback()) ### lock callback
timer expired after 100s: evicting client at XXX.XXX.118.137 at tcp  ns:
filter-phase1-OST0023_UUID lock: ffff810154ed5000/0x891870f2e8a00d6b lrc: 3/0,0
mode: PR/PR res: 6449/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 0x20 remote: 0x202026369bcdd0b6 expref: 17
pid: 18170 timeout 6508323681
LustreError: 24280:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED  
req at ffff8105ff468c00 x1372975847776577/t0
o105->@NET_0x20000c6ca7689_UUID:15/16 lens 344/384 e 0 to 1 dl 0 ref 1 fl
Rpc:N/0/0 rc 0/0
LustreError: 24280:0:(ldlm_lockd.c:607:ldlm_handle_ast_error()) ### client (nid
XXX.XXX.118.137 at tcp) returned 0 from completion AST ns:
filter-phase1-OST0023_UUID lock: ffff810233131e00/0x891870f2e8a00f4e lrc: 3/0,0
mode: PW/PW res: 6451/0 rrc: 3 type: EXT [0->18446744073709551615] (req
0->1048575) flags: 0x0 remote: 0x202026369bcdecc4 expref: 15 pid: 18171
timeout 0
LustreError: 24483:0:(ost_handler.c:1060:ost_brw_write()) @@@ Eviction on bulk
GET  req at ffff81061cce1800 x1373631114821716/t0
o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens
448/416 e 3 to 0 dl 1310502547 ref 1 fl Interpret:/2/0 rc 0/0
Lustre: 24483:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0023: ignoring
bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at
NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry
Lustre: 24483:0:(ost_handler.c:1224:ost_brw_write()) Skipped 1 previous similar
message
Lustre: 24436:0:(lib-move.c:1033:lnet_post_send_locked()) Dropping message for
12345-XXX.XXX.118.137 at tcp: peer not alive
Lustre: 24436:0:(lib-move.c:1033:lnet_post_send_locked()) Skipped 3 previous
similar messages
Lustre: 24436:0:(lib-move.c:2445:LNetPut()) Error sending PUT to
12345-XXX.XXX.118.137 at tcp: -113
Lustre: 24436:0:(lib-move.c:2445:LNetPut()) Skipped 3 previous similar messages
Lustre: 18190:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0021:
ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting
Lustre: 18190:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 5 previous
similar messages
Lustre: Service thread pid 24433 was inactive for 200.00s. The thread might be
hung, or it might only be slow and will resume later. Dumping the stack trace
for debugging purposes:
Pid: 24433, comm: ll_ost_io_00

Call Trace:
 [<ffffffff800638ab>] schedule_timeout+0x8a/0xad
 [<ffffffff80097cb7>] process_timeout+0x0/0x5
 [<ffffffff889fd3fa>] ost_brw_write+0x139a/0x2480 [ost]
 [<ffffffff88761658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
 [<ffffffff8872c8b0>] target_committed_to_req+0x40/0x120 [ptlrpc]
 [<ffffffff88765b05>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
 [<ffffffff8008cf93>] default_wake_function+0x0/0xe
 [<ffffffff88765bc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
 [<ffffffff88a0108e>] ost_handle+0x2bae/0x55b0 [ost]
 [<ffffffff8877515a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
 [<ffffffff8008cf93>] default_wake_function+0x0/0xe
 [<ffffffff88776817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc]
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff887758e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc]
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

Lustre: Service thread pid 24438 was inactive for 200.00s. The thread might be
hung, or it might only be slow and will resume later. Dumping the stack trace
for debugging purposes:
LustreError: dumping log to /tmp/lustre-log.1310502559.24433
Pid: 24438, comm: ll_ost_io_05

Call Trace:
 [<ffffffff800638ab>] schedule_timeout+0x8a/0xad
 [<ffffffff80097cb7>] process_timeout+0x0/0x5
 [<ffffffff889fd3fa>] ost_brw_write+0x139a/0x2480 [ost]
 [<ffffffff88761658>] ptlrpc_send_reply+0x5c8/0x5e0 [ptlrpc]
 [<ffffffff8872c8b0>] target_committed_to_req+0x40/0x120 [ptlrpc]
 [<ffffffff88765b05>] lustre_msg_get_version+0x35/0xf0 [ptlrpc]
 [<ffffffff8008cf93>] default_wake_function+0x0/0xe
 [<ffffffff88765bc8>] lustre_msg_check_version_v2+0x8/0x20 [ptlrpc]
 [<ffffffff88a0108e>] ost_handle+0x2bae/0x55b0 [ost]
 [<ffffffff8877515a>] ptlrpc_server_handle_request+0x97a/0xdf0 [ptlrpc]
 [<ffffffff88776817>] ptlrpc_main+0xf37/0x10f0 [ptlrpc]
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff887758e0>] ptlrpc_main+0x0/0x10f0 [ptlrpc]
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

LustreError: dumping log to /tmp/lustre-log.1310502560.24438
Lustre: 24406:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0020:
ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting
Lustre: 24406:0:(ldlm_lib.c:874:target_handle_connect()) phase1-OST0020: refuse
reconnection from ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at XXX.XXX.118.137@tcp to
0xffff810496028400; still busy with 2 active RPCs
LustreError: 24406:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing
error (-16)  req at ffff81025c5a2050 x1373631114823077/t0
o8->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens
368/264 e 0 to 0 dl 1310502660 ref 1 fl Interpret:/0/0 rc -16/0
LustreError: 24438:0:(ost_handler.c:1064:ost_brw_write()) @@@ Reconnect on bulk
GET  req at ffff81061cce1400 x1373631114821715/t0
o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens
448/416 e 4 to 0 dl 1310502755 ref 1 fl Interpret:/2/0 rc 0/0
Lustre: 24438:0:(ost_handler.c:1224:ost_brw_write()) phase1-OST0020: ignoring
bulk IO comm error with ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at
NET_0x20000c6ca7689_UUID id 12345-XXX.XXX.118.137 at tcp - client will retry
Lustre: Service thread pid 24438 completed after 201.00s. This indicates the
system was overloaded (too many service threads, or there were not enough
hardware resources).
LustreError: 24433:0:(ost_handler.c:1064:ost_brw_write()) @@@ Reconnect on bulk
GET  req at ffff810242d00000 x1373631114821719/t0
o4->ebb92dfb-9c0c-c870-a01a-0f0e7333c71b at NET_0x20000c6ca7689_UUID:0/0 lens
448/416 e 4 to 0 dl 1310502755 ref 1 fl Interpret:/2/0 rc 0/0
Lustre: Service thread pid 24433 completed after 201.00s. This indicates the
system was overloaded (too many service threads, or there were not enough
hardware resources).
Lustre: 24373:0:(ldlm_lib.c:574:target_handle_reconnect()) phase1-OST0020:
ebb92dfb-9c0c-c870-a01a-0f0e7333c71b reconnecting

Lustre discuss - Jul 2011 - Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?

[Lustre-discuss] Client Eviction Preceded by EHOSTUNREACH and then ENOTCONN?