Hi all,
I have a weird problem on one of my OSSs, though I''ve seen it once on
the other OSS. Things will be humming along nicely, when suddenly I get
lots of messages like this:
Jan 29 15:26:16 venus kernel: Lustre:
898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-192.168.1.26 at tcp
Jan 29 15:26:16 venus kernel: Lustre:
1090:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-192.168.1.26 at tcp
In the 50 odd minutes before I picked it up, it produced over 10 million
such lines in /var/log/messages. Performance degrade systematically
during this time on all clients. On the client node in question, IO is
disrupted until I unmount and remount the OSTs. Then the problem goes
away for a week or so. The part of the logs where this error starts, is
at the end of this mail. Oh, and I can ping/ssh the machine in question
from the server in question at the time of the problem. So it doesn''t
seem to be a general networking problem.
A bit of info regarding my setup, in case it has something to do with
this: I have a shared MDS/MGS and two OSSs, all on Dell servers. The OSS
that''s giving me the most headaches, has 8GB RAM, 4 x 4TB OSTs and an
Intel 10GB NIC. The other OSS has 4GB RAM, 1 x 4.8TB OSTs and two GB
Intel NICs that''s bonded using the 802.3ad dynamic link aggregation
protocol. I have about 200 clients connecting to this file system. I
have another lustre system, comprising of Intel based component servers,
that acts as a mirror. This system has been running fine. All the
servers are running Centos 5.4 64bit and lustre 1.8.1.1. The clients are
running the Suse 11 lustre kernel.
So, does anybody know what''s going on here? Or have any pointers as to
how I can debug this?
Any and all help appreciated,
Deon
/var/log/messages just before the flood starts:
Jan 29 14:28:54 venus kernel: Lustre:
4254:0:(socklnd_cb.c:2173:ksocknal_find_timed_out_conn()) A connection
with 12345-192.168.0.99 at tcp (192.168.0.99:1023) timed out; the network
or node may be down.
Jan 29 14:31:42 venus kernel: Lustre: galaxy-OST0000: haven''t heard
from
client 0afbaa24-aa3c-07d6-5752-10300b3997ba (at 192.168.0.99 at tcp) in 227
seconds. I think it''s dead, and I am evicting it.
Jan 29 14:31:42 venus kernel: Lustre: galaxy-OST0003: haven''t heard
from
client 0afbaa24-aa3c-07d6-5752-10300b3997ba (at 192.168.0.99 at tcp) in 227
seconds. I think it''s dead, and I am evicting it.
Jan 29 14:31:42 venus kernel: Lustre: Skipped 1 previous similar message
Jan 29 14:38:45 venus kernel: Lustre:
4254:0:(socklnd_cb.c:2173:ksocknal_find_timed_out_conn()) A connection
with 12345-192.168.1.26 at tcp (192.168.1.26:1023) timed out; the network
or node may be d
own.
Jan 29 14:40:50 venus kernel: Lustre:
4251:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting
192.168.0.16/1023 -> 192.168.1.26/988
Jan 29 14:40:50 venus kernel: Lustre:
4251:0:(acceptor.c:95:lnet_connect_console_error()) Connection to
192.168.1.26 at tcp at host 192.168.1.26 was unreachable: the network or
that node may be down,
or Lustre may be misconfigured.
Jan 29 14:40:50 venus kernel: Lustre:
4251:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:40:54 venus kernel: Lustre:
1090:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1325271916821767 sent from galaxy-OST0001 to NID 192.168.1.26 at tcp 7s
ago has timed out (limit
7s).
Jan 29 14:40:54 venus kernel: req at ffff81016c500000
x1325271916821767/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768854
ref 1 fl Rpc:/0/0 rc 0/0
Jan 29 14:40:57 venus kernel: Lustre:
4253:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting
192.168.0.16/1023 -> 192.168.1.26/988
Jan 29 14:40:57 venus kernel: Lustre:
4253:0:(acceptor.c:95:lnet_connect_console_error()) Connection to
192.168.1.26 at tcp at host 192.168.1.26 was unreachable: the network or
that node may be down,
or Lustre may be misconfigured.
Jan 29 14:40:57 venus kernel: Lustre:
4253:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:40:59 venus kernel: Lustre:
898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-192.168.1.26 at tcp
Jan 29 14:40:59 venus kernel: LustreError:
898:0:(events.c:66:request_out_callback()) @@@ type 4, status -5
req at ffff8102b94ff000 x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0
to 1 dl 126476
8866 ref 2 fl Rpc:/0/0 rc 0/0
Jan 29 14:40:59 venus kernel: LustreError:
898:0:(events.c:66:request_out_callback()) Skipped 3929552 previous
similar messages
Jan 29 14:40:59 venus kernel: Lustre:
898:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1325271916822084 sent from galaxy-OST0002 to NID 192.168.1.26 at tcp 0s
ago has failed due to netw
ork error (limit 7s).
Jan 29 14:40:59 venus kernel: req at ffff8102b94ff000
x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768866
ref 1 fl Rpc:/0/0 rc 0/0
Jan 29 14:40:59 venus kernel: Lustre:
898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-192.168.1.26 at tcp
Jan 29 14:40:59 venus last message repeated 648 times
Jan 29 14:41:02 venus kernel: Lustre:
4252:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting
192.168.0.16/1023 -> 192.168.1.26/988
Jan 29 14:41:02 venus kernel: Lustre:
4252:0:(acceptor.c:95:lnet_connect_console_error()) Connection to
192.168.1.26 at tcp at host 192.168.1.26 was unreachable: the network or
that node may be down,
or Lustre may be misconfigured.
Jan 29 14:41:02 venus kernel: Lustre:
4252:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:41:02 venus kernel: Lustre:
4252:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:41:06 venus kernel: Lustre:
898:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1325271916822084 sent from galaxy-OST0002 to NID 192.168.1.26 at tcp 7s
ago has timed out (limit 7
s).
Jan 29 14:41:06 venus kernel: req at ffff8102b94ff000
x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768866
ref 1 fl Rpc:/2/0 rc 0/0
Jan 29 14:41:06 venus kernel: Lustre:
898:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 650 previous
similar messages
Jan 29 14:41:09 venus kernel: Lustre:
4250:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting
192.168.0.16/1023 -> 192.168.1.26/988
Jan 29 14:41:09 venus kernel: Lustre:
4250:0:(acceptor.c:95:lnet_connect_console_error()) Connection to
192.168.1.26 at tcp at host 192.168.1.26 was unreachable: the network or
that node may be down,
or Lustre may be misconfigured.
Jan 29 14:41:09 venus kernel: Lustre:
4250:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:41:09 venus kernel: Lustre:
4250:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1
len 296 192.168.0.16 at tcp->192.168.1.26 at tcp
Jan 29 14:41:13 venus kernel: Lustre:
898:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1325271916822084 sent from galaxy-OST0002 to NID 192.168.1.26 at tcp 7s
ago has timed out (limit 7
s).
Jan 29 14:41:13 venus kernel: req at ffff8102b94ff000
x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768873
ref 1 fl Rpc:/2/0 rc 0/0
Jan 29 14:41:13 venus kernel: Lustre:
898:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 1 previous
similar message
Jan 29 14:41:13 venus kernel: Lustre:
898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-192.168.1.26 at tcp
Jan 29 14:41:15 venus last message repeated 84370 times
Jan 29 14:41:15 venus kernel: Lustre:
1090:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-192.168.1.26 at tcp
Jan 29 14:41:15 venus kernel: Lustre:
898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to
12345-192.168.1.26 at tcp
--
Deon Borman
IT Supervisor
BlackGinger
--