Dear CFS support,
We have large cluster installation with lustre 1.6.1
We have 550 clients and one MDS/MGS and one OSS with one OST. Back-
end storage is DELL MD1000
Clients have 1Gb Ethernet connection Servers have 10Gb Ethernet
connection
We''ve tested bandwidth of the back end storage before putting lustre
on it. We''ve got 350Mbytes/s for writes and 500Mbytes/s for reads,
tested using dd and iozone. After putting lustre on the back-end
storage we get 250Mbytes/s write and 350Mbytes/s read
We are currently experiencing lot''s of difficulties caused by clients
loosing connection to lustre file system. Sometime clients retrieve
connection but job that has been running on that clients are usually
dead and some time clients just hangs waiting for reconnection with
lustre file system.
both server and clients are running the same kernel
2.6.9-55.EL_lustre-1.6.1smp
I suspect that clients are timing out for some reason and server is
loosing connection with them?
Could you give us suggestions on how We can tune lustre parameters
like OST and MDS threads and OST,MDS,Clients timeouts.
I''ve searched mailings list to find some clues but what I found was
couple of threads with similar errors saying that all this is fixed
in 1.6.1 version
FYI we starting lustre on clients through rc.local script:
------------/etc/rc.d/rc.local-------
...
# Mount lustre
modprobe -r lustre lnet
echo "options lnet networks=tcp0(eth1)" >> /etc/modprobe.conf
modprobe lnet
lctl network configure
modprobe lustre
mount /home
...
-----------------------------------------
-------/etc/fstab------
...
10.143.245.3@tcp0:/home-md /home lustre
noauto,defaults,nosuid,nodev 0 0
...
-------------------------
on clients I get following errors:
------------------------------------------Client
1-----------------------------------------------------------------------
------------------------------------------------------------------------
------------------------------------------------------------------------
---------------------------------------
Aug 23 09:03:40 bindloe02 kernel: LustreError: 21239:0:(client.c:
962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187856120,
100s ago) req@00000102280ae400 x3295/t0 o101->home-md-
MDT0000_UUID@10.143.245.3@tcp:12 lens 432/768 ref 1 fl Rpc:/0/0 rc 0/-22
Aug 23 09:03:40 bindloe02 kernel: Lustre: home-md-MDT0000-
mdc-00000102280b8c00: Connection to service home-md-MDT0000 via nid
10.142.10.3@tcp1 was lost; in progress operations using this service
will wait for recovery to complete.
Aug 23 09:03:40 bindloe02 kernel: LustreError: 21239:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -4
Aug 23 09:03:40 bindloe02 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:04:30 bindloe02 last message repeated 2 times
Aug 23 09:05:45 bindloe02 last message repeated 3 times
Aug 23 09:06:54 bindloe02 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:08:15 bindloe02 last message repeated 4 times
Aug 23 09:09:30 bindloe02 last message repeated 3 times
Aug 23 09:10:20 bindloe02 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:10:20 bindloe02 kernel: LustreError: Skipped 1 previous
similar message
Aug 23 09:12:00 bindloe02 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:12:00 bindloe02 kernel: LustreError: Skipped 3 previous
similar messages
Aug 23 09:14:55 bindloe02 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:14:55 bindloe02 kernel: LustreError: Skipped 6 previous
similar messages
Aug 23 09:20:26 bindloe02 kernel: LustreError: 21259:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -4
Aug 23 09:20:45 bindloe02 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:20:45 bindloe02 kernel: LustreError: Skipped 12 previous
similar messages
Aug 23 09:31:10 bindloe02 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:31:10 bindloe02 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 09:36:35 bindloe02 kernel: LustreError: 167-0: This client was
evicted by home-md-MDT0000; in progress operations using this service
will fail.
Aug 23 09:36:35 bindloe02 kernel: LustreError: 11741:0:
(ldlm_request.c:890:ldlm_cli_cancel_req()) Got rc -5 from cancel RPC:
canceling anyway
Aug 23 09:36:35 bindloe02 kernel: LustreError: 21191:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -5
Aug 23 09:36:35 bindloe02 kernel: Lustre: home-md-MDT0000-
mdc-00000102280b8c00: Connection restored to service home-md-MDT0000
using nid 10.142.10.3@tcp1.
Aug 23 09:40:49 bindloe02 kernel: LustreError: 21341:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -17
Aug 23 09:40:49 bindloe02 kernel: LustreError: 21341:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 1 previous similar message
Aug 23 09:56:43 bindloe02 kernel: LustreError: 21428:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 09:56:43 bindloe02 kernel: LustreError: 21428:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 1 previous similar message
Aug 23 09:56:43 bindloe02 kernel: LustreError: 21428:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 09:57:04 bindloe02 kernel: LustreError: 21430:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 09:57:04 bindloe02 kernel: LustreError: 21430:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 1 previous similar message
Aug 23 09:57:38 bindloe02 kernel: LustreError: 21475:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 09:57:38 bindloe02 kernel: LustreError: 21475:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 4 previous similar messages
Aug 23 09:57:50 bindloe02 kernel: LustreError: 21478:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 09:58:11 bindloe02 kernel: LustreError: 21522:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 09:58:11 bindloe02 kernel: LustreError: 21522:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 4 previous similar messages
Aug 23 09:58:43 bindloe02 kernel: LustreError: 21570:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 09:58:43 bindloe02 kernel: LustreError: 21570:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 3 previous similar messages
Aug 23 10:02:06 bindloe02 kernel: LustreError: 21604:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:02:06 bindloe02 kernel: LustreError: 21604:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 5 previous similar messages
Aug 23 10:17:38 bindloe02 kernel: LustreError: 21711:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:17:38 bindloe02 kernel: LustreError: 21711:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 2 previous similar messages
Aug 23 10:17:39 bindloe02 kernel: LustreError: 21747:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:17:39 bindloe02 kernel: LustreError: 21783:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:17:39 bindloe02 kernel: LustreError: 21783:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 12 previous similar messages
Aug 23 10:17:39 bindloe02 kernel: LustreError: 21855:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:17:39 bindloe02 kernel: LustreError: 21855:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 4 previous similar messages
Aug 23 10:17:40 bindloe02 kernel: LustreError: 21927:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:17:40 bindloe02 kernel: LustreError: 21927:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 292 previous similar messages
Aug 23 10:17:42 bindloe02 kernel: LustreError: 21999:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:17:42 bindloe02 kernel: LustreError: 21999:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 25 previous similar messages
Aug 23 10:17:45 bindloe02 kernel: LustreError: 22071:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:17:45 bindloe02 kernel: LustreError: 22071:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 70 previous similar messages
Aug 23 10:17:50 bindloe02 kernel: LustreError: 22071:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:17:50 bindloe02 kernel: LustreError: 22071:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 135 previous similar messages
Aug 23 10:18:19 bindloe02 kernel: LustreError: 22107:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:18:19 bindloe02 kernel: LustreError: 22107:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 158 previous similar messages
Aug 23 10:19:22 bindloe02 kernel: LustreError: 22325:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:19:22 bindloe02 kernel: LustreError: 22325:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 610 previous similar messages
Aug 23 10:20:07 bindloe02 kernel: LustreError: 22396:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:22:30 bindloe02 kernel: LustreError: 22433:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:26:38 bindloe02 kernel: LustreError: 22522:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 23 10:26:38 bindloe02 kernel: LustreError: 22522:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 2 previous similar messages
---------------------------------------------------------------
Client2-----------------------------------------------------------------
-------------------------------
Aug 23 10:21:02 bindloe01 kernel: LustreError: 15638:0:(client.c:
962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187860762,
100s ago) req@0000010216714600 x107/t0 o36->home-md-
MDT0000_UUID@10.143.245.3@tcp:12 lens 336/296 ref 1 fl Rpc:/0/0 rc 0/-22
Aug 23 10:21:02 bindloe01 kernel: Lustre: home-md-MDT0000-
mdc-0000010006952800: Connection to service home-md-MDT0000 via nid
10.142.10.3@tcp1 was lost; in progress operations using this service
will wait for recovery to complete.
Aug 23 10:21:08 bindloe01 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:21:52 bindloe01 last message repeated 2 times
Aug 23 10:23:07 bindloe01 last message repeated 3 times
Aug 23 10:24:22 bindloe01 last message repeated 3 times
Aug 23 10:25:37 bindloe01 last message repeated 3 times
Aug 23 10:27:17 bindloe01 last message repeated 3 times
Aug 23 10:27:17 bindloe01 kernel: LustreError: Skipped 1 previous
similar message
Aug 23 10:27:53 bindloe01 kernel: LustreError: 15735:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -4
Aug 23 10:28:57 bindloe01 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:28:57 bindloe01 kernel: LustreError: Skipped 3 previous
similar messages
Aug 23 10:31:52 bindloe01 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:31:52 bindloe01 kernel: LustreError: Skipped 5 previous
similar messages
Aug 23 10:37:42 bindloe01 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:37:42 bindloe01 kernel: LustreError: Skipped 13 previous
similar messages
Aug 23 10:48:07 bindloe01 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:48:07 bindloe01 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 10:58:56 bindloe01 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:58:56 bindloe01 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 11:00:56 bindloe01 kernel: LustreError: 15935:0:(llite_lib.c:
1399:ll_statfs_internal()) mdc_statfs fails: rc = -4
Aug 23 11:08:57 bindloe01 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 11:08:57 bindloe01 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 11:16:26 bindloe01 kernel: LustreError: 16339:0:(llite_lib.c:
1399:ll_statfs_internal()) mdc_statfs fails: rc = -4
------------------------------------------------------------------------
--------client3--------------------------------------------------------
Aug 22 22:09:08 node-i14 kernel: Lustre: Lustre Client File System;
info@clusterfs.com
Aug 22 22:09:08 node-i14 kernel: Lustre: Binding irq 169 to CPU 0
with cmd: echo 1 > /proc/irq/169/smp_affinity
Aug 22 22:09:08 node-i14 kernel: Lustre: Client home-md-client has
started
Aug 22 22:28:46 node-i14 kernel: LustreError: 15463:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 22 22:28:46 node-i14 kernel: LustreError: 15465:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 22 22:28:46 node-i14 kernel: LustreError: 15465:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 3 previous similar messages
Aug 22 22:28:46 node-i14 kernel: LustreError: 15468:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 22 22:28:46 node-i14 kernel: LustreError: 15468:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 46 previous similar messages
Aug 22 22:28:46 node-i14 kernel: LustreError: 15473:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 22 22:28:46 node-i14 kernel: LustreError: 15473:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 38 previous similar messages
Aug 22 22:28:46 node-i14 kernel: LustreError: 15478:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -2
Aug 22 22:28:46 node-i14 kernel: LustreError: 15478:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 52 previous similar messages
Aug 22 22:33:49 node-i14 kernel: LustreError: 15626:0:(client.c:
962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187818329,
100s ago) req@00000102137ef400 x1628/t0 o101->home-md-
MDT0000_UUID@10.143.245.3@tcp:12 lens 456/768 ref 1 fl Rpc:P/0/0 rc
0/-22
Aug 22 22:33:49 node-i14 kernel: Lustre: home-md-MDT0000-
mdc-00000100069b6c00: Connection to service home-md-MDT0000 via nid
10.143.245.3@tcp was lost; in progress operations using this service
will wait for recovery to complete.
Aug 22 22:33:49 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 22:37:09 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 22:40:04 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 22:40:04 node-i14 kernel: LustreError: Skipped 1 previous
similar message
Aug 22 22:41:44 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 22:41:44 node-i14 kernel: LustreError: Skipped 3 previous
similar messages
Aug 22 22:44:39 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 22:44:39 node-i14 kernel: LustreError: Skipped 6 previous
similar messages
Aug 22 22:50:29 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 22:50:29 node-i14 kernel: LustreError: Skipped 13 previous
similar messages
Aug 22 23:00:54 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 23:00:54 node-i14 kernel: LustreError: Skipped 24 previous
similar messages
Aug 22 23:11:19 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 23:11:19 node-i14 kernel: LustreError: Skipped 24 previous
similar messages
Aug 22 23:21:44 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 23:21:44 node-i14 kernel: LustreError: Skipped 24 previous
similar messages
Aug 22 23:32:12 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 23:32:12 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 22 23:42:34 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 23:42:34 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 22 23:52:59 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 22 23:52:59 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 00:03:24 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 00:03:24 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 00:14:32 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 00:14:32 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 00:24:39 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 00:24:39 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 00:35:04 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 00:35:04 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 00:46:02 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 00:46:02 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 00:56:19 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 00:56:19 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 01:06:19 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 01:06:19 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 01:16:44 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 01:16:44 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 01:27:09 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 01:27:09 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 01:37:30 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 01:37:30 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 01:47:34 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 01:47:34 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 01:57:59 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 01:57:59 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 02:08:24 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 02:08:24 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 02:18:49 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 02:18:49 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 02:29:14 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 02:29:14 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 02:39:26 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 02:39:26 node-i14 kernel: LustreError: Skipped 20 previous
similar messages
Aug 23 02:49:39 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 02:49:39 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 02:59:39 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 02:59:39 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 03:10:04 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 03:10:04 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 03:20:23 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 03:20:23 node-i14 kernel: LustreError: Skipped 20 previous
similar messages
Aug 23 03:30:29 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 03:30:29 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 03:40:54 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 03:40:54 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 03:51:53 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 03:51:53 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 04:02:09 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 04:02:09 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 04:12:34 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 04:12:34 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 04:22:59 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 04:22:59 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 04:33:55 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 04:33:55 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 04:44:15 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 04:44:15 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 04:54:40 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 04:54:40 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 05:05:05 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 05:05:05 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 05:15:34 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 05:15:34 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 05:25:55 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 05:25:55 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 05:36:20 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 05:36:20 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 05:46:45 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 05:46:45 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 05:56:45 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 05:56:45 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 06:07:10 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 06:07:10 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 06:17:35 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 06:17:35 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 06:28:00 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 06:28:00 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 06:38:25 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 06:38:25 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 06:48:50 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 06:48:50 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 06:59:15 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 06:59:15 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 07:09:15 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 07:09:15 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 07:19:40 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 07:19:40 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 07:30:05 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 07:30:05 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 07:40:30 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 07:40:30 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 07:51:18 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 07:51:18 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 08:01:20 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 08:01:20 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 08:11:45 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 08:11:45 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 08:22:10 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 08:22:10 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 08:32:35 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 08:32:35 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 08:43:00 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 08:43:00 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 08:53:00 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 08:53:00 node-i14 kernel: LustreError: Skipped 21 previous
similar messages
Aug 23 09:03:25 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:03:25 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 09:13:50 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:13:50 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 09:24:15 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:24:15 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 09:35:15 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:35:15 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 09:45:30 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:45:30 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 09:55:55 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 09:55:55 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 09:59:13 node-i14 kernel: LustreError: 15626:0:(mdc_locks.c:
420:mdc_enqueue()) ldlm_cli_enqueue: -4
Aug 23 09:59:13 node-i14 kernel: LustreError: 15626:0:(mdc_locks.c:
420:mdc_enqueue()) Skipped 21 previous similar messages
Aug 23 10:06:20 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:06:20 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 10:16:45 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:16:45 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 10:27:10 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:27:10 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 10:37:35 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:37:35 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 10:48:00 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:48:00 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 10:58:56 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 10:58:56 node-i14 kernel: LustreError: Skipped 22 previous
similar messages
Aug 23 11:09:15 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 11:09:15 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
Aug 23 11:19:40 node-i14 kernel: LustreError: 11-0: an error ocurred
while communicating with (no nid) The mds_connect operation failed
with -16
Aug 23 11:19:40 node-i14 kernel: LustreError: Skipped 23 previous
similar messages
---------------------------------------------------------------MDS/
MGS---------------------------------------------------------------------
----------------------------------------
Aug 23 10:20:43 storage03 kernel: LustreError: 18968:0:(client.c:
962:ptlrpc_expire_one_request()) Skipped 22821480 previous similar
messages
Aug 23 10:21:02 storage03 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb
()) Watchdog triggered for pid 19007: it was inactive for 100s
Aug 23 10:21:02 storage03 kernel: Lustre: 0:0:(linux-debug.c:
168:libcfs_debug_dumpstack()) showing stack for process 19007
Aug 23 10:21:02 storage03 kernel: ll_mdt_119 S
0000000102da4f34 0 19007 1 19008 19006 (L-TLB)
Aug 23 10:21:02 storage03 kernel: 000001011a2c7438 0000000000000046
0000010086272b80 0000010100000073
Aug 23 10:21:02 storage03 kernel: 0000010086272bc0
00000000a03d99c0 0000010001043a20 000000001a2c7468
Aug 23 10:21:02 storage03 kernel: 000001009b9d0800
0000000000005f72
Aug 23 10:21:02 storage03 kernel: Call Trace:<ffffffffa03b5c24>
{:ptlrpc:ldlm_run_cp_ast_work+356}
Aug 23 10:21:02 storage03 kernel: <ffffffff8013f57c>
{__mod_timer+293} ll_mdt_75 S 0000000000000000 0 18960
1 18961 18959 (L-TLB)
Aug 23 10:21:02 storage03 kernel: 00000100769453e8 0000000000000046
0000000000000013 0000000000000086
Aug 23 10:21:02 storage03 kernel: 0000000000007530
0000000000000000 000001000105ba20 00000003abc5bf5a
Aug 23 10:21:02 storage03 kernel: 00000100a66c2800
00000000000007a1
Aug 23 10:21:02 storage03 kernel: Call Trace:<ffffffff8013f57c>
{__mod_timer+293} <ffffffff80320877>{schedule_timeout+367}
Aug 23 10:21:02 storage03 kernel: <ffffffff8013ffac>
{process_timeout+0} <ffffffffa03c83f4>{:ptlrpc:ldlm_completion_ast+1188}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03b9e78>
{:ptlrpc:ldlm_resource_get+456} <ffffffff8013369a>
{default_wake_function+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03c7d00>
{:ptlrpc:ldlm_expired_completion_wait+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03c7cf0>
{:ptlrpc:interrupted_completion_wait+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03b862b>
{:ptlrpc:ldlm_lock_enqueue+1355}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03c7d00>
{:ptlrpc:ldlm_expired_completion_wait+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03c7cf0>
{:ptlrpc:interrupted_completion_wait+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03caec0>
{:ptlrpc:ldlm_blocking_ast+0} <ffffffffa03c8c08>
{:ptlrpc:ldlm_cli_enqueue_local+1192}
Aug 23 10:21:02 storage03 kernel: <ffffffffa05e2a9f>
{:mds:enqueue_ordered_locks+1135}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03c7f50>
{:ptlrpc:ldlm_completion_ast+0} <ffffffffa05e3f20>
{:mds:mds_get_parent_child_locked+1408}
Aug 23 10:21:02 storage03 kernel: <ffffffffa0452cc4>
{:ksocklnd:ksocknal_queue_tx_locked+1236}
Aug 23 10:21:02 storage03 kernel: <ffffffffa05cc58f>
{:mds:mds_getattr_lock+1695} <ffffffffa0450ef4>
{:ksocklnd:ksocknal_alloc_tx+468}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03ebbf7>
{:ptlrpc:lustre_msg_get_flags+87}
Aug 23 10:21:02 storage03 kernel: <ffffffffa05d8b31>
{:mds:mds_intent_policy+1601} <ffffffffa03089f2>{:lnet:LNetMDBind+690}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03bb1f3>
{:ptlrpc:ldlm_resource_putref+435}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03b81d6>
{:ptlrpc:ldlm_lock_enqueue+246} <ffffffffa03b3005>
{:ptlrpc:ldlm_lock_create+1477}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03ee590>
{:ptlrpc:lustre_swab_ldlm_request+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03d1aa2>
{:ptlrpc:ldlm_handle_enqueue+3426}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03cf540>
{:ptlrpc:ldlm_server_blocking_ast+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03cfbd0>
{:ptlrpc:ldlm_server_completion_ast+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa05d3b18>
{:mds:mds_handle+18424} <ffffffffa030cb58>
{:lnet:lnet_match_blocked_msg+920}
Aug 23 10:21:02 storage03 kernel: <ffffffff80320877>
{schedule_timeout+367}
Aug 23 10:21:02 storage03 kernel: <ffffffffa035bf40>
{:obdclass:class_handle2object+224}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03ed110>
{:ptlrpc:lustre_swab_ptlrpc_body+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03eb03d>
{:ptlrpc:lustre_swab_buf+205} <ffffffffa030860c>{:lnet:LNetMDAttach+764}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03f22ec>
{:ptlrpc:ptlrpc_server_handle_request+3036}
Aug 23 10:21:02 storage03 kernel: <ffffffffa02dbbbe>
{:libcfs:lcw_update_time+30} <ffffffff8013ffac>{process_timeout+0}
<ffffffffa03e00e4>{:ptlrpc:ptlrpc_set_wait+932}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03ebdb5>
{:ptlrpc:lustre_msg_set_flags+69}
Aug 23 10:21:02 storage03 kernel: <ffffffff8013f57c>
{__mod_timer+293}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03f48da>
{:ptlrpc:ptlrpc_main+2410} <ffffffffa03f3000>
{:ptlrpc:ptlrpc_retry_rqbds+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03f3000>
{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffff8013369a>
{default_wake_function+0}<ffffffffa03f3000>{:ptlrpc:ptlrpc_retry_rqbds
+0}
Aug 23 10:21:02 storage03 kernel: <ffffffff80110de3>{child_rip
+8} <ffffffffa03f3f70>{:ptlrpc:ptlrpc_main+0}<ffffffffa03de320>
{:ptlrpc:ptlrpc_expired_set+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03dc380>
{:ptlrpc:ptlrpc_interrupted_set+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03cec73>
{:ptlrpc:__ldlm_add_waiting_lock+147}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03de320>
{:ptlrpc:ptlrpc_expired_set+0} <ffffffffa03dc380>
{:ptlrpc:ptlrpc_interrupted_set+0}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03b354b>
{:ptlrpc:ldlm_send_and_maybe_create_set+27}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03b658a>
{:ptlrpc:ldlm_run_bl_ast_work+490}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03b4e73>
{:ptlrpc:ldlm_add_ast_work_item+179}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03b3ee5>
{:ptlrpc:ldlm_granted_list_add_lock+325}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03d25db>
{:ptlrpc:ldlm_handle_bl_callback+427}
Aug 23 10:21:02 storage03 kernel: <ffffffffa03d974c>
{:ptlrpc:ldlm_process_inodebits_lock+748}
Aug 23 10:21:03 storage03 kernel: <ffffffffa03b9e78>
{:ptlrpc:ldlm_resource_get+456} <ffffffffa03b85b4>
{:ptlrpc:ldlm_lock_enqueue+1236}
Aug 23 10:21:03 storage03 kernel: <ffffffffa03b248a>
{:ptlrpc:ldlm_lock_remove_from_lru+138}
Aug 23 10:21:03 storage03 kernel: <ffffffffa03caec0>
{:ptlrpc:ldlm_blocking_ast+0} <ffffffffa03c8a93>
{:ptlrpc:ldlm_cli_enqueue_local+819}
Aug 23 10:21:03 storage03 kernel: <ffffffffa05e2a9f>
{:mds:enqueue_ordered_locks+1135}
Aug 23 10:21:03 storage03 kernel: <ffffffffa03c7f50>
{:ptlrpc:ldlm_completion_ast+0} <ffffffffa05e72f9>{:mds:mds_reint_link
+1657}
Aug 23 10:21:03 storage03 kernel: <ffffffffa02f32da>
{:lvfs:upcall_cache_get_entry+522}
Aug 23 10:21:03 storage03 kernel: <ffffffffa030c0b0>
{:lnet:lnet_send+2544} <ffffffffa05ea54a>{:mds:mds_reint_rec+458}
Aug 23 10:21:03 storage03 kernel: <ffffffffa05fcc3c>
{:mds:mds_link_unpack+636} <ffffffffa05fdb9b>{:mds:mds_update_unpack
+507}
Aug 23 10:21:03 storage03 kernel: <ffffffffa05ce871>
{:mds:mds_reint+1089} <ffffffffa05cf133>{:mds:mds_msg_check_version+387}
Aug 23 10:21:03 storage03 kernel: <ffffffffa05d1e88>
{:mds:mds_handle+11112} <ffffffffa030cb58>
{:lnet:lnet_match_blocked_msg+920}
Aug 23 10:21:03 storage03 kernel: <ffffffffa035bf40>
{:obdclass:class_handle2object+224}
Aug 23 10:21:03 storage03 kernel: <ffffffffa03ed110>
{:ptlrpc:lustre_swab_ptlrpc_body+0}
Aug 23 10:21:03 storage03 kernel: <ffffffffa03eb03d>
{:ptlrpc:lustre_swab_buf+205} <ffffffff80131a57>{recalc_task_prio+337}
Aug 23 10:21:03 storage03 kernel: <ffffffffa03f22ec>
{:ptlrpc:ptlrpc_server_handle_request+3036}
Aug 23 10:21:03 storage03 kernel: <ffffffffa02dbbbe>
{:libcfs:lcw_update_time+30}
Aug 23 10:21:03 storage03 kernel: <ffffffff80110ddb>{child_rip+0}
Aug 23 10:21:03 storage03 kernel: LustreError: dumping log to /tmp/
lustre-log.1187860862.19007
Aug 23 10:21:03 storage03 kernel: LustreError: 19007:0:
(ldlm_request.c:60:ldlm_expired_completion_wait()) ### lock timed out
(enqueued at 1187860762, 100s ago); not entering recovery in server
code, just going back to sleep ns: mds-home-md-MDT0000_UUID lock:
0000010062575980/0x6d867fda339b07ce lrc: 3/1,0 mode: --/CR res:
32671088/1775612723 bits 0x3 rrc: 4 type: IBT flags: 4004000 remote:
0x0 expref: -99 pid 19007
Aug 23 10:21:03 storage03 kernel: <ffffffff8013f57c>{__mod_timer+293}
Aug 23 10:21:03 storage03 kernel: <ffffffffa03f48da>
{:ptlrpc:ptlrpc_main+2410} <ffffffffa03f3000>
{:ptlrpc:ptlrpc_retry_rqbds+0}
Aug 23 10:21:03 storage03 kernel: <ffffffffa03f3000>
{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03f3000>
{:ptlrpc:ptlrpc_retry_rqbds+0}
Aug 23 10:21:03 storage03 kernel: <ffffffff80110de3>{child_rip
+8} <ffffffffa03f3f70>{:ptlrpc:ptlrpc_main+0}
Aug 23 10:21:03 storage03 kernel: <ffffffff80110ddb>{child_rip+0}
Aug 23 10:21:08 storage03 kernel: LustreError: dumping log to /tmp/
lustre-log.1187860868.18960
Aug 23 10:22:40 storage03 kernel: Lustre: MGS: haven''t heard from
client f03c5e89-73c4-4d6c-1389-38fe5cfbb771 (at 10.142.9.20@tcp) in
227 seconds. I think it''s dead, and I am evicting it.
Aug 23 10:22:40 storage03 kernel: Lustre: Skipped 3 previous similar
messages
Aug 23 10:24:06 storage03 kernel: LustreError: 0:0:(ldlm_lockd.c:
214:waiting_locks_callback()) ### lock callback timer expired:
evicting client 482df67f-be1c-4469-bf41-
d85560b363b0@NET_0x200010a8e0a68_UUID nid 10.142.10.104@tcp1 ns: mds-
home-md-MDT0000_UUID lock: 000001010e31ca40/0x6d867fda339b0b08 lrc:
1/0,0 mode: CR/CR res: 32320082/1777829066 bits 0x3 rrc: 2 type: IBT
flags: 4000020 remote: 0xfda6fd9685772b07 expref: 177 pid 18934
Aug 23 10:26:02 storage03 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb
()) Watchdog triggered for pid 18922: it was inactive for 100s
Aug 23 10:26:02 storage03 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb
()) Skipped 1 previous similar message
Aug 23 10:26:02 storage03 kernel: Lustre: 0:0:(linux-debug.c:
168:libcfs_debug_dumpstack()) showing stack for process 18922
Aug 23 10:26:02 storage03 kernel: Lustre: 0:0:(linux-debug.c:
168:libcfs_debug_dumpstack()) Skipped 1 previous similar message
Aug 23 10:26:02 storage03 kernel: ll_mdt_48 S
0000000000000000 0 18922 1 18923 18921 (L-TLB)
Aug 23 10:26:02 storage03 kernel: 0000010048045438 0000000000000046
0000010086272b80 0000010048045468
Aug 23 10:26:02 storage03 kernel: 0000010086272bc0
ffffffffa03d99c0 00000100480454f8 0000000348045468
Aug 23 10:26:02 storage03 kernel: 0000010086991030
0000000000001404
Aug 23 10:26:02 storage03 kernel: Call Trace:<ffffffffa03d99c0>
{:ptlrpc:ptlrpc_prep_set+320} <ffffffffa03b5c24>
{:ptlrpc:ldlm_run_cp_ast_work+356}
Aug 23 10:26:02 storage03 kernel: <ffffffff8013f57c>
{__mod_timer+293} <ffffffff80320877>{schedule_timeout+367}
Aug 23 10:26:02 storage03 kernel: <ffffffff8013ffac>
{process_timeout+0} <ffffffffa03c83f4>{:ptlrpc:ldlm_completion_ast+1188}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03b9e78>
{:ptlrpc:ldlm_resource_get+456} <ffffffff8013369a>
{default_wake_function+0}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03c7d00>
{:ptlrpc:ldlm_expired_completion_wait+0}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03c7cf0>
{:ptlrpc:interrupted_completion_wait+0}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03b862b>
{:ptlrpc:ldlm_lock_enqueue+1355}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03c7d00>
{:ptlrpc:ldlm_expired_completion_wait+0}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03c7cf0>
{:ptlrpc:interrupted_completion_wait+0}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03caec0>
{:ptlrpc:ldlm_blocking_ast+0} <ffffffffa03c8c08>
{:ptlrpc:ldlm_cli_enqueue_local+1192}
Aug 23 10:26:02 storage03 kernel: <ffffffffa05e2a9f>
{:mds:enqueue_ordered_locks+1135}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03c7f50>
{:ptlrpc:ldlm_completion_ast+0} <ffffffffa05e3f20>
{:mds:mds_get_parent_child_locked+1408}
Aug 23 10:26:02 storage03 kernel: <ffffffff801336eb>
{__wake_up_common+67} <ffffffffa05cc58f>{:mds:mds_getattr_lock+1695}
Aug 23 10:26:02 storage03 kernel: <ffffffffa0450ef4>
{:ksocklnd:ksocknal_alloc_tx+468}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03ebbf7>
{:ptlrpc:lustre_msg_get_flags+87}
Aug 23 10:26:02 storage03 kernel: <ffffffffa05d8b31>
{:mds:mds_intent_policy+1601} <ffffffffa03089f2>{:lnet:LNetMDBind+690}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03bb1f3>
{:ptlrpc:ldlm_resource_putref+435}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03b81d6>
{:ptlrpc:ldlm_lock_enqueue+246} <ffffffffa03b3005>
{:ptlrpc:ldlm_lock_create+1477}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03ee590>
{:ptlrpc:lustre_swab_ldlm_request+0}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03d1aa2>
{:ptlrpc:ldlm_handle_enqueue+3426}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03cf540>
{:ptlrpc:ldlm_server_blocking_ast+0}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03cfbd0>
{:ptlrpc:ldlm_server_completion_ast+0}
Aug 23 10:26:02 storage03 kernel: <ffffffffa05d3b18>
{:mds:mds_handle+18424} <ffffffffa0022bda>
{:e1000:e1000_alloc_rx_buffers_ps+512}
Aug 23 10:26:02 storage03 kernel: <ffffffffa030cb58>
{:lnet:lnet_match_blocked_msg+920}
Aug 23 10:26:02 storage03 kernel: <ffffffffa035bf40>
{:obdclass:class_handle2object+224}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03ed110>
{:ptlrpc:lustre_swab_ptlrpc_body+0}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03eb03d>
{:ptlrpc:lustre_swab_buf+205} <ffffffff80131a57>{recalc_task_prio+337}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03f22ec>
{:ptlrpc:ptlrpc_server_handle_request+3036}
Aug 23 10:26:02 storage03 kernel: <ffffffffa02dbbbe>
{:libcfs:lcw_update_time+30} <ffffffff8013f57c>{__mod_timer+293}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03f48da>
{:ptlrpc:ptlrpc_main+2410} <ffffffff8013369a>{default_wake_function+0}
Aug 23 10:26:02 storage03 kernel: <ffffffffa03f3000>
{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa03f3000>
{:ptlrpc:ptlrpc_retry_rqbds+0}
Aug 23 10:26:02 storage03 kernel: <ffffffff80110de3>{child_rip
+8} <ffffffffa03f3f70>{:ptlrpc:ptlrpc_main+0}
Aug 23 10:26:02 storage03 kernel: <ffffffff80110ddb>{child_rip+0}
Aug 23 10:26:02 storage03 kernel: LustreError: dumping log to /tmp/
lustre-log.1187861162.18922
Aug 23 10:26:02 storage03 kernel: LustreError: 18922:0:
(ldlm_request.c:60:ldlm_expired_completion_wait()) ### lock timed out
(enqueued at 1187861062, 100s ago); not entering recovery in server
code, just going back to sleep ns: mds-home-md-MDT0000_UUID lock:
000001009c086080/0x6d867fda339b3ea1 lrc: 3/1,0 mode: --/CR res:
32671088/1775612723 bits 0x3 rrc: 4 type: IBT flags: 4004000 remote:
0x0 expref: -99 pid 18922
Aug 23 10:27:16 storage03 kernel: Lustre: 18966:0:(ldlm_lib.c:
724:target_handle_connect()) home-md-MDT0000: refuse reconnection
from 2b962b5b-ac4d-368e-e3c1-04795a6a3139@10.142.2.50@tcp to
0x000001005bcbe000/2
Aug 23 10:27:16 storage03 kernel: Lustre: 18966:0:(ldlm_lib.c:
724:target_handle_connect()) Skipped 434 previous similar messages
Aug 23 10:27:16 storage03 kernel: LustreError: 18966:0:(ldlm_lib.c:
1395:target_send_reply_msg()) @@@ processing error (-16)
req@00000100c1a5f800 x52340/t0 o38->2b962b5b-ac4d-368e-
e3c1-04795a6a3139@NET_0x200000a8e0232_UUID:-1 lens 304/200 ref 0 fl
Interpret:/0/0 rc -16/0
Aug 23 10:27:16 storage03 kernel: LustreError: 18966:0:(ldlm_lib.c:
1395:target_send_reply_msg()) Skipped 434 previous similar messages
Aug 23 10:27:26 storage03 kernel: Lustre: There was an unexpected
network error while writing to 10.143.6.51: -110.
Aug 23 10:27:41 storage03 kernel: Lustre: 18915:0:(ldlm_lib.c:
502:target_handle_reconnect()) home-md-MDT0000: 325cee47-c08d-a6f6-
f6cc-1d9d50429640 reconnecting
Aug 23 10:27:41 storage03 kernel: Lustre: 18915:0:(ldlm_lib.c:
502:target_handle_reconnect()) Skipped 439 previous similar messages
Aug 23 10:28:23 storage03 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb
()) Watchdog triggered for pid 19008: it was inactive for 100s
Aug 23 10:28:23 storage03 kernel: Lustre: 0:0:(linux-debug.c:
168:libcfs_debug_dumpstack()) showing stack for process 19008
Aug 23 10:28:23 storage03 kernel: ll_mdt_120 S
ffffffff8031f460 0 19008 1 19009 19007 (L-TLB)
Aug 23 10:28:23 storage03 kernel: 00000100b3ea3438 0000000000000046
0000010086272b80 00000100b3ea3468
Aug 23 10:28:23 storage03 kernel: 0000010086272bc0
ffffffffa03d99c0 00000100b3ea34f8 00000003b3ea3468
Aug 23 10:28:23 storage03 kernel: 000001009b9d0030
0000000000005262
Aug 23 10:28:23 storage03 kernel: Call Trace:<ffffffffa03d99c0>
{:ptlrpc:ptlrpc_prep_set+320} <ffffffffa03b5c24>
{:ptlrpc:ldlm_run_cp_ast_work+356}
Aug 23 10:28:23 storage03 kernel: <ffffffff8013f57c>
{__mod_timer+293} <ffffffff80320877>{schedule_timeout+367}
Aug 23 10:28:23 storage03 kernel: <ffffffff8013ffac>
{process_timeout+0} <ffffffffa03c83f4>{:ptlrpc:ldlm_completion_ast+1188}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03b9e78>
{:ptlrpc:ldlm_resource_get+456} <ffffffff8013369a>
{default_wake_function+0}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03c7d00>
{:ptlrpc:ldlm_expired_completion_wait+0}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03c7cf0>
{:ptlrpc:interrupted_completion_wait+0}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03b862b>
{:ptlrpc:ldlm_lock_enqueue+1355}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03c7d00>
{:ptlrpc:ldlm_expired_completion_wait+0}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03c7cf0>
{:ptlrpc:interrupted_completion_wait+0}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03caec0>
{:ptlrpc:ldlm_blocking_ast+0} <ffffffffa03c8c08>
{:ptlrpc:ldlm_cli_enqueue_local+1192}
Aug 23 10:28:23 storage03 kernel: <ffffffffa05e2a9f>
{:mds:enqueue_ordered_locks+1135}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03c7f50>
{:ptlrpc:ldlm_completion_ast+0} <ffffffffa05e3f20>
{:mds:mds_get_parent_child_locked+1408}
Aug 23 10:28:23 storage03 kernel: <ffffffff801336eb>
{__wake_up_common+67} <ffffffffa05cc58f>{:mds:mds_getattr_lock+1695}
Aug 23 10:28:23 storage03 kernel: <ffffffffa0450ef4>
{:ksocklnd:ksocknal_alloc_tx+468}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03ebbf7>
{:ptlrpc:lustre_msg_get_flags+87}
Aug 23 10:28:23 storage03 kernel: <ffffffffa05d8b31>
{:mds:mds_intent_policy+1601} <ffffffffa03bb1f3>
{:ptlrpc:ldlm_resource_putref+435}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03b81d6>
{:ptlrpc:ldlm_lock_enqueue+246} <ffffffffa03b3005>
{:ptlrpc:ldlm_lock_create+1477}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03ee590>
{:ptlrpc:lustre_swab_ldlm_request+0}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03d1aa2>
{:ptlrpc:ldlm_handle_enqueue+3426}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03cf540>
{:ptlrpc:ldlm_server_blocking_ast+0}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03cfbd0>
{:ptlrpc:ldlm_server_completion_ast+0}
Aug 23 10:28:23 storage03 kernel: <ffffffffa05d3b18>
{:mds:mds_handle+18424} <ffffffffa030cb58>
{:lnet:lnet_match_blocked_msg+920}
Aug 23 10:28:23 storage03 kernel: <ffffffffa035bf40>
{:obdclass:class_handle2object+224}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03ed110>
{:ptlrpc:lustre_swab_ptlrpc_body+0}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03eb03d>
{:ptlrpc:lustre_swab_buf+205} <ffffffffa03f22ec>
{:ptlrpc:ptlrpc_server_handle_request+3036}
Aug 23 10:28:23 storage03 kernel: <ffffffffa02dbbbe>
{:libcfs:lcw_update_time+30} <ffffffff8013f57c>{__mod_timer+293}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03f48da>
{:ptlrpc:ptlrpc_main+2410} <ffffffff80179504>{end_buffer_async_write+0}
Aug 23 10:28:23 storage03 kernel: <ffffffff8013369a>
{default_wake_function+0} <ffffffffa03f3000>
{:ptlrpc:ptlrpc_retry_rqbds+0}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03f3000>
{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffff80110de3>{child_rip+8}
Aug 23 10:28:23 storage03 kernel: <ffffffffa03f3f70>
{:ptlrpc:ptlrpc_main+0} <ffffffff80110ddb>{child_rip+0}
Aug 23 10:28:23 storage03 kernel:
Aug 23 10:28:23 storage03 kernel: LustreError: dumping log to /tmp/
lustre-log.1187861303.19008
Aug 23 10:28:23 storage03 kernel: LustreError: 19008:0:
(ldlm_request.c:60:ldlm_expired_completion_wait()) ### lock timed out
(enqueued at 1187861203, 100s ago); not entering recovery in server
code, just going back to sleep ns: mds-home-md-MDT0000_UUID lock:
00000100302ac740/0x6d867fda339b4696 lrc: 3/1,0 mode: --/CR res:
32671088/1775612723 bits 0x3 rrc: 5 type: IBT flags: 4004000 remote:
0x0 expref: -99 pid 19008
Aug 23 10:28:27 storage03 kernel: Lustre: home-md-OST0001-osc:
Connection to service home-md-OST0001 via nid 10.143.245.2@tcp was
lost; in progress operations using this service will wait for
recovery to complete.
Aug 23 10:28:37 storage03 kernel: Lustre: 14387:0:(lib-move.c:
1647:lnet_parse_put()) Dropping PUT from 12345-10.143.245.2@tcp
portal 4 match 2375236 offset 0 length 272: 2
Aug 23 10:28:37 storage03 kernel: Lustre: 14387:0:(lib-move.c:
1647:lnet_parse_put()) Skipped 8 previous similar messages
Aug 23 10:28:37 storage03 kernel: Lustre: 14398:0:(quota_master.c:
1101:mds_quota_recovery()) Not all osts are active, abort quota recovery
Aug 23 10:28:37 storage03 kernel: Lustre: home-md-OST0001-osc:
Connection restored to service home-md-OST0001 using nid
10.143.245.2@tcp.
Aug 23 10:28:37 storage03 kernel: Lustre: MDS home-md-MDT0000: home-
md-OST0001_UUID now active, resetting orphans
Aug 23 10:29:52 storage03 kernel: Lustre: home-md-MDT0000: haven''t
heard from client 3c353018-50c0-df10-cf89-35e17382a109 (at
10.142.6.51@tcp) in 227 seconds. I think it''s dead, and I am evicting
it.
Aug 23 10:29:52 storage03 kernel: Lustre: Skipped 30 previous similar
messages
Aug 23 10:30:08 storage03 kernel: LustreError: 18968:0:(events.c:
55:request_out_callback()) @@@ type 4, status -5
req@0000010129910e00 x2357264/t0 o104->??H?????
@NET_0x200000a8e0230_UUID:15 lens 232/128 ref 2 fl Rpc:/0/0 rc 0/-22
Aug 23 10:30:08 storage03 kernel: LustreError: 18968:0:(events.c:
55:request_out_callback()) Skipped 21228492 previous similar messages
Aug 23 10:30:44 storage03 kernel: LustreError: 18960:0:(client.c:
962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187861438, 6s
ago) req@0000010029324e00 x2375189/t0 o104->??H?????
@NET_0x200000a8e0920_UUID:15 lens 232/128 ref 30 fl Rpc:/0/0 rc 0/-22
Aug 23 10:30:44 storage03 kernel: LustreError: 18960:0:(client.c:
962:ptlrpc_expire_one_request()) Skipped 20493872 previous similar
messages
Aug 23 10:32:37 storage03 kernel: Lustre: home-md-OST0001-osc:
Connection to service home-md-OST0001 via nid 10.143.245.2@tcp was
lost; in progress operations using this service will wait for
recovery to complete.
Aug 23 10:32:46 storage03 kernel: Lustre: 14387:0:(lib-move.c:
1647:lnet_parse_put()) Dropping PUT from 12345-10.143.245.2@tcp
portal 4 match 2375318 offset 0 length 128: 2
Aug 23 10:32:46 storage03 kernel: Lustre: 14398:0:(quota_master.c:
1101:mds_quota_recovery()) Not all osts are active, abort quota recovery
Aug 23 10:32:46 storage03 kernel: Lustre: home-md-OST0001-osc:
Connection restored to service home-md-OST0001 using nid
10.143.245.2@tcp.
Aug 23 10:32:46 storage03 kernel: Lustre: MDS home-md-MDT0000: home-
md-OST0001_UUID now active, resetting orphans
Aug 23 10:36:47 storage03 kernel: Lustre: home-md-OST0001-osc:
Connection to service home-md-OST0001 via nid 10.143.245.2@tcp was
lost; in progress operations using this service will wait for
recovery to complete.
Aug 23 10:37:12 storage03 kernel: Lustre: 14398:0:(quota_master.c:
1101:mds_quota_recovery()) Not all osts are active, abort quota recovery
Aug 23 10:37:12 storage03 kernel: Lustre: home-md-OST0001-osc:
Connection restored to service home-md-OST0001 using nid
10.143.245.2@tcp.
Aug 23 10:37:13 storage03 kernel: Lustre: MDS home-md-MDT0000: home-
md-OST0001_UUID now active, resetting orphans
Aug 23 10:37:16 storage03 kernel: Lustre: 19002:0:(ldlm_lib.c:
724:target_handle_connect()) home-md-MDT0000: refuse reconnection
from 325cee47-c08d-a6f6-f6cc-1d9d50429640@10.142.2.53@tcp to
0x00000100c7672000/2
Aug 23 10:37:16 storage03 kernel: Lustre: 19002:0:(ldlm_lib.c:
724:target_handle_connect()) Skipped 464 previous similar messages
Aug 23 10:37:16 storage03 kernel: LustreError: 19002:0:(ldlm_lib.c:
1395:target_send_reply_msg()) @@@ processing error (-16)
req@0000010126811850 x52438/t0 o38->325cee47-c08d-a6f6-
f6cc-1d9d50429640@NET_0x200000a8e0235_UUID:-1 lens 304/200 ref 0 fl
Interpret:/0/0 rc -16/0
Aug 23 10:37:16 storage03 kernel: LustreError: 19002:0:(ldlm_lib.c:
1395:target_send_reply_msg()) Skipped 464 previous similar messages
Aug 23 10:37:41 storage03 kernel: Lustre: 18976:0:(ldlm_lib.c:
502:target_handle_reconnect()) home-md-MDT0000:
581834bb-609c-58c9-9c0a-2c272dd8db14 reconnecting
Aug 23 10:37:41 storage03 kernel: Lustre: 18976:0:(ldlm_lib.c:
502:target_handle_reconnect()) Skipped 464 previous similar messages
Aug 23 10:40:44 storage03 kernel: LustreError: 14492:0:(client.c:
962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187862038, 6s
ago) req@0000010117384a00 x2286224/t0 o104->??H?????
@NET_0x200000a8e0124_UUID:15 lens 232/128 ref 71 fl Rpc:/0/0 rc 0/-22
Aug 23 10:40:44 storage03 kernel: LustreError: 14492:0:(client.c:
962:ptlrpc_expire_one_request()) Skipped 17962807 previous similar
messages
Aug 23 10:40:58 storage03 kernel: Lustre: home-md-OST0001-osc:
Connection to service home-md-OST0001 via nid 10.143.245.2@tcp was
lost; in progress operations using this service will wait for
recovery to complete.
------------------------------------------------------------------------
------------------------------------------------------------------------
----------------------------------
We have lots of log files in /tmp on MDS/MGS available upon request.
-----------------------------------------------
OSS---------------------------------------------------------------------
--------------------------------------------------------
Aug 23 09:56:15 storage02 kernel: LustreError: 15091:0:(events.c:
55:request_out_callback()) @@@ type 4, status -5
req@00000101155e4400 x1262654/t0 o104->@NET_0x200000a8e0618_UUID:15
lens 232/128 ref 10 fl Rpc:/0/0 rc 0/-22
Aug 23 09:56:15 storage02 kernel: LustreError: 15091:0:(events.c:
55:request_out_callback()) Skipped 12249742 previous similar messages
Aug 23 10:00:37 storage02 kernel: LustreError: 15328:0:(client.c:
962:ptlrpc_expire_one_request()) @@@ network error (sent at
1187859637, 0s ago) req@00000101155e4400 x1262654/t0 o104-
>@NET_0x200000a8e0618_UUID:15 lens 232/128 ref 1 fl Rpc:/0/0 rc 0/-22
Aug 23 10:00:37 storage02 kernel: LustreError: 15328:0:(client.c:
962:ptlrpc_expire_one_request()) Skipped 9053263 previous similar
messages
Aug 23 10:01:12 storage02 kernel: LustreError: 0:0:(ldlm_lockd.c:
214:waiting_locks_callback()) ### lock callback timer expired:
evicting client b4f65a5f-89e5-5b3b-
e3b4-0d2cce26a2d0@NET_0x200000a8e0221_UUID nid 10.142.2.33@tcp ns:
filter-home-md-OST0001_UUID lock: 000001009d736500/0xac36b10d5a5d77f8
lrc: 3/0,0 mode: PW/PW res: 1673624/0 rrc: 3 type: EXT [0-
>18446744073709551615] (req 0->18446744073709551615) flags: 20
remote: 0xcba7e12fdda9fc2f expref: 4 pid: 15564
Aug 23 10:02:02 storage02 kernel: Lustre: 0:0:(watchdog.c:130:lcw_cb
()) Watchdog triggered for pid 15618: it was inactive for 100s
Aug 23 10:02:02 storage02 kernel: Lustre: 0:0:(linux-debug.c:
168:libcfs_debug_dumpstack()) showing stack for process 15618
Aug 23 10:02:02 storage02 kernel: ll_ost_251 S
0000000000000000 0 15618 1 15619 15617 (L-TLB)
Aug 23 10:02:02 storage02 kernel: 000001011f11b6e8 0000000000000046
0000000000000013 00000000001d0e3f
Aug 23 10:02:02 storage02 kernel: 0000000000007530
0000000000000000 0000000000000000 00000000240fce52
Aug 23 10:02:02 storage02 kernel: 0000010114fae800
0000000000000b7b
Aug 23 10:02:02 storage02 kernel: Call Trace:<ffffffff8013f57c>
{__mod_timer+293} <ffffffff80320877>{schedule_timeout+367}
Aug 23 10:02:02 storage02 kernel: <ffffffff8013ffac>
{process_timeout+0} <ffffffffa03fe0e4>{:ptlrpc:ptlrpc_set_wait+932}
Aug 23 10:02:02 storage02 kernel: <ffffffffa0409db5>
{:ptlrpc:lustre_msg_set_flags+69}
Aug 23 10:02:02 storage02 kernel: <ffffffff8013369a>
{default_wake_function+0} <ffffffffa03fc320>
{:ptlrpc:ptlrpc_expired_set+0}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03fa380>
{:ptlrpc:ptlrpc_interrupted_set+0}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03ecc73>
{:ptlrpc:__ldlm_add_waiting_lock+147}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03fc320>
{:ptlrpc:ptlrpc_expired_set+0} <ffffffffa03fa380>
{:ptlrpc:ptlrpc_interrupted_set+0}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03d154b>
{:ptlrpc:ldlm_send_and_maybe_create_set+27}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03d458a>
{:ptlrpc:ldlm_run_bl_ast_work+490}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03e59c6>
{:ptlrpc:ldlm_process_extent_lock+1286}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03d7ea5>
{:ptlrpc:ldlm_resource_get+501} <ffffffffa03d65b4>
{:ptlrpc:ldlm_lock_enqueue+1236}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03d1005>
{:ptlrpc:ldlm_lock_create+1477} <ffffffffa03ee170>
{:ptlrpc:ldlm_server_glimpse_ast+0}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03efaa2>
{:ptlrpc:ldlm_handle_enqueue+3426}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03ed540>
{:ptlrpc:ldlm_server_blocking_ast+0}
Aug 23 10:02:02 storage02 kernel: <ffffffffa03edbd0>
{:ptlrpc:ldlm_server_completion_ast+0}
Aug 23 10:02:02 storage02 kernel: <ffffffffa059718d>
{:ost:ost_handle+18525} <ffffffff802df496>{ip_rcv+1046}
Aug 23 10:02:02 storage02 kernel: <ffffffff802c6081>
{netif_receive_skb+791} <ffffffffa0296dda>{:cxgb3:lro_flush_session+154}
Aug 23 10:02:02 storage02 kernel: <ffffffffa02feb58>
{:lnet:lnet_match_blocked_msg+920}
Aug 23 10:02:02 storage02 kernel: <ffffffffa04102ec>
{:ptlrpc:ptlrpc_server_handle_request+3036}
Aug 23 10:02:02 storage02 kernel: <ffffffffa02dbbbe>
{:libcfs:lcw_update_time+30} <ffffffff8013f57c>{__mod_timer+293}
Aug 23 10:02:03 storage02 kernel: <ffffffffa04128da>
{:ptlrpc:ptlrpc_main+2410} <ffffffff8013369a>{default_wake_function+0}
Aug 23 10:02:03 storage02 kernel: <ffffffffa0411000>
{:ptlrpc:ptlrpc_retry_rqbds+0} <ffffffffa0411000>
{:ptlrpc:ptlrpc_retry_rqbds+0}
Aug 23 10:02:03 storage02 kernel: <ffffffff80110de3>{child_rip
+8} <ffffffffa0411f70>{:ptlrpc:ptlrpc_main+0}
Aug 23 10:02:03 storage02 kernel: <ffffffff80110ddb>{child_rip+0}
Aug 23 10:02:03 storage02 kernel: LustreError: dumping log to /tmp/
lustre-log.1187859722.15618
Aug 23 10:02:28 storage02 kernel: Lustre: home-md-OST0001: haven''t
heard from client ffc67dee-7cb3-b41a-dfb4-cb2144aae537 (at
10.142.8.36@tcp) in 227 seconds. I think it''s dead, and I am evicting
it.
Aug 23 10:02:28 storage02 kernel: Lustre: Skipped 3 previous similar
messages
Aug 23 10:03:45 storage02 kernel: Lustre: home-md-OST0001: haven''t
heard from client 61a48488-d9f3-f017-103e-0ae054cfface (at
10.142.8.13@tcp) in 227 seconds. I think it''s dead, and I am evicting
it.
Aug 23 10:03:45 storage02 kernel: Lustre: Skipped 66 previous similar
messages
Aug 23 10:06:31 storage02 kernel: Lustre: home-md-OST0001: haven''t
heard from client 933e5c84-cf62-54fc-1152-bb507d26d17f (at
10.142.10.101@tcp1) in 227 seconds. I think it''s dead, and I am
evicting it.
Aug 23 10:06:31 storage02 kernel: Lustre: Skipped 39 previous similar
messages
Aug 23 10:06:41 storage02 kernel: LustreError: 15090:0:(events.c:
55:request_out_callback()) @@@ type 4, status -5
req@000001003104ac00 x1904191/t0 o104->@NET_0x200000a8e0221_UUID:15
lens 232/128 ref 10 fl Rpc:/0/0 rc 0/-22
Aug 23 10:06:41 storage02 kernel: LustreError: 15090:0:(events.c:
55:request_out_callback()) Skipped 12348046 previous similar messages
Aug 23 10:10:42 storage02 kernel: LustreError: 15328:0:(client.c:
962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187860222, 20s
ago) req@00000101155e4400 x1262654/t0 o104-
>@NET_0x200000a8e0618_UUID:15 lens 232/128 ref 4 fl Rpc:/0/0 rc 0/-22
Aug 23 10:10:42 storage02 kernel: LustreError: 15328:0:(client.c:
962:ptlrpc_expire_one_request()) Skipped 11863941 previous similar
messages
Aug 23 10:11:19 storage02 kernel: Lustre: 15605:0:(ldlm_lib.c:
502:target_handle_reconnect()) home-md-OST0001: home-md-mdtlov_UUID
reconnecting
Aug 23 10:11:19 storage02 kernel: Lustre: 15605:0:(ldlm_lib.c:
502:target_handle_reconnect()) Skipped 1 previous similar message
Aug 23 10:11:19 storage02 kernel: Lustre: home-md-OST0001: received
MDS connection from 10.143.245.3@tcp
Aug 23 10:11:19 storage02 kernel: Lustre: 15657:0:(recov_thread.c:
578:llog_repl_connect()) llcd 00000100315f4000:000001011a4f5280 not
empty
Aug 23 10:11:19 storage02 kernel: Lustre: 15547:0:(filter.c:
2640:filter_destroy_precreated()) home-md-OST0001: deleting orphan
objects from 2349111 to 2349155
Aug 23 10:11:44 storage02 kernel: LustreError: 166-1:
MGC10.143.245.3@tcp: Connection to service MGS via nid
10.143.245.3@tcp was lost; in progress operations using this service
will fail.
Aug 23 10:13:24 storage02 kernel: LustreError: 15145:0:(client.c:
520:ptlrpc_import_delay_req()) @@@ IMP_INVALID req@000001002609a600
x1904353/t0 o400->MGS@MGC10.143.245.3@tcp_0:26 lens 128/128 ref 1 fl
Rpc:N/0/0 rc 0/0
Aug 23 10:13:24 storage02 kernel: Lustre: MGC10.143.245.3@tcp:
Reactivating import
Aug 23 10:13:24 storage02 kernel: Lustre: MGC10.143.245.3@tcp:
Connection restored to service MGS using nid 10.143.245.3@tcp.
Aug 23 10:15:10 storage02 kernel: Lustre: home-md-OST0001: haven''t
heard from client 287bed05-3741-adba-04b1-df4388012e55 (at
10.142.4.65@tcp) in 227 seconds. I think it''s dead, and I am evicting
it.
Aug 23 10:17:00 storage02 kernel: LustreError: 15091:0:(events.c:
55:request_out_callback()) @@@ type 4, status -5
req@00000101155e4400 x1262654/t0 o104->@NET_0x200000a8e0618_UUID:15
lens 232/128 ref 10 fl Rpc:/0/0 rc 0/-22
Aug 23 10:17:00 storage02 kernel: LustreError: 15091:0:(events.c:
55:request_out_callback()) Skipped 9755063 previous similar messages
Aug 23 10:20:59 storage02 kernel: LustreError: 15618:0:(client.c:
962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187860839, 20s
ago) req@000001003104ac00 x1904191/t0 o104-
>@NET_0x200000a8e0221_UUID:15 lens 232/128 ref 3 fl Rpc:/0/0 rc 0/-22
Aug 23 10:20:59 storage02 kernel: LustreError: 15618:0:(client.c:
962:ptlrpc_expire_one_request()) Skipped 11526274 previous similar
messages
Aug 23 10:22:40 storage02 kernel: Lustre: home-md-OST0001: haven''t
heard from client 18f2f2a2-a580-c26c-7068-c80d12412606 (at
10.142.9.20@tcp) in 227 seconds. I think it''s dead, and I am evicting
it.
Aug 23 10:22:40 storage02 kernel: Lustre: Skipped 1 previous similar
message
Aug 23 10:27:37 storage02 kernel: LustreError: 15090:0:(events.c:
55:request_out_callback()) @@@ type 4, status -5
req@000001003104ac00 x1904191/t0 o104->@NET_0x200000a8e0221_UUID:15
lens 232/128 ref 10 fl Rpc:/0/0 rc 0/-22
Aug 23 10:27:37 storage02 kernel: LustreError: 15090:0:(events.c:
55:request_out_callback()) Skipped 18522685 previous similar messages
Aug 23 10:28:37 storage02 kernel: Lustre: 15511:0:(ldlm_lib.c:
502:target_handle_reconnect()) home-md-OST0001: home-md-mdtlov_UUID
reconnecting
Aug 23 10:28:37 storage02 kernel: Lustre: home-md-OST0001: received
MDS connection from 10.143.245.3@tcp
Aug 23 10:28:37 storage02 kernel: Lustre: 15603:0:(recov_thread.c:
578:llog_repl_connect()) llcd 000001009c8d0000:000001011a4f5280 not
empty
Aug 23 10:28:37 storage02 kernel: Lustre: 15581:0:(filter.c:
2640:filter_destroy_precreated()) home-md-OST0001: deleting orphan
objects from 2349171 to 2349238
Aug 23 10:30:11 storage02 kernel: Lustre: home-md-OST0001: haven''t
heard from client edd30842-5900-8b5e-b45b-a7097c41707f (at
10.142.6.53@tcp) in 227 seconds. I think it''s dead, and I am evicting
it.
Aug 23 10:30:11 storage02 kernel: Lustre: Skipped 15 previous similar
messages
Aug 23 10:31:07 storage02 kernel: LustreError: 15328:0:(client.c:
962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187861447, 20s
ago) req@00000101155e4400 x1262654/t0 o104-
>@NET_0x200000a8e0618_UUID:15 lens 232/128 ref 3 fl Rpc:/0/0 rc 0/-22
Aug 23 10:31:07 storage02 kernel: LustreError: 15328:0:(client.c:
962:ptlrpc_expire_one_request()) Skipped 20459836 previous similar
messages
Aug 23 10:32:46 storage02 kernel: Lustre: home-md-OST0001: haven''t
heard from client 60a05b8c-57e4-d62d-5144-298734d9fff5 (at
10.142.5.22@tcp) in 244 seconds. I think it''s dead, and I am evicting
it.
Aug 23 10:32:46 storage02 kernel: Lustre: Skipped 15 previous similar
messages
Aug 23 10:32:46 storage02 kernel: Lustre: 15294:0:(ldlm_lib.c:
502:target_handle_reconnect()) home-md-OST0001: home-md-mdtlov_UUID
reconnecting
Aug 23 10:32:46 storage02 kernel: Lustre: home-md-OST0001: received
MDS connection from 10.143.245.3@tcp
Aug 23 10:32:46 storage02 kernel: Lustre: 15267:0:(filter.c:
2640:filter_destroy_precreated()) home-md-OST0001: deleting orphan
objects from 2349178 to 2349298
Aug 23 10:32:47 storage02 kernel: Lustre: 15915:0:(ldlm_lib.c:
502:target_handle_reconnect()) home-md-OST0001: e3047b56-eeac-2754-
ce90-64884f8f572d reconnecting
Aug 23 10:36:55 storage02 kernel: Lustre: Host 10.143.245.3 reset our
connection while we were sending data; it may have rebooted.
Aug 23 10:36:55 storage02 kernel: LustreError: 166-1:
MGC10.143.245.3@tcp: Connection to service MGS via nid
10.143.245.3@tcp was lost; in progress operations using this service
will fail.
Aug 23 10:37:12 storage02 kernel: Lustre: 15689:0:(ldlm_lib.c:
502:target_handle_reconnect()) home-md-OST0001: home-md-mdtlov_UUID
reconnecting
Aug 23 10:37:13 storage02 kernel: Lustre: home-md-OST0001: received
MDS connection from 10.143.245.3@tcp
Aug 23 10:37:45 storage02 kernel: LustreError: 15090:0:(events.c:
55:request_out_callback()) @@@ type 4, status -5
req@00000101155e4400 x1262654/t0 o104->@NET_0x200000a8e0618_UUID:15
lens 232/128 ref 10 fl Rpc:/0/0 rc 0/-22
Aug 23 10:37:45 storage02 kernel: LustreError: 15090:0:(events.c:
55:request_out_callback()) Skipped 20466448 previous similar messages
Aug 23 10:38:35 storage02 kernel: Lustre: MGC10.143.245.3@tcp:
Reactivating import
Aug 23 10:38:35 storage02 kernel: Lustre: MGC10.143.245.3@tcp:
Connection restored to service MGS using nid 10.143.245.3@tcp.
Aug 23 10:41:04 storage02 kernel: Lustre: 15666:0:(ldlm_lib.c:
502:target_handle_reconnect()) home-md-OST0001: home-md-mdtlov_UUID
reconnecting
Aug 23 10:41:04 storage02 kernel: Lustre: home-md-OST0001: received
MDS connection from 10.143.245.3@tcp
Aug 23 10:41:04 storage02 kernel: Lustre: 15555:0:(recov_thread.c:
578:llog_repl_connect()) llcd 000001002c1df000:000001011a4f5280 not
empty
Aug 23 10:41:04 storage02 kernel: Lustre: 15496:0:(filter.c:
2640:filter_destroy_precreated()) home-md-OST0001: deleting orphan
objects from 2349181 to 2349433
Aug 23 10:41:24 storage02 kernel: LustreError: 15618:0:(client.c:
962:ptlrpc_expire_one_request()) @@@ timeout (sent at 1187862064, 20s
ago) req@000001003104ac00 x1904191/t0 o104-
>@NET_0x200000a8e0221_UUID:15 lens 232/128 ref 2 fl Rpc:/0/0 rc 0/-22
Aug 23 10:41:24 storage02 kernel: LustreError: 15618:0:(client.c:
962:ptlrpc_expire_one_request()) Skipped 20442611 previous similar
messages
Aug 23 10:42:20 storage02 kernel: Lustre: home-md-OST0001: haven''t
heard from client 942f63cc-c058-fc48-7c8d-8d6d3042059d (at
10.142.2.49@tcp) in 255 seconds. I think it''s dead, and I am evicting
it.
------------------------------------------------------------------------
------------------------------------------------------------------------
--------------
Please tell us what can cause this errors. It is essential for us to
have stable running lustre.
Regards
Wojciech Turek
Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: wjt27@cam.ac.uk
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-discuss/attachments/20070823/2cf5ea11/attachment-0001.html