Hello,
Occasionally when we put a client, typically a head node, under very
heavy load, it freezes all operations on the Lustre mount and requires a
hard reboot before the mount is usable again. The symptoms look similar
to the statahead problem observed by others, but I was under the
impression that this wouldn''t be an issue in 1.6.5.1, the version that
we''re running. On the client, the messages in the log file are:
Aug 19 12:42:47 herologin1 kernel: LustreError: 11-0: an error occurred
while communicating with 10.242.42.204 at tcp. The mds_statfs operation
failed with -107
Aug 19 12:42:47 herologin1 kernel: Lustre:
circelfs-MDT0000-mdc-ffff81021eabdc00: Connection to service
circelfs-MDT0000 via nid 10.242.42.204 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
Aug 19 12:42:47 herologin1 kernel: LustreError: 167-0: This client was
evicted by circelfs-MDT0000; in progress operations using this service
will fail.
Aug 19 12:42:47 herologin1 kernel: LustreError:
7067:0:(llite_lib.c:1549:ll_statfs_internal()) mdc_statfs fails: rc = -5
while on the MGS/MDS the messages are:
Aug 19 12:41:11 circe1 kernel: Lustre: MGS: haven''t heard from client
1be0f382-ff65-f231-d348-9d2523654fbb (at 10.242.40.14 at tcp) in 1127
seconds. I think it''s dead, and I am evicting it.
Aug 19 12:41:11 circe1 kernel: Lustre: Skipped 2 previous similar messages
Aug 19 12:41:50 circe1 kernel: Lustre: circelfs-OST001f: haven''t heard
from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at 10.242.40.14 at tcp)
in 1127 seconds. I think it''s dead, and I am evicting it.
Aug 19 12:41:51 circe1 kernel: Lustre: circelfs-OST0019: haven''t heard
from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at 10.242.40.14 at tcp)
in 1127 seconds. I think it''s dead, and I am evicting it.
Aug 19 12:41:52 circe1 kernel: Lustre: circelfs-OST001a: haven''t heard
from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at 10.242.40.14 at tcp)
in 1127 seconds. I think it''s dead, and I am evicting it.
Aug 19 12:42:15 circe1 kernel: Lustre: circelfs-MDT0000: haven''t heard
from client b7bbe482-a8c5-0d21-4f44-713aa2aa4f81 (at 10.242.40.14 at tcp)
in 1127 seconds. I think it''s dead, and I am evicting it.
Aug 19 12:42:15 circe1 kernel: Lustre: Skipped 4 previous similar messages
Aug 19 12:42:47 circe1 kernel: LustreError:
7735:0:(handler.c:1515:mds_handle()) operation 41 on unconnected MDS
from 12345-10.242.40.14 at tcp
Aug 19 12:42:47 circe1 kernel: LustreError:
7735:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error
(-107) req at ffff810038817a00 x12738961/t0 o41-><?>@<?>:0/0
lens 128/0 e
0 to 0 dl 1219164667 ref 1 fl Interpret:/0/0 rc -107/0
Aug 19 12:42:47 circe1 kernel: LustreError:
7735:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 41 previous
similar messages
(In the logs above 10.242.40.14 = herologin1). Should I try the
echo 0 > /proc/fs/lustre/llite/*/statahead_max
solution that fixed the statahead problem?
Many thanks,
Chris