Evan,
Have you logged this with HP support? On later versions of SFS we''ve
got some extra Lustre fixes for client evict bugs + recommendations
regarding increasing the timeout which may avoid the client evict.
thanks
-therese
-----Original Message-----
From: lustre-devel-bounces@clusterfs.com
[mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of Felix, Evan J
Sent: 03 January 2007 21:09
To: lustre-devel@clusterfs.com; Unitt, Chris
Subject: [Lustre-devel] Lustre 1.4.2 problem
Guys we are seeing a problem, on one of our larger clusters that
is still running 1.4.2 from HP, We have reliably caused a EIO problem
with 32 clients... we attempted at 8, but it did not happen.
Our config:
1k-IA64 nodes, quadrics4 interconnect.
1 MDS, 32 OSS, 64 OSTs in an active-active mode.
We see this on one or more clients(dmesg):
LustreError: Connection to service pair12_Xdg1 via nid 0:953 was
lost; in progress operations using this service will wait for recovery
to complete.
Lustre: 11646:0:(import.c:136:ptlrpc_set_import_discon())
OSC_m568_pair12_Xdg1_MNT_client: connection lost to
pair12_Xdg1_UUID@NID_953_UUID
LustreError:
11646:0:(ldlm_request.c:69:ldlm_expired_completion_wait()) ### lock
timed out, entering recovery for pair12_Xdg1_UUID@NID_953_UUID ns:
OSC_m568_pair12
_Xdg1_MNT_client lock: e000000144d2fe80/0xbf806ddd959ef0c3 lrc:
4/1,0 mode: --/PR res: 16436509/0 rrc: 3 type: EXT [45088768->47185919]
(req 45088768->47185919) fl
ags: 0 remote: 0xbb88faef1d525d0e expref: -99 pid: 11646
Lustre: 1218:0:(import.c:308:import_select_connection())
OSC_m568_pair12_Xdg1_MNT_client: Using connection NID_953_UUID
LustreError: This client was evicted by pair12_Xdg1; in progress
operations using this service will be reattempted.
LustreError:
11662:0:(ldlm_resource.c:358:ldlm_namespace_cleanup()) Namespace
OSC_m568_pair12_Xdg1_MNT_client resource refcount 2 after lock cleanup
Lustre: Connection restored to service pair12_Xdg1 using nid
0:953.
LustreError:
11646:0:(lov_request.c:166:lov_update_enqueue_set()) error: enqueue
objid 0x5cb456c subobj 0xfacd1d on OST idx 12: rc = -5
Lustre:
11662:0:(import.c:687:ptlrpc_import_recovery_state_machine())
OSC_m568_pair12_Xdg1_MNT_client: connection restored to
pair12_Xdg1_UUID@NID_953_UUID
And one message on the OST:
LustreError: 1557:0:(ldlm_lockd.c:198:waiting_locks_callback())
### lock callback timer expired: evicting client
5df1a_lov_7312679867@NET_0x238_UUID nid 0:568 ns:
filter-pair12_Xdg1_UUID lock: e000000122b96380/0xbb88faef1d4ed1d1 lrc:
1/0,0 mode: PW/PW res: 16436509/0 rrc: 60 type: EXT [44040192->45391871]
(req 44040192->44048383) flags: 20 remote: 0xbf806ddd9542580e expref: 13
pid: 1550
The user application as I understand it:
It has two phases, one where it generates a large sparse file,
then after all the nodes have filled in the portion of the file they are
responsible for, then then ''compress'' the file by moving
chunks that are
spread out everywhere into new non-sparse file(in parallel). It is
during this second phase that we are seeing the EIO errors. The
application has been modified to retry the write IO when the buffer
write fails, and normally it succeeds the second time.
Any thoughts on why this would happen?
It seems that because the client gets evicted its lock it
invalid, and then returns a EIO to the user, but then just re-acquires a
lock for the suceeding IO. Also I don''t have an explanation to why the
connection gets evicted. There is a high IO load during this time, but
it only seems to be using about 2/3 of the available peak IO.
Evan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070104/670b3d4b/attachment-0001.html