thr3ads.net - Lustre devel - [Lustre-devel] Lustre 1.4.2 problem [Jan 2007]

If this information is useful, please help other people find it:
Share via:

Felix, Evan J

2007-Jan-03 14:09 UTC

[Lustre-devel] Lustre 1.4.2 problem

Guys we are seeing a problem, on one of our larger clusters that is
still running 1.4.2 from HP, We have reliably caused a EIO problem with
32 clients... we attempted at 8, but it did not happen.

Our config:
1k-IA64 nodes, quadrics4 interconnect. 
1 MDS, 32 OSS, 64 OSTs in an active-active mode.

We see this on one or more clients(dmesg):

LustreError: Connection to service pair12_Xdg1 via nid 0:953 was lost;
in progress operations using this service will wait for recovery to
complete.
Lustre: 11646:0:(import.c:136:ptlrpc_set_import_discon())
OSC_m568_pair12_Xdg1_MNT_client: connection lost to
pair12_Xdg1_UUID@NID_953_UUID
LustreError: 11646:0:(ldlm_request.c:69:ldlm_expired_completion_wait())
### lock timed out, entering recovery for pair12_Xdg1_UUID@NID_953_UUID
ns: OSC_m568_pair12
_Xdg1_MNT_client lock: e000000144d2fe80/0xbf806ddd959ef0c3 lrc: 4/1,0
mode: --/PR res: 16436509/0 rrc: 3 type: EXT [45088768->47185919] (req
45088768->47185919) fl
ags: 0 remote: 0xbb88faef1d525d0e expref: -99 pid: 11646
Lustre: 1218:0:(import.c:308:import_select_connection())
OSC_m568_pair12_Xdg1_MNT_client: Using connection NID_953_UUID
LustreError: This client was evicted by pair12_Xdg1; in progress
operations using this service will be reattempted.
LustreError: 11662:0:(ldlm_resource.c:358:ldlm_namespace_cleanup())
Namespace OSC_m568_pair12_Xdg1_MNT_client resource refcount 2 after lock
cleanup
Lustre: Connection restored to service pair12_Xdg1 using nid 0:953.
LustreError: 11646:0:(lov_request.c:166:lov_update_enqueue_set()) error:
enqueue objid 0x5cb456c subobj 0xfacd1d on OST idx 12: rc = -5
Lustre: 11662:0:(import.c:687:ptlrpc_import_recovery_state_machine())
OSC_m568_pair12_Xdg1_MNT_client: connection restored to
pair12_Xdg1_UUID@NID_953_UUID


And one message on the OST:

LustreError: 1557:0:(ldlm_lockd.c:198:waiting_locks_callback()) ### lock
callback timer expired: evicting client
5df1a_lov_7312679867@NET_0x238_UUID nid 0:568  ns:
filter-pair12_Xdg1_UUID lock: e000000122b96380/0xbb88faef1d4ed1d1 lrc:
1/0,0 mode: PW/PW res: 16436509/0 rrc: 60 type: EXT [44040192->45391871]
(req 44040192->44048383) flags: 20 remote: 0xbf806ddd9542580e expref: 13
pid: 1550


The user application as I understand it:

It has two phases, one where it generates a large sparse file, then
after all the nodes have filled in the portion of the file they are
responsible for, then then ''compress'' the file by moving
chunks that are
spread out everywhere into new non-sparse file(in parallel).  It is
during this second phase that we are seeing the EIO errors.  The
application has been modified to retry the write IO when the buffer
write fails, and normally it succeeds the second time.

Any thoughts on why this would happen?

It seems that because the client gets evicted its lock it invalid, and
then returns a EIO to the user, but then just re-acquires a lock for the
suceeding IO.  Also I don''t have an explanation to why the connection
gets evicted.  There is a high IO load during this time, but it only
seems to be using about 2/3 of the available peak IO.

Evan
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070103/bce69d22/attachment.html

McHale, Therese

2007-Jan-04 04:49 UTC

head link

[Lustre-devel] Lustre 1.4.2 problem

Evan,
 
Have you logged this with HP support?   On later versions of SFS  we''ve
got some extra Lustre fixes  for client evict bugs + recommendations
regarding increasing the timeout which may avoid the client evict.
 
thanks 
-therese

	-----Original Message-----
	From: lustre-devel-bounces@clusterfs.com
[mailto:lustre-devel-bounces@clusterfs.com] On Behalf Of Felix, Evan J
	Sent: 03 January 2007 21:09
	To: lustre-devel@clusterfs.com; Unitt, Chris
	Subject: [Lustre-devel] Lustre 1.4.2 problem
	
	

	Guys we are seeing a problem, on one of our larger clusters that
is still running 1.4.2 from HP, We have reliably caused a EIO problem
with 32 clients... we attempted at 8, but it did not happen.

	Our config: 
	1k-IA64 nodes, quadrics4 interconnect. 
	1 MDS, 32 OSS, 64 OSTs in an active-active mode. 

	We see this on one or more clients(dmesg): 

	LustreError: Connection to service pair12_Xdg1 via nid 0:953 was
lost; in progress operations using this service will wait for recovery
to complete.

	Lustre: 11646:0:(import.c:136:ptlrpc_set_import_discon())
OSC_m568_pair12_Xdg1_MNT_client: connection lost to
pair12_Xdg1_UUID@NID_953_UUID

	LustreError:
11646:0:(ldlm_request.c:69:ldlm_expired_completion_wait()) ### lock
timed out, entering recovery for pair12_Xdg1_UUID@NID_953_UUID ns:
OSC_m568_pair12

	_Xdg1_MNT_client lock: e000000144d2fe80/0xbf806ddd959ef0c3 lrc:
4/1,0 mode: --/PR res: 16436509/0 rrc: 3 type: EXT [45088768->47185919]
(req 45088768->47185919) fl

	ags: 0 remote: 0xbb88faef1d525d0e expref: -99 pid: 11646 
	Lustre: 1218:0:(import.c:308:import_select_connection())
OSC_m568_pair12_Xdg1_MNT_client: Using connection NID_953_UUID 
	LustreError: This client was evicted by pair12_Xdg1; in progress
operations using this service will be reattempted. 
	LustreError:
11662:0:(ldlm_resource.c:358:ldlm_namespace_cleanup()) Namespace
OSC_m568_pair12_Xdg1_MNT_client resource refcount 2 after lock cleanup

	Lustre: Connection restored to service pair12_Xdg1 using nid
0:953. 
	LustreError:
11646:0:(lov_request.c:166:lov_update_enqueue_set()) error: enqueue
objid 0x5cb456c subobj 0xfacd1d on OST idx 12: rc = -5

	Lustre:
11662:0:(import.c:687:ptlrpc_import_recovery_state_machine())
OSC_m568_pair12_Xdg1_MNT_client: connection restored to
pair12_Xdg1_UUID@NID_953_UUID


	And one message on the OST: 

	LustreError: 1557:0:(ldlm_lockd.c:198:waiting_locks_callback())
### lock callback timer expired: evicting client
5df1a_lov_7312679867@NET_0x238_UUID nid 0:568  ns:
filter-pair12_Xdg1_UUID lock: e000000122b96380/0xbb88faef1d4ed1d1 lrc:
1/0,0 mode: PW/PW res: 16436509/0 rrc: 60 type: EXT [44040192->45391871]
(req 44040192->44048383) flags: 20 remote: 0xbf806ddd9542580e expref: 13
pid: 1550


	The user application as I understand it: 

	It has two phases, one where it generates a large sparse file,
then after all the nodes have filled in the portion of the file they are
responsible for, then then ''compress'' the file by moving
chunks that are
spread out everywhere into new non-sparse file(in parallel).  It is
during this second phase that we are seeing the EIO errors.  The
application has been modified to retry the write IO when the buffer
write fails, and normally it succeeds the second time.

	Any thoughts on why this would happen? 

	It seems that because the client gets evicted its lock it
invalid, and then returns a EIO to the user, but then just re-acquires a
lock for the suceeding IO.  Also I don''t have an explanation to why the
connection gets evicted.  There is a high IO load during this time, but
it only seems to be using about 2/3 of the available peak IO.

	Evan 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://mail.clusterfs.com/pipermail/lustre-devel/attachments/20070104/670b3d4b/attachment-0001.html

Lustre devel - Jan 2007 - Lustre 1.4.2 problem

[Lustre-devel] Lustre 1.4.2 problem

[Lustre-devel] Lustre 1.4.2 problem