thr3ads.net - Lustre discuss - [Lustre-discuss] lock_page & wait_on_page

If this information is useful, please help other people find it:
Share via:
McHale, Therese
2007-Feb-07 10:33 UTC
[Lustre-discuss] lock_page & wait_on_page_bit

Kaizaad,

As this is an SFS system have you raised this with HP support?  You
appear to be on SFS V2.1-0 with no patches. The description appears
match an ldlm problem in this release which is resolved in a Lustre
patch which is supplied in SFS V2.1-1.

regards,
Therese
(HP SFS support)

-----Original Message-----
From: lustre-discuss-bounces@clusterfs.com
[mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Kaizaad
Bilimorya
Sent: 07 February 2007 17:00
To: lustre-discuss@clusterfs.com
Subject: [Lustre-discuss] lock_page & wait_on_page_bit

Hello,

I am not sure of how to further debug this problem. Here is the
situation.

Feb  6 15:58:00 nar2 kernel: LustreError:
2939:0:(client.c:442:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR,
err == -107 req@00000100b5cd1a00 x3573956/t0
o400->nar-sfs-ost103_UUID@NID_3712500356_UUID:28 lens 64/64 ref 1 fl
Rpc:RN/0/0 rc 0/-107 Feb  6 15:58:00 nar2 kernel: LustreError:
Connection to service nar-sfs-ost103 via nid 0:3712500356 was lost; in
progress operations using this service will wait for recovery to
complete. Feb  6 15:58:00 nar2 kernel: Lustre:
2939:0:(import.c:139:ptlrpc_set_import_discon())
OSC_nar2_nar-sfs-ost103_MNT_client_gm: connection lost to
nar-sfs-ost103_UUID@NID_3712500356_UUID
Feb  6 15:58:00 nar2 kernel: Lustre:
2939:0:(import.c:288:import_select_connection())
OSC_nar2_nar-sfs-ost103_MNT_client_gm: Using
connectionNID_3712497273_UUID Feb  6 15:58:00 nar2 kernel: Lustre:
2939:0:(import.c:288:import_select_connection()) skipped 2 similar
messages (ending 354162.872 seconds ago) Feb  6 15:58:11 nar2 kernel:
LustreError: This client was evicted by nar-sfs-ost103; in progress
operations using this service will be reattempted. Feb  6 15:58:11 nar2
kernel: LustreError:
5652:0:(ldlm_resource.c:361:ldlm_namespace_cleanup()) Namespace
OSC_nar2_nar-sfs-ost103_MNT_client_gm resource refcount 4 after lock
cleanup Feb  6 15:58:11 nar2 kernel: LustreError:
5646:0:(llite_mmap.c:208:ll_tree_unlock()) couldn''t unlock -5 Feb  6
15:58:11 nar2 kernel: Lustre: Connection restored to service
nar-sfs-ost103 using nid 0:3712500356. Feb  6 15:58:11 nar2 kernel:
Lustre: 5652:0:(import.c:692:ptlrpc_import_recovery_state_machine())
OSC_nar2_nar-sfs-ost103_MNT_client_gm: connection restored to
nar-sfs-ost103_UUID@NID_3712500356_UUID

[root@nar2 ~]# ps -lfu dgerbasi
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY
TIME CMD
5 S dgerbasi  5642  5639  0  76   0 -  8211 -      Feb06 ?
00:01:21 slurmd: [164600.1]
0 D dgerbasi  5643  5642  0  76   0 -  6473 lock_p Feb06 ?
00:00:00 /nar_sfs/dgerbasi/programs/siesta/water/./siesta
0 D dgerbasi  5644  5642  0  76   0 -  6473 wait_o Feb06 ?
00:00:00 /nar_sfs/dgerbasi/programs/siesta/water/./siesta
0 D dgerbasi  5645  5642  0  76   0 -  6473 lock_p Feb06 ?
00:00:00 /nar_sfs/dgerbasi/programs/siesta/water/./siesta


there is a file that seems to be "locked" on the problem node:
[root@nar2 water]# ls -al O.POT.CONF
-rw-r--r--  1 dgerbasi twoo 62275 Feb  6 15:58 O.POT.CONF

[root@nar2 water]# stat O.POT.CONF
   File: `O.POT.CONF''
   Size: 62275           Blocks: 128        IO Block: 2097152 regular
file
Device: f908b518h/-116869864d   Inode: 1441916     Links: 1
Access: (0644/-rw-r--r--)  Uid: (130408/dgerbasi)   Gid: (130023/
twoo)
Access: 2007-02-06 15:57:05.507257800 -0500
Modify: 2007-02-06 15:58:11.711026503 -0500
Change: 2007-02-06 15:58:11.711026503 -0500

[root@nar2 water]# file O.POT.CONF
...hangs

while on another node, it is fine but an older version: [root@nar320
water]# ls -al O.POT.CONF
-rw-r--r--  1 dgerbasi twoo 62275 Feb  6 15:57 O.POT.CONF

[root@nar320 water]# stat O.POT.CONF
   File: `O.POT.CONF''
   Size: 62275           Blocks: 128        IO Block: 2097152 regular
file
Device: f908b518h/-116869864d   Inode: 1441916     Links: 1
Access: (0644/-rw-r--r--)  Uid: (130408/dgerbasi)   Gid: (130023/
twoo)
Access: 2007-02-07 10:50:21.299914088 -0500
Modify: 2007-02-06 15:57:05.000000000 -0500
Change: 2007-02-06 15:57:05.000000000 -0500

[root@nar320 water]# file O.POT.CONF
O.POT.CONF: ASCII text


So it seems, the problem node (nar2) lost connectivity to the ost103 
service, got evicted, then reconnected ("lfs check servers" now shows 
active). It is now waiting on some type of page lock to be released
before 
it can update the file to disk ("lfs getstripe O.POT.CONF" confirmed
file 
is on ost103).

Any advise or points in the right direction would be appreciated. We see

this pop up every now and then on our nodes with a variety of user
codes.

thanks
-k

This is a HP SFS system (based on lustre 1.4.2)
client version:
[root@nar320 water]# cat /proc/fs/lustre/version
1.4.2-20051219152732-CHANGED-.bld.PER_RC3_2.xc.src.lustre........obj.x86
_64.kernel-2.6.9.linux-2.6.9-2.6.9-22.7hp.XCsmp

server version:
V2.1-0 (build nlH8hp, 2005-12-22) (filesystem 1.4.2)

_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@clusterfs.com
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss
Lustre discuss - Feb 2007 - lock_page & wait_on_page_bit

[Lustre-discuss] lock_page & wait_on_page_bit