Kaizaad, As this is an SFS system have you raised this with HP support? You appear to be on SFS V2.1-0 with no patches. The description appears match an ldlm problem in this release which is resolved in a Lustre patch which is supplied in SFS V2.1-1. regards, Therese (HP SFS support) -----Original Message----- From: lustre-discuss-bounces@clusterfs.com [mailto:lustre-discuss-bounces@clusterfs.com] On Behalf Of Kaizaad Bilimorya Sent: 07 February 2007 17:00 To: lustre-discuss@clusterfs.com Subject: [Lustre-discuss] lock_page & wait_on_page_bit Hello, I am not sure of how to further debug this problem. Here is the situation. Feb 6 15:58:00 nar2 kernel: LustreError: 2939:0:(client.c:442:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == -107 req@00000100b5cd1a00 x3573956/t0 o400->nar-sfs-ost103_UUID@NID_3712500356_UUID:28 lens 64/64 ref 1 fl Rpc:RN/0/0 rc 0/-107 Feb 6 15:58:00 nar2 kernel: LustreError: Connection to service nar-sfs-ost103 via nid 0:3712500356 was lost; in progress operations using this service will wait for recovery to complete. Feb 6 15:58:00 nar2 kernel: Lustre: 2939:0:(import.c:139:ptlrpc_set_import_discon()) OSC_nar2_nar-sfs-ost103_MNT_client_gm: connection lost to nar-sfs-ost103_UUID@NID_3712500356_UUID Feb 6 15:58:00 nar2 kernel: Lustre: 2939:0:(import.c:288:import_select_connection()) OSC_nar2_nar-sfs-ost103_MNT_client_gm: Using connectionNID_3712497273_UUID Feb 6 15:58:00 nar2 kernel: Lustre: 2939:0:(import.c:288:import_select_connection()) skipped 2 similar messages (ending 354162.872 seconds ago) Feb 6 15:58:11 nar2 kernel: LustreError: This client was evicted by nar-sfs-ost103; in progress operations using this service will be reattempted. Feb 6 15:58:11 nar2 kernel: LustreError: 5652:0:(ldlm_resource.c:361:ldlm_namespace_cleanup()) Namespace OSC_nar2_nar-sfs-ost103_MNT_client_gm resource refcount 4 after lock cleanup Feb 6 15:58:11 nar2 kernel: LustreError: 5646:0:(llite_mmap.c:208:ll_tree_unlock()) couldn''t unlock -5 Feb 6 15:58:11 nar2 kernel: Lustre: Connection restored to service nar-sfs-ost103 using nid 0:3712500356. Feb 6 15:58:11 nar2 kernel: Lustre: 5652:0:(import.c:692:ptlrpc_import_recovery_state_machine()) OSC_nar2_nar-sfs-ost103_MNT_client_gm: connection restored to nar-sfs-ost103_UUID@NID_3712500356_UUID [root@nar2 ~]# ps -lfu dgerbasi F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 5 S dgerbasi 5642 5639 0 76 0 - 8211 - Feb06 ? 00:01:21 slurmd: [164600.1] 0 D dgerbasi 5643 5642 0 76 0 - 6473 lock_p Feb06 ? 00:00:00 /nar_sfs/dgerbasi/programs/siesta/water/./siesta 0 D dgerbasi 5644 5642 0 76 0 - 6473 wait_o Feb06 ? 00:00:00 /nar_sfs/dgerbasi/programs/siesta/water/./siesta 0 D dgerbasi 5645 5642 0 76 0 - 6473 lock_p Feb06 ? 00:00:00 /nar_sfs/dgerbasi/programs/siesta/water/./siesta there is a file that seems to be "locked" on the problem node: [root@nar2 water]# ls -al O.POT.CONF -rw-r--r-- 1 dgerbasi twoo 62275 Feb 6 15:58 O.POT.CONF [root@nar2 water]# stat O.POT.CONF File: `O.POT.CONF'' Size: 62275 Blocks: 128 IO Block: 2097152 regular file Device: f908b518h/-116869864d Inode: 1441916 Links: 1 Access: (0644/-rw-r--r--) Uid: (130408/dgerbasi) Gid: (130023/ twoo) Access: 2007-02-06 15:57:05.507257800 -0500 Modify: 2007-02-06 15:58:11.711026503 -0500 Change: 2007-02-06 15:58:11.711026503 -0500 [root@nar2 water]# file O.POT.CONF ...hangs while on another node, it is fine but an older version: [root@nar320 water]# ls -al O.POT.CONF -rw-r--r-- 1 dgerbasi twoo 62275 Feb 6 15:57 O.POT.CONF [root@nar320 water]# stat O.POT.CONF File: `O.POT.CONF'' Size: 62275 Blocks: 128 IO Block: 2097152 regular file Device: f908b518h/-116869864d Inode: 1441916 Links: 1 Access: (0644/-rw-r--r--) Uid: (130408/dgerbasi) Gid: (130023/ twoo) Access: 2007-02-07 10:50:21.299914088 -0500 Modify: 2007-02-06 15:57:05.000000000 -0500 Change: 2007-02-06 15:57:05.000000000 -0500 [root@nar320 water]# file O.POT.CONF O.POT.CONF: ASCII text So it seems, the problem node (nar2) lost connectivity to the ost103 service, got evicted, then reconnected ("lfs check servers" now shows active). It is now waiting on some type of page lock to be released before it can update the file to disk ("lfs getstripe O.POT.CONF" confirmed file is on ost103). Any advise or points in the right direction would be appreciated. We see this pop up every now and then on our nodes with a variety of user codes. thanks -k This is a HP SFS system (based on lustre 1.4.2) client version: [root@nar320 water]# cat /proc/fs/lustre/version 1.4.2-20051219152732-CHANGED-.bld.PER_RC3_2.xc.src.lustre........obj.x86 _64.kernel-2.6.9.linux-2.6.9-2.6.9-22.7hp.XCsmp server version: V2.1-0 (build nlH8hp, 2005-12-22) (filesystem 1.4.2) _______________________________________________ Lustre-discuss mailing list Lustre-discuss@clusterfs.com https://mail.clusterfs.com/mailman/listinfo/lustre-discuss