Sorry if this is a repeat, but I never saw it come across the list last week and the archive does not contain messages for the current month.... Summary: after upgrade from 1.2.6 to 1.4.6 system hangs have made our lustre file system almost unuseable Version: 1.4.6 Platform: i386 OS/Version: Novell Suse Linux Enterprise Server 9 w/ sp2 First question -- anyone out there running lustre on SLES9? Seems like most people are using a redhat variant. Second plea -- if you have any advice on figuring out how to diagnose the cause of the system hangs w/ lustre and linux I would appreciate that even if you haven''t had problems like I am currently experiencing. On to the problem description.... Our first lustre volumes were setup with Novell SLES9 sp2 and the lustre 1.2.6 components included on the SLES9 CD (lustre-lite-1.2.6.suse.2-0.3.i586.rpm). We were running with clients and servers on the same machines and things were working reasonably well, but recovery timeouts from down machines, and a rarely encountered deadlock inspired me to upgrade to the latest release rpm''s compiled for SLES9 - 1.4.6.1. So far 1.4.6.1 and 1.4.6 have been very unstable for us. Out of our cluster of 150 machines, we have been experiencing 5 to 7 system hangs per day. I have LKCD tools installed, but as none of the machines have keyboards, I have not been able to get any system core dumps. Before hanging, ganglia often indicates that the machine is experiencing high load averages (6 to 20 or so). It is hard for me to pinpoint what may be causing the hang, because we are using a configuration with lustre client and lustre server running on the same machines. We have 5 lustre volumes, each with 30 ibm e326 dual opteron 270 servers. 29 of the machines are OSTs and the 30th one is the MDS. They all use a second internal SATA drive for OST/MDS. When there is no lustre traffic (but filesystems mounted) the systems do not hang. Once allowing our computation programs to start using the lustre file system it can take minutes or hours for a system hang to occur. Rebooting gets the filesystem functional again until a different node fails. Very seldom is it the same node failing twice in a row. Errors in the syslog only seem to be from other clients not able to access the OST that has hung, with no errors from the system hanging until it is rebooted and recovery started. Lustre components installed include: # rpm -qa | grep lustre kernel-bigsmp-2.6.5-7.244_lustre.1.4.6 lustre-1.4.6-2.6.5_7.244_lustre.1.4.6bigsmp lustre-modules-1.4.6-2.6.5_7.244_lustre.1.4.6bigsmp lustre-debuginfo-1.4.6-2.6.5_7.244_lustre.1.4.6bigsmp or # rpm -qa | grep lustre lustre-modules-1.4.6.1-2.6.5_7.252_lustre.1.4.6.1bigsmp lustre-1.4.6.1-2.6.5_7.252_lustre.1.4.6.1bigsmp kernel-bigsmp-2.6.5-7.252_lustre.1.4.6.1 lustre-debuginfo-1.4.6.1-2.6.5_7.252_lustre.1.4.6.1bigsmp I am also still using the same config.xml that I used for lustre 1.2.6 and I am mounting the file systems with lconf --node client /etc/lustre/config.xml Here is a copy of the config.xml file for one lustre volume: <?xml version=''1.0'' encoding=''UTF-8''?> <lustre version=''2003070801''> <ldlm name=''ldlm'' uuid=''ldlm_UUID''/> <node uuid=''btbal3031_UUID'' name=''btbal3031''> <profile_ref uuidref=''PROFILE_btbal3031_UUID''/> <network uuid=''NET_btbal3031_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3031_tcp''> <nid>btbal3031</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3031_UUID'' name=''PROFILE_btbal3031''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3031_tcp_UUID''/> <mdsdev_ref uuidref=''D_btbal3031-mds1_btbal3031_UUID''/> </profile> <node uuid=''btbal3002_UUID'' name=''btbal3002''> <profile_ref uuidref=''PROFILE_btbal3002_UUID''/> <network uuid=''NET_btbal3002_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3002_tcp''> <nid>btbal3002</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3002_UUID'' name=''PROFILE_btbal3002''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3002_tcp_UUID''/> <osd_ref uuidref=''D_btbal3002-ost1_btbal3002_UUID''/> </profile> <node uuid=''btbal3003_UUID'' name=''btbal3003''> <profile_ref uuidref=''PROFILE_btbal3003_UUID''/> <network uuid=''NET_btbal3003_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3003_tcp''> <nid>btbal3003</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3003_UUID'' name=''PROFILE_btbal3003''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3003_tcp_UUID''/> <osd_ref uuidref=''D_btbal3003-ost1_btbal3003_UUID''/> </profile> <node uuid=''btbal3004_UUID'' name=''btbal3004''> <profile_ref uuidref=''PROFILE_btbal3004_UUID''/> <network uuid=''NET_btbal3004_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3004_tcp''> <nid>btbal3004</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3004_UUID'' name=''PROFILE_btbal3004''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3004_tcp_UUID''/> <osd_ref uuidref=''D_btbal3004-ost1_btbal3004_UUID''/> </profile> <node uuid=''btbal3005_UUID'' name=''btbal3005''> <profile_ref uuidref=''PROFILE_btbal3005_UUID''/> <network uuid=''NET_btbal3005_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3005_tcp''> <nid>btbal3005</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3005_UUID'' name=''PROFILE_btbal3005''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3005_tcp_UUID''/> <osd_ref uuidref=''D_btbal3005-ost1_btbal3005_UUID''/> </profile> <node uuid=''btbal3006_UUID'' name=''btbal3006''> <profile_ref uuidref=''PROFILE_btbal3006_UUID''/> <network uuid=''NET_btbal3006_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3006_tcp''> <nid>btbal3006</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3006_UUID'' name=''PROFILE_btbal3006''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3006_tcp_UUID''/> <osd_ref uuidref=''D_btbal3006-ost1_btbal3006_UUID''/> </profile> <node uuid=''btbal3007_UUID'' name=''btbal3007''> <profile_ref uuidref=''PROFILE_btbal3007_UUID''/> <network uuid=''NET_btbal3007_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3007_tcp''> <nid>btbal3007</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3007_UUID'' name=''PROFILE_btbal3007''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3007_tcp_UUID''/> <osd_ref uuidref=''D_btbal3007-ost1_btbal3007_UUID''/> </profile> <node uuid=''btbal3008_UUID'' name=''btbal3008''> <profile_ref uuidref=''PROFILE_btbal3008_UUID''/> <network uuid=''NET_btbal3008_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3008_tcp''> <nid>btbal3008</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3008_UUID'' name=''PROFILE_btbal3008''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3008_tcp_UUID''/> <osd_ref uuidref=''D_btbal3008-ost1_btbal3008_UUID''/> </profile> <node uuid=''btbal3009_UUID'' name=''btbal3009''> <profile_ref uuidref=''PROFILE_btbal3009_UUID''/> <network uuid=''NET_btbal3009_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3009_tcp''> <nid>btbal3009</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3009_UUID'' name=''PROFILE_btbal3009''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3009_tcp_UUID''/> <osd_ref uuidref=''D_btbal3009-ost1_btbal3009_UUID''/> </profile> <node uuid=''btbal3010_UUID'' name=''btbal3010''> <profile_ref uuidref=''PROFILE_btbal3010_UUID''/> <network uuid=''NET_btbal3010_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3010_tcp''> <nid>btbal3010</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3010_UUID'' name=''PROFILE_btbal3010''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3010_tcp_UUID''/> <osd_ref uuidref=''D_btbal3010-ost1_btbal3010_UUID''/> </profile> <node uuid=''btbal3011_UUID'' name=''btbal3011''> <profile_ref uuidref=''PROFILE_btbal3011_UUID''/> <network uuid=''NET_btbal3011_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3011_tcp''> <nid>btbal3011</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3011_UUID'' name=''PROFILE_btbal3011''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3011_tcp_UUID''/> <osd_ref uuidref=''D_btbal3011-ost1_btbal3011_UUID''/> </profile> <node uuid=''btbal3012_UUID'' name=''btbal3012''> <profile_ref uuidref=''PROFILE_btbal3012_UUID''/> <network uuid=''NET_btbal3012_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3012_tcp''> <nid>btbal3012</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3012_UUID'' name=''PROFILE_btbal3012''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3012_tcp_UUID''/> <osd_ref uuidref=''D_btbal3012-ost1_btbal3012_UUID''/> </profile> <node uuid=''btbal3013_UUID'' name=''btbal3013''> <profile_ref uuidref=''PROFILE_btbal3013_UUID''/> <network uuid=''NET_btbal3013_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3013_tcp''> <nid>btbal3013</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3013_UUID'' name=''PROFILE_btbal3013''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3013_tcp_UUID''/> <osd_ref uuidref=''D_btbal3013-ost1_btbal3013_UUID''/> </profile> <node uuid=''btbal3014_UUID'' name=''btbal3014''> <profile_ref uuidref=''PROFILE_btbal3014_UUID''/> <network uuid=''NET_btbal3014_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3014_tcp''> <nid>btbal3014</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3014_UUID'' name=''PROFILE_btbal3014''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3014_tcp_UUID''/> <osd_ref uuidref=''D_btbal3014-ost1_btbal3014_UUID''/> </profile> <node uuid=''btbal3015_UUID'' name=''btbal3015''> <profile_ref uuidref=''PROFILE_btbal3015_UUID''/> <network uuid=''NET_btbal3015_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3015_tcp''> <nid>btbal3015</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3015_UUID'' name=''PROFILE_btbal3015''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3015_tcp_UUID''/> <osd_ref uuidref=''D_btbal3015-ost1_btbal3015_UUID''/> </profile> <node uuid=''btbal3016_UUID'' name=''btbal3016''> <profile_ref uuidref=''PROFILE_btbal3016_UUID''/> <network uuid=''NET_btbal3016_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3016_tcp''> <nid>btbal3016</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3016_UUID'' name=''PROFILE_btbal3016''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3016_tcp_UUID''/> <osd_ref uuidref=''D_btbal3016-ost1_btbal3016_UUID''/> </profile> <node uuid=''btbal3017_UUID'' name=''btbal3017''> <profile_ref uuidref=''PROFILE_btbal3017_UUID''/> <network uuid=''NET_btbal3017_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3017_tcp''> <nid>btbal3017</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3017_UUID'' name=''PROFILE_btbal3017''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3017_tcp_UUID''/> <osd_ref uuidref=''D_btbal3017-ost1_btbal3017_UUID''/> </profile> <node uuid=''btbal3018_UUID'' name=''btbal3018''> <profile_ref uuidref=''PROFILE_btbal3018_UUID''/> <network uuid=''NET_btbal3018_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3018_tcp''> <nid>btbal3018</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3018_UUID'' name=''PROFILE_btbal3018''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3018_tcp_UUID''/> <osd_ref uuidref=''D_btbal3018-ost1_btbal3018_UUID''/> </profile> <node uuid=''btbal3019_UUID'' name=''btbal3019''> <profile_ref uuidref=''PROFILE_btbal3019_UUID''/> <network uuid=''NET_btbal3019_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3019_tcp''> <nid>btbal3019</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3019_UUID'' name=''PROFILE_btbal3019''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3019_tcp_UUID''/> <osd_ref uuidref=''D_btbal3019-ost1_btbal3019_UUID''/> </profile> <node uuid=''btbal3020_UUID'' name=''btbal3020''> <profile_ref uuidref=''PROFILE_btbal3020_UUID''/> <network uuid=''NET_btbal3020_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3020_tcp''> <nid>btbal3020</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3020_UUID'' name=''PROFILE_btbal3020''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3020_tcp_UUID''/> <osd_ref uuidref=''D_btbal3020-ost1_btbal3020_UUID''/> </profile> <node uuid=''btbal3021_UUID'' name=''btbal3021''> <profile_ref uuidref=''PROFILE_btbal3021_UUID''/> <network uuid=''NET_btbal3021_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3021_tcp''> <nid>btbal3021</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3021_UUID'' name=''PROFILE_btbal3021''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3021_tcp_UUID''/> <osd_ref uuidref=''D_btbal3021-ost1_btbal3021_UUID''/> </profile> <node uuid=''btbal3022_UUID'' name=''btbal3022''> <profile_ref uuidref=''PROFILE_btbal3022_UUID''/> <network uuid=''NET_btbal3022_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3022_tcp''> <nid>btbal3022</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3022_UUID'' name=''PROFILE_btbal3022''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3022_tcp_UUID''/> <osd_ref uuidref=''D_btbal3022-ost1_btbal3022_UUID''/> </profile> <node uuid=''btbal3023_UUID'' name=''btbal3023''> <profile_ref uuidref=''PROFILE_btbal3023_UUID''/> <network uuid=''NET_btbal3023_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3023_tcp''> <nid>btbal3023</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3023_UUID'' name=''PROFILE_btbal3023''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3023_tcp_UUID''/> <osd_ref uuidref=''D_btbal3023-ost1_btbal3023_UUID''/> </profile> <node uuid=''btbal3024_UUID'' name=''btbal3024''> <profile_ref uuidref=''PROFILE_btbal3024_UUID''/> <network uuid=''NET_btbal3024_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3024_tcp''> <nid>btbal3024</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3024_UUID'' name=''PROFILE_btbal3024''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3024_tcp_UUID''/> <osd_ref uuidref=''D_btbal3024-ost1_btbal3024_UUID''/> </profile> <node uuid=''btbal3025_UUID'' name=''btbal3025''> <profile_ref uuidref=''PROFILE_btbal3025_UUID''/> <network uuid=''NET_btbal3025_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3025_tcp''> <nid>btbal3025</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3025_UUID'' name=''PROFILE_btbal3025''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3025_tcp_UUID''/> <osd_ref uuidref=''D_btbal3025-ost1_btbal3025_UUID''/> </profile> <node uuid=''btbal3026_UUID'' name=''btbal3026''> <profile_ref uuidref=''PROFILE_btbal3026_UUID''/> <network uuid=''NET_btbal3026_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3026_tcp''> <nid>btbal3026</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3026_UUID'' name=''PROFILE_btbal3026''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3026_tcp_UUID''/> <osd_ref uuidref=''D_btbal3026-ost1_btbal3026_UUID''/> </profile> <node uuid=''btbal3027_UUID'' name=''btbal3027''> <profile_ref uuidref=''PROFILE_btbal3027_UUID''/> <network uuid=''NET_btbal3027_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3027_tcp''> <nid>btbal3027</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3027_UUID'' name=''PROFILE_btbal3027''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3027_tcp_UUID''/> <osd_ref uuidref=''D_btbal3027-ost1_btbal3027_UUID''/> </profile> <node uuid=''btbal3028_UUID'' name=''btbal3028''> <profile_ref uuidref=''PROFILE_btbal3028_UUID''/> <network uuid=''NET_btbal3028_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3028_tcp''> <nid>btbal3028</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3028_UUID'' name=''PROFILE_btbal3028''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3028_tcp_UUID''/> <osd_ref uuidref=''D_btbal3028-ost1_btbal3028_UUID''/> </profile> <node uuid=''btbal3029_UUID'' name=''btbal3029''> <profile_ref uuidref=''PROFILE_btbal3029_UUID''/> <network uuid=''NET_btbal3029_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3029_tcp''> <nid>btbal3029</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3029_UUID'' name=''PROFILE_btbal3029''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3029_tcp_UUID''/> <osd_ref uuidref=''D_btbal3029-ost1_btbal3029_UUID''/> </profile> <node uuid=''btbal3030_UUID'' name=''btbal3030''> <profile_ref uuidref=''PROFILE_btbal3030_UUID''/> <network uuid=''NET_btbal3030_tcp_UUID'' nettype=''tcp'' name=''NET_btbal3030_tcp''> <nid>btbal3030</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_btbal3030_UUID'' name=''PROFILE_btbal3030''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_btbal3030_tcp_UUID''/> <osd_ref uuidref=''D_btbal3030-ost1_btbal3030_UUID''/> </profile> <node uuid=''client_UUID'' name=''client''> <profile_ref uuidref=''PROFILE_client_UUID''/> <network uuid=''NET_client_tcp_UUID'' nettype=''tcp'' name=''NET_client_tcp''> <nid>*</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid=''PROFILE_client_UUID'' name=''PROFILE_client''> <ldlm_ref uuidref=''ldlm_UUID''/> <network_ref uuidref=''NET_client_tcp_UUID''/> <mountpoint_ref uuidref=''MNT_client_UUID''/> </profile> <mds uuid=''btbal3031-mds1_UUID'' name=''btbal3031-mds1''> <active_ref uuidref=''D_btbal3031-mds1_btbal3031_UUID''/> <lovconfig_ref uuidref=''LVCFG_lov01_UUID''/> <filesystem_ref uuidref=''FS_fsname_UUID''/> </mds> <mdsdev uuid=''D_btbal3031-mds1_btbal3031_UUID'' name=''MDD_btbal3031-mds1_btbal3031''> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> <node_ref uuidref=''btbal3031_UUID''/> <target_ref uuidref=''btbal3031-mds1_UUID''/> </mdsdev> <lov stripesize=''16777216'' stripecount=''1'' stripepattern=''0'' uuid=''lov01_UUID'' name=''lov01''> <mds_ref uuidref=''btbal3031-mds1_UUID''/> <obd_ref uuidref=''btbal3002-ost1_UUID''/> <obd_ref uuidref=''btbal3003-ost1_UUID''/> <obd_ref uuidref=''btbal3004-ost1_UUID''/> <obd_ref uuidref=''btbal3005-ost1_UUID''/> <obd_ref uuidref=''btbal3006-ost1_UUID''/> <obd_ref uuidref=''btbal3007-ost1_UUID''/> <obd_ref uuidref=''btbal3008-ost1_UUID''/> <obd_ref uuidref=''btbal3009-ost1_UUID''/> <obd_ref uuidref=''btbal3010-ost1_UUID''/> <obd_ref uuidref=''btbal3011-ost1_UUID''/> <obd_ref uuidref=''btbal3012-ost1_UUID''/> <obd_ref uuidref=''btbal3013-ost1_UUID''/> <obd_ref uuidref=''btbal3014-ost1_UUID''/> <obd_ref uuidref=''btbal3015-ost1_UUID''/> <obd_ref uuidref=''btbal3016-ost1_UUID''/> <obd_ref uuidref=''btbal3017-ost1_UUID''/> <obd_ref uuidref=''btbal3018-ost1_UUID''/> <obd_ref uuidref=''btbal3019-ost1_UUID''/> <obd_ref uuidref=''btbal3020-ost1_UUID''/> <obd_ref uuidref=''btbal3021-ost1_UUID''/> <obd_ref uuidref=''btbal3022-ost1_UUID''/> <obd_ref uuidref=''btbal3023-ost1_UUID''/> <obd_ref uuidref=''btbal3024-ost1_UUID''/> <obd_ref uuidref=''btbal3025-ost1_UUID''/> <obd_ref uuidref=''btbal3026-ost1_UUID''/> <obd_ref uuidref=''btbal3027-ost1_UUID''/> <obd_ref uuidref=''btbal3028-ost1_UUID''/> <obd_ref uuidref=''btbal3029-ost1_UUID''/> <obd_ref uuidref=''btbal3030-ost1_UUID''/> </lov> <lovconfig uuid=''LVCFG_lov01_UUID'' name=''LVCFG_lov01''> <lov_ref uuidref=''lov01_UUID''/> </lovconfig> <ost uuid=''btbal3002-ost1_UUID'' name=''btbal3002-ost1''> <active_ref uuidref=''D_btbal3002-ost1_btbal3002_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3002-ost1_btbal3002_UUID'' name=''OSD_btbal3002-ost1_btbal3002''> <target_ref uuidref=''btbal3002-ost1_UUID''/> <node_ref uuidref=''btbal3002_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3003-ost1_UUID'' name=''btbal3003-ost1''> <active_ref uuidref=''D_btbal3003-ost1_btbal3003_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3003-ost1_btbal3003_UUID'' name=''OSD_btbal3003-ost1_btbal3003''> <target_ref uuidref=''btbal3003-ost1_UUID''/> <node_ref uuidref=''btbal3003_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3004-ost1_UUID'' name=''btbal3004-ost1''> <active_ref uuidref=''D_btbal3004-ost1_btbal3004_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3004-ost1_btbal3004_UUID'' name=''OSD_btbal3004-ost1_btbal3004''> <target_ref uuidref=''btbal3004-ost1_UUID''/> <node_ref uuidref=''btbal3004_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3005-ost1_UUID'' name=''btbal3005-ost1''> <active_ref uuidref=''D_btbal3005-ost1_btbal3005_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3005-ost1_btbal3005_UUID'' name=''OSD_btbal3005-ost1_btbal3005''> <target_ref uuidref=''btbal3005-ost1_UUID''/> <node_ref uuidref=''btbal3005_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3006-ost1_UUID'' name=''btbal3006-ost1''> <active_ref uuidref=''D_btbal3006-ost1_btbal3006_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3006-ost1_btbal3006_UUID'' name=''OSD_btbal3006-ost1_btbal3006''> <target_ref uuidref=''btbal3006-ost1_UUID''/> <node_ref uuidref=''btbal3006_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3007-ost1_UUID'' name=''btbal3007-ost1''> <active_ref uuidref=''D_btbal3007-ost1_btbal3007_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3007-ost1_btbal3007_UUID'' name=''OSD_btbal3007-ost1_btbal3007''> <target_ref uuidref=''btbal3007-ost1_UUID''/> <node_ref uuidref=''btbal3007_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3008-ost1_UUID'' name=''btbal3008-ost1''> <active_ref uuidref=''D_btbal3008-ost1_btbal3008_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3008-ost1_btbal3008_UUID'' name=''OSD_btbal3008-ost1_btbal3008''> <target_ref uuidref=''btbal3008-ost1_UUID''/> <node_ref uuidref=''btbal3008_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3009-ost1_UUID'' name=''btbal3009-ost1''> <active_ref uuidref=''D_btbal3009-ost1_btbal3009_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3009-ost1_btbal3009_UUID'' name=''OSD_btbal3009-ost1_btbal3009''> <target_ref uuidref=''btbal3009-ost1_UUID''/> <node_ref uuidref=''btbal3009_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3010-ost1_UUID'' name=''btbal3010-ost1''> <active_ref uuidref=''D_btbal3010-ost1_btbal3010_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3010-ost1_btbal3010_UUID'' name=''OSD_btbal3010-ost1_btbal3010''> <target_ref uuidref=''btbal3010-ost1_UUID''/> <node_ref uuidref=''btbal3010_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3011-ost1_UUID'' name=''btbal3011-ost1''> <active_ref uuidref=''D_btbal3011-ost1_btbal3011_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3011-ost1_btbal3011_UUID'' name=''OSD_btbal3011-ost1_btbal3011''> <target_ref uuidref=''btbal3011-ost1_UUID''/> <node_ref uuidref=''btbal3011_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3012-ost1_UUID'' name=''btbal3012-ost1''> <active_ref uuidref=''D_btbal3012-ost1_btbal3012_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3012-ost1_btbal3012_UUID'' name=''OSD_btbal3012-ost1_btbal3012''> <target_ref uuidref=''btbal3012-ost1_UUID''/> <node_ref uuidref=''btbal3012_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3013-ost1_UUID'' name=''btbal3013-ost1''> <active_ref uuidref=''D_btbal3013-ost1_btbal3013_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3013-ost1_btbal3013_UUID'' name=''OSD_btbal3013-ost1_btbal3013''> <target_ref uuidref=''btbal3013-ost1_UUID''/> <node_ref uuidref=''btbal3013_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3014-ost1_UUID'' name=''btbal3014-ost1''> <active_ref uuidref=''D_btbal3014-ost1_btbal3014_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3014-ost1_btbal3014_UUID'' name=''OSD_btbal3014-ost1_btbal3014''> <target_ref uuidref=''btbal3014-ost1_UUID''/> <node_ref uuidref=''btbal3014_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3015-ost1_UUID'' name=''btbal3015-ost1''> <active_ref uuidref=''D_btbal3015-ost1_btbal3015_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3015-ost1_btbal3015_UUID'' name=''OSD_btbal3015-ost1_btbal3015''> <target_ref uuidref=''btbal3015-ost1_UUID''/> <node_ref uuidref=''btbal3015_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3016-ost1_UUID'' name=''btbal3016-ost1''> <active_ref uuidref=''D_btbal3016-ost1_btbal3016_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3016-ost1_btbal3016_UUID'' name=''OSD_btbal3016-ost1_btbal3016''> <target_ref uuidref=''btbal3016-ost1_UUID''/> <node_ref uuidref=''btbal3016_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3017-ost1_UUID'' name=''btbal3017-ost1''> <active_ref uuidref=''D_btbal3017-ost1_btbal3017_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3017-ost1_btbal3017_UUID'' name=''OSD_btbal3017-ost1_btbal3017''> <target_ref uuidref=''btbal3017-ost1_UUID''/> <node_ref uuidref=''btbal3017_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3018-ost1_UUID'' name=''btbal3018-ost1''> <active_ref uuidref=''D_btbal3018-ost1_btbal3018_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3018-ost1_btbal3018_UUID'' name=''OSD_btbal3018-ost1_btbal3018''> <target_ref uuidref=''btbal3018-ost1_UUID''/> <node_ref uuidref=''btbal3018_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3019-ost1_UUID'' name=''btbal3019-ost1''> <active_ref uuidref=''D_btbal3019-ost1_btbal3019_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3019-ost1_btbal3019_UUID'' name=''OSD_btbal3019-ost1_btbal3019''> <target_ref uuidref=''btbal3019-ost1_UUID''/> <node_ref uuidref=''btbal3019_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3020-ost1_UUID'' name=''btbal3020-ost1''> <active_ref uuidref=''D_btbal3020-ost1_btbal3020_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3020-ost1_btbal3020_UUID'' name=''OSD_btbal3020-ost1_btbal3020''> <target_ref uuidref=''btbal3020-ost1_UUID''/> <node_ref uuidref=''btbal3020_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3021-ost1_UUID'' name=''btbal3021-ost1''> <active_ref uuidref=''D_btbal3021-ost1_btbal3021_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3021-ost1_btbal3021_UUID'' name=''OSD_btbal3021-ost1_btbal3021''> <target_ref uuidref=''btbal3021-ost1_UUID''/> <node_ref uuidref=''btbal3021_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3022-ost1_UUID'' name=''btbal3022-ost1''> <active_ref uuidref=''D_btbal3022-ost1_btbal3022_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3022-ost1_btbal3022_UUID'' name=''OSD_btbal3022-ost1_btbal3022''> <target_ref uuidref=''btbal3022-ost1_UUID''/> <node_ref uuidref=''btbal3022_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3023-ost1_UUID'' name=''btbal3023-ost1''> <active_ref uuidref=''D_btbal3023-ost1_btbal3023_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3023-ost1_btbal3023_UUID'' name=''OSD_btbal3023-ost1_btbal3023''> <target_ref uuidref=''btbal3023-ost1_UUID''/> <node_ref uuidref=''btbal3023_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3024-ost1_UUID'' name=''btbal3024-ost1''> <active_ref uuidref=''D_btbal3024-ost1_btbal3024_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3024-ost1_btbal3024_UUID'' name=''OSD_btbal3024-ost1_btbal3024''> <target_ref uuidref=''btbal3024-ost1_UUID''/> <node_ref uuidref=''btbal3024_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3025-ost1_UUID'' name=''btbal3025-ost1''> <active_ref uuidref=''D_btbal3025-ost1_btbal3025_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3025-ost1_btbal3025_UUID'' name=''OSD_btbal3025-ost1_btbal3025''> <target_ref uuidref=''btbal3025-ost1_UUID''/> <node_ref uuidref=''btbal3025_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3026-ost1_UUID'' name=''btbal3026-ost1''> <active_ref uuidref=''D_btbal3026-ost1_btbal3026_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3026-ost1_btbal3026_UUID'' name=''OSD_btbal3026-ost1_btbal3026''> <target_ref uuidref=''btbal3026-ost1_UUID''/> <node_ref uuidref=''btbal3026_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3027-ost1_UUID'' name=''btbal3027-ost1''> <active_ref uuidref=''D_btbal3027-ost1_btbal3027_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3027-ost1_btbal3027_UUID'' name=''OSD_btbal3027-ost1_btbal3027''> <target_ref uuidref=''btbal3027-ost1_UUID''/> <node_ref uuidref=''btbal3027_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3028-ost1_UUID'' name=''btbal3028-ost1''> <active_ref uuidref=''D_btbal3028-ost1_btbal3028_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3028-ost1_btbal3028_UUID'' name=''OSD_btbal3028-ost1_btbal3028''> <target_ref uuidref=''btbal3028-ost1_UUID''/> <node_ref uuidref=''btbal3028_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3029-ost1_UUID'' name=''btbal3029-ost1''> <active_ref uuidref=''D_btbal3029-ost1_btbal3029_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3029-ost1_btbal3029_UUID'' name=''OSD_btbal3029-ost1_btbal3029''> <target_ref uuidref=''btbal3029-ost1_UUID''/> <node_ref uuidref=''btbal3029_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <ost uuid=''btbal3030-ost1_UUID'' name=''btbal3030-ost1''> <active_ref uuidref=''D_btbal3030-ost1_btbal3030_UUID''/> </ost> <osd osdtype=''obdfilter'' uuid=''D_btbal3030-ost1_btbal3030_UUID'' name=''OSD_btbal3030-ost1_btbal3030''> <target_ref uuidref=''btbal3030-ost1_UUID''/> <node_ref uuidref=''btbal3030_UUID''/> <fstype>ldiskfs</fstype> <devpath>/dev/sdb1</devpath> <autoformat>no</autoformat> <devsize>0</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> </osd> <filesystem uuid=''FS_fsname_UUID'' name=''FS_fsname''> <mds_ref uuidref=''btbal3031-mds1_UUID''/> <obd_ref uuidref=''lov01_UUID''/> </filesystem> <mountpoint uuid=''MNT_client_UUID'' name=''MNT_client''> <filesystem_ref uuidref=''FS_fsname_UUID''/> <path>/hostname/tick_cache</path> </mountpoint> </lustre> -- Allen Todd allen.todd@sig.com Susquehanna International Group, LLP 610.617.2738 IMPORTANT: The information contained in this email and/or its attachments is confidential. If you are not the intended recipient, please notify the sender immediately by reply and immediately delete this message and all its attachments. Any review, use, reproduction, disclosure or dissemination of this message or any attachment by an unintended recipient is strictly prohibited. Neither this message nor any attachment is intended as or should be construed as an offer, solicitation or recommendation to buy or sell any security or other financial instrument. Neither the sender, his or her employer nor any of their respective affiliates makes any warranties as to the completeness or accuracy of any of the information contained herein or that this message or any of its attachments is free of viruses.
On Jun 12, 2006 14:23 -0400, Allen Todd wrote:> We were running with clients and servers on the same machines and things > were working reasonably well, but recovery timeouts from down machines, > and a rarely encountered deadlock inspired me to upgrade to the latest > release rpm''s compiled for SLES9 - 1.4.6.1.You are aware that client-on-OST is an unsupported configuration. That said, we are interested in figuring out what we can here, because others are also interested in this kind of setup.> So far 1.4.6.1 and 1.4.6 have been very unstable for us. Out of our > cluster of 150 machines, we have been experiencing 5 to 7 system hangs > per day. I have LKCD tools installed, but as none of the machines have > keyboards, I have not been able to get any system core dumps. Before > hanging, ganglia often indicates that the machine is experiencing high > load averages (6 to 20 or so).You need to hook up a serial console to these machines in order to get stack traces during a hang. One possibility is to set it up so that machine N is connecting its serial port to N+1, and "N" is running "minicom" or if you can convice "conman" to run with this setup. Then you can at least get console logs and sysrq-t from 1/2 of your machines.> Errors in the syslog only seem to be from other clients not able to access the > OST that has hung, with no errors from the system hanging until it is rebooted > and recovery started.It is likely that you are hitting the scenario that causes us not to support this configuration. Namely, that under high load and memory pressure (client is trying to free memory and writing out dirty pages), the OST needs to do allocations for those writes but the OST allocations are waiting on memory to be freed, which is waiting on the OST write to finish, ... deadlock. If you are able to reproduce this with a test load of some sort (e.g. IOR or bonnie with large files), you could test this theory by using files that are explicitly striped (via lfs setstripe) to a DIFFERENT OST than the client is running on. This should avoid the memory deadlock on the local client. If you can gather stack traces on one of the failing nodes, it would be useful to see where in the client write path that it is doing allocations (these can be preallocated) and were in the server it is doing allocations (these may or may not be fixable, depending if it is in Lustre code or not). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc.