Hi, some days ago one of my users started a lot of matlab jobs flooding all processors on our 40 nodes 2CPU cluster (4Core System). More details and log files can be found in the appendix. As a result of this I observed a strange behavior of lustre. Ptlrpcd used 100% of one CPU, the second CPU was completly occupied by pwd. Pwd was a child of the matlab process invoked by the user. I/O on lustre was partly possible but df reported access denied. A recovery with the mdt started after lustre.timeout=300 but did not complete. I had to reboot all nodes which showed this behavior. The ost showed the message: Mar 14 16:47:05 cn46 kernel: LustreError: 138-a: lustre-OST0003: A client on nid 10.128.15.2 at tcp was evicted due to a lock glimpse callback to 10.128.15.2 at tcp timed out: rc -110 The client kernels reported soft lockup on all available cores. Does anyone have an idea how to prevent such behavior. Thanks for your help. Regards w.d. -------------------------------------------------------------------------------- W.Dilling Tel.: (49) 7071/29-70206 Universitaet Tuebingen Fax.: (49) 7071/29-5912 Zentrum fuer Datenverarbeitung mail: dilling at zdv.uni-tuebingen.de Waechterstrasse 76 72074 Tuebingen -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre_error_14.03.2008.tar Type: application/x-tar Size: 71680 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080321/ff21f16f/attachment-0002.tar -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2052 bytes Desc: S/MIME krytographische Unterschrift Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080321/ff21f16f/attachment-0002.bin
On Mar 21, 2008 19:15 +0100, Dilling wrote:> some days ago one of my users started a lot of matlab jobs flooding all > processors on our 40 nodes 2CPU cluster (4Core System). More details and > log files can be found in the appendix. As a result of this I observed a > strange behavior of lustre. Ptlrpcd used 100% of one CPU, the second CPU > was completly occupied by pwd. Pwd was a child of the matlab process > invoked by the user. I/O on lustre was partly possible but df reported > access denied. A recovery with the mdt started after lustre.timeout=300 but > did not complete. I had to reboot all nodes which showed this behavior. The > ost showed the message: > Mar 14 16:47:05 cn46 kernel: LustreError: 138-a: lustre-OST0003: A client > on nid 10.128.15.2 at tcp was evicted due to a lock glimpse callback to > 10.128.15.2 at tcp timed out: rc -110 > The client kernels reported soft lockup on all available cores. > Does anyone have an idea how to prevent such behavior. Thanks for your help.You missed an important detail right at the beginning of your woes: ll_sai_entry_set()) ASSERTION(entry->se_stat == SA_ENTRY_UNSTATED) failed This is a bug in the "statahead" code. This is a new feature which detects apps doing "readdir + sequential stat" operations on a directory and starts multiple concurrent metadata RPCs in order to hide the network latency of the serialized "stat" operations. This is a known bug 15175 in our bugzilla and is being worked on. You can disable statahead on the clients until this is resolved: echo 0 > /proc/fs/lustre/llite/*/statahead_count Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.