Hello clusterfs, I found the timeout bug in my testing environment ! This took me some weeks of experiments, but now its reproducible! The bug is reproducible in my environment with the provided rpms by clusterfs (RHEL, with lustre 1.2.7) and with my own build kernel 2.4.24+lustre 1.3.3. The problem I had, was a timeout reported as "LustreError", which leads to "evicting client" errors on the server side, and on a lustre client to a "ptlrpc_expire" message and the execution of the upcall script. I reported this earlier: https://lists.clusterfs.com/pipermail/lustre-discuss/2004-October/000487.html https://lists.clusterfs.com/pipermail/lustre-discuss/2004-October/000473.html This bug appears usually after 40 Minutes of stresstesting, which made it hard to find. Then after some calls to the upcall script the machine has Input/Output Errors in the lustre mount. The solution: disable zerocopy! echo "8192" >/proc/sys/socknal/zero_copy The bug is reproducible in this way, that without zero_copy I have no errors even after days of stresstesing, and with zero_copy I have the timeout bug. Is it possible, that there is an error in the zero_copy socket? regards, Martin -- ============================================ Martin Vogt ITWM - Fraunhofer Institut fuer Techno- und Wirtschaftsmathematik Europaallee 10 D-67657 Kaiserslautern, Germany Tel. +49 (0) 631/303-1806, Fax -1811 =============================================