Hello clusterfs,
I found the timeout bug in my testing environment !
This took me some weeks of experiments, but now its reproducible!
The bug is reproducible in my environment with the provided rpms by
clusterfs (RHEL, with lustre 1.2.7)
and with my own build kernel 2.4.24+lustre 1.3.3.
The problem I had, was a timeout reported as "LustreError", which
leads to
"evicting client" errors on the server side, and on a lustre client
to a "ptlrpc_expire" message and the execution of the upcall script.
I reported this earlier:
https://lists.clusterfs.com/pipermail/lustre-discuss/2004-October/000487.html
https://lists.clusterfs.com/pipermail/lustre-discuss/2004-October/000473.html
This bug appears usually after 40 Minutes of stresstesting, which made
it hard to find. Then after some calls to the upcall script
the machine has Input/Output Errors in the lustre mount.
The solution: disable zerocopy!
echo "8192" >/proc/sys/socknal/zero_copy
The bug is reproducible in this way, that without zero_copy I have no
errors even
after days of stresstesing, and with zero_copy I have the timeout bug.
Is it possible, that there is an error in the zero_copy socket?
regards,
Martin
--
============================================ Martin Vogt
ITWM - Fraunhofer Institut fuer
Techno- und Wirtschaftsmathematik
Europaallee 10
D-67657 Kaiserslautern, Germany
Tel. +49 (0) 631/303-1806, Fax -1811
=============================================