thr3ads.net - Lustre discuss - [Lustre-discuss] timeout bug found! [May 2006]

If this information is useful, please help other people find it:
Share via:

Martin Vogt

2006-May-19 07:36 UTC

[Lustre-discuss] timeout bug found!

Hello clusterfs,

I found the timeout bug in my testing environment !
This took me some weeks of experiments, but now its reproducible!

The bug is reproducible in my environment with the provided rpms by 
clusterfs (RHEL, with lustre 1.2.7)
and with my own build kernel 2.4.24+lustre 1.3.3.

The problem I had, was a timeout reported as "LustreError", which
leads to
"evicting client" errors on the server side, and on a lustre client
to a "ptlrpc_expire" message and the execution of the upcall script.

I reported this earlier:

https://lists.clusterfs.com/pipermail/lustre-discuss/2004-October/000487.html
https://lists.clusterfs.com/pipermail/lustre-discuss/2004-October/000473.html

This bug appears usually after 40 Minutes of stresstesting, which made
 it hard to find. Then after some calls to the upcall script
the machine has  Input/Output Errors in the lustre mount.

The solution: disable zerocopy!
echo "8192" >/proc/sys/socknal/zero_copy


The bug is reproducible in this way, that without zero_copy I have no 
errors even
after days of stresstesing, and with zero_copy I have the timeout bug.

Is it possible, that there is an error in the zero_copy socket?

regards,

Martin



-- 




============================================ Martin Vogt
 ITWM - Fraunhofer Institut fuer
        Techno- und Wirtschaftsmathematik
 Europaallee 10
 D-67657 Kaiserslautern, Germany
 Tel. +49 (0) 631/303-1806, Fax -1811
=============================================

Lustre discuss - May 2006 - timeout bug found!

[Lustre-discuss] timeout bug found!