Michael Kluge
2012-Feb-03 13:34 UTC
[Lustre-discuss] 1.8 client loses contact to 1.6 router
Hi list, we have a 1.6.7 fs running which still works nicely. One node exports this FS (via 10GE) to another cluster that has some 1.8.5 patchless clients. These clients at some point (randomly, I think) mark the router as down (lctl show_route). It is always a different client and usually a few clients each week that do this. Despite that we configured the clients to ping the router again from time to time, the route never comes back. On these clients I can still "ping" the IP of the router but "lctl ping" gives me an Input/Output error. If I do somthing like: lctl --net o2ib set_route 172.30.128.241 at tcp1 down sleep 45 lctl --net o2ib del_route 172.30.128.241 at tcp1 sleep 45 lctl --net o2ib add_route 172.30.128.241 at tcp1 sleep 45 lctl --net o2ib set_route 172.30.128.241 at tcp1 up the route comes back, sometimes the client works again but sometimes the clients issue an "unexpected aliveness of peer .." and need a reboot. I looked around and could not find a note whether 1.8. clients and 1.6 routers will work together as expexted. Has anyone experience with this kind of setup or an idea for further debugging? Regards, Michael modprobe.d/luste.conf on the 1.8.5 clients -----------------------------------------8<------------------------------ options lnet networks=tcp1(eth0) options lnet routes="o2ib 172.30.128.241 at tcp1;" options lnet dead_router_check_interval=60 router_ping_timeout=30 -----------------------------------------8<------------------------------ -- Dr.-Ing. Michael Kluge Technische Universit?t Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax: (+49) 351 463-37773 e-mail: michael.kluge at tu-dresden.de WWW: http://www.tu-dresden.de/zih