Jeff Darcy
2008-Feb-11 14:15 UTC
[Lustre-discuss] Problems with NAT and/or version upgrade
In our lab, we have several 108-node and 972-node systems, and we''re using a small (four-node Opteron 2.6.16-27-0.9_lustre-1.6.0.1smp) Lustre server setup to provide some shared space for applications and such. Because there are so many client subnets involved and we don''t want the server routing tables to get out of control, we use NAT for most of these. I did notice that the server seems to know the "private" internal addresses of the clients as part of their NIDs, but it seemed to work. Lately, though, especially since we upgraded clients from Linux 2.6.15 to 2.6.18 and from Lustre 1.5.99beta to 1.6.3, we''ve been seeing some instability on the servers. The MDS has crashed repeatedly, requiring manual cleanup of last_recv to recover. In about the same time period, clients have also started complaining about "Cannot send after transport endpoint shutdown (143)" and generally acting flaky. Has anybody else seen anything similar, either as part of a version upgrade or related to using NAT?