Lundgren, Andrew
2008-May-07 17:40 UTC
[Lustre-discuss] clients getting disconnected from MGS and OSS servers
We seem to be having a problem with lustre 1.6.4.3 and clients getting disconnected. We currently have a situation where a box that just does maintenance work on the cluster (du/stats other work) has some directories it cannot enter. (The shell just hangs and doesn''t timeout.) An lfs check servers shows all of the servers are ok: % lfs check servers content-MDT0000-mdc-ffff810210b0fc00 active. content-OST0000-osc-ffff810210b0fc00 active. content-OST0001-osc-ffff810210b0fc00 active. content-OST0002-osc-ffff810210b0fc00 active. content-OST0003-osc-ffff810210b0fc00 active. content-OST0004-osc-ffff810210b0fc00 active. content-OST0005-osc-ffff810210b0fc00 active. content-OST0006-osc-ffff810210b0fc00 active. content-OST0007-osc-ffff810210b0fc00 active. I enabled the rpctrace in the debug logs, and am now seeing this: 00000100:00080000:2:1210181389.481562:0:4282:0:(pinger.c:139:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or recovery disabled: 0/1) 00000100:00080000:2:1210181414.476881:0:4282:0:(pinger.c:139:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or recovery disabled: 0/1) 00000100:00080000:2:1210181439.471197:0:4282:0:(pinger.c:139:ptlrpc_pinger_main()) not pinging MGS (in recovery: FULL or recovery disabled: 0/1) I can reboot the machine and it will come back. The other clients connected to this cluster are not experiencing this problem. Is anyone else seeing these issues? Thoughts? Thanks! -- Andrew -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080507/8dd4091a/attachment.html