Hello,
We''ve got a problem here we hope someone can help us with.
We''ve have a
few 1.8.5 OSS nodes which seems to get locked up Lustre-wise on our tcp
clients from time to time. This is a recent phenomena - we are not
sure, but we think it may be related to a particular workload. Our o2ib
clients don''t seem to have any trouble.
''lfs df'' shows "Resource temporarily unavailable"
for all OSTs on the
affected OSS on all tcp clients when this happens. When we look on the
OSS itself we see secoknal_sd and ll_ost_io processes consuming cycles:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 10954 root 16 0 0 0 0 R 66.6 0.0 515:48.58 socknal_sd02
> 11370 root 16 0 0 0 0 R 64.9 0.0 241:21.83 ll_ost_io_91
> 10959 root 19 0 0 0 0 R 49.7 0.0 111:53.27 socknal_sd07
There are plenty of cycles free on each core of the OSS, though. We do
see that plenty of lustre logs were dumped, as well after service
threads were inactive for 20 minutes. I haven''t been able to learn
much
from ''lctl debug_file'' yet.
Further, we can see from ''netstat -t'' that the Recv-Q count is
increasing on the client connections - never decreasing. Send-Q count
is zero for all but two clients, where seem to be a constant non-zero
value (few-several hundred K).
Anyway, it seems like the socknal and/or ll_ost_io_91 processes above
are just stuck doing nothing productive. Syslog messages aren''t
telling
me why. Has anyone seen anything like this?
We know that after rebooting the OSS our tcp clients will start working
again.
Thanks,
Craig Prescott
UF HPC Center