thr3ads.net - Lustre discuss - [Lustre-discuss] stuck OSS node [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Craig Prescott

2011-Aug-04 19:16 UTC

[Lustre-discuss] stuck OSS node

Hello,

We''ve got a problem here we hope someone can help us with. 
We''ve have a
few 1.8.5 OSS nodes which seems to get locked up Lustre-wise on our tcp 
clients from time to time.  This is a recent phenomena - we are not 
sure, but we think it may be related to a particular workload.  Our o2ib 
clients don''t seem to have any trouble.

''lfs df'' shows "Resource temporarily unavailable"
for all OSTs on the
affected OSS on all tcp clients when this happens.  When we look on the 
OSS itself we see secoknal_sd and ll_ost_io processes consuming cycles:
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> 10954 root      16   0     0    0    0 R 66.6  0.0 515:48.58 socknal_sd02
> 11370 root      16   0     0    0    0 R 64.9  0.0 241:21.83 ll_ost_io_91
> 10959 root      19   0     0    0    0 R 49.7  0.0 111:53.27 socknal_sd07
There are plenty of cycles free on each core of the OSS, though.  We do 
see that plenty of lustre logs were dumped, as well after service 
threads were inactive for 20 minutes.  I haven''t been able to learn
much
from ''lctl debug_file'' yet.

Further, we can see from ''netstat -t'' that the Recv-Q count is
increasing on the client connections - never decreasing.  Send-Q count 
is zero for all but two clients, where seem to be a constant non-zero 
value (few-several hundred K).

Anyway, it seems like the socknal and/or ll_ost_io_91 processes above 
are just stuck doing nothing productive.  Syslog messages aren''t
telling
me why.  Has anyone seen anything like this?

We know that after rebooting the OSS our tcp clients will start working 
again.

Thanks,
Craig Prescott
UF HPC Center

Adrian Ulrich

2011-Aug-05 09:01 UTC

head link

[Lustre-discuss] stuck OSS node

Hi Craig,
> Has anyone seen anything like this?
Yes: we had a similar problem a couple of times:


First, try to umount all OSTs on the affected OSS.

Some OSTs will (most likely) fail to umount. (umount gets stuck due to the
ll_ost_io_?? thread).
Note the ''broken'' OSTs and kill the OSS (echo b >
/proc/sysrq-trigger) after the ''good'' OSTs finished umounting.

Afterwards do a simple ''e2fsck -f -p'' on the bad OSTs - it
should complain about corrupted directories and other nice things. If it
doesn''t -> upgrade to the latest fsck from whamcloud.
(We had a corruption a few months ago that was unfixable/not detected with the
1.8.4-sun e2fsprogs)


> This is a recent phenomena - we are not 
> sure, but we think it may be related to a particular workload.  Our o2ib 
> clients don''t seem to have any trouble.
I don''t think that this issue is related to the network: It''s
probably just ''bad luck'' that only the tcp clients hit the
corrupted directories.



Regards,
 Adrian

Lustre discuss - Aug 2011 - stuck OSS node

[Lustre-discuss] stuck OSS node

[Lustre-discuss] stuck OSS node