thr3ads.net - Lustre discuss - [Lustre-discuss] OSS extremely slow in response, ll

If this information is useful, please help other people find it:
Share via:

Thomas Roth

2009-Nov-03 18:22 UTC

[Lustre-discuss] OSS extremely slow in response, ll_ost load high

Hi all,

in our 1.6.7.2 - Debian- Kernel 2.6.22 Cluster, 2 Servers with 2 and 3
OSTs have become somewhat blocking in the sense that commands like "lfs
df" will have to wait for ca. 30s when reaching these OSTs in the list.
Some of our clients do not have this problem, some have these contact(?)
problems with the one server, some with the other, and it is time
dependent. I have run "lfs df" without problem five times, only on the
sixth run it would halt.

What really distinguishes these lame OSS machines from all others is
that each has one ll_ost_123 thread that takes up one cpu core entirely.
Since our servers have 8 Cores, 8GB RAM, each, I didn''t think this
would
actually impede Lustre operations.

Btw, I have "options ost oss_num_threads=256" in the modprobe-conf on
these servers.

There is no entry in the clients logs connected with this behavior.

One of the said OSS has had 2 of its 3 OSTs attached somewhat later than
the first one. Hence, the younger 2 appear later in a listing of OSTs as
you would get out of "lfs df". None of the clients would stop for
these
OSTs. I conclude that I am not dealing with network problems.

Now for the OSS-logs, there are indeed ''new'' error messages.

Nov  3 18:49:58 OSS kernel: Lustre:
13086:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_R
EQs for peer Client-IP at tcp detected: 4; the oldest (ffff81010fc15000)
timed out 0 secs ago

Nov  3 18:55:32 OSS kernel: LustreError:
13323:0:(events.c:66:request_out_callback()) @@@ type 4, status
-5  req at ffff8102005cbc00 x155576/t0
o106->@NET_0x200000a0c4487_UUID:15/16 lens 232/296 e 0 to 1 dl 1257270939
 ref 2 fl Rpc:/2/0 rc 0/0
Nov  3 18:55:32 OSS kernel: LustreError:
13323:0:(events.c:66:request_out_callback()) Skipped 68485395 pr
evious similar messages

Status -5 means Linux error code -5 = I/O error ? Silent disk corruption?
Of course I don''t have any other indications of hard disk failure.
There
was a power outage, however. Only it was already one week ago, and we
did not see this behavior before today.

Is there anything I can do to get rid of these annoying ll_ost threads
in the running system? Of course I''m not sure they are the root of the
problem...

Regards,
Thomas

Andreas Dilger

2009-Nov-03 22:37 UTC

head link

[Lustre-discuss] OSS extremely slow in response, ll_ost load high

On 2009-11-03, at 11:22, Thomas Roth wrote:> in our 1.6.7.2 - Debian- Kernel 2.6.22 Cluster, 2 Servers with 2 and 3
> OSTs have become somewhat blocking in the sense that commands like  
> "lfs
> df" will have to wait for ca. 30s when reaching these OSTs in the  
> list.
> Some of our clients do not have this problem, some have these  
> contact(?)
> problems with the one server, some with the other, and it is time
> dependent. I have run "lfs df" without problem five times, only
on the
> sixth run it would halt.
>
> What really distinguishes these lame OSS machines from all others is
> that each has one ll_ost_123 thread that takes up one cpu core  
> entirely.
> Since our servers have 8 Cores, 8GB RAM, each, I didn''t think this
> would
> actually impede Lustre operations.
This probably indicates an LBUG on that system that is causing the  
thread to
hang.  You should check the /var/log/messages to see what the root  
cause is.
> One of the said OSS has had 2 of its 3 OSTs attached somewhat later  
> than
> the first one. Hence, the younger 2 appear later in a listing of  
> OSTs as
> you would get out of "lfs df". None of the clients would stop for
> these
> OSTs. I conclude that I am not dealing with network problems.
>
> Now for the OSS-logs, there are indeed ''new'' error
messages.
>
> Nov  3 18:49:58 OSS kernel: Lustre:
> 13086:0:(socklnd_cb.c:2728:ksocknal_check_peer_timeouts()) Stale ZC_R
> EQs for peer Client-IP at tcp detected: 4; the oldest (ffff81010fc15000)
> timed out 0 secs ago
>
> Nov  3 18:55:32 OSS kernel: LustreError:
> 13323:0:(events.c:66:request_out_callback()) @@@ type 4, status
> -5  req at ffff8102005cbc00 x155576/t0
> o106->@NET_0x200000a0c4487_UUID:15/16 lens 232/296 e 0 to 1 dl  
> 1257270939
> ref 2 fl Rpc:/2/0 rc 0/0
> Nov  3 18:55:32 OSS kernel: LustreError:
> 13323:0:(events.c:66:request_out_callback()) Skipped 68485395 pr
> evious similar messages
>
> Status -5 means Linux error code -5 = I/O error ? Silent disk  
> corruption?
> Of course I don''t have any other indications of hard disk failure.
> There
> was a power outage, however. Only it was already one week ago, and we
> did not see this behavior before today.
This is unlikely to mean disk filesystem corruption, but rather that  
there
was an error reading or writing over the network...

The fact that there were 68 million of these messages means something is
quite wrong with that node.
> Is there anything I can do to get rid of these annoying ll_ost threads
> in the running system? Of course I''m not sure they are the root of
the
> problem...
Well, the ll_ost_* threads are the ones that are doing the actual work
of handling the RPCs, so you can''t get rid of them.  The modprobe.conf
oss_num_threads line is forcing the startup of 256 of those threads,
instead of letting Lustre start the appropriate number of threads based
on the system load.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Nov 2009 - OSS extremely slow in response, ll_ost load high

[Lustre-discuss] OSS extremely slow in response, ll_ost load high

[Lustre-discuss] OSS extremely slow in response, ll_ost load high