thr3ads.net - Lustre discuss - [Lustre-discuss] Lustre 1.8 refused reconnection and evictions [Jul 2009]

If this information is useful, please help other people find it:
Share via:

Jake Maul

2009-Jul-18 18:03 UTC

[Lustre-discuss] Lustre 1.8 refused reconnection and evictions

Greetings,

I have a new, still very small Lustre 1.8 cluster... only currently 3
clients and 1 OST. All servers are CentOS 5.3. The MDS and OST are
both Lustre 1.8.0. One of the clients is 1.8.0 running the Lustre
kernel from Sun, the other two are Lustre 1.8.0.1 running the stock
RHEL/CentOS kernel newly supported by the 1.8.0.1 precompiled
patchless kernel module RPMs, 2.6.18-128.1.6.el5.

Under a fairly weak load, we''re starting to see lots and lots of the
following errors, and extremely poor performance:

on the MDS:
kernel: Lustre: 16140:0:(ldlm_lib.c:815:target_handle_connect())
storage-MDT0000: refuse reconnection from
17df8648-2c26-03c8-a0e0-d427bb234f95 at 209.188.6.6@tcp to
0xffff8100bd4aa000; still busy with 2 active RPCs

LustreError: 138-a: storage-MDT0000: A client on nid x.x.x.x at tcp was
evicted due to a lock blocking callback to x.x.x.x at tcp timed out: rc
-107


on the OST:

kernel: Lustre: 25764:0:(ldlm_lib.c:815:target_handle_connect())
storage-OST0000: refuse reconnection from
b9281ac1-ae7d-a4a4-ad9a-c123f8a3b639 at x.x.x.y@tcp to
0xffff81002139e000; still busy with 10 active RPCs

kernel: Lustre: 25804:0:(ldlm_lib.c:815:target_handle_connect())
storage-OST0000: refuse reconnection from
17df8648-2c26-03c8-a0e0-d427bb234f95 at x.x.x.x@tcp to
0xffff810105a72000; still busy with 12 active RPCs

LustreError: 138-a: storage-OST0000: A client on nid x.x.x.x at tcp was
evicted due to a lock glimpse callback to x.x.x.x at tcp timed out: rc
-107



Another common problem is clients not reconnecting during recovery.
Instead, it often seems to just sit and wait for the full 5 minutes,
even with this tiny number of clients (4 total... the MDS sometimes
mounts the data store itself).

Any ideas on what we can tune to get this back on track?

Many thanks,
Jake

Jake Maul

2009-Jul-20 19:27 UTC

head link

[Lustre-discuss] Lustre 1.8 refused reconnection and evictions

I believe I may have sorted this out.

The 3 frontend clients have 2 interfaces, eth0 and eth1. They also
have an extra (frontend) IP bound to lo:1, for IPVS based load
balancing (this works by changing some ARP-related settings, if you''ve
never worked with IPVS in gatewaying / direct routing mode).

For some reason, all 3 of them had chosen that IP to be their NID.
They could still communicate with the backends via eth1, but the
backends couldn''t (always) reach them. It worked for a few days before
becoming a significant problem.

I think I''ve fixed this by adding this line to /etc/modprobe.conf:

options lnet networks=tcp0(eth1)

I had foolishly skipped over this part of the documentation, as it''s
in the "More Complicated Configurations" section, whereas the
"Simple
TCP Network" omits specifying allowable interfaces. I thought I *had*
a pretty simple network, at least from that point of view. :)

Since doing that, it''s been running for approximately an hour with no
refused connections or evictions. I''ll update this thread again if the
problem reappears...

Thanks,
Jake

On Sat, Jul 18, 2009 at 11:03 AM, Jake Maul<jakemaul at gmail.com>
wrote:> Greetings,
>
> I have a new, still very small Lustre 1.8 cluster... only currently 3
> clients and 1 OST. All servers are CentOS 5.3. The MDS and OST are
> both Lustre 1.8.0. One of the clients is 1.8.0 running the Lustre
> kernel from Sun, the other two are Lustre 1.8.0.1 running the stock
> RHEL/CentOS kernel newly supported by the 1.8.0.1 precompiled
> patchless kernel module RPMs, 2.6.18-128.1.6.el5.
>
> Under a fairly weak load, we''re starting to see lots and lots of
the
> following errors, and extremely poor performance:
>
> on the MDS:
> kernel: Lustre: 16140:0:(ldlm_lib.c:815:target_handle_connect())
> storage-MDT0000: refuse reconnection from
> 17df8648-2c26-03c8-a0e0-d427bb234f95 at x.x.x.x@tcp to
> 0xffff8100bd4aa000; still busy with 2 active RPCs
>
> LustreError: 138-a: storage-MDT0000: A client on nid x.x.x.x at tcp was
> evicted due to a lock blocking callback to x.x.x.x at tcp timed out: rc
> -107
>
>
> on the OST:
>
> kernel: Lustre: 25764:0:(ldlm_lib.c:815:target_handle_connect())
> storage-OST0000: refuse reconnection from
> b9281ac1-ae7d-a4a4-ad9a-c123f8a3b639 at x.x.x.y@tcp to
> 0xffff81002139e000; still busy with 10 active RPCs
>
> kernel: Lustre: 25804:0:(ldlm_lib.c:815:target_handle_connect())
> storage-OST0000: refuse reconnection from
> 17df8648-2c26-03c8-a0e0-d427bb234f95 at x.x.x.x@tcp to
> 0xffff810105a72000; still busy with 12 active RPCs
>
> LustreError: 138-a: storage-OST0000: A client on nid x.x.x.x at tcp was
> evicted due to a lock glimpse callback to x.x.x.x at tcp timed out: rc
> -107
>
>
>
> Another common problem is clients not reconnecting during recovery.
> Instead, it often seems to just sit and wait for the full 5 minutes,
> even with this tiny number of clients (4 total... the MDS sometimes
> mounts the data store itself).
>
> Any ideas on what we can tune to get this back on track?
>
> Many thanks,
> Jake
>

Lustre discuss - Jul 2009 - Lustre 1.8 refused reconnection and evictions

[Lustre-discuss] Lustre 1.8 refused reconnection and evictions

[Lustre-discuss] Lustre 1.8 refused reconnection and evictions