thr3ads.net - Lustre discuss - [Lustre-discuss] lock timeouts and OST evictions on 1.4 server

If this information is useful, please help other people find it:
Share via:

Simon Kelley

2009-Feb-10 10:17 UTC

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

Were seeing problems which seem to be lock-related and result in data
loss. This is a fairly low probability event: it''s happening on a ~1000
core compute farm which is heavily loaded and the frequency is of the
order of tens of failures an hour.

The clients are running kernel 2.6.22 and lustre 1.6.6 from Sun. The
servers are HP SFS and running lustre 1.4.11. The mixed versions are
deliberate: we''re in the process of adding servers running 1.6.6 but
none are currently in use: all servers are 1.4.11. 1.4.11 clients don''t
see problems and there are some old 1.4.11 clients left still using
these filesystems at teh same time as the new 1.6.6 ones.

On the server (OST), and event typically looks like this:

2009/02/09 14:05:10 kern    i   sf2-sfs33: LustreError:
0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer
expired: evicting client
4cdd29e6-8b9e-7e81-5d72-2a055f0e2a38 at NET_0x20000ac1f6eee_UUID nid
172.31.110.238 at tcp ns: filter-sf2-sfs-ost495_UUID lock:
0000010048ae0100/0x5fd2cf129ee3e5bb lrc: 1/0 , 0 mode: PW/PW res:
1542868/0 rrc: 4 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 20 remote: 0xc8fbbbbcb3ab0d8d expref:
386 pid: 16626
2009/02/09 14:05:30 kern    i   sf2-sfs33: LustreError:
16577:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error
(-107) req at 0000010073bbce00 x2138818/t0 o400-><?>@<?>:-1 lens
64/0 ref 0
fl Interpret:/0/0 rc -107/0

and on the client at the same time:

Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: 11-0: an error occurred
while communicating with 172.31.96.96 at tcp. The obd_ping operation failed
with -107
Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: Skipped 12 previous
similar messages
Feb  9 14:05:30 sf-2-3-10 kernel: Lustre:
OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost495_MNT_client_tcp-ffff81021f897800:

Connection to service sf2-sfs-ost495 via nid 172.31.96.96 at tcp was lost;
in progress operations using this service will wait for recovery to
complete.
Feb  9 14:05:30 sf-2-3-10 kernel: Lustre: Skipped 4 previous similar
messages
Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: 167-0: This client was
evicted by sf2-sfs-ost495; in progress operations using this service
will fail.



It gets worse: the same client s few minutes later saw this on another
OST, same filesystem.

2009/02/09 14:08:30 kern    i   sf2-sfs34: LustreError:
0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer
expired: evicting client
4cdd29e6-8b9e-7e81-5d72-2a055f0e2a38 at NET_0x20000ac1f6eee_UUID nid
172.31.110.238 at tcp ns: filter-sf2-sfs-ost496_UUID lock:
000001006b8bfa40/0x900da6275f064529 lrc: 1/0 , 0 mode: PW/PW res:
1542328/0 rrc: 2 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 20 remote: 0xc8fbbbbcb3ab1478 expref:
172 pid: 16089
2009/02/09 14:08:30 kern    i   sf2-sfs33: LustreError:
0:0:(ldlm_lockd.c:227:waiting_locks_callback()) ### lock callback timer
expired: evicting client
f3275d1e-1360-e139-e71c-44b283d7ec3c at NET_0x20000ac1f665d_UUID nid
172.31.102.93 at tcp ns: filter-sf2-sfs-ost495_UUID lock:
000001005a810040/0x5fd2cf129ee3e9f1 lrc: 1/0 , 0 mode: PW/PW res:
1542868/0 rrc: 3 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 20 remote: 0xa899b6e20a06a83e expref:
316 pid: 16454
2009/02/09 14:08:30 kern    i   sf2-sfs33: LustreError:
16335:0:(ldlm_lockd.c:606:ldlm_server_completion_ast()) ### enqueue wait
took 399491082us from 1234188111 ns: filter-sf2-sfs-ost495_UUID lock:
0000010064a638c0/0x5fd2cf129ee3ea68 lrc: 2/0 , 0 mode: PW/PW res:
1542868/0 rrc: 3 type: EXT [0->18446744073709551615] (req
0->18446744073709551615) flags: 20 remote: 0xc8423e042b195e36 expref:
501 pid: 16493
2009/02/09 14:08:32 storage w   sf2-sfs1 : Warning: event log
(/var/hpls/evlog/eventlog) size is 77760824
2009/02/09 14:08:50 kern    i   sf2-sfs34: LustreError:
16253:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error
(-107) req at 00000100b3ea4a00 x2150055/t0 o400-><?>@<?>:-1 lens
64/0 ref 0
fl Interpret:/0/0 rc -107/0
2009/02/09 14:09:02 kern    i   sf2-sfs33: LustreError:
16566:0:(ldlm_lib.c:1364:target_send_reply_msg()) @@@ processing error
(-107) req at 000001006b7b3200 x5180156/t0 o400-><?>@<?>:-1 lens
64/0 ref 0
fl Interpret:/0/0 rc -107/0

whilst the client suffered an assertion failure and LBUG.

Feb  9 14:08:50 sf-2-3-10 kernel: LustreError: 11-0: an error occurred
while communicating with 172.31.96.97 at tcp. The obd_ping operation failed
with -107
Feb  9 14:08:50 sf-2-3-10 kernel: Lustre:
OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost496_MNT_client_tcp-ffff81021f897800:

Connection to service sf2-sfs-ost496 via nid 172.31.96.97 at tcp was lost;
in progress operations using this service will wait for recovery to
complete.
Feb  9 14:08:50 sf-2-3-10 kernel: LustreError: 167-0: This client was
evicted by sf2-sfs-ost496; in progress operations using this service
will fail.
Feb  9 14:08:50 sf-2-3-10 kernel: LustreError:
29564:0:(import.c:241:ptlrpc_invalidate_import()) sf2-sfs-ost495_UUID:
rc = -110 waiting for callback (1 != 0)
Feb  9 14:08:50 sf-2-3-10 kernel: LustreError:
29564:0:(import.c:252:ptlrpc_invalidate_import())
ASSERTION(atomic_read(&imp->imp_inflight) == 0) failed
Feb  9 14:08:50 sf-2-3-10 kernel: Lustre:
OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost496_MNT_client_tcp-ffff81021f897800:

Connection restored to service sf2-sfs-ost496 using nid 172.31.96.97 at tcp.
Feb  9 14:08:50 sf-2-3-10 kernel: Lustre: Skipped 4 previous similar
messages
Feb  9 14:08:50 sf-2-3-10 kernel: LustreError:
29564:0:(tracefile.c:450:libcfs_assertion_failed()) LBUG
Feb  9 14:08:50 sf-2-3-10 kernel: Lustre:
29564:0:(linux-debug.c:185:libcfs_debug_dumpstack()) showing stack for
process 29564
Feb  9 14:08:50 sf-2-3-10 kernel: ll_imp_inval  R  running task       0
29564      2 (L-TLB)
Feb  9 14:08:50 sf-2-3-10 kernel: ffffffff804b1d28 ffffffff804b1d28
ffffffff804b1d28 0000000000000000
Feb  9 14:08:50 sf-2-3-10 kernel: ffffffff8856cf5f ffff8101a231fcc0
0000003000000030 ffff8101b8079e90
Feb  9 14:08:50 sf-2-3-10 kernel: ffff8101b8079d70 ffffffff8025d587
0000000000000286 ffffffff8025d5a9
Feb  9 14:08:50 sf-2-3-10 kernel: Call Trace:
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020b2f0>]
show_trace+0x34/0x47
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020b3fc>]
_show_stack+0xe2/0xf1
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff882f7b7c>]
:libcfs:lbug_with_loc+0x6c/0xd0
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff882ffbc4>]
:libcfs:libcfs_assertion_failed+0x54/0x60
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff88451785>]
:ptlrpc:ptlrpc_invalidate_import+0x4d5/0x5c0
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff8022c500>]
default_wake_function+0x0/0xe
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff80292e8c>] dput+0x1c/0x10b
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff88453ce9>]
:ptlrpc:ptlrpc_invalidate_import_thread+0x49/0x2e0
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020aba8>] child_rip+0xa/0x12
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff88453ca0>]
:ptlrpc:ptlrpc_invalidate_import_thread+0x0/0x2e0
Feb  9 14:08:50 sf-2-3-10 kernel: [<ffffffff8020ab9e>] child_rip+0x0/0x12
Feb  9 14:08:50 sf-2-3-10 kernel:

Userspace file operations terminate with an IO error when these events
happen. The failed operations seem to be writes to files which are
long-open (spool files taking job output).

We are also seeing some userspace file operations fail with the error
"No locks available". These don''t generate any logging on the
client so
I don''t have exact timing. It''s possible that they are
associated with
further "### lock callback timer expired" server logs.


OSTs have the lustre timeout set to 200 seconds. System load is not
excessive and no greater than has been supported for more than a year
with 1.4 clients.

I''d be very happy to have any clues on fixing this....


Cheers,

Simon.

Oleg Drokin

2009-Feb-10 14:21 UTC

head link

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

Hello!

On Feb 10, 2009, at 5:17 AM, Simon Kelley wrote:>
> Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: 11-0: an error occurred
> while communicating with 172.31.96.96 at tcp. The obd_ping operation  
> failed
> with -107
> Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: Skipped 12 previous
> similar messages
> Feb  9 14:05:30 sf-2-3-10 kernel: Lustre:
> OSC_sf2-sfs2.internal.sanger.ac.uk_sf2-sfs-ost495_MNT_client_tcp- 
> ffff81021f897800:
>
> Connection to service sf2-sfs-ost495 via nid 172.31.96.96 at tcp was  
> lost;
> in progress operations using this service will wait for recovery to
> complete.
> Feb  9 14:05:30 sf-2-3-10 kernel: Lustre: Skipped 4 previous similar
> messages
> Feb  9 14:05:30 sf-2-3-10 kernel: LustreError: 167-0: This client was
> evicted by sf2-sfs-ost495; in progress operations using this service
> will fail.
>
What would be useful here is if you can enable dlm tracing (echo  
+dlm_trace >/proc/sys/lnet/debug)
on some of those 1.6 nodes (also if you are running with no debug  
enabled at all,
also enable rpc_trace and info levels) and also enable "dump on  
eviction" feature.
(echo 1 >/proc/sys/lustre/dump_on_eviction).
Then when next eviction happens, there would be some useful debug data  
dumped on the client,
that you can attach to a bugzilla bug along with server-side eviction  
message (processed
with "lctl dl" command first).
> We are also seeing some userspace file operations fail with the error
> "No locks available". These don''t generate any logging
on the client
> so
> I don''t have exact timing. It''s possible that they are
associated with
> further "### lock callback timer expired" server logs.
This error code typically means an application attempting to do some i/ 
o and Lustre
has no lock for the i/o area for some reason anymore (it is normally  
obtained
once read or write path is entered), and that could be related to  
evictions too
(locks are revoked at eviction time).

Bye,
     Oleg

Simon Kelley

2009-Feb-10 17:11 UTC

head link

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

Oleg Drokin wrote:>
> What would be useful here is if you can enable dlm tracing (echo  
> +dlm_trace >/proc/sys/lnet/debug)
> on some of those 1.6 nodes (also if you are running with no debug  
> enabled at all,
> also enable rpc_trace and info levels) and also enable "dump on  
> eviction" feature.
> (echo 1 >/proc/sys/lustre/dump_on_eviction).
> Then when next eviction happens, there would be some useful debug data  
> dumped on the client,
> that you can attach to a bugzilla bug along with server-side eviction  
> message (processed
> with "lctl dl" command first).
OK, will do. The main problem is reproducing the error: our users have 
unreasonably insisted that we run their jobs using known-good 1.4 
clients and even if I grab their code to run on isolated test nodes 
_most_ runs are fine.
> 
>> We are also seeing some userspace file operations fail with the error
>> "No locks available". These don''t generate any
logging on the client
>> so
>> I don''t have exact timing. It''s possible that they
are associated with
>> further "### lock callback timer expired" server logs.
> 
> This error code typically means an application attempting to do some i/ 
> o and Lustre
> has no lock for the i/o area for some reason anymore (it is normally  
> obtained
> once read or write path is entered), and that could be related to  
> evictions too
> (locks are revoked at eviction time).
I should have mentioned that we are also seeing many errors of the form 
"LustreError: 19842:0:(ldlm_lockd.c:1078:ldlm_handle_cancel()) received 
cancel for unknown lock cookie." Checking back, these would seem to 
pre-date the introduction of 1.6 clients and even after we upgraded 
clients I can see them associated with both 1.4 and 1.6 clients. They 
may indicate something else relevant about the filesystems or workload.

Cheers,

Simon.

Oleg Drokin

2009-Feb-10 17:18 UTC

head link

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

Hello!

On Feb 10, 2009, at 12:11 PM, Simon Kelley wrote:>>> We are also seeing some userspace file operations fail with the  
>>> error
>>> "No locks available". These don''t generate any
logging on the
>>> client  so
>>> I don''t have exact timing. It''s possible that
they are associated
>>> with
>>> further "### lock callback timer expired" server logs.
>> This error code typically means an application attempting to do  
>> some i/ o and Lustre
>> has no lock for the i/o area for some reason anymore (it is  
>> normally  obtained
>> once read or write path is entered), and that could be related to   
>> evictions too
>> (locks are revoked at eviction time).
> I should have mentioned that we are also seeing many errors of the  
> form "LustreError: 19842:0:(ldlm_lockd.c:1078:ldlm_handle_cancel())  
> received cancel for unknown lock cookie." Checking back, these would  
> seem to pre-date the introduction of 1.6 clients and even after we  
> upgraded clients I can see them associated with both 1.4 and 1.6  
> clients. They may indicate something else relevant about the  
> filesystems or workload.
Hm, that means clients hold some locks that server does not believe  
thy have, which is pretty strange.
Or it just does not recognize the lock released by client and later  
releases the client.
If you have a complete kernel log of the event, that might be useful  
to see the sequence of events.
I assume you do not have flaky network and your clients do not  
reconnect all the time to the servers
with messages in logs like ''changed handle from X to Y; copying, but  
this may foreshadow disaster'',
which would be a different bug no longer present in 1.6.6, too.

Bye,
     Oleg

Simon Kelley

2009-Feb-10 17:46 UTC

head link

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

Oleg Drokin wrote:> Hello!
> 
> On Feb 10, 2009, at 12:11 PM, Simon Kelley wrote:
>>>> We are also seeing some userspace file operations fail with the
>>>> error
>>>> "No locks available". These don''t generate
any logging on the
>>>> client  so
>>>> I don''t have exact timing. It''s possible that
they are associated
>>>> with
>>>> further "### lock callback timer expired" server
logs.
>>> This error code typically means an application attempting to do  
>>> some i/ o and Lustre
>>> has no lock for the i/o area for some reason anymore (it is  
>>> normally  obtained
>>> once read or write path is entered), and that could be related to
>>> evictions too
>>> (locks are revoked at eviction time).
>> I should have mentioned that we are also seeing many errors of the  
>> form "LustreError:
19842:0:(ldlm_lockd.c:1078:ldlm_handle_cancel())
>> received cancel for unknown lock cookie." Checking back, these
would
>> seem to pre-date the introduction of 1.6 clients and even after we  
>> upgraded clients I can see them associated with both 1.4 and 1.6  
>> clients. They may indicate something else relevant about the  
>> filesystems or workload.
> 
> Hm, that means clients hold some locks that server does not believe  
> thy have, which is pretty strange.
> Or it just does not recognize the lock released by client and later  
> releases the client.
> If you have a complete kernel log of the event, that might be useful  
> to see the sequence of events.
If, by "the complete event" you mean the "received cancel for
unknown
cookie", there''s not much more to tell. Grepping through the last 
month''s server logs shows that there are bursts of typically between 3 
and 7 messages, at the same time and from the same client. After a gap, 
the same thing but from a different client. The number can be as low a 
one, and up to ten. They look to be related to client workload, at a guess.

Picking a few events and looking at the client logfile for the same time 
gives absolutely nothing at all.
> I assume you do not have flaky network and your clients do not  
> reconnect all the time to the servers
> with messages in logs like ''changed handle from X to Y; copying,
but
> this may foreshadow disaster'',
> which would be a different bug no longer present in 1.6.6, too.
> 
None of that applies.


Simon.

Oleg Drokin

2009-Feb-10 17:48 UTC

head link

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

Hello!

On Feb 10, 2009, at 12:46 PM, Simon Kelley wrote:> If, by "the complete event" you mean the "received cancel
for
> unknown cookie", there''s not much more to tell. Grepping
through the
> last month''s server logs shows that there are bursts of typically
> between 3 and 7 messages, at the same time and from the same client.  
> After a gap, the same thing but from a different client. The number  
> can be as low a one, and up to ten. They look to be related to  
> client workload, at a guess.
Ok, so you do not see a pattern of this unknown cookie message  
followed by eviction in some time like 100 seconds? That''s what my  
question about.

Bye,
     Oleg

Simon Kelley

2009-Feb-11 09:45 UTC

head link

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

Oleg Drokin wrote:> Hello!
> 
> On Feb 10, 2009, at 12:46 PM, Simon Kelley wrote:
>> If, by "the complete event" you mean the "received
cancel for unknown
>> cookie", there''s not much more to tell. Grepping through
the last
>> month''s server logs shows that there are bursts of typically
between 3
>> and 7 messages, at the same time and from the same client. After a 
>> gap, the same thing but from a different client. The number can be as 
>> low a one, and up to ten. They look to be related to client workload, 
>> at a guess.
> 
> Ok, so you do not see a pattern of this unknown cookie message followed 
> by eviction in some time like 100 seconds? That''s what my question
about.
> 
> Bye,
>     Oleg
> 
No, there are plenty of examples of the unknown cookie message and no
eviction or other problem.

It''s possible that there is a pattern where there''s a run on
"unknown
lock cookie" for a particular client and then a couple of minutes later
"lock callback timer expired: evicting client" messages for a
_different_ client, but the signal is not strong.

It does look like the clusters of "unknown lock cookie" may be related
to striping. If a file striped across several OSTs experiences the
problem then there is a cluster of the messages  all referencing the
same client node, one from each OST.

I''m working on reproducing the problem in a controlled way and getting
the information you asked for.

Cheers,

Simon.

Lustre discuss - Feb 2009 - lock timeouts and OST evictions on 1.4 server - 1.6 client system.

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.

[Lustre-discuss] lock timeouts and OST evictions on 1.4 server - 1.6 client system.