thr3ads.net - Lustre discuss - [Lustre-discuss] lock completion timeouts? [Nov 2008]

If this information is useful, please help other people find it:
Share via:

daledude

2008-Nov-13 09:12 UTC

[Lustre-discuss] lock completion timeouts?

Im trying to determine the issue causing the below log entries on our
backup lustre fs. It seems to happen once or twice an hour during
rsync of another 20tb lustre fs. I dont see any errors on the 20tb
lustre fs. I''ve read that its not a good idea to run the MDT/MGS/OST/
ALL on the same server so maybe this is the reason for the errors, but
I''d like to understand better.

The set up is:
* 20tb lustre fs is using centos 5.2 64bit + lustre 1.6.5. Completely
seperate nodes and functions without errors.

* 8tb backup lustre fs using centos 5.2 64bit + lustre 1.6.6. MDT/MGS/
OST/ALL all on a single server with 4gb memory. I mount the 20tb
lustre fs on this machine and also run the rsync on it.


Nov 13 00:42:42 mds kernel: Lustre: Request x148006922 sent from
mybackup-OST0000 to NID 0 at lo 7s ago has timed out (limit 6s).
Nov 13 00:42:42 mds kernel: LustreError: 138-a: mybackup-OST0000: A
client on nid 0 at lo was evicted due to a lock completion callback to
0 at lo timed out: rc -107
Nov 13 00:42:42 mds kernel: LustreError: 2263:0:(ldlm_lib.c:
1619:target_send_reply_msg()) @@@ processing error (-107)
req at eaffe400 x148007203/t0 o4-><?>@<?>:0/0 lens 384/0 e 0 to 0
dl
1226565862 ref 1 fl Interpret:/0/0 rc -107/0
Nov 13 00:42:42 mds kernel: LustreError: 11-0: an error occurred while
communicating with 0 at lo. The ost_write operation failed with -107
Nov 13 00:42:42 mds kernel: Lustre: mybackup-OST0000-osc-f27eac00:
Connection to service mybackup-OST0000 via nid 0 at lo was lost; in
progress operations using this service will wait for recovery to
complete.
Nov 13 00:42:42 mds kernel: LustreError: 167-0: This client was
evicted by mybackup-OST0000; in progress operations using this service
will fail.
Nov 13 00:42:42 mds kernel: LustreError: 2163:0:(ldlm_request.c:
996:ldlm_cli_cancel_req()) Got rc -5 from cancel RPC: canceling anyway
Nov 13 00:42:42 mds kernel: LustreError: 2163:0:(ldlm_request.c:
1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -5
Nov 13 00:42:42 mds kernel: LustreError: 2104:0:(client.c:
722:ptlrpc_import_delay_req()) @@@ IMP_INVALID  req at f22d2a00
x148007205/t0 o4->mybackup-OST0000_UUID at 192.168.10.14@tcp:6/4 lens
384/480 e 0 to 100 dl 0 ref 2 fl Rpc:/0/0 rc 0/0
Nov 13 00:42:42 mds kernel: LustreError: 2104:0:(client.c:
722:ptlrpc_import_delay_req()) Skipped 8 previous similar messages
Nov 13 00:42:42 mds kernel: Lustre: mybackup-OST0000-osc-f27eac00:
Connection restored to service mybackup-OST0000 using nid 0 at lo.

Thanks for any advice.

Andreas Dilger

2008-Nov-21 04:57 UTC

head link

[Lustre-discuss] lock completion timeouts?

On Nov 13, 2008  01:12 -0800, daledude wrote:> * 8tb backup lustre fs using centos 5.2 64bit + lustre 1.6.6. MDT/MGS/
> OST/ALL all on a single server with 4gb memory. I mount the 20tb
> lustre fs on this machine and also run the rsync on it.
This is documented as an unsupported configuration, mainly due to the
risk of a client thread flushing dirty data under memory pressure waiting
for an OST thread trying to write the data to disk, but needing to allocate
memory to complete the write...
> Nov 13 00:42:42 mds kernel: Lustre: Request x148006922 sent from
> mybackup-OST0000 to NID 0 at lo 7s ago has timed out (limit 6s).
> Nov 13 00:42:42 mds kernel: LustreError: 11-0: an error occurred while
> communicating with 0 at lo. The ost_write operation failed with -107
> Nov 13 00:42:42 mds kernel: Lustre: mybackup-OST0000-osc-f27eac00:
> Connection to service mybackup-OST0000 via nid 0 at lo was lost; in
> progress operations using this service will wait for recovery to
> complete.
There appears to be a timeout communicating from the local machine to
itself (0 at lo).  That can''t possibly be due to "network"
problems because
the lustre "loopback" is network is simply "memcpy".  It is
possible that
if the machine is overloaded that some threads are just taking too long
to be scheduled due to the many server threads on the system.

You could try increasing the the /proc/sys/lustre/ldlm_timeout value to
see if this helps.  You could also try limiting the number of server
threads running on the system by putting options in /etc/modprobe.conf:

option mds mds_num_threads=32
option ost oss_num_threads=32

That will reduce the contention on the node and allow the threads
to be run more frequently.

Another, better, solution is to mount the 8TB filesystem on one of
the nodes that is also mounting the 20TB filesystem and run rsync
there.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Nov 2008 - lock completion timeouts?

[Lustre-discuss] lock completion timeouts?

[Lustre-discuss] lock completion timeouts?