thr3ads.net - Lustre discuss - [Lustre-discuss] client timeout (input/output error) [Jul 2006]

If this information is useful, please help other people find it:
Share via:
Mustafa A. Hashmi
2006-Jul-10 10:27 UTC
[Lustre-discuss] client timeout (input/output error)

All:

Testing a 4 node config (2 mds nodes (one failover), 2 oss nodes [2
osts with drbd replication both ways && active/active failover]). We
connected 3 clients and started doing file writes from all of them in
the same dir. Files are written by a small script, where the purpose
is to completely fill up the osts in question.

The test is currently still running as files being written are 1k in
size (we realize this isn''t optimal or what lustre is designed for).

The problem:
--
On any of the client nodes that is executing the script in question,
write operations anywhere in the lustre file system (from the same
client) fail with an input output error. The following is logged on
the client side:

-- Client side log --
LustreError: 25832:0:(ldlm_resource.c:365:ldlm_namespace_cleanup())
Namespace OSC_node50_ost2_MNT_client-d6f7ba00 resource refcount 1
after lock cleanup; forcing cleanup.
LustreError: 25832:0:(ldlm_resource.c:365:ldlm_namespace_cleanup())
previously skipped 2816 similar messages
LustreError: 25832:0:(file.c:462:ll_pgcache_remove_extent()) writepage
of page c11b2ec0 failed: -5
LustreError: 25832:0:(file.c:462:ll_pgcache_remove_extent())
previously skipped 2661 similar messages
Lustre: OSC_node50_ost2_MNT_client-d6f7ba00: Connection restored to
service ost2 using nid 192.168.0.41@tcp.
LustreError: 10841:0:(lov_request.c:172:lov_update_enqueue_set())
error: enqueue objid 0xd949a6 subobj 0xa00b8 on OST idx 1: rc = -5
LustreError: Connection to service n50mds via nid 192.168.0.50@tcp was
lost; in progress operations using this service will wait for recovery
to complete.
LustreError: This client was evicted by n50mds; in progress operations
using this service will fail.
LustreError: 29725:0:(ldlm_resource.c:365:ldlm_namespace_cleanup())
Namespace MDC_node50_n50mds_MNT_client-d6f7ba00 resource refcount 1
after lock cleanup; forcing cleanup.
LustreError: 29725:0:(ldlm_resource.c:365:ldlm_namespace_cleanup())
previously skipped 2788 similar messages
Lustre: MDC_node50_n50mds_MNT_client-d6f7ba00: Connection restored to
service n50mds using nid 192.168.0.50@tcp.

-- end client log --

-- mds log --

LustreError: 25832:0:(ldlm_resource.c:365:ldlm_namespace_cleanup())
previously skipped 2816 similar messages
LustreError: 25832:0:(file.c:462:ll_pgcache_remove_extent()) writepage
of page c11b2ec0 failed: -5
LustreError: 25832:0:(file.c:462:ll_pgcache_remove_extent())
previously skipped 2661 similar messages
Lustre: OSC_node50_ost2_MNT_client-d6f7ba00: Connection restored to
service ost2 using nid 192.168.0.41@tcp.
LustreError: 10841:0:(lov_request.c:172:lov_update_enqueue_set())
error: enqueue objid 0xd949a6 subobj 0xa00b8 on OST idx 1: rc = -5
LustreError: Connection to service n50mds via nid 192.168.0.50@tcp was
lost; in progress operations using this service will wait for recovery
to complete.
LustreError: This client was evicted by n50mds; in progress operations
using this service will fail.
LustreError: 29725:0:(ldlm_resource.c:365:ldlm_namespace_cleanup())
Namespace MDC_node50_n50mds_MNT_client-d6f7ba00 resource refcount 1
after lock cleanup; forcing cleanup.
LustreError: 29725:0:(ldlm_resource.c:365:ldlm_namespace_cleanup())
previously skipped 2788 similar messages
Lustre: MDC_node50_n50mds_MNT_client-d6f7ba00: Connection restored to
service n50mds using nid 192.168.0.50@tcp.


-- end mds log --

A 4th client on our side can however open connections successfully for
write, however, I am sure executing the script here and trying a
second write operation will fail (like the other 3 clients).

Could someone please explain the workings here.

Please note: Request ID: 527 @
https://buffalo.lustre.org:8443/completed_requests.pl describes
similar behavior.

--
Final note: MDS, OSS are: version:
1.4.6.4-19700101050000-PRISTINE-.usr.src.linux-2.6.12.6.

The clients are:
1.4.6.1-19700101050000-PRISTINE-.usr.src.linux-2.6.12.6

Regards,
-- 
Mustafa A. Hashmi
mahashmi@gmail.com
mh@stderr.net
Lustre discuss - Jul 2006 - client timeout (input/output error)

[Lustre-discuss] client timeout (input/output error)