Mustafa A. Hashmi
2006-Jul-10 10:27 UTC
[Lustre-discuss] client timeout (input/output error)
All: Testing a 4 node config (2 mds nodes (one failover), 2 oss nodes [2 osts with drbd replication both ways && active/active failover]). We connected 3 clients and started doing file writes from all of them in the same dir. Files are written by a small script, where the purpose is to completely fill up the osts in question. The test is currently still running as files being written are 1k in size (we realize this isn''t optimal or what lustre is designed for). The problem: -- On any of the client nodes that is executing the script in question, write operations anywhere in the lustre file system (from the same client) fail with an input output error. The following is logged on the client side: -- Client side log -- LustreError: 25832:0:(ldlm_resource.c:365:ldlm_namespace_cleanup()) Namespace OSC_node50_ost2_MNT_client-d6f7ba00 resource refcount 1 after lock cleanup; forcing cleanup. LustreError: 25832:0:(ldlm_resource.c:365:ldlm_namespace_cleanup()) previously skipped 2816 similar messages LustreError: 25832:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c11b2ec0 failed: -5 LustreError: 25832:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 2661 similar messages Lustre: OSC_node50_ost2_MNT_client-d6f7ba00: Connection restored to service ost2 using nid 192.168.0.41@tcp. LustreError: 10841:0:(lov_request.c:172:lov_update_enqueue_set()) error: enqueue objid 0xd949a6 subobj 0xa00b8 on OST idx 1: rc = -5 LustreError: Connection to service n50mds via nid 192.168.0.50@tcp was lost; in progress operations using this service will wait for recovery to complete. LustreError: This client was evicted by n50mds; in progress operations using this service will fail. LustreError: 29725:0:(ldlm_resource.c:365:ldlm_namespace_cleanup()) Namespace MDC_node50_n50mds_MNT_client-d6f7ba00 resource refcount 1 after lock cleanup; forcing cleanup. LustreError: 29725:0:(ldlm_resource.c:365:ldlm_namespace_cleanup()) previously skipped 2788 similar messages Lustre: MDC_node50_n50mds_MNT_client-d6f7ba00: Connection restored to service n50mds using nid 192.168.0.50@tcp. -- end client log -- -- mds log -- LustreError: 25832:0:(ldlm_resource.c:365:ldlm_namespace_cleanup()) previously skipped 2816 similar messages LustreError: 25832:0:(file.c:462:ll_pgcache_remove_extent()) writepage of page c11b2ec0 failed: -5 LustreError: 25832:0:(file.c:462:ll_pgcache_remove_extent()) previously skipped 2661 similar messages Lustre: OSC_node50_ost2_MNT_client-d6f7ba00: Connection restored to service ost2 using nid 192.168.0.41@tcp. LustreError: 10841:0:(lov_request.c:172:lov_update_enqueue_set()) error: enqueue objid 0xd949a6 subobj 0xa00b8 on OST idx 1: rc = -5 LustreError: Connection to service n50mds via nid 192.168.0.50@tcp was lost; in progress operations using this service will wait for recovery to complete. LustreError: This client was evicted by n50mds; in progress operations using this service will fail. LustreError: 29725:0:(ldlm_resource.c:365:ldlm_namespace_cleanup()) Namespace MDC_node50_n50mds_MNT_client-d6f7ba00 resource refcount 1 after lock cleanup; forcing cleanup. LustreError: 29725:0:(ldlm_resource.c:365:ldlm_namespace_cleanup()) previously skipped 2788 similar messages Lustre: MDC_node50_n50mds_MNT_client-d6f7ba00: Connection restored to service n50mds using nid 192.168.0.50@tcp. -- end mds log -- A 4th client on our side can however open connections successfully for write, however, I am sure executing the script here and trying a second write operation will fail (like the other 3 clients). Could someone please explain the workings here. Please note: Request ID: 527 @ https://buffalo.lustre.org:8443/completed_requests.pl describes similar behavior. -- Final note: MDS, OSS are: version: 1.4.6.4-19700101050000-PRISTINE-.usr.src.linux-2.6.12.6. The clients are: 1.4.6.1-19700101050000-PRISTINE-.usr.src.linux-2.6.12.6 Regards, -- Mustafa A. Hashmi mahashmi@gmail.com mh@stderr.net