We''re having an OST issue stemming from a storage array problem we had the other day and was hoping maybe someone could help us out... The scenario we''re seeing is as follows; 41 OST''s mount perfectly fine. OST_41 mounts and recovery appears to work however it appears that no new objects can be created on the OST and after a while the MDS marks it inactive. (*Note - the device in the OS is named OST_41, multipathd has it as dm-46 and in Lustre it is known as 0029 - It''s a long story....) Let me know if anyone needs the full dmesg logs - I''ve also run e2fsck on the OST prior to mounting it and we''re seeing the same problem. I should note that 90% of the clients appear to be able to read from the OST just fine - but writes are not working. The OSS is serving up numerous other OSTs without issue. Ex: lfs setstripe --count 1 --index 41 filename And the file gets created on OST-0... but something like lfs check servers reports all OSTs are active. Help would be much appreciated. --Mike Relevent MDS Log; Lustre: MDS scratch-MDT0000: scratch-OST001f_UUID now active, resetting orphans Lustre: 8405:0:(quota_master.c:1718:mds_quota_recovery()) Only 41/42 OSTs are active, abort quota recovery Lustre: 8405:0:(quota_master.c:1718:mds_quota_recovery()) Skipped 1 previous similar message Lustre: scratch-OST0021-osc: Connection restored to service scratch-OST0021 using nid 10.0.1.235 at o2ib. Lustre: Skipped 1 previous similar message Lustre: MDS scratch-MDT0000: scratch-OST0021_UUID now active, resetting orphans Lustre: Skipped 1 previous similar message LustreError: 15966:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 1 of 2 llog-records failed: -2 LustreError: 15966:0:(llog_server.c:466:llog_origin_handle_cancel()) Skipped 1 previous similar message Lustre: 8406:0:(import.c:517:import_select_connection()) scratch-OST0029-osc: tried all connections, increasing latency to 16s Lustre: 8406:0:(import.c:517:import_select_connection()) Skipped 218 previous similar messages LustreError: 15971:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 1 of 2 llog-records failed: -2 LustreError: 15959:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 1 of 1 llog-records failed: -2 LustreError: 15966:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 1 of 2 llog-records failed: -2 Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1409470549792684 sent from scratch-OST0029-osc to NID 10.0.1.236 at o2ib 0s ago has failed due to network error (23s prior to deadline). req at ffff810335e53000 x1409470549792684/t0 o8->scratch-OST0029_UUID at 10.0.1.236@o2ib:28/4 lens 368/584 e 0 to 1 dl 1344610732 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 196 previous similar messages LustreError: 15970:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 1 of 2 llog-records failed: -2 LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x1a78fe63 sub-object on OST idx 12/4: rc = -107 LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) Skipped 18 previous similar messages Lustre: 8406:0:(import.c:517:import_select_connection()) scratch-OST0029-osc: tried all connections, increasing latency to 25s Lustre: 8406:0:(import.c:517:import_select_connection()) Skipped 11 previous similar messages Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1409470549796118 sent from scratch-OST0029-osc to NID 10.0.1.236 at o2ib 0s ago has failed due to network error (30s prior to deadline). req at ffff810349dfbc00 x1409470549796118/t0 o8->scratch-OST0029_UUID at 10.0.1.236@o2ib:28/4 lens 368/584 e 0 to 1 dl 1344611365 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 10 previous similar messages Lustre: 8406:0:(import.c:517:import_select_connection()) scratch-OST0029-osc: tried all connections, increasing latency to 10s Lustre: 8406:0:(import.c:517:import_select_connection()) Skipped 14 previous similar messages Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1409470549799083 sent from scratch-OST0029-osc to NID 10.0.1.236 at o2ib 0s ago has failed due to network error (17s prior to deadline). req at ffff810333fb0c00 x1409470549799083/t0 o8->scratch-OST0029_UUID at 10.0.1.236@o2ib:28/4 lens 368/584 e 0 to 1 dl 1344611974 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 8405:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 15 previous similar messages Lustre: 8405:0:(quota_master.c:1718:mds_quota_recovery()) Only 41/42 OSTs are active, abort quota recovery Lustre: 8405:0:(quota_master.c:1718:mds_quota_recovery()) Skipped 12 previous similar messages Lustre: scratch-OST0029-osc: Connection restored to service scratch-OST0029 using nid 10.0.1.235 at o2ib. Lustre: Skipped 12 previous similar messages Lustre: MDS scratch-MDT0000: scratch-OST0029_UUID now active, resetting orphans Lustre: Skipped 12 previous similar messages LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0xe09537f sub-object on OST idx 41/4: rc = -107 LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0xe09538d sub-object on OST idx 41/4: rc = -107 LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) Skipped 1 previous similar message LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x11a04ba3 sub-object on OST idx 41/4: rc = -107 LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0xe09538e sub-object on OST idx 41/4: rc = -107 LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x11a81070 sub-object on OST idx 41/4: rc = -107 LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) Skipped 2 previous similar messages LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x15790270 sub-object on OST idx 41/1: rc = -107 LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) Skipped 9 previous similar messages LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) error creating fid 0x11b82a3b sub-object on OST idx 41/4: rc = -107 LustreError: 8404:0:(lov_request.c:694:lov_update_create_set()) Skipped 5 previous similar messages Lustre: 7455:0:(client.c:1487:ptlrpc_expire_one_request()) @@@ Request x1409470549800467 sent from scratch-OST0029-osc to NID 10.0.1.235 at o2ib 1103s ago has timed out (1103s prior to deadline). req at ffff81039f930c00 x1409470549800467/t0 o5->scratch-OST0029_UUID at 10.0.1.235@o2ib:28/4 lens 400/592 e 4 to 1 dl 1344613415 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 7455:0:(client.c:1487:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: scratch-OST0029-osc: Connection to service scratch-OST0029 via nid 10.0.1.235 at o2ib was lost; in progress operations using this service will wait for recovery to complete. LustreError: 7455:0:(osc_create.c:593:osc_create()) scratch-OST0029-osc: oscc recovery failed: -110 LustreError: 7455:0:(lov_obd.c:1150:lov_clear_orphans()) error in orphan recovery on OST idx 41/42: rc = -110 LustreError: 7455:0:(mds_lov.c:1057:__mds_lov_synchronize()) scratch-OST0029_UUID failed at mds_lov_clear_orphans: -110 LustreError: 7455:0:(mds_lov.c:1066:__mds_lov_synchronize()) scratch-OST0029_UUID sync failed -110, deactivating LustreError: 11-0: an error occurred while communicating with 10.0.1.235 at o2ib. The ost_connect operation failed with -16 LustreError: Skipped 3 previous similar messages LustreError: 11-0: an error occurred while communicating with 10.0.1.235 at o2ib. The ost_connect operation failed with -16 LustreError: Skipped 2 previous similar messages LustreError: 11-0: an error occurred while communicating with 10.0.1.235 at o2ib. The ost_connect operation failed with -16 LustreError: Skipped 5 previous similar messages LustreError: 11-0: an error occurred while communicating with 10.0.1.235 at o2ib. The ost_connect operation failed with -16 LustreError: Skipped 10 previous similar messages LustreError: 11-0: an error occurred while communicating with 10.0.1.235 at o2ib. The ost_connect operation failed with -16 LustreError: Skipped 22 previous similar messages LustreError: 11-0: an error occurred while communicating with 10.0.1.235 at o2ib. The ost_connect operation failed with -16 LustreError: Skipped 42 previous similar messages OST Log LDISKFS-fs (dm-46): mounted filesystem with ordered data mode JBD: barrier-based sync failed on dm-46-8 - disabling barriers LDISKFS-fs (dm-46): barriers disabled LDISKFS-fs (dm-46): mounted filesystem with ordered data mode Lustre: 9381:0:(filter.c:1001:filter_init_server_data()) RECOVERY: service scratch-OST0029, 388 recoverable clients, 0 delayed clients, last_rcvd 85899346284 Lustre: 9381:0:(filter.c:1001:filter_init_server_data()) Skipped 1 previous similar message Lustre: scratch-OST0029: Now serving scratch-OST0029 on /dev/mapper/ost_scratch_41 with recovery enabled Lustre: Skipped 1 previous similar message Lustre: scratch-OST0029: Will be in recovery for at least 5:00, or until 388 clients reconnect Lustre: Skipped 1 previous similar message Lustre: 31046:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) scratch-OST0029: 387 recoverable clients remain Lustre: 31046:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 26 previous similar messages LustreError: 30900:0:(ldlm_lib.c:944:target_handle_connect()) scratch-OST0029: denying connection for new client 10.2.1.112 at o2ib2 (aa6ab4db-167b-53ac-8147-5823b6ce20de): 234 clients in recovery for 280s Lustre: 30897:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) scratch-OST0029: 114 recoverable clients remain Lustre: 30897:0:(ldlm_lib.c:1815:target_queue_last_replay_reply()) Skipped 272 previous similar messages <A BUNCH OF DENIED CONNECTIONS BECAUSE OF RECOVERY> Lustre: scratch-OST0029: Recovery period over after 5:00, of 388 clients 385 recovered and 3 were evicted. Lustre: Skipped 2 previous similar messages Lustre: scratch-OST0029: sending delayed replies to recovered clients Lustre: Skipped 2 previous similar messages Lustre: scratch-OST0029: received MDS connection from 10.0.1.240 at o2ib Lustre: scratch-OST0029: sending delayed replies to recovered clients Lustre: Skipped 2 previous similar messages Lustre: scratch-OST0029: received MDS connection from 10.0.1.240 at o2ib Lustre: Skipped 2 previous similar messages LustreError: 30865:0:(filter.c:3427:filter_precreate()) scratch-OST0029: Serious error: objid 28055313 already exists; is this filesystem corrupt? LustreError: 30865:0:(filter.c:3428:filter_precreate()) LBUG Pid: 30865, comm: ll_ost_40 Call Trace: [<ffffffff8854f6a1>] libcfs_debug_dumpstack+0x51/0x60 [libcfs] [<ffffffff8854fbda>] lbug_with_loc+0x7a/0xd0 [libcfs] [<ffffffff889eb606>] filter_precreate+0xef6/0x19b0 [obdfilter] [<ffffffff885a33a6>] lnet_ni_send+0x96/0xe0 [lnet] [<ffffffff8000d5a8>] dput+0x2c/0x113 [<ffffffff889f52fb>] filter_create+0x11ab/0x17c0 [obdfilter] [<ffffffff885a9230>] LNetPut+0x730/0x840 [lnet] [<ffffffff886b6323>] ptl_send_buf+0x3f3/0x5b0 [ptlrpc] [<ffffffff886bbf54>] lustre_msg_add_version+0x34/0x110 [ptlrpc] [<ffffffff886b6ac8>] ptlrpc_send_reply+0x5e8/0x600 [ptlrpc] [<ffffffff886bed69>] lustre_pack_reply+0x29/0xb0 [ptlrpc] [<ffffffff889ad761>] ost_handle+0x1271/0x55b0 [ost] [<ffffffff8008dff2>] dequeue_task+0x18/0x37 [<ffffffff886ca6d9>] ptlrpc_server_handle_request+0x989/0xe00 [ptlrpc] [<ffffffff886cae35>] ptlrpc_wait_event+0x2e5/0x310 [ptlrpc] [<ffffffff8008cc1e>] __wake_up_common+0x3e/0x68 [<ffffffff886cbdc6>] ptlrpc_main+0xf66/0x1120 [ptlrpc] [<ffffffff8005dfb1>] child_rip+0xa/0x11 [<ffffffff886cae60>] ptlrpc_main+0x0/0x1120 [ptlrpc] [<ffffffff8005dfa7>] child_rip+0x0/0x11 LustreError: dumping log to /tmp/lustre-log.1344612315.30865 Lustre: Service thread pid 30865 was inactive for 200.00s. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: Pid: 30865, comm: ll_ost_40 Lustre: 30941:0:(ldlm_lib.c:874:target_handle_connect()) scratch-OST0029: refuse reconnection from 41b8aeb5-2c37-88e7-d228-7a59ee9cac74 at 10.2.1.130@o2ib2 to 0xffff81023028c600; still busy with 3 active RPCs