We had a RAID array barf this morning resulting in some OST corruption which appeared to be successfully repaired with a combination of fsck and ll_recover_lost_found_objs. The OSTs mounted OK but the MDS can''t seem to recover its connection to two of the OSTs as we are seeing a continuing stream of the following in the MDS syslog. Apr 28 11:37:54 crnmds kernel: Lustre: 31983:0:(recover.c: 67:ptlrpc_initiate_recovery()) crn-OST0013_UUID: starting recovery Apr 28 11:37:54 crnmds kernel: Lustre: 31983:0:(import.c: 608:ptlrpc_connect_import()) ffff810117426000 crn-OST0013_UUID: changing import state from DISCONN to CONNECTING Apr 28 11:37:54 crnmds kernel: Lustre: 31983:0:(import.c: 470:import_select_connection()) crn-OST0013-osc: connect to NID 10.13.24.92 at o2ib last attempt 22689204132 Apr 28 11:37:54 crnmds kernel: Lustre: 31983:0:(import.c: 544:import_select_connection()) crn-OST0013-osc: import ffff810117426000 using connection 10.13.24.92 at o2ib/10.13.24.92 at o2ib Apr 28 11:37:54 crnmds kernel: Lustre: 31982:0:(import.c: 1091:ptlrpc_connect_interpret()) ffff810117426000 crn-OST0013_UUID: changing import state from CONNECTING to DISCONN Apr 28 11:37:54 crnmds kernel: Lustre: 31982:0:(import.c: 1137:ptlrpc_connect_interpret()) recovery of crn-OST0013_UUID on 10.13.24.92 at o2ib failed (-16) Apr 28 11:37:54 crnmds kernel: Lustre: 31982:0:(import.c: 1091:ptlrpc_connect_interpret()) ffff81012e50d000 crn-OST0007_UUID: changing import state from CONNECTING to DISCONN Apr 28 11:37:54 crnmds kernel: Lustre: 31982:0:(import.c: 1137:ptlrpc_connect_interpret()) recovery of crn-OST0007_UUID on 10.13.24.91 at o2ib failed (-16) It seems that we never see a ''oscc recovery finished'' message on crnmds for OST0007 or OST0013. We have not seen this problem before so we are trying to figure out how to get the MDT reconnected to these two OSTs. Any one else been through this before? Thanks, Charlie Taylor UF HPC Center
On Thu, Apr 28, 2011 at 12:47:02PM -0400, Charles Taylor wrote:> Apr 28 11:37:54 crnmds kernel: Lustre: 31982:0:(import.c: > 1137:ptlrpc_connect_interpret()) recovery of crn-OST0013_UUID on > 10.13.24.92 at o2ib failed (-16) > Apr 28 11:37:54 crnmds kernel: Lustre: 31982:0:(import.c: > 1137:ptlrpc_connect_interpret()) recovery of crn-OST0007_UUID on > 10.13.24.91 at o2ib failed (-16)Both OST0007 & OST0013 return EBUSY. Any messages or watchdogs in the OSS logs (i.e. 10.13.24.9{1,2}@o2ib)? Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com