In doing some testing with our new hardware I did the following: I rebooted the active MDS server, it failed over to the second one as expected. While this was happening a client was reset. When the MDS came up on the new server by heartbeat it went into recovery as expected. The MDS now has been in recovery for 1.5 hours. I don''t think this is normal. What would cause this? I know by having a client go down (the reset above) while the MDS is down but before recovery will cause recovery to time out but 1.5 hours is unacceptable time to wait for the file system to come back. This is a stock 1.6.5.1 install. cat recovery_status status: RECOVERING recovery_start: 0 time_remaining: 0 connected_clients: 0/1 completed_clients: 0/1 replayed_requests: 0/?? queued_requests: 0 next_transno: 117 Did I some how wedge the file system? Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
Brian J. Murrell
2008-Aug-07 16:23 UTC
[Lustre-discuss] slow recovery when MDS failed over
On Thu, 2008-08-07 at 12:06 -0400, Brock Palen wrote:> In doing some testing with our new hardware I did the following: > > I rebooted the active MDS server, it failed over to the second one as > expected. While this was happening a client was reset. > > When the MDS came up on the new server by heartbeat it went into > recovery as expected. The MDS now has been in recovery for 1.5 > hours. I don''t think this is normal. > > What would cause this? I know by having a client go down (the reset > above) while the MDS is down but before recovery will cause recovery > to time out but 1.5 hours is unacceptable time to wait for the file > system to come back. > > This is a stock 1.6.5.1 install.Hrm. Can you provide the syslog from the backup MDS from the time it was mounted until present? b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080807/e57eee9b/attachment.bin
On Aug 07, 2008 12:06 -0400, Brock Palen wrote:> When the MDS came up on the new server by heartbeat it went into > recovery as expected. The MDS now has been in recovery for 1.5 > hours. I don''t think this is normal. > > What would cause this? I know by having a client go down (the reset > above) while the MDS is down but before recovery will cause recovery > to time out but 1.5 hours is unacceptable time to wait for the file > system to come back.The recovery should time out in about 5 minutes if the clients do not reply. Something is definitely wrong. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Something appeared to be messed up. We rebuilt the filesystem and now we cant reproduce the problem. Thanks for looking into it. I am doing some failover testing right now, see my other emails. Now that I have the MGS seen as two hosts, failover is quite snappy for a known failover, IE reboot on active MDS, heartbeat does what it should. Recovery from yanking power (ipmitool chassis power rest) takes a little longer but still quite fast. I am much happier with lustre failover than I was a few days ago. My own personal growing pains. Thanks again for looking into this. Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Aug 18, 2008, at 11:02 PM, Andreas Dilger wrote:> On Aug 07, 2008 12:06 -0400, Brock Palen wrote: >> When the MDS came up on the new server by heartbeat it went into >> recovery as expected. The MDS now has been in recovery for 1.5 >> hours. I don''t think this is normal. >> >> What would cause this? I know by having a client go down (the reset >> above) while the MDS is down but before recovery will cause recovery >> to time out but 1.5 hours is unacceptable time to wait for the file >> system to come back. > > The recovery should time out in about 5 minutes if the clients do not > reply. Something is definitely wrong. > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > >