Hi, all, I am evaluating Lustre with DRBD failover, and experiencing about 2 minutes in OSS failover time to switch to the secondary node. Has anyone have the similar observation (so that we can conclude this should be expected), or if there is some parameters that I should tune to reduce that time? I have a simple setup: the MDS and OSS0 are hosted on server1, and OSS1 are hosted on server2. OSS0 and OSS1 are the primary nodes for OST0 and OST1, respectively, and the OSTs are replicated using DRBD (protocol C) to the other machine. The two OSTs are about 73GB each. I am running Lustre 1.6 + DRBD 8 + Heartbeat v2 (but using v1 configuration).>From HA logs, it looks that Heartbeat noticed a node is down within 10 seconds (with is consistent with the deadtime of 6 seconds). Where does the secondary node spend the remaining 100-110 seconds? There was a post (http://groups.google.com/group/lustre-discuss-list/msg/bbbeac047df678ca?dmode=source) contributing MDS failover time to fsck. Does it also cause my problem?Thanks, -Tao -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090714/e67c37ed/attachment.html
On Tue, 2009-07-14 at 17:54 +0200, tao.a.wu at nokia.com wrote:> > Hi, all, > > I am evaluating Lustre with DRBD failover, and experiencing about 2 > minutes in OSS failover time to switch to the secondary node.What is this 2 minutes including? Just the time for the second OSS to mount the disk and start recovery or is it 2 minutes to detect the primary failure and have the secondary complete recovery so that the clients are fully functional again? If the latter, then you are doing quite well. Recovery is not an instantaneous process. Much work needs to be done to ensure coherency between what is on the disk of the failed over OST and what the clients think is on disk. Getting to this state requires that all clients synchronize with the OST and getting/waiting for many clients to do this can, currently, take many minutes as each client has to first notice the primary is dead and sync up with the failover. Some clients might not even be available to sync, in which case you have to wait for a timeout. So if you are talking 2 minutes from failure to full recovery, you are not likely going to put much of a dent in this. Lustre 1.8 has adaptive timeouts enabled and that should help in optimal situations, but it will still take time to do a full recovery. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090714/6fd0308a/attachment-0001.bin
tao.a.wu at nokia.com wrote:> > Hi, all, > > I am evaluating Lustre with DRBD failover, and experiencing about 2 > minutes in OSS failover time to switch to the secondary node. Has > anyone have the similar observation (so that we can conclude this should > be expected), or if there is some parameters that I should tune to > reduce that time? > > I have a simple setup: the MDS and OSS0 are hosted on server1, and OSS1 > are hosted on server2. OSS0 and OSS1 are the primary nodes for OST0 and > OST1, respectively, and the OSTs are replicated using DRBD (protocol C) > to the other machine. The two OSTs are about 73GB each. I am running > Lustre 1.6 + DRBD 8 + Heartbeat v2 (but using v1 configuration). > > From HA logs, it looks that Heartbeat noticed a node is down within 10 > seconds (with is consistent with the deadtime of 6 seconds). Where does > the secondary node spend the remaining 100-110 seconds? There was a > post > (_http://groups.google.com/group/lustre-discuss-list/msg/bbbeac047df678ca?dmode=source_) > contributing MDS failover time to fsck. Does it also cause my problem?as Brian mentioned, Lustre servers go through a recovery process. You need to examine system logs on the OSS - if Lustre is in recovery, there will be messages in the logs explaining this. cliffw> Thanks, > > -Tao > > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Yes, it is the latter... Thanks for the info. A related but different question, Lustre 2.0 will have replication. Under 2.0 (with replication), what would happen if the primary node goes down? Would the backup node be able to take over the load in shorter period of time? Or is the replication feature for something else? Thanks, -Tao -----Original Message----- From: lustre-discuss-bounces at lists.lustre.org [mailto:lustre-discuss-bounces at lists.lustre.org] On Behalf Of ext Brian J. Murrell Sent: Tuesday, July 14, 2009 12:10 PM To: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] Lustre DRBD failover time On Tue, 2009-07-14 at 17:54 +0200, tao.a.wu at nokia.com wrote:> > Hi, all, > > I am evaluating Lustre with DRBD failover, and experiencing about 2 > minutes in OSS failover time to switch to the secondary node.What is this 2 minutes including? Just the time for the second OSS to mount the disk and start recovery or is it 2 minutes to detect the primary failure and have the secondary complete recovery so that the clients are fully functional again? If the latter, then you are doing quite well. Recovery is not an instantaneous process. Much work needs to be done to ensure coherency between what is on the disk of the failed over OST and what the clients think is on disk. Getting to this state requires that all clients synchronize with the OST and getting/waiting for many clients to do this can, currently, take many minutes as each client has to first notice the primary is dead and sync up with the failover. Some clients might not even be available to sync, in which case you have to wait for a timeout. So if you are talking 2 minutes from failure to full recovery, you are not likely going to put much of a dent in this. Lustre 1.8 has adaptive timeouts enabled and that should help in optimal situations, but it will still take time to do a full recovery. b.
On Jul 14, 2009 21:05 +0200, tao.a.wu at nokia.com wrote:> A related but different question, Lustre 2.0 will have replication. > Under 2.0 (with replication), what would happen if the primary node > goes down? Would the backup node be able to take over the load in > shorter period of time? Or is the replication feature for something else?The "replication" feature has nothing to do with what you are thinking. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.