Robert, Comments inline> -----Original Message----- > From: Robert.Read at Sun.COM [mailto:Robert.Read at Sun.COM] On Behalf Of Robert Read > Sent: 11 December 2008 12:25 AM > To: Eric Barton > Subject: imperative recovery > > Earlier today you suggested that the server could ping the clients > after it restarts. Assuming the server had the nids, how would that > actually work? Clients don''t have any services (or even an acceptor > for the socklnd case), so how would a server initiate communication > with the client? We could add a new kind of RPC that doesn''t > require a ptlrpc connection (much like connect itself doesn''t > require a connection), but it seems at least with socklnd there is > no way to send that message.Indeed - this overturns the precedent that Lustre servers don''t send unsolicited RPCs to clients. This is a nod towards network security so that client firewalls can trivially block incoming connection requests. But this precedent is only assured at the lustre RPC level - with redundantly routed networks, connections can be established in either direction at the LND level. An RPC reply will most probably follow a different path back through the network to the request sender and establish new LND connections as required. This is fine for kernel LNDs which both create and accept connections - but userspace LNDs typically don''t run acceptors, so userspace LNET specifically establishes connections to all known routers on startup to avoid this issue. Ignoring this precedent for now - one could argue that when a rebooting server sees info about a client in the on-disk export, it could have some expectation that the client is waiting for recovery. Some way of alerting the client that now is a good time to try to reconnect therefore seems reasonable. However I think there is a wider issue to consider first. Q. Why can''t clients reconnect immediately the server restarts? A. Because they may not know yet that the server died. Q. Why don''t clients know that the server died? A. Because server death is not detected until RPCs time out. Q. Why is the RPC timeout so long? A. Because server death and congestion are easily confused. This seems to me to get at some fundamental issues about recovery handling that not even adaptive timeouts has solved for us... 1. Server failover/recovery should complete in 10s of seconds, not minutes or hours. . Clients must detect server death promptly - much faster than normal RPC latency on a congested cluster . Servers must detect client death/absence promptly to ensure recovery isn''t blocked too long by a client crash. . To prevent unrelated traffic from being blocked unduly, communications associated with a failed client or server must be removed from the network promptly, as if the failing node were still responsive. 2. Peer failure must be detected with reasonably accuracy in the presence of server congestion, LNET router congestion, and LNET router failure. . Router failure can cause large numbers of RPCs to fail or time out. . Mis-diagnosing server death is inefficient but the client can reconnect harmlessly. . Mis-diagnosing client death can cause lost updates when the server evicts the client.> Other options I''ve thought of to explore this idea: > > - MGS notifies clients (somehow) after a server has restarted. > > - A new tcp socket (possibly in userspace) that can receive > administrative messages like this (messages can be sent from the > server, from master admin node, etc). Perhaps related to new lproc > replacement? Updates could be sent from servers themselves or from > "god" appliance that was keeping track of server nodes. > > - Use "pdsh lctl" to notify all clients a failover has occurred. > Ugly, but it would allow us to test the basic idea quickly. (All we > need is a new lctl command and changes in the ptlrpc client bits to > support external initiation of recovery to a specific node, which > we''ll need anyway.) > > > robertI''m totally in favour of supporting additional notification methods that can increase diagnostic accuracy or speed recovery. However... 1. We can''t rely purely on external notifications. We need a portable baseline capability that works well with existing network infrastructure. 2. I''m extremely nervous of relying on notifications via 3rd parties unless the whole Lustre communications model is changed to accomodate them. Network failures can be observed quite differently from different nodes, so I''d like to stick with methods that uses the same paths as regular communications. I think some elements of the solution include... 0. Change the point-to-point LNET peer health model from one that times out individual messages to one that removes messages blocking for a failing peer aggressively. This has already been demonstrated to work successfully to flush congested routers when a server dies (bug 16186) 1. Health related communications must not be affected by congested "normal" communications. The obvious solution is to provide an additional virtual LNET just for this traffic - i.e. implement message priority - but this poses further questions... a. How much will this complicate the LNET/LND implementation - e.g. do _all_ connection-based LNDs have to double up their connections to ensure orthogonality or complicate existing credit protocols to account for priority messaging. b. Is 2 priority levels enough - maybe lock conflict resolution could/should benefit? c. What effect does this have on security/resilience to attack? 2. Aggregate health related communications between peers to minimize the number of health messages in the system. Also ensure health related communications only occur when knowledge of peer health is actually required - e.g. a client with no locks on a given server doesn''t have to be responsive. The implementation of these features is fundamental to scalability. They determine the level of background health "noise" and its effect on "real" traffic at a given client and server count given a required failure detection latency and limits (or lack thereof) on how much state on how many servers each client can cache. Cheers, Eric
Eric Barton wrote:> >> Other options I''ve thought of to explore this idea: >> >> - MGS notifies clients (somehow) after a server has restarted. >>This seems like a no-brainer easy win today, and doesn''t depend on any advanced features like message priority. The only scalability issue would seem to be the broadcast of the message to all clients, but this is no different than the current broadcast mechanism the MGS employs to update client configs. The message from the MGS would be taken as a suggestion, "Why don''t y''all time out all your current RPCs since I noticed OST0004 restarted. Oh, and use failover nid #2." Current replay/recovery need not be touched.
Nathaniel Rutman wrote:> Eric Barton wrote: >>> Other options I''ve thought of to explore this idea: >>> >>> - MGS notifies clients (somehow) after a server has restarted. >>> > This seems like a no-brainer easy win today, and doesn''t depend on any > advanced features like message priority. The only scalability issue > would seem to be the broadcast of the message to all clients, but this > is no different than the current broadcast mechanism the MGS employs to > update client configs. The message from the MGS would be taken as a > suggestion, "Why don''t y''all time out all your current RPCs since I > noticed OST0004 restarted. Oh, and use failover nid #2." Current > replay/recovery need not be touched.This would be a great enhancement for OSS failover or reboot, it is really the only way we''ll get to recovery times under ~2.5 x obd_timeout. Adaptive Timeouts really aren''t buying us much here, as at scale and under load we are seeing the timeouts approach the usual static obd_timeout of 300s. It only takes one client with a higher timeout to push the recovery time out. I do think this will miss a significant case: combo MGS+MDS. A majority of our customers are deploying with this configuration. Perhaps exposing this mechanism on the clients via a /proc file would be enough - that way a failover framework could manually trigger the timeout and/or nid switching. Nic
On Jan 9, 2009, at 07:27 , Nicholas Henke wrote:> Nathaniel Rutman wrote: >> Eric Barton wrote: >>>> Other options I''ve thought of to explore this idea: >>>> >>>> - MGS notifies clients (somehow) after a server has restarted. >>>> >> This seems like a no-brainer easy win today, and doesn''t depend on >> any >> advanced features like message priority. The only scalability issue >> would seem to be the broadcast of the message to all clients, but >> this >> is no different than the current broadcast mechanism the MGS >> employs to >> update client configs. The message from the MGS would be taken as a >> suggestion, "Why don''t y''all time out all your current RPCs since I >> noticed OST0004 restarted. Oh, and use failover nid #2." Current >> replay/recovery need not be touched. > > This would be a great enhancement for OSS failover or reboot, it is > really the > only way we''ll get to recovery times under ~2.5 x obd_timeout. > Adaptive Timeouts > really aren''t buying us much here, as at scale and under load we are > seeing the > timeouts approach the usual static obd_timeout of 300s. It only > takes one client > with a higher timeout to push the recovery time out. > > I do think this will miss a significant case: combo MGS+MDS. A > majority of our > customers are deploying with this configuration. Perhaps exposing > this mechanism > on the clients via a /proc file would be enough - that way a > failover framework > could manually trigger the timeout and/or nid switching.Yes, exactly what I was thinking. Exposing this feature via proc (or lctl) on the clients is the first step. It''s has minimal impact, requires no changes to the server, and should integrate well with existing failover frameworks. We also need to get the server to end recovery sooner (without waiting for all the stale exports), but VBR should help with that. robert> > > Nic > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel
Robert Read wrote:> > On Jan 9, 2009, at 07:27 , Nicholas Henke wrote:>> >> I do think this will miss a significant case: combo MGS+MDS. A >> majority of our >> customers are deploying with this configuration. Perhaps exposing this >> mechanism >> on the clients via a /proc file would be enough - that way a failover >> framework >> could manually trigger the timeout and/or nid switching. > > Yes, exactly what I was thinking. Exposing this feature via proc (or > lctl) on the clients is the first step. It''s has minimal impact, > requires no changes to the server, and should integrate well with > existing failover frameworks. We also need to get the server to end > recovery sooner (without waiting for all the stale exports), but VBR > should help with that. > > robertFWIW: we''d prefer /proc. We don''t ship lctl on our computes for memory (initramfs) usage reasons. Being in /proc makes it easy for someone to use the functionality from another kernel module as well; we can just call the .read or .write functions directly. Nic
On Jan 09, 2009 09:04 -0800, Robert Read wrote:> On Jan 9, 2009, at 07:27 , Nicholas Henke wrote: > > This would be a great enhancement for OSS failover or reboot, it is > > really the only way we''ll get to recovery times under ~2.5 x obd_timeout. > > > > I do think this will miss a significant case: combo MGS+MDS. A > > majority of our customers are deploying with this configuration. > > Perhaps exposing this mechanism on the clients via a /proc file > > would be enough - that way a failover framework > > could manually trigger the timeout and/or nid switching. > > Yes, exactly what I was thinking. Exposing this feature via proc (or > lctl) on the clients is the first step. It''s has minimal impact, > requires no changes to the server, and should integrate well with > existing failover frameworks. We also need to get the server to end > recovery sooner (without waiting for all the stale exports), but VBR > should help with that.Hey, wouldn''t (essentially) "lctl --device $foo recover" do the trick today? Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
On Jan 9, 2009, at 4:50 PM, Andreas Dilger wrote:> On Jan 09, 2009 09:04 -0800, Robert Read wrote: >> On Jan 9, 2009, at 07:27 , Nicholas Henke wrote: >>> This would be a great enhancement for OSS failover or reboot, it is >>> really the only way we''ll get to recovery times under ~2.5 x >>> obd_timeout. >>> >>> I do think this will miss a significant case: combo MGS+MDS. A >>> majority of our customers are deploying with this configuration. >>> Perhaps exposing this mechanism on the clients via a /proc file >>> would be enough - that way a failover framework >>> could manually trigger the timeout and/or nid switching. >> >> Yes, exactly what I was thinking. Exposing this feature via proc (or >> lctl) on the clients is the first step. It''s has minimal impact, >> requires no changes to the server, and should integrate well with >> existing failover frameworks. We also need to get the server to end >> recovery sooner (without waiting for all the stale exports), but VBR >> should help with that. > > Hey, wouldn''t (essentially) "lctl --device $foo recover" do the trick > today?The main difference is we need to specify the nid to connect to. Also, since lctl isn''t always available we should do this with a /proc file (and set_param), so something like this: echo $new_ost_nid > /proc/fs/lustre/osc/OSC_FOO_01/target_nid or lctl set_param osc.osc_FOO_01.target_nid $new_ost_nid robert> > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-devel mailing list > Lustre-devel at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-devel