thr3ads.net - Lustre devel - [Lustre-devel] imperative recovery [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2008-Dec-15 20:32 UTC

[Lustre-devel] imperative recovery

Robert,

Comments inline
> -----Original Message-----
> From: Robert.Read at Sun.COM [mailto:Robert.Read at Sun.COM] On Behalf Of
Robert Read
> Sent: 11 December 2008 12:25 AM
> To: Eric Barton
> Subject: imperative recovery
> 
> Earlier today you suggested that the server could ping the clients
> after it restarts. Assuming the server had the nids, how would that
> actually work? Clients don''t have any services (or even an
acceptor
> for the socklnd case), so how would a server initiate communication
> with the client?  We could add a new kind of RPC that doesn''t
> require a ptlrpc connection (much like connect itself doesn''t
> require a connection), but it seems at least with socklnd there is
> no way to send that message.
Indeed - this overturns the precedent that Lustre servers don''t send
unsolicited RPCs to clients.  This is a nod towards network security
so that client firewalls can trivially block incoming connection
requests.  But this precedent is only assured at the lustre RPC level
- with redundantly routed networks, connections can be established in
either direction at the LND level.  An RPC reply will most probably
follow a different path back through the network to the request sender
and establish new LND connections as required.  This is fine for
kernel LNDs which both create and accept connections - but userspace
LNDs typically don''t run acceptors, so userspace LNET specifically
establishes connections to all known routers on startup to avoid this
issue.

Ignoring this precedent for now - one could argue that when a
rebooting server sees info about a client in the on-disk export, it
could have some expectation that the client is waiting for recovery.
Some way of alerting the client that now is a good time to try to
reconnect therefore seems reasonable.  However I think there is a
wider issue to consider first.

Q. Why can''t clients reconnect immediately the server restarts?

A. Because they may not know yet that the server died.

Q. Why don''t clients know that the server died?

A. Because server death is not detected until RPCs time out.

Q. Why is the RPC timeout so long?

A. Because server death and congestion are easily confused.

This seems to me to get at some fundamental issues about recovery
handling that not even adaptive timeouts has solved for us...

1. Server failover/recovery should complete in 10s of seconds, not
   minutes or hours.

   . Clients must detect server death promptly - much faster than
     normal RPC latency on a congested cluster

   . Servers must detect client death/absence promptly to ensure
     recovery isn''t blocked too long by a client crash.

   . To prevent unrelated traffic from being blocked unduly,
     communications associated with a failed client or server must be
     removed from the network promptly, as if the failing node were
     still responsive.

2. Peer failure must be detected with reasonably accuracy in the
   presence of server congestion, LNET router congestion, and LNET
   router failure.

   . Router failure can cause large numbers of RPCs to fail or time
     out.

   . Mis-diagnosing server death is inefficient but the client can
     reconnect harmlessly.

   . Mis-diagnosing client death can cause lost updates when the
     server evicts the client.
> Other options I''ve thought of to explore this idea:
> 
> - MGS notifies clients (somehow) after a server has restarted.
> 
> - A new tcp socket (possibly in userspace) that can receive
> administrative messages like this (messages can be sent from the
> server, from master admin node, etc). Perhaps related to new lproc
> replacement? Updates could be sent from servers themselves or from
> "god" appliance that was keeping track of server nodes.
> 
> - Use "pdsh lctl" to notify all clients a failover has occurred.
> Ugly, but it would allow us to test the basic idea quickly.  (All we
> need is a new lctl command and changes in the ptlrpc client bits to
> support external initiation of recovery to a specific node, which
> we''ll need anyway.)
> 
> 
> robert
I''m totally in favour of supporting additional notification methods
that can increase diagnostic accuracy or speed recovery.  However...

1. We can''t rely purely on external notifications.  We need a portable
   baseline capability that works well with existing network
   infrastructure.

2. I''m extremely nervous of relying on notifications via 3rd parties
   unless the whole Lustre communications model is changed to
   accomodate them.  Network failures can be observed quite
   differently from different nodes, so I''d like to stick with methods
   that uses the same paths as regular communications.

I think some elements of the solution include...

0. Change the point-to-point LNET peer health model from one that
   times out individual messages to one that removes messages blocking
   for a failing peer aggressively.  This has already been
   demonstrated to work successfully to flush congested routers when a
   server dies (bug 16186)
 
1. Health related communications must not be affected by congested
   "normal" communications.  The obvious solution is to provide an
   additional virtual LNET just for this traffic - i.e. implement
   message priority - but this poses further questions...

   a. How much will this complicate the LNET/LND implementation -
      e.g. do _all_ connection-based LNDs have to double up their
      connections to ensure orthogonality or complicate existing
      credit protocols to account for priority messaging.

   b. Is 2 priority levels enough - maybe lock conflict resolution
      could/should benefit?

   c. What effect does this have on security/resilience to attack?

2. Aggregate health related communications between peers to minimize
   the number of health messages in the system.  Also ensure health
   related communications only occur when knowledge of peer health is
   actually required - e.g. a client with no locks on a given server
   doesn''t have to be responsive.

   The implementation of these features is fundamental to scalability.
   They determine the level of background health "noise" and its
   effect on "real" traffic at a given client and server count given a
   required failure detection latency and limits (or lack thereof) on
   how much state on how many servers each client can cache.

    Cheers,
              Eric

Nathaniel Rutman

2008-Dec-18 20:15 UTC

head link

[Lustre-devel] imperative recovery

Eric Barton wrote:>
>> Other options I''ve thought of to explore this idea:
>>
>> - MGS notifies clients (somehow) after a server has restarted.
>>     This seems like a no-brainer easy win today, and doesn''t depend on any 
advanced features like message priority.  The only scalability issue 
would seem to be the broadcast of the message to all clients, but this 
is no different than the current broadcast mechanism the MGS employs to 
update client configs.  The message from the MGS would be taken as a 
suggestion, "Why don''t y''all time out all your current
RPCs since I
noticed OST0004 restarted.  Oh, and use failover nid #2."  Current 
replay/recovery need not be touched.

Nicholas Henke

2009-Jan-09 15:27 UTC

head link

[Lustre-devel] imperative recovery

Nathaniel Rutman wrote:> Eric Barton wrote:
>>> Other options I''ve thought of to explore this idea:
>>>
>>> - MGS notifies clients (somehow) after a server has restarted.
>>>     
> This seems like a no-brainer easy win today, and doesn''t depend on
any
> advanced features like message priority.  The only scalability issue 
> would seem to be the broadcast of the message to all clients, but this 
> is no different than the current broadcast mechanism the MGS employs to 
> update client configs.  The message from the MGS would be taken as a 
> suggestion, "Why don''t y''all time out all your
current RPCs since I
> noticed OST0004 restarted.  Oh, and use failover nid #2."  Current 
> replay/recovery need not be touched.
This would be a great enhancement for OSS failover or reboot, it is really the 
only way we''ll get to recovery times under ~2.5 x obd_timeout. Adaptive
Timeouts
really aren''t buying us much here, as at scale and under load we are
seeing the
timeouts approach the usual static obd_timeout of 300s. It only takes one client
with a higher timeout to push the recovery time out.

I do think this will miss a significant case: combo MGS+MDS. A majority of our 
customers are deploying with this configuration. Perhaps exposing this mechanism
on the clients via a /proc file would be enough - that way a failover framework 
could manually trigger the timeout and/or nid switching.

Nic

Robert Read

2009-Jan-09 17:04 UTC

head link

[Lustre-devel] imperative recovery

On Jan 9, 2009, at 07:27 , Nicholas Henke wrote:
> Nathaniel Rutman wrote:
>> Eric Barton wrote:
>>>> Other options I''ve thought of to explore this idea:
>>>>
>>>> - MGS notifies clients (somehow) after a server has restarted.
>>>>
>> This seems like a no-brainer easy win today, and doesn''t
depend on
>> any
>> advanced features like message priority.  The only scalability issue
>> would seem to be the broadcast of the message to all clients, but  
>> this
>> is no different than the current broadcast mechanism the MGS  
>> employs to
>> update client configs.  The message from the MGS would be taken as a
>> suggestion, "Why don''t y''all time out all your
current RPCs since I
>> noticed OST0004 restarted.  Oh, and use failover nid #2."  Current
>> replay/recovery need not be touched.
>
> This would be a great enhancement for OSS failover or reboot, it is  
> really the
> only way we''ll get to recovery times under ~2.5 x obd_timeout.  
> Adaptive Timeouts
> really aren''t buying us much here, as at scale and under load we
are
> seeing the
> timeouts approach the usual static obd_timeout of 300s. It only  
> takes one client
> with a higher timeout to push the recovery time out.
>
> I do think this will miss a significant case: combo MGS+MDS. A  
> majority of our
> customers are deploying with this configuration. Perhaps exposing  
> this mechanism
> on the clients via a /proc file would be enough - that way a  
> failover framework
> could manually trigger the timeout and/or nid switching.
Yes, exactly what I was thinking. Exposing this feature via proc (or  
lctl) on the clients is the first step. It''s has minimal impact,  
requires no changes to the server, and should integrate well with  
existing failover frameworks.  We also need to get the server to end  
recovery sooner (without waiting for all the stale exports), but VBR  
should help with that.

robert
>
>
> Nic
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Nicholas Henke

2009-Jan-09 19:43 UTC

head link

[Lustre-devel] imperative recovery

Robert Read wrote:> 
> On Jan 9, 2009, at 07:27 , Nicholas Henke wrote:
>>
>> I do think this will miss a significant case: combo MGS+MDS. A 
>> majority of our
>> customers are deploying with this configuration. Perhaps exposing this 
>> mechanism
>> on the clients via a /proc file would be enough - that way a failover 
>> framework
>> could manually trigger the timeout and/or nid switching.
> 
> Yes, exactly what I was thinking. Exposing this feature via proc (or 
> lctl) on the clients is the first step. It''s has minimal impact, 
> requires no changes to the server, and should integrate well with 
> existing failover frameworks.  We also need to get the server to end 
> recovery sooner (without waiting for all the stale exports), but VBR 
> should help with that.
> 
> robert
FWIW: we''d prefer /proc. We don''t ship lctl on our computes
for memory
(initramfs) usage reasons. Being in /proc makes it easy for someone to use the 
functionality from another kernel module as well; we can just call the .read or 
.write functions directly.

Nic

Andreas Dilger

2009-Jan-10 00:50 UTC

head link

[Lustre-devel] imperative recovery

On Jan 09, 2009  09:04 -0800, Robert Read wrote:> On Jan 9, 2009, at 07:27 , Nicholas Henke wrote:
> > This would be a great enhancement for OSS failover or reboot, it is  
> > really the only way we''ll get to recovery times under ~2.5 x
obd_timeout.
> >
> > I do think this will miss a significant case: combo MGS+MDS. A  
> > majority of our customers are deploying with this configuration.
> > Perhaps exposing this mechanism on the clients via a /proc file
> > would be enough - that way a failover framework
> > could manually trigger the timeout and/or nid switching.
> 
> Yes, exactly what I was thinking. Exposing this feature via proc (or  
> lctl) on the clients is the first step. It''s has minimal impact,  
> requires no changes to the server, and should integrate well with  
> existing failover frameworks.  We also need to get the server to end  
> recovery sooner (without waiting for all the stale exports), but VBR  
> should help with that.
Hey, wouldn''t (essentially) "lctl --device $foo recover" do
the trick
today?


Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Robert Read

2009-Jan-10 04:44 UTC

head link

[Lustre-devel] imperative recovery

On Jan 9, 2009, at 4:50 PM, Andreas Dilger wrote:
> On Jan 09, 2009  09:04 -0800, Robert Read wrote:
>> On Jan 9, 2009, at 07:27 , Nicholas Henke wrote:
>>> This would be a great enhancement for OSS failover or reboot, it is
>>> really the only way we''ll get to recovery times under ~2.5
x
>>> obd_timeout.
>>>
>>> I do think this will miss a significant case: combo MGS+MDS. A
>>> majority of our customers are deploying with this configuration.
>>> Perhaps exposing this mechanism on the clients via a /proc file
>>> would be enough - that way a failover framework
>>> could manually trigger the timeout and/or nid switching.
>>
>> Yes, exactly what I was thinking. Exposing this feature via proc (or
>> lctl) on the clients is the first step. It''s has minimal
impact,
>> requires no changes to the server, and should integrate well with
>> existing failover frameworks.  We also need to get the server to end
>> recovery sooner (without waiting for all the stale exports), but VBR
>> should help with that.
>
> Hey, wouldn''t (essentially) "lctl --device $foo recover"
do the trick
> today?
The main difference is we need to specify the nid to connect to. Also,  
since  lctl isn''t always available we should do this with a /proc file
(and  set_param), so something like this:

echo $new_ost_nid > /proc/fs/lustre/osc/OSC_FOO_01/target_nid

or

lctl set_param osc.osc_FOO_01.target_nid $new_ost_nid

robert
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
> _______________________________________________
> Lustre-devel mailing list
> Lustre-devel at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-devel

Lustre devel - Dec 2008 - imperative recovery

[Lustre-devel] imperative recovery

[Lustre-devel] imperative recovery

[Lustre-devel] imperative recovery

[Lustre-devel] imperative recovery

[Lustre-devel] imperative recovery

[Lustre-devel] imperative recovery

[Lustre-devel] imperative recovery