Looking at the error handling logic that I''ll need to do: The characteristics of the transport are such that once I send short messages there''s no way to call them back, so the best I can do there is wait for them all to complete (which might involve a failure code) and then clean them up. For the dma ops, there isn''t a good (ie, free of race conditions) way to tell what the instantaneous state of the transfer is, so the best I can do is invalidate the state of the various buffer descriptors, and wait for them to trickle out, presumably with errors. That''s true for both tx and rx ops. The one bit of good news is that we''re pretty sure that, in the case of a node failure or something, we can arrange to be sure within a small bounded time (seconds) that all pending traffic is done. In the case of a link failure (which stops clocking the dma engine for that link) we can ensure that nothing is happening there, which means it''s safe to reset the engine, which means it''s safe to yank out the pending buffer descriptors without having it scribble all over the memory later. Given all that, I think my error-recovery plan is approx as follows: For short-messages, ignore them, once they''re gone, they''re gone. Ignore those which are received from peers which aren''t in "good" state, except of course for hellos. For rdma transmits, if I get a notification that the peer is down, invalidate the descriptors which correspond to the buffers, and wait for them all to return. If they come back "ok", presumably that means I lost the race, and they get finalized as normal. All the ones that come back with errors get finalized with some kind (what kind?) of error. Is it safe to assume that in that case lnet and friends will deal with whatever kind of recovery is necessary, for instance if an alternate OSS comes on line, do they DTRT about replaying any pending messages to the new peer? For rdma receives, if I get a notification that the peer is down, I do the same sort of thing; invalidate the descriptors. In addition, if it''s a link failure, I must do the under-the-hood stuff to guarantee that the engine is stopped, then by hand signal completion on the relevant buffers. After that I think the rest of the logic about finalizing applies. Does that sound roughly right? Anything else I should be taking into account?
John,> Given all that, I think my error-recovery plan is approx as follows: > > For short-messages, ignore them, once they''re gone, they''re gone. > Ignore those which are received from peers which aren''t in "good" > state, except of course for hellos.Fine> For rdma transmits, if I get a notification that the peer is down, > invalidate the descriptors which correspond to the buffers, and wait > for them all to return. If they come back "ok", presumably that > means I lost the race, and they get finalized as normal. All the > ones that come back with errors get finalized with some kind (what > kind?) of error.Pass any non-zero completion status to flag an error - e.g. lnet_finalize(ni, msg, -EIO)> Is it safe to assume that in that case lnet and friends will deal > with whatever kind of recovery is necessary, for instance if an > alternate OSS comes on line, do they DTRT about replaying any > pending messages to the new peer?yes, that''s not LNET or the LND''s concern at all.> For rdma receives, if I get a notification that the peer is down, I > do the same sort of thing; invalidate the descriptors. In addition, > if it''s a link failure, I must do the under-the-hood stuff to > guarantee that the engine is stopped, then by hand signal completion > on the relevant buffers. After that I think the rest of the logic > about finalizing applies.Sounds fine.> Does that sound roughly right? Anything else I should be taking > into account?The guiding principles for completion are... 1. If you return success from lnd_send or lnd_recv, you must call lnet_finalize() within finite time. 2. You may only call lnet_finalize() when there is no longer any chance that the underlying network can touch (read or write) the payload buffer. 3. The completion status on sends isn''t critical. Lustre only really needs to know that sending is over; knowing whether the send was good or not is really just icing on the cake (e.g. so that it doens''t have to wait for a full timeout for an RPC reply if sending the request failed). 4. The completion status on receives is completely critical. You may only return success if the sink buffer has been filled correctly. Cheers, Eric
On Dec 11, 2006, at 6:06 AM, Eric Barton wrote:> John, > >> Does that sound roughly right? Anything else I should be taking >> into account? > > The guiding principles for completion are... > > 1. If you return success from lnd_send or lnd_recv, you must call > lnet_finalize() within finite time. > > 2. You may only call lnet_finalize() when there is no longer any > chance that the underlying network can touch (read or write) the > payload buffer. > > 3. The completion status on sends isn''t critical. Lustre only really > needs to know that sending is over; knowing whether the send was > good or not is really just icing on the cake (e.g. so that it > doens''t have to wait for a full timeout for an RPC reply if sending > the request failed). > > 4. The completion status on receives is completely critical. You may > only return success if the sink buffer has been filled correctly. > > Cheers, > EricTwo other comments: 1) Do not hold any locks when calling any lnet_ functions. 2) Make sure you are _completely_ done with your buffer before calling lnet_finalize(). I ran into a race condition where I called lnet_finalize() then placed the rx or tx descriptor on my idle queue. :-)
From: "Eric Barton" <eeb@bartonsoftware.com> Date: Mon, 11 Dec 2006 11:06:23 -0000 [...] The guiding principles for completion are... 1. If you return success from lnd_send or lnd_recv, you must call lnet_finalize() within finite time. Right, I got that part. 2. You may only call lnet_finalize() when there is no longer any chance that the underlying network can touch (read or write) the payload buffer. Yes. Not surprisingly, that''s the trickiest part. But it''s all stuff that we control, so it can be done. 3. The completion status on sends isn''t critical. Lustre only really needs to know that sending is over; knowing whether the send was good or not is really just icing on the cake (e.g. so that it doens''t have to wait for a full timeout for an RPC reply if sending the request failed). Ok. 4. The completion status on receives is completely critical. You may only return success if the sink buffer has been filled correctly. Of course. From: Scott Atchley <atchley@myri.com> Date: Mon, 11 Dec 2006 07:21:38 -0500 Two other comments: 1) Do not hold any locks when calling any lnet_ functions. Yikes. Yes. I''m pretty sure I wasn''t, but good to keep in mind. Does that really mean no locks at all, or no locks that could turn into recursive lock attempts due to lnet calling back in? Are the lnet things (which get called into by lnd) all non-blocking? 2) Make sure you are _completely_ done with your buffer before calling lnet_finalize(). I ran into a race condition where I called lnet_finalize() then placed the rx or tx descriptor on my idle queue. :-) Yes, that would probably be a Bad Thing (tm). Thanks...
> From: Scott Atchley <atchley@myri.com> > Date: Mon, 11 Dec 2006 07:21:38 -0500 > > > Two other comments: > > 1) Do not hold any locks when calling any lnet_ functions.Indeed. And I forgot to mention that you have to be in thread context too!> Yikes. Yes. I''m pretty sure I wasn''t, but good to keep in mind. > > Does that really mean no locks at all, or no locks that could turn into > recursive lock attempts due to lnet calling back in? Are the lnet things > (which get called into by lnd) all non-blocking?LNET will not block when you call lnet_finalize() unless it is to allocate memory. But it can call you back (e.g. to send an ACK when you finalize a PUT). All in all, it''s best not to be holding any locks at all. Cheers, Eric --------------------------------------------------- |Eric Barton Barton Software | |9 York Gardens Tel: +44 (117) 330 1575 | |Clifton Mobile: +44 (7909) 680 356 | |Bristol BS8 4LL Fax: call first | |United Kingdom E-Mail: eeb@bartonsoftware.com| ---------------------------------------------------
From: "Eric Barton" <eeb@bartonsoftware.com> Date: Mon, 11 Dec 2006 15:07:16 -0000 > From: Scott Atchley <atchley@myri.com> > Date: Mon, 11 Dec 2006 07:21:38 -0500 > > > Two other comments: > > 1) Do not hold any locks when calling any lnet_ functions. Indeed. And I forgot to mention that you have to be in thread context too! Yes, I''d already figured that one out. I''m really doing very little in interrupt context, I foisted all that off on the service thread. > Yikes. Yes. I''m pretty sure I wasn''t, but good to keep in mind. > > Does that really mean no locks at all, or no locks that could turn into > recursive lock attempts due to lnet calling back in? Are the lnet things > (which get called into by lnd) all non-blocking? LNET will not block when you call lnet_finalize() unless it is to allocate memory. But it can call you back (e.g. to send an ACK when you finalize a PUT). Sure, leading to the obvious class of deadlocks. I just wondered whether there was something I hadn''t thought of which might be causing lnet to block. All in all, it''s best not to be holding any locks at all. Of course. I don''t actually believe there''s any reason for me to be holding any locks at that time. I''m trying to make sure everything is reentrant all over the place, because that''s the only way it''s going to run at speed on a 6-cpu smp.