thr3ads.net - Lustre discuss - [Lustre-discuss] Aborting Communications. [Jun 2006]

If this information is useful, please help other people find it:
Share via:
Eric Barton
2006-Jun-18 11:18 UTC
[Lustre-discuss] Aborting Communications.

Hi!

I''m posting a comment about ptlrpc_abort_bulk() that I just sent to a
CFS customer in response to a bug report because I think it illustrates
the general principle behind communications teardown on error.  The
specific example here is a server thread calling ptlrpc_abort_bulk() on
an XT3 when it times out a client.  And the problem is that this seems
to be taking far too long...


When a service thread thinks its client is not responsive (because a
timeout has expired), it calls ptlrpc_abort_bulk() to disengage its bulk
buffer from the network since it cannot proceed until it can guarantee
that the network isn''t going to overwrite this buffer.

It calls LNetMDUnlink() to initiate the process of detaching the bulk
buffer from the network and then waits for the "final" completion
event
for the bulk buffer; the one that has the "unlinked" flag set meaning 
that the buffer is now inaccessible to the network. Note that normal
completion could still occur at this time, so LNetMDUnlink() is
just a way of saying "I''m bored now; please give me my final
completion
event ASAP".

The service thread MUST wait FOREVER for this final completion message,
because without it, there is no guarantee that the network can''t
overwrite the bulk buffer.  So the service thread actually sits in a
loop and prints a warning message every 300 seconds, because LNET really
should be more responsive than that, and waiting so long is an
indication of deeper problems .  BTW, the lustre watchdog code, which
was implemented later, also happens to catch this case :).

So why might the gap between the LNetMDUnlink() and the final completion
event be so long?

Either it''s a bug in LNET / the LND involved (ptllnd in this particular
example), or the underlying network (in this case cray portals) is not
honouring its requirement to complete all operations in (a reasonably)
finite time.

The LND keeps a reference on the MD while it has the MD mapped for RDMA
and LNET has to wait until this reference is released before it can
issue the final completion event.  However, the LND starts a timeout
running when it maps a buffer for RDMA.  If this timeout expires before
the RDMA completes, the LND does whatever it needs to do (e.g. unmap,
disconnect) to prohibit further network access to the MD.  When that has
been accomplished, it drops its reference on the MD, and this in turn
allows LNET to deliver the final completion event.

In the case of the ptllnd, "mapping for RDMA" means posting a cray
portals ME/MD for the bulk buffer, and "prohibiting further network
access" means calling PtlMDUnlink() and waiting for cray portals to post
a completion event (just like LNET actually :).

So I''d guess that either the ptllnd isn''t timing out
(it''s timeout
defaults to 50 seconds) and calling PtlMDUnlink(), cray portals isn''t
delivering the final completion event, or ptllnd is missing it.

Note that _nothing_ happens until the ptllnd times out (and there should
be a console message when it does).

Also (FYI) we''re considering making LNetMDUnlink() more
''aggressive'' and
calling down into the LND to ask it to lay off, rather than waiting for
the LND''s timeout; but that''s just a thought at the moment.
Lustre discuss - Jun 2006 - Aborting Communications.

[Lustre-discuss] Aborting Communications.