thr3ads.net - Lustre devel - [Lustre-devel] RE: More LND [Dec 2006]

If this information is useful, please help other people find it:
Share via:

Eric Barton

2006-Dec-04 09:10 UTC

[Lustre-devel] RE: More LND

> I think this may boil down a question of tastefulness.
> 
> In the case where A does a GET at B, the nature of our transport
> dictates that B must figure out what he''s going to do to satisfy
the
> GET, get it staged up, then send something back to A saying
"here''s
> the thing you''re receiving from".  Both the get request and
the
> response describing where A should pull from will go over the
> short-message channel.
Alternatively A can send over the RDMA descriptor for its sink buffer
and B can just launch the transfer after it has called lnet_parse()
and had its recv() callback called to "receive" the GET.  The elan and
openib(gen1)/cisco LNDs do this.

Where it gets a bit yeuchy is that the last arg of lnet_parse() is 1
(TRUE) in this case, meaning that the LND wants to do RDMA using
whatever other info (e.g. RDMA descriptors) it just received, so lnet
should call its recv() callback passing the matched buffers (which the
recv() will actually send from!!!).  If it passed 0 (FALSE), the
recv() callback would occur, but just to allow the LND to repost its
buffer - it could be some time later that the LND''s send() callback
would be invoked with the matched buffers and it hugely complicates
the LND code to make it hang on to the receive buffer (including peer
sink buffer RDMA descriptor) until then.
> Looking at existing LNDs that do similar things, it looks like the
> response message is often handled more "under the hood".  But our
> transport is asymmetrical in that way, ie when a receiver starts a
> pull, the signalling is handled by the hardware, but when a
> transmitter wants to push, he must send a short message to tell the
> receiver to pull.
> 
> So the question is, is it a "Bad Thing" (tm) to just invent a new
> opcode analogous to LND_GET_REQUEST, and let it percolate through
> the same flow of control as everything else?  Or do you think I''m
> better off having basically two layers of message, one for these out
> of band ones, and the other for the usual short-message stuff?
I used RDMA READ on the first RDMA capable networks, but stopped using
it for iiblnd, (and other LNDs just followed suit) because I was
worried about additional resources required in the underlying network
to support it.  If RDMA READ is a natural choice on Scicortex, it''s a
no-brainer to eliminate yet another network latency.  

However, note the restrictions on when you can use it to implement
GET/REPLY.  When LNET routers are involved, you cannot optimize the
GET in this - the LNET GET request must make it all the way to the
destination before the source buffers can be matched and the LNET
REPLY is forwarded back just like PUTs.

-- 

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------

John R. Dunning

2006-Dec-04 10:08 UTC

head link

[Lustre-devel] RE: More LND

From: "Eric Barton" <eeb@bartonsoftware.com>
    Date: Mon, 4 Dec 2006 16:11:25 -0000

    > I think this may boil down a question of tastefulness.
    > 
    > In the case where A does a GET at B, the nature of our transport
    > dictates that B must figure out what he''s going to do to
satisfy the
    > GET, get it staged up, then send something back to A saying
"here''s
    > the thing you''re receiving from".  Both the get request
and the
    > response describing where A should pull from will go over the
    > short-message channel.

    Alternatively A can send over the RDMA descriptor for its sink buffer
    and B can just launch the transfer after it has called lnet_parse()
    and had its recv() callback called to "receive" the GET.  The elan
and
    openib(gen1)/cisco LNDs do this.

Yeah, I see that, but our transport doesn''t work that way.  I need to
have
already initialized the transmit side before I can kick the receive side into
gear.  After that it all happens in hardware, but the cost is that I''m
not
allowed to start the receive side first.

    Where it gets a bit yeuchy is that the last arg of lnet_parse() is 1
    (TRUE) in this case, meaning that the LND wants to do RDMA using
    whatever other info (e.g. RDMA descriptors) it just received, so lnet
    should call its recv() callback passing the matched buffers (which the
    recv() will actually send from!!!).  If it passed 0 (FALSE), the
    recv() callback would occur, but just to allow the LND to repost its
    buffer - it could be some time later that the LND''s send() callback
    would be invoked with the matched buffers and it hugely complicates
    the LND code to make it hang on to the receive buffer (including peer
    sink buffer RDMA descriptor) until then.

Ouch.  You''re making my brain hurt.

    > Looking at existing LNDs that do similar things, it looks like the
    > response message is often handled more "under the hood".  But
our
    > transport is asymmetrical in that way, ie when a receiver starts a
    > pull, the signalling is handled by the hardware, but when a
    > transmitter wants to push, he must send a short message to tell the
    > receiver to pull.
    > 
    > So the question is, is it a "Bad Thing" (tm) to just invent a
new
    > opcode analogous to LND_GET_REQUEST, and let it percolate through
    > the same flow of control as everything else?  Or do you think
I''m
    > better off having basically two layers of message, one for these out
    > of band ones, and the other for the usual short-message stuff?

    I used RDMA READ on the first RDMA capable networks, but stopped using
    it for iiblnd, (and other LNDs just followed suit) because I was
    worried about additional resources required in the underlying network
    to support it.  If RDMA READ is a natural choice on Scicortex, it''s
a
    no-brainer to eliminate yet another network latency.  

So it sounds like the answer is that I *should* implement a different layer of
messaging, where the first dispatch is whether or not it''s a low-level
dma
control message vs an lnd message that needs to go through the usual
mechanism.  That way the control messages (of which I believe
"I''ve set up my
transmit, now you start your receive" is the only one) will be handled with
a
minimum of overhead.

    However, note the restrictions on when you can use it to implement
    GET/REPLY.  When LNET routers are involved, you cannot optimize the
    GET in this - the LNET GET request must make it all the way to the
    destination before the source buffers can be matched and the LNET
    REPLY is forwarded back just like PUTs.

Yes, understood.  There are a number of things that get more complicated when
I start to deal with routing, but I believe I can push them all off until rev
2.  For starters, the only thing that will use this LND is the case where the
cluster is directly connected (via FC or whatever) to the storage array, and
therefore all I care about is the simple case, no routing.  For external
cases, I expect we''ll just be using socklnd.  I''m trying to
leave breadcrumbs
in the code for that future date.

There are other issues that will need to be addressed before I can really
build a proper router anyhow.  For instance, can I interface IB''s dma
stuff
directly to our proprietary technology to make bits stream all the way through
from end to end?  If the answer is yes, then the world is a very different
place than if things have to land in memory and be re-launched by software
along the way.  It might turn out that the right architecture is simply
(hah!) to implement an IB-like semantic layer, then use that to extend
something like openib throughout the system.  Until we can think through those
issues, I''m not going to tube too hard on the routing question.

Eric Barton

2006-Dec-04 10:56 UTC

head link

[Lustre-devel] RE: More LND

> Yeah, I see that, but our transport doesn''t work that way.  I need
> to have already initialized the transmit side before I can kick the
> receive side into gear.  After that it all happens in hardware, but
> the cost is that I''m not allowed to start the receive side first.John,
> So it sounds like the answer is that I *should* implement a
> different layer of messaging, where the first dispatch is whether or
> not it''s a low-level dma control message vs an lnd message that
> needs to go through the usual mechanism.  That way the control
> messages (of which I believe "I''ve set up my transmit, now
you start
> your receive" is the only one) will be handled with a minimum of
> overhead.
Do you have a special low-latency way of handling messages that the
short-message channel can''t use?  If not, I don''t see what
this will
really buy you.  If the "natural" RDMA on scicortex is RDMA READ, then
LNET PUT would go something like....

 1. Lustre on node A calls LNetPut(), which calls sc_send() to send an
    LNET PUT header + payload.  sc_send() sets up the source buffers
    ready for transmission and sends an SC_PUT_REQ which includes the
    LNET header and the RDMA descriptor for the source buffers.

 2. The LND on node B receives the SC_PUT_REQ and passes the LNET
    header to lnet_parse().  When sc_recv() is called with the matched
    buffer iovs, it initiates the RDMA to fetch the PUT payload.

 3. Both sides receive notification somehow when the RDMA completes
    and call lnet_finalize().

...and an "optimized" LNET GET might go something like...

 1. Lustre on node A calls LNetGet(), which calls sc_send().  This
    sends an SC_GET_REQ which includes the LNET header.

 2. The LND on node B receives the SC_GET_REQ and passes the LNET
    header to lnet_parse().  When sc_recv() is called with the matched
    buffer iovs, it sets up these buffers for transmitting and sends
    back an SC_GET_ACK containing the RDMA descriptor for them.

 3. The LND on node A receives the SC_GET_ACK and initiates the RDMA
    to fetch the GET payload.

 4. Both sides receive notification somehow when the RDMA completes
    and calls lnet_finalize().

...but actually, this is more code for no reward.  You could just as
easily send the GET immediately and handle RDMA in the LNET REPLY
message i.e....

 1. Lustre on node A calls LNetGet(), which calls sc_send().  This
    sends an SC_IMMEDIATE message with the LNET header and 0 payload.

 2. The LND on node B receives the SC_IMMEDIATE message and passes it
    to lnet_parse().  sc_recv() is called back with NULL payload
    buffers in response to the LNET GET just parsed (just because LNET
    always calls the recv() callback if lnet_parse() returns success).

    LNET on node B also calls sc_send() to send an LNET REPLY header +
    payload.  sc_send() sets up the source buffers ready for
    transmission and sends an SC_PUT_REQ which includes the LNET
    header and the RDMA descriptor for the source buffers.

 3. The LND on node B receives the SC_PUT_REQ and passes the LNET
    header to lnet_parse().  When sc_recv() is called with the matched
    buffer iovs, it initiates the RDMA to fetch the PUT payload.

 4. Both sides receive notification somehow when the RDMA completes
    and call lnet_finalize().

-- 

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------

John R. Dunning

2006-Dec-04 11:37 UTC

head link

[Lustre-devel] RE: More LND

From: "Eric Barton" <eeb@bartonsoftware.com>
    Date: Mon, 4 Dec 2006 17:57:05 -0000
    
    > Yeah, I see that, but our transport doesn''t work that way.  I
need
    > to have already initialized the transmit side before I can kick the
    > receive side into gear.  After that it all happens in hardware, but
    > the cost is that I''m not allowed to start the receive side
first.
    John,
    
    > So it sounds like the answer is that I *should* implement a
    > different layer of messaging, where the first dispatch is whether or
    > not it''s a low-level dma control message vs an lnd message
that
    > needs to go through the usual mechanism.  That way the control
    > messages (of which I believe "I''ve set up my transmit,
now you start
    > your receive" is the only one) will be handled with a minimum of
    > overhead.
    
    Do you have a special low-latency way of handling messages that the
    short-message channel can''t use?  

The hardware (actually it''s microcode), totally below the direct
control of
software, does the signalling necessary to start the companion transmit on a
peer node when a receive is started.  I believe that''s what
you''re asking
about?

There is nothing analogous for going the other direction.  IOW, initiating the
actual transfer of a bag of bits can only be done from the receive side, after
the transmit side is ready for it.  What this means in practice is that
it''s
cheaper to initiate a transfer from the transmit side, because the transmitter
can set it up, then send a short message to the receiver, which then starts
the receive operation.  To initiate a transfer from the receive side, the
receiver has to basically ask the transmitter to set up his side, get a
something back that says that''s done and passes the transmit
descriptor, then
start up the receive.

				      If not, I don''t see what this will
    really buy you.  If the "natural" RDMA on scicortex is RDMA READ,
then
    LNET PUT would go something like....
    
     1. Lustre on node A calls LNetPut(), which calls sc_send() to send an
        LNET PUT header + payload.  sc_send() sets up the source buffers
        ready for transmission and sends an SC_PUT_REQ which includes the
        LNET header and the RDMA descriptor for the source buffers.
    
     2. The LND on node B receives the SC_PUT_REQ and passes the LNET
        header to lnet_parse().  When sc_recv() is called with the matched
        buffer iovs, it initiates the RDMA to fetch the PUT payload.
    
     3. Both sides receive notification somehow when the RDMA completes
        and call lnet_finalize().
    
yes...

    ...and an "optimized" LNET GET might go something like...
    
     1. Lustre on node A calls LNetGet(), which calls sc_send().  This
        sends an SC_GET_REQ which includes the LNET header.
    
     2. The LND on node B receives the SC_GET_REQ and passes the LNET
        header to lnet_parse().  When sc_recv() is called with the matched
        buffer iovs, it sets up these buffers for transmitting and sends
        back an SC_GET_ACK containing the RDMA descriptor for them.
    
     3. The LND on node A receives the SC_GET_ACK and initiates the RDMA
        to fetch the GET payload.
    
     4. Both sides receive notification somehow when the RDMA completes
        and calls lnet_finalize().

yes.  It was step 2, what you''re calling get-ack, which I was
originally
asking about, and supposing ought to happen "under the hood", ie
without
benefit of going through lnet_parse etc.
    
    ...but actually, this is more code for no reward.  You could just as
    easily send the GET immediately and handle RDMA in the LNET REPLY
    message i.e....
    
     1. Lustre on node A calls LNetGet(), which calls sc_send().  This
        sends an SC_IMMEDIATE message with the LNET header and 0 payload.
    
     2. The LND on node B receives the SC_IMMEDIATE message and passes it
        to lnet_parse().  sc_recv() is called back with NULL payload
        buffers in response to the LNET GET just parsed (just because LNET
        always calls the recv() callback if lnet_parse() returns success).
    
        LNET on node B also calls sc_send() to send an LNET REPLY header +
        payload.  sc_send() sets up the source buffers ready for
        transmission and sends an SC_PUT_REQ which includes the LNET
        header and the RDMA descriptor for the source buffers.
    
     3. The LND on node B receives the SC_PUT_REQ and passes the LNET
        header to lnet_parse().  When sc_recv() is called with the matched
        buffer iovs, it initiates the RDMA to fetch the PUT payload.
    
     4. Both sides receive notification somehow when the RDMA completes
        and call lnet_finalize().
    
Ok.  I think I see.  You''re relying on B''s LNET to call into
lnd_recv with the
header (which is interesting) and zero payload (which is not).  B then does
what he would have done if he was the initiator in the first place, namely
issue a PUT and let the rest of the machinery do its thing.

I actually thought about doing that a bit, but wasn''t sure it was safe
for B
to be doing a PUT involving the lnet_hdr_t on which A was already doing a GET.
If the answer is that that is safe, I think my problem''s solved.

Eric Barton

2006-Dec-04 11:58 UTC

head link

[Lustre-devel] RE: More LND

> I actually thought about doing that a bit, but wasn''t sure it 
> was safe for B to be doing a PUT involving the lnet_hdr_t on
> which A was already doing a GET.
> If the answer is that that is safe, I think my problem''s solved.
Yes it''s safe.  All the LNET headers that the LND is asked to 
end/receive will always be in separate buffers, so there is no
risk of them getting trampled on - e.g. The REPLY header created
in response to a GET isn''t the just-received GET header - it''s
a
brand new one in its own buffer.

And all the lnet_parse() does (apart from check header format) is
find matching buffers from incoming headers.  GET and PUT headers
contain matchbits and a portal number for this.  And REPLY and ACK
messages contain an abstract handle (lnet_handle_wire_t).  This handle
is included in the outgoing GET or PUT and copied into the REPLY
or optional ACK that is sent in response, so that when they arrive,
LNET can find the MD associated with the original LNetGet() or LNetPut().

                Cheers,
                        Eric

---------------------------------------------------
|Eric Barton        Barton Software               |
|9 York Gardens     Tel:    +44 (117) 330 1575    |
|Clifton            Mobile: +44 (7909) 680 356    |
|Bristol BS8 4LL    Fax:    call first            |
|United Kingdom     E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------

John R. Dunning

2006-Dec-04 12:00 UTC

head link

[Lustre-devel] RE: More LND

From: "Eric Barton" <eeb@bartonsoftware.com>
    Date: Mon, 4 Dec 2006 18:59:39 -0000
    
    > I actually thought about doing that a bit, but wasn''t sure it 
    > was safe for B to be doing a PUT involving the lnet_hdr_t on
    > which A was already doing a GET.
    
    > If the answer is that that is safe, I think my problem''s
solved.
    
    Yes it''s safe.  All the LNET headers that the LND is asked to 
    end/receive will always be in separate buffers, so there is no
    risk of them getting trampled on - e.g. The REPLY header created
    in response to a GET isn''t the just-received GET header -
it''s a
    brand new one in its own buffer.
    
    And all the lnet_parse() does (apart from check header format) is
    find matching buffers from incoming headers.  GET and PUT headers
    contain matchbits and a portal number for this.  And REPLY and ACK
    messages contain an abstract handle (lnet_handle_wire_t).  This handle
    is included in the outgoing GET or PUT and copied into the REPLY
    or optional ACK that is sent in response, so that when they arrive,
    LNET can find the MD associated with the original LNetGet() or LNetPut().
    
Ok, spiff.  I like that scheme better than the path I was going down.  Thanks!

Lustre devel - Dec 2006 - RE: More LND

[Lustre-devel] RE: More LND

[Lustre-devel] RE: More LND

[Lustre-devel] RE: More LND

[Lustre-devel] RE: More LND

[Lustre-devel] RE: More LND

[Lustre-devel] RE: More LND