John,
> Eric, correct me if I mistate anything, but my understanding is that
> the client always requests a transfer and the server handles the PUT
> (for READ) and GET (for WRITE).
>
> Yikes, now I''m even more confused. Given nodes A and B. A has 27
MB he
wants> to send to B. B isn''t expecting it (yet). Are you saying that A
starts
by> sending a GET to B? I had been thinking that was what a PUT was for. I
> thought A sent a PUT, B figured out where it wanted to receive the 27MB, B
> sent back a GET (which caused the 27MB to flow across) and then optionally
B> sent back something saying that the (original) PUT had completed.
At the lustre level, ALL communications are RPCs....
0. Server posts match-all ME/MDs for RPC request buffers to the service
request
portal.
1. Client posts match-unique ME/MDs for the RPC reply buffer and any bulk
data
buffers associated with the RPC.
2. Client PUTs the RPC request to the server''s request portal. The
request
message includes the unique matchbits for the reply and bulk data
buffers.
3. Server PUTs/GETs the client''s bulk data (if any) using the unique
matchbits
passed in the RPC request.
4. Server PUTs the RPC reply to the client''s RPC reply buffer using the
unique
matchbits passed in the RPC request.
The portal, sender NID+PID and matchbits in an incoming PUT/GET determine
which
buffers should receive/source the payload. Match-any ME/MDs act like a
message
queue and lustre uses these for RPC requests just like conventional message
passing. The matchbits of match-unique ME/MDs are like a safely exportable
RDMA handle and lustre uses these for bulk transfers and RPC replies.
Whether or not to use RDMA on any individual communication is completely up
to
the LND. In fact LNDs don''t have a clue which part of a lustre RPC a
particular PUT/GET belongs to and therefore can''t base their decision
on
that.
All RDMA capable LNDs currently look at the PUT/GET payload size to
determine
whether to use RDMA. If it is small they transfer it by message passing and
it''s copied to the final destination at the sink (i.e. PUT destination
or
GET
source). Otherwise the payload is transferred by RDMA.
This minimises the number of messages and avoids RDMA setup overhead for
small
messages to reduce latency at the expense of doing the copy. Large messages
take the additional latency of the RDMA setup but eliminate the copy
overhead
and therefore maximise bandwidth.
Note, however, that lustre services don''t just run on the OSS or MDS -
every
lustre client runs lock notification services which lustre servers use to
ask
their clients to release locks when there is lock contention. So an RDMA
capable LND will most probably use RDMA to transfer file and directory
contents
between lustre clients and servers, but this isn''t a hard and fast
rule.
--
Cheers,
Eric
---------------------------------------------------
|Eric Barton Barton Software |
|9 York Gardens Tel: +44 (117) 330 1575 |
|Clifton Mobile: +44 (7909) 680 356 |
|Bristol BS8 4LL Fax: call first |
|United Kingdom E-Mail: eeb@bartonsoftware.com|
---------------------------------------------------