thr3ads.net - Lustre discuss - [Lustre-discuss] write RPC & congestion [Aug 2010]

If this information is useful, please help other people find it:
Share via:

burlen

2010-Aug-17 20:15 UTC

[Lustre-discuss] write RPC & congestion

Hi, thanks for previous help.

I have some question about Lustre RPC and the sequence of events that 
occur during large concurrent write() involving many processes and large 
data size per process.  I understand there is a mechanism of flow 
control by credits, but I''m a little unclear on how it works in general
after reading the "networking & io protocol" white paper.

Is it true that a write() RPC transfer''s data in chunks of at least 1MB
and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? 
I can use the bounds to estimate the number of RPCs issued per MB of 
data to write?

About how many concurrent incoming write() RPC per OSS service thread 
can a single server handle before it stops responding to incoming RPCs ?

What happens to an RPC when the server is too busy to handle it, is it 
even issued by the client ? Does the client have to poll and/or resend 
the RPC ? Does the process of polling for flow control credits add 
significant network/server congestion ?

Is it likely that a large number of RPC''s/flow control credit requests 
will induce enough network congestion so that client''s RPC''s
timeout ?
How does the client handle such a timeout ?

Burlen

Andreas Dilger

2010-Aug-19 02:01 UTC

head link

[Lustre-discuss] write RPC & congestion

On 2010-08-17, at 14:15, burlen wrote:> I have some question about Lustre RPC and the sequence of events that 
> occur during large concurrent write() involving many processes and large 
> data size per process.  I understand there is a mechanism of flow 
> control by credits, but I''m a little unclear on how it works in
general
> after reading the "networking & io protocol" white paper.
There are different levels of flow control.  There is one at the LNET level,
that controls low-level messages from overwhelming the server with messages, and
avoiding stalling small/reply messages at the back of a deep queue of requests.
> Is it true that a write() RPC transfer''s data in chunks of at
least 1MB
> and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? 
> I can use the bounds to estimate the number of RPCs issued per MB of 
> data to write?
Currently, 1MB is the largest bulk IO size, and is the typical size used by
clients for all IO.
> About how many concurrent incoming write() RPC per OSS service thread 
> can a single server handle before it stops responding to incoming RPCs ?
The server can handle tens of thousands of write _requests_, but note that since
Lustre has always been designed as an RDMA-capable protocol the request is
relatively small (a few hundreds of bytes) and does not contain any of the DATA.

When one of the server threads is ready to process a read/write request it will
get or put the data from/to the buffers that the client already prepared.  The
number of currently active IO requests is exactly the number of active service
threads (up to 512 by default).
> What happens to an RPC when the server is too busy to handle it, is it 
> even issued by the client ? Does the client have to poll and/or resend 
> the RPC ? Does the process of polling for flow control credits add 
> significant network/server congestion ?
The clients limit the number of concurrent RPC requests, by default to 8 per
OST.  The LNET level message credits will also limit the number of in-flight
messages in case there is e.g. an LNET router between the client and server.

The client will almost never time out a request, as it is informed how long
requests are currently taking to process and will wait patiently for its earlier
requests to finish processing.  If the client is going to time out a request
(based on an earlier request timeout that is about to be exceeded) the server
will inform it to continue waiting and give it a new processing time estimate
(unless of course the server is non-functional or so overwhelmed that it
can''t even do that).
> Is it likely that a large number of RPC''s/flow control credit
requests
> will induce enough network congestion so that client''s
RPC''s timeout ?
> How does the client handle such a timeout ?
Since the flow control credits are bounded, and will be returned to the peer as
earlier requests complete there is not additional traffic due to this.  However,
considering that HPC clusters are distributed denial-of-service engines it is
always possible to overwhelm the server under some conditions.  In case of a
client RPC timeout (hundreds of seconds under load) the client will resend the
request and/or try to contact the backup server until one responds.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

burlen

2010-Aug-22 17:58 UTC

head link

[Lustre-discuss] write RPC & congestion

Andreas Dilger wrote:> On 2010-08-17, at 14:15, burlen wrote:
>   
>> I have some question about Lustre RPC and the sequence of events that 
>> occur during large concurrent write() involving many processes and
large
>> data size per process.  I understand there is a mechanism of flow 
>> control by credits, but I''m a little unclear on how it works
in general
>> after reading the "networking & io protocol" white paper.
>>     
>
> There are different levels of flow control.  There is one at the LNET
level, that controls low-level messages from overwhelming the server with
messages, and avoiding stalling small/reply messages at the back of a deep queue
of requests.
>
>   
>> Is it true that a write() RPC transfer''s data in chunks of at
least 1MB
>> and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ?
>> I can use the bounds to estimate the number of RPCs issued per MB of 
>> data to write?
>>     
>
> Currently, 1MB is the largest bulk IO size, and is the typical size used by
clients for all IO.
>
>   
>> About how many concurrent incoming write() RPC per OSS service thread 
>> can a single server handle before it stops responding to incoming RPCs
?
>>     
>
> The server can handle tens of thousands of write _requests_, but note that
since Lustre has always been designed as an RDMA-capable protocol the request is
relatively small (a few hundreds of bytes) and does not contain any of the DATA.
>
> When one of the server threads is ready to process a read/write request it
will get or put the data from/to the buffers that the client already prepared. 
The number of currently active IO requests is exactly the number of active
service threads (up to 512 by default).
>
>   
>> What happens to an RPC when the server is too busy to handle it, is it 
>> even issued by the client ? Does the client have to poll and/or resend 
>> the RPC ? Does the process of polling for flow control credits add 
>> significant network/server congestion ?
>>     
>
> The clients limit the number of concurrent RPC requests, by default to 8
per OST.  The LNET level message credits will also limit the number of in-flight
messages in case there is e.g. an LNET router between the client and server.
>
> The client will almost never time out a request, as it is informed how long
requests are currently taking to process and will wait patiently for its earlier
requests to finish processing.  If the client is going to time out a request
(based on an earlier request timeout that is about to be exceeded) the server
will inform it to continue waiting and give it a new processing time estimate
(unless of course the server is non-functional or so overwhelmed that it
can''t even do that).
>
>   
>> Is it likely that a large number of RPC''s/flow control credit
requests
>> will induce enough network congestion so that client''s
RPC''s timeout ?
>> How does the client handle such a timeout ?
>>     
>
> Since the flow control credits are bounded, and will be returned to the
peer as earlier requests complete there is not additional traffic due to this. 
However, considering that HPC clusters are distributed denial-of-service engines
it is always possible to overwhelm the server under some conditions.  In case of
a client RPC timeout (hundreds of seconds under load) the client will resend the
request and/or try to contact the backup server until one responds.
>   Thank you for you help.

Is my understanding correct?

A single RPC request will initiate an RDMA transfer of at most 
"max_pages_per_rpc". where the page unit is Lustre page size 65536.
Each
RDMA transfer is executed in 1MB chunks.  On a given client, if there 
are more than "max_pages_per_rpc" pages of data available to transfer
,
multiple RPCs are issued and multiple RDMA''s are initiated.

Would it be correct to say: The purpose of the "max_pages_per_rpc" 
parameter is to enable the servers to even out the individual progress 
of concurrent clients with a lot of data to move and more fairly 
apportion the available bandwidth amongst concurrently writing clients?

Andreas Dilger

2010-Aug-23 18:10 UTC

head link

[Lustre-discuss] write RPC & congestion

On 2010-08-22, at 11:58, burlen wrote:
Andreas Dilger wrote:>> Currently, 1MB is the largest bulk IO size, and is the typical size
used by clients for all IO.
> 
> Is my understanding correct?
> 
> A single RPC request will initiate an RDMA transfer of at most
"max_pages_per_rpc". where the page unit is Lustre page size 65536.
Each RDMA transfer is executed in 1MB chunks.  On a given client, if there are
more than "max_pages_per_rpc" pages of data available to transfer ,
multiple RPCs are issued and multiple RDMA''s are initiated.
No, the max_pages_per_rpc is scaled down proportionately for systems with large
PAGE_SIZE.  This is because the node doesn''t know what the PAGE_SIZE of
the peer is.

There is a patch in bugzilla that does what you propose - submit larger IO
request RPCs, and do multiple 1MB RDMA xfers per request.  However, this showed
performance _loss_ in some cases (in particular shared-file IO), and the reason
for this regression was never diagnosed.
> Would it be correct to say: The purpose of the
"max_pages_per_rpc" parameter is to enable the servers to even out the
individual progress of concurrent clients with a lot of data to move and more
fairly apportion the available bandwidth amongst concurrently writing clients?
Yes, partly.  The more important factor is max_rpcs_in_flight, which limits the
number of requests that a client can submit to each server at one time.

There was a research paper written to have dynamic max_rpcs_in_flight that
showed performance improvements when there are few clients active, and
we''d like to include that code into Lustre when it is ready.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Jeremy Filizetti

2010-Aug-24 02:58 UTC

head link

[Lustre-discuss] write RPC & congestion

>
> > A single RPC request will initiate an RDMA transfer of at most
> "max_pages_per_rpc". where the page unit is Lustre page size
65536. Each
> RDMA transfer is executed in 1MB chunks.  On a given client, if there are
> more than "max_pages_per_rpc" pages of data available to transfer
, multiple
> RPCs are issued and multiple RDMA''s are initiated.
>
> No, the max_pages_per_rpc is scaled down proportionately for systems with
> large PAGE_SIZE.  This is because the node doesn''t know what the
PAGE_SIZE
> of the peer is.
>
> There is a patch in bugzilla that does what you propose - submit larger IO
> request RPCs, and do multiple 1MB RDMA xfers per request.  However, this
> showed performance _loss_ in some cases (in particular shared-file IO), and
> the reason for this regression was never diagnosed.
>
The larger RPCs from bug 16900 offered some significant performance when
working over the WAN.  Our use case involves a few clients who need fast
access rather then 100s or 1000s.  The included PDF shows iozone performance
over the WAN in 10 ms RTT increments up to 200ms for a single Lustre client
and a small Lustre setup (1 MDS, 2 OSS, 6 OSTs).  This test was with a SDR
Infiniband WAN connection using Obsidian Longbows to simulate delay. 
I''m
not 100% sure the value used is correct for the concurrent_sends.

So even though this isn''t geared towards most Lustre users, I think the
larger RPCs is pretty useful.  Plenty of people at LUG2010 mentioned using
Lustre over the WAN in some way.

> > Would it be correct to say: The purpose of the
"max_pages_per_rpc"
> parameter is to enable the servers to even out the individual progress of
> concurrent clients with a lot of data to move and more fairly apportion the
> available bandwidth amongst concurrently writing clients?
>
> Yes, partly.  The more important factor is max_rpcs_in_flight, which limits
> the number of requests that a client can submit to each server at one time.
>
> There was a research paper written to have dynamic max_rpcs_in_flight that
> showed performance improvements when there are few clients active, and
we''d
> like to include that code into Lustre when it is ready.
>
Was there a patch available of this?

>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100823/19c3304b/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lustre_perf_with_large_rpcs.pdf
Type: application/pdf
Size: 227991 bytes
Desc: not available
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100823/19c3304b/attachment-0001.pdf

burlen

2010-Sep-25 00:20 UTC

head link

[Lustre-discuss] write RPC & congestion

Hi, Thanks for all the help,

Andreas Dilger wrote:> When one of the server threads is ready to process a read/write request it
will get or put the data from/to the buffers that the client already prepared. 
The number of currently active IO requests is exactly the number of active
service threads (up to 512 by default).To be sure I understand this, is it correct that each OST has its own 
pool of service threads? So system wide number of service threads is 
bound by oss_max_threads*num_osts?

Thanks again
Burlen

Andreas Dilger

2010-Sep-25 01:10 UTC

head link

[Lustre-discuss] write RPC & congestion

On 2010-09-24, at 18:20, burlen wrote:> Andreas Dilger wrote:
>> When one of the server threads is ready to process a read/write request
it will get or put the data from/to the buffers that the client already
prepared.  The number of currently active IO requests is exactly the number of
active service threads (up to 512 by default).
> 
> To be sure I understand this, is it correct that each OST has its own pool
of service threads? So system wide number of service threads is bound by
oss_max_threads*num_osts?
Actuall, the current oss_max_threads tunable is for the whole OSS (as the name
implies).  It would be quite useful, in fact, to have a tuneable that like
oss_ost_threads (or similar) that is what you suggest.

If you have some time and C coding skills, implementing this would be fairly
easily done (see lustre/ost/ost_handler.c and the oss_num_threads tunable
handling therein, along with .../lproc_ost.c).

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

Andreas Dilger

2010-Sep-25 01:36 UTC

head link

[Lustre-discuss] write RPC & congestion

On 2010-09-24, at 19:10, Andreas Dilger wrote:> On 2010-09-24, at 18:20, burlen wrote:
>> To be sure I understand this, is it correct that each OST has its own
pool of service threads? So system wide number of service threads is bound by
oss_max_threads*num_osts?
> 
> Actuall, the current oss_max_threads tunable is for the whole OSS (as the
name implies).  It would be quite useful, in fact, to have a tuneable that like
oss_ost_threads (or similar) that is what you suggest.
> 
> If you have some time and C coding skills, implementing this would be
fairly easily done (see lustre/ost/ost_handler.c and the oss_num_threads tunable
handling therein, along with .../lproc_ost.c).
In fact, I just noticed that there is such a tunable in a new patch already in
bug 22516.  Sorry for the confusion.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

burlen

2010-Sep-27 22:59 UTC

head link

[Lustre-discuss] write RPC & congestion

On 09/24/2010 06:36 PM, Andreas Dilger wrote:> On 2010-09-24, at 19:10, Andreas Dilger wrote:
>> On 2010-09-24, at 18:20, burlen wrote:
>>> To be sure I understand this, is it correct that each OST has its
own pool of service threads? So system wide number of service threads is bound
by oss_max_threads*num_osts?
>> Actuall, the current oss_max_threads tunable is for the whole OSS (as
the name implies).
Again many thanks for your help

With respect to an upper bound on number of rpcs and RMDAs in flight 
system wide, does the situation change much on the Cray XT5 with Lustre 
1.8 and OSSs directly connected to the 3d torus? I am asking after 
having seen the XT3 section in manual. not sure if it applies to XT5 and 
if it does how this might influence the above tunables.

Cory Spitz

2010-Sep-28 16:53 UTC

head link

[Lustre-discuss] write RPC & congestion

Hello,

The Lustre Operations Manual only covers configuration and tuning for XT3
running Catamount.  However, the tuneables you are concerned about relate more
to what kind and how much storage you have and less about XT specific tunings.

Thanks,
-Cory

On 09/27/2010 05:59 PM, burlen wrote:>   On 09/24/2010 06:36 PM, Andreas Dilger wrote:
>> On 2010-09-24, at 19:10, Andreas Dilger wrote:
>>> On 2010-09-24, at 18:20, burlen wrote:
>>>> To be sure I understand this, is it correct that each OST has
its own pool of service threads? So system wide number of service threads is
bound by oss_max_threads*num_osts?
>>> Actuall, the current oss_max_threads tunable is for the whole OSS
(as the name implies).
> 
> Again many thanks for your help
> 
> With respect to an upper bound on number of rpcs and RMDAs in flight 
> system wide, does the situation change much on the Cray XT5 with Lustre 
> 1.8 and OSSs directly connected to the 3d torus? I am asking after 
> having seen the XT3 section in manual. not sure if it applies to XT5 and 
> if it does how this might influence the above tunables.
> 
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>

Oleg Drokin

2010-Dec-22 03:51 UTC

head link

[Lustre-discuss] write RPC & congestion

Hello!

I guess I am a little bit late to the party, but I was just reading comments in
bug 16900 and have this question I really need to ask.

On Aug 23, 2010, at 10:58 PM, Jeremy Filizetti wrote:> The larger RPCs from bug 16900 offered some significant performance when
working over the WAN.  Our use case involves a few clients who need fast access
rather then 100s or 1000s.  The included PDF shows iozone performance over the
WAN in 10 ms RTT increments up to 200ms for a single Lustre client and a small
Lustre setup (1 MDS, 2 OSS, 6 OSTs).  This test was with a SDR Infiniband WAN
connection using Obsidian Longbows to simulate delay.  I''m not 100%
sure the value used is correct for the concurrent_sends.
> 
> So even though this isn''t geared towards most Lustre users, I
think the larger RPCs is pretty useful.  Plenty of people at LUG2010 mentioned
using Lustre over the WAN in some way.
So are you sure you got your benefit from the larger RPC size as opposed to just
having 4x more data on the wire? There is another way to increase the amount of
data on the wire without large RPCs, you can increase the number or RPCs in
flight to OSTs from current default of 8 to say 32
(/proc/fs/lustre/osc/*/max_rpcs_in_flight).

I really wonder how the results would compare to the 4M RPCs results if you
still have the capability to test it.

Thanks.

Bye,
    Oleg

Jeremy Filizetti

2010-Dec-22 05:43 UTC

head link

[Lustre-discuss] write RPC & congestion

In the attachment I created that Andreas posted at
https://bugzilla.lustre.org/attachment.cgi?id=31423 if you look at graph 1
and 2 they are both using larger than default max_rpcs_in_flight.  I believe
the data without the patch from bug 16900 had max_rpcs_in_flight=42.  For
the data with the patch from 16900 used max_rpcs_in_flight=32.  So the short
answer is we are already increasing max_rpcs_in_flight for all of that data
(which is needed for good performance at higher latencies).

My understanding of what the real benefit is from the larger RPC patch is
that we are not having to face 12 round-trip-times to read 4 MB (4 - 1 MB
bulk RPCs) instead I think we have 3.  Although I''ve never traced
through to
see this is actually what is happening.  But from what I read about the
patch it sends 4 memory descriptors with a single bulk request.

What isn''t quite clear to me is why Lustre takes 3 RTT for a read and 2
for
a write.  I think I understand the write having to communicate once with the
server because preallocating buffers for all clients would possible be a
waste of resources.  But for reading it seems logical (from the RDMA stand
point) that the memory buffer could be pre-registered and send to the server
and the server would respond back with the contents for that buffer for a
read which would be 1 RTT.

I don''t have everything setup right now in our test environment but
with a
little effort I could setup a similar test if your wondering about something
specific.

Jeremy

So are you sure you got your benefit from the larger RPC size as opposed
to> just having 4x more data on the wire? There is another way to increase the
> amount of data on the wire without large RPCs, you can increase the number
> or RPCs in flight to OSTs from current default of 8 to say 32
> (/proc/fs/lustre/osc/*/max_rpcs_in_flight).
>
> I really wonder how the results would compare to the 4M RPCs results if you
> still have the capability to test it.
>
> Thanks.
>
> Bye,
>     Oleg-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101222/c1ff17f5/attachment.html

Oleg Drokin

2010-Dec-22 06:32 UTC

head link

[Lustre-discuss] write RPC & congestion

Hello!

On Dec 22, 2010, at 12:43 AM, Jeremy Filizetti wrote:
> In the attachment I created that Andreas posted at
https://bugzilla.lustre.org/attachment.cgi?id=31423 if you look at graph 1 and 2
they are both using larger than default max_rpcs_in_flight.  I believe the data
without the patch from bug 16900 had max_rpcs_in_flight=42.  For the data with
the patch from 16900 used max_rpcs_in_flight=32.  So the short answer is we are
already increasing max_rpcs_in_flight for all of that data (which is needed for
good performance at higher latencies).
Ah! This should have been noted somewhere.
Well, it''s still unfair then! ;)
You see, each OSC can cache up to 32mb of dirty data by default (max_dirty_mb
osc setting in /proc).
So when you have 4M RPCs, you actually use only 8 RPCs to transfer your entire
allotment of dirty pages where as you use 32 for 1M RPCs (and so setting it any
higher has no effect unless you also bump max dirty mb). Of course this will
only affect the write RPCs, not read.
> My understanding of what the real benefit is from the larger RPC patch is
that we are not having to face 12 round-trip-times to read 4 MB (4 - 1 MB bulk
RPCs) instead I think we have 3.  Although I''ve never traced through to
see this is actually what is happening.  But from what I read about the patch it
sends 4 memory descriptors with a single bulk request.
Well, I don''t think this should matter anyhow. Since we send the RPCs
in async manner in parallel anyway, the latency of bulk descriptor get is not
adding up.
Due to that the results you''ve got should have been much closer
together too. I wonder what other factors played a role here?
I see you only had single client, so it''s not like you were able to
overwhelm the number of OSS threads running. Even with the case of 6 OSTs per
oss assuming all 42 RPCs were in flight, that''s still only 252 RPCs.
did you make sure that that''s the number of threads you had running by
any chance?
How does your rtt delay gets introduced for the test? Could it be that if there
are more messages on the wire even at the same time, they are delayed more
(aside from obvious bandwidth-induced delay, like bottlenecking a single message
at a time with mandatory delay or something like this)?
> What isn''t quite clear to me is why Lustre takes 3 RTT for a read
and 2 for a write.  I think I understand the write having to communicate once
with the server because preallocating buffers for all clients would possible be
a waste of resources.  But for reading it seems logical (from the RDMA stand
point) that the memory buffer could be pre-registered and send to the server and
the server would respond back with the contents for that buffer for a read which
would be 1 RTT.
Probably the difference is the one of GET vs PUT semantic in lnet.
there''s going to be at least 2 RTTs in any case. One RTT is the
"header" RPC that tells OST "hey, I am doing this operation
here that involves bulk io, it has this many pages and the descriptor is so and
so", then server does another RTT to actually fetch/push the data (and that
might actually be worse than one for one of the GET/PUT case I guess?)
> I don''t have everything setup right now in our test environment
but with a little effort I could setup a similar test if your wondering about
something specific.
Would be interesting to confirm the amount of RPCs actually being processed on
the server at any one time, I think.
Did you try the direct IO too? Some older version of lustre used to send all
outstanding directio RPCs in parallel, so if you did your IO as just a single
direct IO write, the latency of that write should be around a couple of RTTs. I
think that we still do this even in 1.8.5, so it would make an interesting
comparison.

Bye,
    Oleg

Kevin Van Maren

2010-Dec-22 06:43 UTC

head link

[Lustre-discuss] write RPC & congestion

Oleg Drokin wrote:>> I don''t have everything setup right now in our test
environment but with a little effort I could setup a similar test if your
wondering about something specific.
>>     
>
> Would be interesting to confirm the amount of RPCs actually being processed
on the server at any one time, I think.
> Did you try the direct IO too? Some older version of lustre used to send
all outstanding directio RPCs in parallel, so if you did your IO as just a
single direct IO write, the latency of that write should be around a couple of
RTTs. I think that we still do this even in 1.8.5, so it would make an
interesting comparison.
>   
You probably want to look at the brw_stats for a single client NID to do 
that.  Unless you are running 1.8.5, check out the patch in Bug 23827 
(and maybe 23826).

Kevin

Oleg Drokin

2010-Dec-22 06:47 UTC

head link

[Lustre-discuss] write RPC & congestion

Hello!

On Dec 22, 2010, at 1:43 AM, Kevin Van Maren wrote:
>> Would be interesting to confirm the amount of RPCs actually being
processed on the server at any one time, I think.
>> Did you try the direct IO too? Some older version of lustre used to
send all outstanding directio RPCs in parallel, so if you did your IO as just a
single direct IO write, the latency of that write should be around a couple of
RTTs. I think that we still do this even in 1.8.5, so it would make an
interesting comparison.
> You probably want to look at the brw_stats for a single client NID to do 
> that.  Unless you are running 1.8.5, check out the patch in Bug 23827 
> (and maybe 23826).
Actually there was a single client only in the original test it seems, so we can
check the aggregated brw_stats (just need to zero it out before the test run).

Bye,
    Oleg

Jeremy Filizetti

2010-Dec-22 13:25 UTC

head link

[Lustre-discuss] write RPC & congestion

On Wed, Dec 22, 2010 at 1:32 AM, Oleg Drokin <green at whamcloud.com>
wrote:
> Ah! This should have been noted somewhere.
>
It was in some brief associated with the data but unfortunately I can''t
seem
to find that.

> Well, it''s still unfair then! ;)
>
Yea wasn''t quite balanced, I was actually trying to test our current
Lustre
setup against what could be done with some extra parameters and the patch
from 16900.  I''m pretty sure max_dirty_mb was set to 128 for both
tests.
One thing that isn''t clear to me is why the Lustre manual recomments
max_dirty_mb = max_rpcs_in_flight x 4.  It seems to be that having them
equal or max_dirty_mb slightly larger to handle any off my 1 erros should be
sufficient?

> You see, each OSC can cache up to 32mb of dirty data by default
> (max_dirty_mb osc setting in /proc).
> So when you have 4M RPCs, you actually use only 8 RPCs to transfer your
> entire allotment of dirty pages where as you use 32 for 1M RPCs (and so
> setting it any higher has no effect unless you also bump max dirty mb). Of
> course this will only affect the write RPCs, not read.
>
> Well, I don''t think this should matter anyhow. Since we send the
RPCs in
> async manner in parallel anyway, the latency of bulk descriptor get is not
> adding up.
>
It does make a difference before read ahead has kicked in.  There is a big
difference in how things start off even though for sequential load they all
peak at the same value.  For instance if I was just accessing a portion of a
file or seeking around and reading a few MBs instead of reading it all
sequentially  this has a big impact IIRC.

Due to that the results you''ve got should have been much closer
together> too. I wonder what other factors played a role here?\
>
I was also pretty interested in that as well.  It seems like hte larger RPC
is hiding some other issue besides theoretical data on the wire.

> I see you only had single client, so it''s not like you were able
to
> overwhelm the number of OSS threads running. Even with the case of 6 OSTs
> per oss assuming all 42 RPCs were in flight, that''s still only 252
RPCs. did
> you make sure that that''s the number of threads you had running by
any
> chance?
>
I did not check but this were decent systems (24 GB RAM, 2-quad core
nehalems with HT) so I assume it had sufficient threads.  And there were
only 6 OSTs total 126 RPCs per OSS.

How does your rtt delay gets introduced for the test? Could it be that
if> there are more messages on the wire even at the same time, they are delayed
> more (aside from obvious bandwidth-induced delay, like bottlenecking a
> single message at a time with mandatory delay or something like this)?
>
The RTT delay is handled by the Obsidian Longbow and was set symmetric to
half the RTT on each end (I think they delay receive traffic only not
transmit).  The data at 110+ ms seems to have a larger decay then the rest
of the data so I''m not sure if someone was happening before that but
the
rest of the data seems consistent with what I''ve seen using real
distance
and not simulated delay, although I only have points to compare it to and
not whole range of 0-200ms.
>
> Probably the difference is the one of GET vs PUT semantic in lnet.
there''s
> going to be at least 2 RTTs in any case. One RTT is the "header"
RPC that
> tells OST "hey, I am doing this operation
> here that involves bulk io, it has this many pages and the descriptor is so
> and so", then server does another RTT to actually fetch/push the data
(and
> that might actually be worse than one for one of the GET/PUT case I guess?)
>
I need to look at this more but it seems to me that the read case should
still be capable of completing in 1 RTT because the server can send the
response as soon as it gets the request since all the MD info should be
included with the request?

> > I don''t have everything setup right now in our test
environment but with
> a little effort I could setup a similar test if your wondering about
> something specific.
>
> Would be interesting to confirm the amount of RPCs actually being processed
> on the server at any one time, I think.
> Did you try the direct IO too? Some older version of lustre used to send
> all outstanding directio RPCs in parallel, so if you did your IO as just a
> single direct IO write, the latency of that write should be around a couple
> of RTTs. I think that we still do this even in 1.8.5, so it would make an
> interesting comparison.
>
I didn''t try any direct IO but I certainly could.

>
> Bye,
>     Oleg-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101222/e63422c1/attachment.html

Oleg Drokin

2010-Dec-22 21:47 UTC

head link

[Lustre-discuss] write RPC & congestion

Hello!

On Dec 22, 2010, at 8:25 AM, Jeremy Filizetti wrote:>> Well, I don''t think this should matter anyhow. Since we send
the RPCs in async manner in parallel anyway, the latency of bulk descriptor get
is not adding up.
> It does make a difference before read ahead has kicked in.  There is a big
difference in how things start off even though for sequential load they all peak
at the same value.  For instance if I was just accessing a portion of a file or
seeking around and reading a few MBs instead of reading it all sequentially 
this has a big impact IIRC.
Right, I can see where large RPCs would help in case of reads while readahead
picks up the pace.
But the same this should not happen with writes since the client is the sole
originator of the writes.
>> How does your rtt delay gets introduced for the test? Could it be that
if there are more messages on the wire even at the same time, they are delayed
more (aside from obvious bandwidth-induced delay, like bottlenecking a single
message at a time with mandatory delay or something like this)?
> The RTT delay is handled by the Obsidian Longbow and was set symmetric to
half the RTT on each end (I think they delay receive traffic only not transmit).
The data at 110+ ms seems to have a larger decay then the rest of the data so
I''m not sure if someone was happening before that but the rest of the
data seems consistent with what I''ve seen using real distance and not
simulated delay, although I only have points to compare it to and not whole
range of 0-200ms.
I think it would be also interesting to run a similar test over a genuinely high
latency link.
>> Probably the difference is the one of GET vs PUT semantic in lnet.
there''s going to be at least 2 RTTs in any case. One RTT is the
"header" RPC that tells OST "hey, I am doing this operation
>> here that involves bulk io, it has this many pages and the descriptor
is so and so", then server does another RTT to actually fetch/push the data
(and that might actually be worse than one for one of the GET/PUT case I guess?)
> I need to look at this more but it seems to me that the read case should
still be capable of completing in 1 RTT because the server can send the response
as soon as it gets the request since all the MD info should be included with the
request?
Well, while theoretically that might be the case, with lustre as it is right now
BULK RPC is a two phase process where 1 RTT is to transmit "metadata"
of sorts that describes the IO in one direction and then returns the io status
back and then
the other RTT is to actually transfer the data over the wire in one of the
directions.
>> Would be interesting to confirm the amount of RPCs actually being
processed on the server at any one time, I think.
>> Did you try the direct IO too? Some older version of lustre used to
send all outstanding directio RPCs in parallel, so if you did your IO as just a
single direct IO write, the latency of that write should be around a couple of
RTTs. I think that we still do this even in 1.8.5, so it would make an
interesting comparison.
> I didn''t try any direct IO but I certainly could.
Thanks.

Please keep us informed if you see anything else interesting.

Bye,
    Oleg

Yuriy Umanets

2010-Dec-23 09:52 UTC

head link

[Lustre-discuss] write RPC & congestion

On Dec 22, 2010, at 05:51 , Oleg Drokin wrote:
> Hello!
> hi all,
> I guess I am a little bit late to the party, but I was just reading
comments in bug 16900 and have this question I really need to ask.
> 
> On Aug 23, 2010, at 10:58 PM, Jeremy Filizetti wrote:
>> The larger RPCs from bug 16900 offered some significant performance
when working over the WAN.  Our use case involves a few clients who need fast
access rather then 100s or 1000s.  The included PDF shows iozone performance
over the WAN in 10 ms RTT increments up to 200ms for a single Lustre client and
a small Lustre setup (1 MDS, 2 OSS, 6 OSTs).  This test was with a SDR
Infiniband WAN connection using Obsidian Longbows to simulate delay. 
I''m not 100% sure the value used is correct for the concurrent_sends.
>> 
>> So even though this isn''t geared towards most Lustre users, I
think the larger RPCs is pretty useful.  Plenty of people at LUG2010 mentioned
using Lustre over the WAN in some way.
> 
> So are you sure you got your benefit from the larger RPC size as opposed to
just having 4x more data on the wire? There is another way to increase the
amount of data on the wire without large RPCs, you can increase the number or
RPCs in flight to OSTs from current default of 8 to say 32
(/proc/fs/lustre/osc/*/max_rpcs_in_flight).
> 
> I really wonder how the results would compare to the 4M RPCs results if you
still have the capability to test it.
> I agree with Oleg this is better approach also from another point of view. While
Lustre tries to form full 1M or 4M (whatever) IO rpcs this is not always
possible. One of such a cases is IO to many small files. There is just no way to
pack into one IO rpc pages that belong to multiple files. This causes lots of
small IO that definitely will under-load the network.

While tuning max_rpc_in_flight you may want to control that network is not
overloaded. This can be done with checking "threads_started" in
service on server. This is number of threads that currently used for handling
rpc on server for that service. If it stops growing with increasing
max_rpc_in_flight - your network becoming bottleneck.

Example:

cat /proc/fs/lustre/ost/OSS/ost_io/threads_started

Thanks.
> Thanks.
> 
> Bye,
>    Oleg
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

--
umka

______________________________________________________________________
This email may contain privileged or confidential information, which should only
be used for the purpose for which it was sent by Xyratex. No further rights or
licenses are granted to use such information. If you are not the intended
recipient of this message, please notify the sender by return and delete it. You
may not use, copy, disclose or rely on the information contained in it.

Internet email is susceptible to data corruption, interception and unauthorised
amendment for which Xyratex does not accept liability. While we have taken
reasonable precautions to ensure that this email is free of viruses, Xyratex
does not accept liability for the presence of any computer viruses in this
email, nor for any losses caused as a result of viruses.

Xyratex Technology Limited (03134912), Registered in England & Wales,
Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA.

The Xyratex group of companies also includes, Xyratex Ltd, registered in
Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia)
Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in
The People''s Republic of China and Xyratex Japan Limited registered in
Japan.
______________________________________________________________________

Jeremy Filizetti

2010-Dec-28 03:31 UTC

head link

[Lustre-discuss] write RPC & congestion

>
> I agree with Oleg this is better approach also from another point of view.
> While Lustre tries to form full 1M or 4M (whatever) IO rpcs this is not
> always possible. One of such a cases is IO to many small files. There is
> just no way to pack into one IO rpc pages that belong to multiple files.
> This causes lots of small IO that definitely will under-load the network.
>
>I''m really targeting sequential single client access were these RPCs
will be
filled.

> While tuning max_rpc_in_flight you may want to control that network is not
> overloaded. This can be done with checking "threads_started" in
service on
> server. This is number of threads that currently used for handling rpc on
> server for that service. If it stops growing with increasing
> max_rpc_in_flight - your network becoming bottleneck.
>
It is just a single client connecting to 2 OSS.  I did check to make sure
through as well.  The test I just ran had 128 threads on each OSS.  The
latest data is incorporating the patch from bug 16900 but modifying
max_pages_per_rpc to make a 1 or 4 MB RPC.  I didn''t see a huge
difference
this time around and the test was slightly more balanced with respect to the
parameters used.  I have some tests running now with no patch at all instead
of just limited max_pages_per_rpc, but AFAIK those should be equivalent.

You can find the two new attachments at:
https://bugzilla.lustre.org/attachment.cgi?id=32618 and
https://bugzilla.lustre.org/attachment.cgi?id=32619

Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101227/2a3db8ce/attachment.html

Seemingly Similar Threads

Search for more possibly parallel threads

Lustre discuss - Aug 2010 - write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

[Lustre-discuss] write RPC & congestion

Seemingly Similar Threads