Hi, thanks for previous help. I have some question about Lustre RPC and the sequence of events that occur during large concurrent write() involving many processes and large data size per process. I understand there is a mechanism of flow control by credits, but I''m a little unclear on how it works in general after reading the "networking & io protocol" white paper. Is it true that a write() RPC transfer''s data in chunks of at least 1MB and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? I can use the bounds to estimate the number of RPCs issued per MB of data to write? About how many concurrent incoming write() RPC per OSS service thread can a single server handle before it stops responding to incoming RPCs ? What happens to an RPC when the server is too busy to handle it, is it even issued by the client ? Does the client have to poll and/or resend the RPC ? Does the process of polling for flow control credits add significant network/server congestion ? Is it likely that a large number of RPC''s/flow control credit requests will induce enough network congestion so that client''s RPC''s timeout ? How does the client handle such a timeout ? Burlen
On 2010-08-17, at 14:15, burlen wrote:> I have some question about Lustre RPC and the sequence of events that > occur during large concurrent write() involving many processes and large > data size per process. I understand there is a mechanism of flow > control by credits, but I''m a little unclear on how it works in general > after reading the "networking & io protocol" white paper.There are different levels of flow control. There is one at the LNET level, that controls low-level messages from overwhelming the server with messages, and avoiding stalling small/reply messages at the back of a deep queue of requests.> Is it true that a write() RPC transfer''s data in chunks of at least 1MB > and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? > I can use the bounds to estimate the number of RPCs issued per MB of > data to write?Currently, 1MB is the largest bulk IO size, and is the typical size used by clients for all IO.> About how many concurrent incoming write() RPC per OSS service thread > can a single server handle before it stops responding to incoming RPCs ?The server can handle tens of thousands of write _requests_, but note that since Lustre has always been designed as an RDMA-capable protocol the request is relatively small (a few hundreds of bytes) and does not contain any of the DATA. When one of the server threads is ready to process a read/write request it will get or put the data from/to the buffers that the client already prepared. The number of currently active IO requests is exactly the number of active service threads (up to 512 by default).> What happens to an RPC when the server is too busy to handle it, is it > even issued by the client ? Does the client have to poll and/or resend > the RPC ? Does the process of polling for flow control credits add > significant network/server congestion ?The clients limit the number of concurrent RPC requests, by default to 8 per OST. The LNET level message credits will also limit the number of in-flight messages in case there is e.g. an LNET router between the client and server. The client will almost never time out a request, as it is informed how long requests are currently taking to process and will wait patiently for its earlier requests to finish processing. If the client is going to time out a request (based on an earlier request timeout that is about to be exceeded) the server will inform it to continue waiting and give it a new processing time estimate (unless of course the server is non-functional or so overwhelmed that it can''t even do that).> Is it likely that a large number of RPC''s/flow control credit requests > will induce enough network congestion so that client''s RPC''s timeout ? > How does the client handle such a timeout ?Since the flow control credits are bounded, and will be returned to the peer as earlier requests complete there is not additional traffic due to this. However, considering that HPC clusters are distributed denial-of-service engines it is always possible to overwhelm the server under some conditions. In case of a client RPC timeout (hundreds of seconds under load) the client will resend the request and/or try to contact the backup server until one responds. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
Andreas Dilger wrote:> On 2010-08-17, at 14:15, burlen wrote: > >> I have some question about Lustre RPC and the sequence of events that >> occur during large concurrent write() involving many processes and large >> data size per process. I understand there is a mechanism of flow >> control by credits, but I''m a little unclear on how it works in general >> after reading the "networking & io protocol" white paper. >> > > There are different levels of flow control. There is one at the LNET level, that controls low-level messages from overwhelming the server with messages, and avoiding stalling small/reply messages at the back of a deep queue of requests. > > >> Is it true that a write() RPC transfer''s data in chunks of at least 1MB >> and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? >> I can use the bounds to estimate the number of RPCs issued per MB of >> data to write? >> > > Currently, 1MB is the largest bulk IO size, and is the typical size used by clients for all IO. > > >> About how many concurrent incoming write() RPC per OSS service thread >> can a single server handle before it stops responding to incoming RPCs ? >> > > The server can handle tens of thousands of write _requests_, but note that since Lustre has always been designed as an RDMA-capable protocol the request is relatively small (a few hundreds of bytes) and does not contain any of the DATA. > > When one of the server threads is ready to process a read/write request it will get or put the data from/to the buffers that the client already prepared. The number of currently active IO requests is exactly the number of active service threads (up to 512 by default). > > >> What happens to an RPC when the server is too busy to handle it, is it >> even issued by the client ? Does the client have to poll and/or resend >> the RPC ? Does the process of polling for flow control credits add >> significant network/server congestion ? >> > > The clients limit the number of concurrent RPC requests, by default to 8 per OST. The LNET level message credits will also limit the number of in-flight messages in case there is e.g. an LNET router between the client and server. > > The client will almost never time out a request, as it is informed how long requests are currently taking to process and will wait patiently for its earlier requests to finish processing. If the client is going to time out a request (based on an earlier request timeout that is about to be exceeded) the server will inform it to continue waiting and give it a new processing time estimate (unless of course the server is non-functional or so overwhelmed that it can''t even do that). > > >> Is it likely that a large number of RPC''s/flow control credit requests >> will induce enough network congestion so that client''s RPC''s timeout ? >> How does the client handle such a timeout ? >> > > Since the flow control credits are bounded, and will be returned to the peer as earlier requests complete there is not additional traffic due to this. However, considering that HPC clusters are distributed denial-of-service engines it is always possible to overwhelm the server under some conditions. In case of a client RPC timeout (hundreds of seconds under load) the client will resend the request and/or try to contact the backup server until one responds. >Thank you for you help. Is my understanding correct? A single RPC request will initiate an RDMA transfer of at most "max_pages_per_rpc". where the page unit is Lustre page size 65536. Each RDMA transfer is executed in 1MB chunks. On a given client, if there are more than "max_pages_per_rpc" pages of data available to transfer , multiple RPCs are issued and multiple RDMA''s are initiated. Would it be correct to say: The purpose of the "max_pages_per_rpc" parameter is to enable the servers to even out the individual progress of concurrent clients with a lot of data to move and more fairly apportion the available bandwidth amongst concurrently writing clients?
On 2010-08-22, at 11:58, burlen wrote: Andreas Dilger wrote:>> Currently, 1MB is the largest bulk IO size, and is the typical size used by clients for all IO. > > Is my understanding correct? > > A single RPC request will initiate an RDMA transfer of at most "max_pages_per_rpc". where the page unit is Lustre page size 65536. Each RDMA transfer is executed in 1MB chunks. On a given client, if there are more than "max_pages_per_rpc" pages of data available to transfer , multiple RPCs are issued and multiple RDMA''s are initiated.No, the max_pages_per_rpc is scaled down proportionately for systems with large PAGE_SIZE. This is because the node doesn''t know what the PAGE_SIZE of the peer is. There is a patch in bugzilla that does what you propose - submit larger IO request RPCs, and do multiple 1MB RDMA xfers per request. However, this showed performance _loss_ in some cases (in particular shared-file IO), and the reason for this regression was never diagnosed.> Would it be correct to say: The purpose of the "max_pages_per_rpc" parameter is to enable the servers to even out the individual progress of concurrent clients with a lot of data to move and more fairly apportion the available bandwidth amongst concurrently writing clients?Yes, partly. The more important factor is max_rpcs_in_flight, which limits the number of requests that a client can submit to each server at one time. There was a research paper written to have dynamic max_rpcs_in_flight that showed performance improvements when there are few clients active, and we''d like to include that code into Lustre when it is ready. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
> > > A single RPC request will initiate an RDMA transfer of at most > "max_pages_per_rpc". where the page unit is Lustre page size 65536. Each > RDMA transfer is executed in 1MB chunks. On a given client, if there are > more than "max_pages_per_rpc" pages of data available to transfer , multiple > RPCs are issued and multiple RDMA''s are initiated. > > No, the max_pages_per_rpc is scaled down proportionately for systems with > large PAGE_SIZE. This is because the node doesn''t know what the PAGE_SIZE > of the peer is. > > There is a patch in bugzilla that does what you propose - submit larger IO > request RPCs, and do multiple 1MB RDMA xfers per request. However, this > showed performance _loss_ in some cases (in particular shared-file IO), and > the reason for this regression was never diagnosed. >The larger RPCs from bug 16900 offered some significant performance when working over the WAN. Our use case involves a few clients who need fast access rather then 100s or 1000s. The included PDF shows iozone performance over the WAN in 10 ms RTT increments up to 200ms for a single Lustre client and a small Lustre setup (1 MDS, 2 OSS, 6 OSTs). This test was with a SDR Infiniband WAN connection using Obsidian Longbows to simulate delay. I''m not 100% sure the value used is correct for the concurrent_sends. So even though this isn''t geared towards most Lustre users, I think the larger RPCs is pretty useful. Plenty of people at LUG2010 mentioned using Lustre over the WAN in some way.> > Would it be correct to say: The purpose of the "max_pages_per_rpc" > parameter is to enable the servers to even out the individual progress of > concurrent clients with a lot of data to move and more fairly apportion the > available bandwidth amongst concurrently writing clients? > > Yes, partly. The more important factor is max_rpcs_in_flight, which limits > the number of requests that a client can submit to each server at one time. > > There was a research paper written to have dynamic max_rpcs_in_flight that > showed performance improvements when there are few clients active, and we''d > like to include that code into Lustre when it is ready. >Was there a patch available of this?> > Cheers, Andreas > -- > Andreas Dilger > Lustre Technical Lead > Oracle Corporation Canada Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100823/19c3304b/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: lustre_perf_with_large_rpcs.pdf Type: application/pdf Size: 227991 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20100823/19c3304b/attachment-0001.pdf
Hi, Thanks for all the help, Andreas Dilger wrote:> When one of the server threads is ready to process a read/write request it will get or put the data from/to the buffers that the client already prepared. The number of currently active IO requests is exactly the number of active service threads (up to 512 by default).To be sure I understand this, is it correct that each OST has its own pool of service threads? So system wide number of service threads is bound by oss_max_threads*num_osts? Thanks again Burlen
On 2010-09-24, at 18:20, burlen wrote:> Andreas Dilger wrote: >> When one of the server threads is ready to process a read/write request it will get or put the data from/to the buffers that the client already prepared. The number of currently active IO requests is exactly the number of active service threads (up to 512 by default). > > To be sure I understand this, is it correct that each OST has its own pool of service threads? So system wide number of service threads is bound by oss_max_threads*num_osts?Actuall, the current oss_max_threads tunable is for the whole OSS (as the name implies). It would be quite useful, in fact, to have a tuneable that like oss_ost_threads (or similar) that is what you suggest. If you have some time and C coding skills, implementing this would be fairly easily done (see lustre/ost/ost_handler.c and the oss_num_threads tunable handling therein, along with .../lproc_ost.c). Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
On 2010-09-24, at 19:10, Andreas Dilger wrote:> On 2010-09-24, at 18:20, burlen wrote: >> To be sure I understand this, is it correct that each OST has its own pool of service threads? So system wide number of service threads is bound by oss_max_threads*num_osts? > > Actuall, the current oss_max_threads tunable is for the whole OSS (as the name implies). It would be quite useful, in fact, to have a tuneable that like oss_ost_threads (or similar) that is what you suggest. > > If you have some time and C coding skills, implementing this would be fairly easily done (see lustre/ost/ost_handler.c and the oss_num_threads tunable handling therein, along with .../lproc_ost.c).In fact, I just noticed that there is such a tunable in a new patch already in bug 22516. Sorry for the confusion. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc.
On 09/24/2010 06:36 PM, Andreas Dilger wrote:> On 2010-09-24, at 19:10, Andreas Dilger wrote: >> On 2010-09-24, at 18:20, burlen wrote: >>> To be sure I understand this, is it correct that each OST has its own pool of service threads? So system wide number of service threads is bound by oss_max_threads*num_osts? >> Actuall, the current oss_max_threads tunable is for the whole OSS (as the name implies).Again many thanks for your help With respect to an upper bound on number of rpcs and RMDAs in flight system wide, does the situation change much on the Cray XT5 with Lustre 1.8 and OSSs directly connected to the 3d torus? I am asking after having seen the XT3 section in manual. not sure if it applies to XT5 and if it does how this might influence the above tunables.
Hello, The Lustre Operations Manual only covers configuration and tuning for XT3 running Catamount. However, the tuneables you are concerned about relate more to what kind and how much storage you have and less about XT specific tunings. Thanks, -Cory On 09/27/2010 05:59 PM, burlen wrote:> On 09/24/2010 06:36 PM, Andreas Dilger wrote: >> On 2010-09-24, at 19:10, Andreas Dilger wrote: >>> On 2010-09-24, at 18:20, burlen wrote: >>>> To be sure I understand this, is it correct that each OST has its own pool of service threads? So system wide number of service threads is bound by oss_max_threads*num_osts? >>> Actuall, the current oss_max_threads tunable is for the whole OSS (as the name implies). > > Again many thanks for your help > > With respect to an upper bound on number of rpcs and RMDAs in flight > system wide, does the situation change much on the Cray XT5 with Lustre > 1.8 and OSSs directly connected to the 3d torus? I am asking after > having seen the XT3 section in manual. not sure if it applies to XT5 and > if it does how this might influence the above tunables. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Hello! I guess I am a little bit late to the party, but I was just reading comments in bug 16900 and have this question I really need to ask. On Aug 23, 2010, at 10:58 PM, Jeremy Filizetti wrote:> The larger RPCs from bug 16900 offered some significant performance when working over the WAN. Our use case involves a few clients who need fast access rather then 100s or 1000s. The included PDF shows iozone performance over the WAN in 10 ms RTT increments up to 200ms for a single Lustre client and a small Lustre setup (1 MDS, 2 OSS, 6 OSTs). This test was with a SDR Infiniband WAN connection using Obsidian Longbows to simulate delay. I''m not 100% sure the value used is correct for the concurrent_sends. > > So even though this isn''t geared towards most Lustre users, I think the larger RPCs is pretty useful. Plenty of people at LUG2010 mentioned using Lustre over the WAN in some way.So are you sure you got your benefit from the larger RPC size as opposed to just having 4x more data on the wire? There is another way to increase the amount of data on the wire without large RPCs, you can increase the number or RPCs in flight to OSTs from current default of 8 to say 32 (/proc/fs/lustre/osc/*/max_rpcs_in_flight). I really wonder how the results would compare to the 4M RPCs results if you still have the capability to test it. Thanks. Bye, Oleg
In the attachment I created that Andreas posted at https://bugzilla.lustre.org/attachment.cgi?id=31423 if you look at graph 1 and 2 they are both using larger than default max_rpcs_in_flight. I believe the data without the patch from bug 16900 had max_rpcs_in_flight=42. For the data with the patch from 16900 used max_rpcs_in_flight=32. So the short answer is we are already increasing max_rpcs_in_flight for all of that data (which is needed for good performance at higher latencies). My understanding of what the real benefit is from the larger RPC patch is that we are not having to face 12 round-trip-times to read 4 MB (4 - 1 MB bulk RPCs) instead I think we have 3. Although I''ve never traced through to see this is actually what is happening. But from what I read about the patch it sends 4 memory descriptors with a single bulk request. What isn''t quite clear to me is why Lustre takes 3 RTT for a read and 2 for a write. I think I understand the write having to communicate once with the server because preallocating buffers for all clients would possible be a waste of resources. But for reading it seems logical (from the RDMA stand point) that the memory buffer could be pre-registered and send to the server and the server would respond back with the contents for that buffer for a read which would be 1 RTT. I don''t have everything setup right now in our test environment but with a little effort I could setup a similar test if your wondering about something specific. Jeremy So are you sure you got your benefit from the larger RPC size as opposed to> just having 4x more data on the wire? There is another way to increase the > amount of data on the wire without large RPCs, you can increase the number > or RPCs in flight to OSTs from current default of 8 to say 32 > (/proc/fs/lustre/osc/*/max_rpcs_in_flight). > > I really wonder how the results would compare to the 4M RPCs results if you > still have the capability to test it. > > Thanks. > > Bye, > Oleg-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101222/c1ff17f5/attachment.html
Hello! On Dec 22, 2010, at 12:43 AM, Jeremy Filizetti wrote:> In the attachment I created that Andreas posted at https://bugzilla.lustre.org/attachment.cgi?id=31423 if you look at graph 1 and 2 they are both using larger than default max_rpcs_in_flight. I believe the data without the patch from bug 16900 had max_rpcs_in_flight=42. For the data with the patch from 16900 used max_rpcs_in_flight=32. So the short answer is we are already increasing max_rpcs_in_flight for all of that data (which is needed for good performance at higher latencies).Ah! This should have been noted somewhere. Well, it''s still unfair then! ;) You see, each OSC can cache up to 32mb of dirty data by default (max_dirty_mb osc setting in /proc). So when you have 4M RPCs, you actually use only 8 RPCs to transfer your entire allotment of dirty pages where as you use 32 for 1M RPCs (and so setting it any higher has no effect unless you also bump max dirty mb). Of course this will only affect the write RPCs, not read.> My understanding of what the real benefit is from the larger RPC patch is that we are not having to face 12 round-trip-times to read 4 MB (4 - 1 MB bulk RPCs) instead I think we have 3. Although I''ve never traced through to see this is actually what is happening. But from what I read about the patch it sends 4 memory descriptors with a single bulk request.Well, I don''t think this should matter anyhow. Since we send the RPCs in async manner in parallel anyway, the latency of bulk descriptor get is not adding up. Due to that the results you''ve got should have been much closer together too. I wonder what other factors played a role here? I see you only had single client, so it''s not like you were able to overwhelm the number of OSS threads running. Even with the case of 6 OSTs per oss assuming all 42 RPCs were in flight, that''s still only 252 RPCs. did you make sure that that''s the number of threads you had running by any chance? How does your rtt delay gets introduced for the test? Could it be that if there are more messages on the wire even at the same time, they are delayed more (aside from obvious bandwidth-induced delay, like bottlenecking a single message at a time with mandatory delay or something like this)?> What isn''t quite clear to me is why Lustre takes 3 RTT for a read and 2 for a write. I think I understand the write having to communicate once with the server because preallocating buffers for all clients would possible be a waste of resources. But for reading it seems logical (from the RDMA stand point) that the memory buffer could be pre-registered and send to the server and the server would respond back with the contents for that buffer for a read which would be 1 RTT.Probably the difference is the one of GET vs PUT semantic in lnet. there''s going to be at least 2 RTTs in any case. One RTT is the "header" RPC that tells OST "hey, I am doing this operation here that involves bulk io, it has this many pages and the descriptor is so and so", then server does another RTT to actually fetch/push the data (and that might actually be worse than one for one of the GET/PUT case I guess?)> I don''t have everything setup right now in our test environment but with a little effort I could setup a similar test if your wondering about something specific.Would be interesting to confirm the amount of RPCs actually being processed on the server at any one time, I think. Did you try the direct IO too? Some older version of lustre used to send all outstanding directio RPCs in parallel, so if you did your IO as just a single direct IO write, the latency of that write should be around a couple of RTTs. I think that we still do this even in 1.8.5, so it would make an interesting comparison. Bye, Oleg
Oleg Drokin wrote:>> I don''t have everything setup right now in our test environment but with a little effort I could setup a similar test if your wondering about something specific. >> > > Would be interesting to confirm the amount of RPCs actually being processed on the server at any one time, I think. > Did you try the direct IO too? Some older version of lustre used to send all outstanding directio RPCs in parallel, so if you did your IO as just a single direct IO write, the latency of that write should be around a couple of RTTs. I think that we still do this even in 1.8.5, so it would make an interesting comparison. >You probably want to look at the brw_stats for a single client NID to do that. Unless you are running 1.8.5, check out the patch in Bug 23827 (and maybe 23826). Kevin
Hello! On Dec 22, 2010, at 1:43 AM, Kevin Van Maren wrote:>> Would be interesting to confirm the amount of RPCs actually being processed on the server at any one time, I think. >> Did you try the direct IO too? Some older version of lustre used to send all outstanding directio RPCs in parallel, so if you did your IO as just a single direct IO write, the latency of that write should be around a couple of RTTs. I think that we still do this even in 1.8.5, so it would make an interesting comparison. > You probably want to look at the brw_stats for a single client NID to do > that. Unless you are running 1.8.5, check out the patch in Bug 23827 > (and maybe 23826).Actually there was a single client only in the original test it seems, so we can check the aggregated brw_stats (just need to zero it out before the test run). Bye, Oleg
On Wed, Dec 22, 2010 at 1:32 AM, Oleg Drokin <green at whamcloud.com> wrote:> Ah! This should have been noted somewhere. >It was in some brief associated with the data but unfortunately I can''t seem to find that.> Well, it''s still unfair then! ;) >Yea wasn''t quite balanced, I was actually trying to test our current Lustre setup against what could be done with some extra parameters and the patch from 16900. I''m pretty sure max_dirty_mb was set to 128 for both tests. One thing that isn''t clear to me is why the Lustre manual recomments max_dirty_mb = max_rpcs_in_flight x 4. It seems to be that having them equal or max_dirty_mb slightly larger to handle any off my 1 erros should be sufficient?> You see, each OSC can cache up to 32mb of dirty data by default > (max_dirty_mb osc setting in /proc). > So when you have 4M RPCs, you actually use only 8 RPCs to transfer your > entire allotment of dirty pages where as you use 32 for 1M RPCs (and so > setting it any higher has no effect unless you also bump max dirty mb). Of > course this will only affect the write RPCs, not read. > > Well, I don''t think this should matter anyhow. Since we send the RPCs in > async manner in parallel anyway, the latency of bulk descriptor get is not > adding up. >It does make a difference before read ahead has kicked in. There is a big difference in how things start off even though for sequential load they all peak at the same value. For instance if I was just accessing a portion of a file or seeking around and reading a few MBs instead of reading it all sequentially this has a big impact IIRC. Due to that the results you''ve got should have been much closer together> too. I wonder what other factors played a role here?\ >I was also pretty interested in that as well. It seems like hte larger RPC is hiding some other issue besides theoretical data on the wire.> I see you only had single client, so it''s not like you were able to > overwhelm the number of OSS threads running. Even with the case of 6 OSTs > per oss assuming all 42 RPCs were in flight, that''s still only 252 RPCs. did > you make sure that that''s the number of threads you had running by any > chance? >I did not check but this were decent systems (24 GB RAM, 2-quad core nehalems with HT) so I assume it had sufficient threads. And there were only 6 OSTs total 126 RPCs per OSS. How does your rtt delay gets introduced for the test? Could it be that if> there are more messages on the wire even at the same time, they are delayed > more (aside from obvious bandwidth-induced delay, like bottlenecking a > single message at a time with mandatory delay or something like this)? >The RTT delay is handled by the Obsidian Longbow and was set symmetric to half the RTT on each end (I think they delay receive traffic only not transmit). The data at 110+ ms seems to have a larger decay then the rest of the data so I''m not sure if someone was happening before that but the rest of the data seems consistent with what I''ve seen using real distance and not simulated delay, although I only have points to compare it to and not whole range of 0-200ms.> > Probably the difference is the one of GET vs PUT semantic in lnet. there''s > going to be at least 2 RTTs in any case. One RTT is the "header" RPC that > tells OST "hey, I am doing this operation > here that involves bulk io, it has this many pages and the descriptor is so > and so", then server does another RTT to actually fetch/push the data (and > that might actually be worse than one for one of the GET/PUT case I guess?) >I need to look at this more but it seems to me that the read case should still be capable of completing in 1 RTT because the server can send the response as soon as it gets the request since all the MD info should be included with the request?> > I don''t have everything setup right now in our test environment but with > a little effort I could setup a similar test if your wondering about > something specific. > > Would be interesting to confirm the amount of RPCs actually being processed > on the server at any one time, I think. > Did you try the direct IO too? Some older version of lustre used to send > all outstanding directio RPCs in parallel, so if you did your IO as just a > single direct IO write, the latency of that write should be around a couple > of RTTs. I think that we still do this even in 1.8.5, so it would make an > interesting comparison. >I didn''t try any direct IO but I certainly could.> > Bye, > Oleg-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101222/e63422c1/attachment.html
Hello! On Dec 22, 2010, at 8:25 AM, Jeremy Filizetti wrote:>> Well, I don''t think this should matter anyhow. Since we send the RPCs in async manner in parallel anyway, the latency of bulk descriptor get is not adding up. > It does make a difference before read ahead has kicked in. There is a big difference in how things start off even though for sequential load they all peak at the same value. For instance if I was just accessing a portion of a file or seeking around and reading a few MBs instead of reading it all sequentially this has a big impact IIRC.Right, I can see where large RPCs would help in case of reads while readahead picks up the pace. But the same this should not happen with writes since the client is the sole originator of the writes.>> How does your rtt delay gets introduced for the test? Could it be that if there are more messages on the wire even at the same time, they are delayed more (aside from obvious bandwidth-induced delay, like bottlenecking a single message at a time with mandatory delay or something like this)? > The RTT delay is handled by the Obsidian Longbow and was set symmetric to half the RTT on each end (I think they delay receive traffic only not transmit). The data at 110+ ms seems to have a larger decay then the rest of the data so I''m not sure if someone was happening before that but the rest of the data seems consistent with what I''ve seen using real distance and not simulated delay, although I only have points to compare it to and not whole range of 0-200ms.I think it would be also interesting to run a similar test over a genuinely high latency link.>> Probably the difference is the one of GET vs PUT semantic in lnet. there''s going to be at least 2 RTTs in any case. One RTT is the "header" RPC that tells OST "hey, I am doing this operation >> here that involves bulk io, it has this many pages and the descriptor is so and so", then server does another RTT to actually fetch/push the data (and that might actually be worse than one for one of the GET/PUT case I guess?) > I need to look at this more but it seems to me that the read case should still be capable of completing in 1 RTT because the server can send the response as soon as it gets the request since all the MD info should be included with the request?Well, while theoretically that might be the case, with lustre as it is right now BULK RPC is a two phase process where 1 RTT is to transmit "metadata" of sorts that describes the IO in one direction and then returns the io status back and then the other RTT is to actually transfer the data over the wire in one of the directions.>> Would be interesting to confirm the amount of RPCs actually being processed on the server at any one time, I think. >> Did you try the direct IO too? Some older version of lustre used to send all outstanding directio RPCs in parallel, so if you did your IO as just a single direct IO write, the latency of that write should be around a couple of RTTs. I think that we still do this even in 1.8.5, so it would make an interesting comparison. > I didn''t try any direct IO but I certainly could.Thanks. Please keep us informed if you see anything else interesting. Bye, Oleg
On Dec 22, 2010, at 05:51 , Oleg Drokin wrote:> Hello! >hi all,> I guess I am a little bit late to the party, but I was just reading comments in bug 16900 and have this question I really need to ask. > > On Aug 23, 2010, at 10:58 PM, Jeremy Filizetti wrote: >> The larger RPCs from bug 16900 offered some significant performance when working over the WAN. Our use case involves a few clients who need fast access rather then 100s or 1000s. The included PDF shows iozone performance over the WAN in 10 ms RTT increments up to 200ms for a single Lustre client and a small Lustre setup (1 MDS, 2 OSS, 6 OSTs). This test was with a SDR Infiniband WAN connection using Obsidian Longbows to simulate delay. I''m not 100% sure the value used is correct for the concurrent_sends. >> >> So even though this isn''t geared towards most Lustre users, I think the larger RPCs is pretty useful. Plenty of people at LUG2010 mentioned using Lustre over the WAN in some way. > > So are you sure you got your benefit from the larger RPC size as opposed to just having 4x more data on the wire? There is another way to increase the amount of data on the wire without large RPCs, you can increase the number or RPCs in flight to OSTs from current default of 8 to say 32 (/proc/fs/lustre/osc/*/max_rpcs_in_flight). > > I really wonder how the results would compare to the 4M RPCs results if you still have the capability to test it. >I agree with Oleg this is better approach also from another point of view. While Lustre tries to form full 1M or 4M (whatever) IO rpcs this is not always possible. One of such a cases is IO to many small files. There is just no way to pack into one IO rpc pages that belong to multiple files. This causes lots of small IO that definitely will under-load the network. While tuning max_rpc_in_flight you may want to control that network is not overloaded. This can be done with checking "threads_started" in service on server. This is number of threads that currently used for handling rpc on server for that service. If it stops growing with increasing max_rpc_in_flight - your network becoming bottleneck. Example: cat /proc/fs/lustre/ost/OSS/ost_io/threads_started Thanks.> Thanks. > > Bye, > Oleg > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- umka ______________________________________________________________________ This email may contain privileged or confidential information, which should only be used for the purpose for which it was sent by Xyratex. No further rights or licenses are granted to use such information. If you are not the intended recipient of this message, please notify the sender by return and delete it. You may not use, copy, disclose or rely on the information contained in it. Internet email is susceptible to data corruption, interception and unauthorised amendment for which Xyratex does not accept liability. While we have taken reasonable precautions to ensure that this email is free of viruses, Xyratex does not accept liability for the presence of any computer viruses in this email, nor for any losses caused as a result of viruses. Xyratex Technology Limited (03134912), Registered in England & Wales, Registered Office, Langstone Road, Havant, Hampshire, PO9 1SA. The Xyratex group of companies also includes, Xyratex Ltd, registered in Bermuda, Xyratex International Inc, registered in California, Xyratex (Malaysia) Sdn Bhd registered in Malaysia, Xyratex Technology (Wuxi) Co Ltd registered in The People''s Republic of China and Xyratex Japan Limited registered in Japan. ______________________________________________________________________
>> I agree with Oleg this is better approach also from another point of view. > While Lustre tries to form full 1M or 4M (whatever) IO rpcs this is not > always possible. One of such a cases is IO to many small files. There is > just no way to pack into one IO rpc pages that belong to multiple files. > This causes lots of small IO that definitely will under-load the network. > >I''m really targeting sequential single client access were these RPCs will be filled.> While tuning max_rpc_in_flight you may want to control that network is not > overloaded. This can be done with checking "threads_started" in service on > server. This is number of threads that currently used for handling rpc on > server for that service. If it stops growing with increasing > max_rpc_in_flight - your network becoming bottleneck. >It is just a single client connecting to 2 OSS. I did check to make sure through as well. The test I just ran had 128 threads on each OSS. The latest data is incorporating the patch from bug 16900 but modifying max_pages_per_rpc to make a 1 or 4 MB RPC. I didn''t see a huge difference this time around and the test was slightly more balanced with respect to the parameters used. I have some tests running now with no patch at all instead of just limited max_pages_per_rpc, but AFAIK those should be equivalent. You can find the two new attachments at: https://bugzilla.lustre.org/attachment.cgi?id=32618 and https://bugzilla.lustre.org/attachment.cgi?id=32619 Jeremy -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20101227/2a3db8ce/attachment.html