Hello, I''m hoping to get a few ideas on how we could modify LST to make doing performance testing easier. Right now we can use "lst stat" to get a rough idea of performance, but the timers are pretty rough and the data is a snapshot. Any ideas ? I''ve got cycles to do the coding, but not sure what would be the best way to fit this into the existing LST framework. BTW - the ability to dump CSV or some other text file with per-node and per-group data would also be nice. Nic
On Thu, Sep 24, 2009 at 03:33:18PM -0500, Nic Henke wrote:> Hello, > > I''m hoping to get a few ideas on how we could modify LST to make doing > performance testing easier. Right now we can use "lst stat" to get a > rough idea of performance, but the timers are pretty rough and the data > is a snapshot. > > Any ideas ? I''ve got cycles to do the coding, but not sure what would > be the best way to fit this into the existing LST framework.There''s some rough edges in the stat gathering code. First, the LST console has no idea whether the tests have stopped, and that''s why the ''lst stat'' command by default loops until a ^C. Test clients could return a counter for active test batches and when it drops to 0 all tests on the client must have completed, but servers are passive and have no idea whether clients are done or not. The throughput calculation also could be inaccurate. IIRC, the console just take a snapshot of stat counters on test nodes at a fixed interval (1 second by default), and calculate the throughput as changes in the successive counter snapshots divided by the interval. But, apparently the interval at which the console sends ''get_stat'' requests does not equal the interval at which snapshots are taken on test nodes - the ''get_stat'' requests could be delayed on the path when the network is stressed (something LST was designed to do), and even worse they could be reordered in the presence of routers. One possible solution would be to include timestamp in the ''get_stat'' replies, and calculate the throughput as diffs in counters divided by diffs in timestamps. Since the console only cares about the changes in timestamps, the test nodes clocks do not need to be in sync at all (but they do need to be monotonic and be of a same resolution). The test servers concurrently posts one passive buffer for each request, so for each test request there''s one LNetMDAttach and one unlink operation and both operations need to grab the one big LNET_LOCK therefore it could be possible that the server CPU becomes a bottleneck before the network could be saturated. The solution is to, instead of one request per buffer, post one big buffer that could accommodate multiple requests to amortize the per buffer processing costs. Refining these rough edges might likely involve protocol changes. The LST is not a production service so strict backward compatibility is not necessary. I think it''d suffice to do a protocol version check at the time of ''add_node'' command and simply refuse to add a test node whose protocol version is different than that of the console.> BTW - the ability to dump CSV or some other text file with per-node and > per-group data would also be nice.That''s a good idea, then users could do whatever they''d like to the data. Thanks, Isaac
Isaac Huang wrote:> On Thu, Sep 24, 2009 at 03:33:18PM -0500, Nic Henke wrote: > >> Hello, >> >> I''m hoping to get a few ideas on how we could modify LST to make doing >> performance testing easier. Right now we can use "lst stat" to get a >> rough idea of performance, but the timers are pretty rough and the data >> is a snapshot. >> >> Any ideas ? I''ve got cycles to do the coding, but not sure what would >> be the best way to fit this into the existing LST framework. >> > > There''s some rough edges in the stat gathering code. First, the LST > console has no idea whether the tests have stopped, and that''s why the > ''lst stat'' command by default loops until a ^C. Test clients could > return a counter for active test batches and when it drops to 0 all > tests on the client must have completed, but servers are passive and > have no idea whether clients are done or not. >I think the timing of the start/stop of each of the tests is probably the trickiest bit. To get really good end-to-end numbers, we''d need to be able to accurately time each of the tests.> The throughput calculation also could be inaccurate. IIRC, the console > just take a snapshot of stat counters on test nodes at a fixed > interval (1 second by default), and calculate the throughput as > changes in the successive counter snapshots divided by the interval. > But, apparently the interval at which the console sends ''get_stat'' > requests does not equal the interval at which snapshots are taken on > test nodes - the ''get_stat'' requests could be delayed on the path when > the network is stressed (something LST was designed to do), and even > worse they could be reordered in the presence of routers. One possible > solution would be to include timestamp in the ''get_stat'' replies, and > calculate the throughput as diffs in counters divided by diffs in > timestamps. Since the console only cares about the changes in > timestamps, the test nodes clocks do not need to be in sync at all > (but they do need to be monotonic and be of a same resolution). >I''m wondering if we couldn''t add a new ''batch_stat'' command. The idea is that the client code will fill in the start/stop times for each test and then after the test is done, ''batch_stat'' would collect this data. The collection would still be passive and a new command should minimize the protocol changes. The per-test data would allow us to get accurate perf numbers and also provide some data into how parallel the tests were, if there are any unfairness issues, etc.> The test servers concurrently posts one passive buffer for each > request, so for each test request there''s one LNetMDAttach and one > unlink operation and both operations need to grab the one big > LNET_LOCK therefore it could be possible that the server CPU becomes a > bottleneck before the network could be saturated. The solution is to, > instead of one request per buffer, post one big buffer that could > accommodate multiple requests to amortize the per buffer processing > costs. >If we added timestamps to the data, the processing time & buffer sizing would be less of an issue - it wouldn''t factor into the accuracy of the numbers are are gathering.> Refining these rough edges might likely involve protocol changes. The > LST is not a production service so strict backward compatibility is > not necessary. I think it''d suffice to do a protocol version check at > the time of ''add_node'' command and simply refuse to add a test node > whose protocol version is different than that of the console. > >OK.>> BTW - the ability to dump CSV or some other text file with per-node and >> per-group data would also be nice. >> > > That''s a good idea, then users could do whatever they''d like to the data. > >Nic
On Tue, 2009-09-29 at 11:51 -0500, Nic Henke wrote:> I''m wondering if we couldn''t add a new ''batch_stat'' command. The idea is > that the client code will fill in the start/stop times for each test and > then after the test is done, ''batch_stat'' would collect this data. The > collection would still be passive and a new command should minimize the > protocol changes. The per-test data would allow us to get accurate perf > numbers and also provide some data into how parallel the tests were, if > there are any unfairness issues, etc.Along these lines, it would be nice if we could specify a run time for each test rather than an amount of data to be transferred -- it makes it easier to get aggregate bandwidth numbers, and often shows imbalances nicely -- the node getting starved is the one that transfers less data. It may also make sense to add a ''delay'' parameter that causes each test to wait a specified amount of time from the ''go'' signal. This allows the signal to propagate without running into congestion from the test, helping to cause all of the clients to start the test closer to simultaneously. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office
David Dillow wrote:> On Tue, 2009-09-29 at 11:51 -0500, Nic Henke wrote: > >> I''m wondering if we couldn''t add a new ''batch_stat'' command. The idea is >> that the client code will fill in the start/stop times for each test and >> then after the test is done, ''batch_stat'' would collect this data. The >> collection would still be passive and a new command should minimize the >> protocol changes. The per-test data would allow us to get accurate perf >> numbers and also provide some data into how parallel the tests were, if >> there are any unfairness issues, etc. >> > > Along these lines, it would be nice if we could specify a run time for > each test rather than an amount of data to be transferred -- it makes it > easier to get aggregate bandwidth numbers, and often shows imbalances > nicely -- the node getting starved is the one that transfers less data. > > It may also make sense to add a ''delay'' parameter that causes each test > to wait a specified amount of time from the ''go'' signal. This allows the > signal to propagate without running into congestion from the test, > helping to cause all of the clients to start the test closer to > simultaneously. >Interesting - can you elaborate, perhaps in the form of a patch ? :-) I like both ideas, but not signing up to code them just yet. Nic
Nic Henke wrote:> Isaac Huang wrote: > >> On Thu, Sep 24, 2009 at 03:33:18PM -0500, Nic Henke wrote: >> > > I''m wondering if we couldn''t add a new ''batch_stat'' command. The idea is >So - for a current swag: - I''m thinking of timing the ''test instance'' run (sfw_test_instance_t) - this would get timing for each test run end-to-end but be detailed enough to show the different timing with --concurrency > 1 is used. Doing per-RPC timing seemed to heavy and I''m not sure that level of resolution is really needed. - the new command ''test_stat'' would collect this data ala the current ''stat'' but allow CSV dumping for more fine-grained analysis. I''d probably dump some node & group info as well to make easier to do ''group'' stats. Nic
On Tue, 2009-09-29 at 14:02 -0400, Nic Henke wrote:> David Dillow wrote: > > Along these lines, it would be nice if we could specify a run time for > > each test rather than an amount of data to be transferred -- it makes it > > easier to get aggregate bandwidth numbers, and often shows imbalances > > nicely -- the node getting starved is the one that transfers less data. > > > > It may also make sense to add a ''delay'' parameter that causes each test > > to wait a specified amount of time from the ''go'' signal. This allows the > > signal to propagate without running into congestion from the test, > > helping to cause all of the clients to start the test closer to > > simultaneously. > > > > Interesting - can you elaborate, perhaps in the form of a patch ? :-) I > like both ideas, but not signing up to code them just yet.Maybe, but it is going to be well down the list -- I owe Oleg a large amount of testing before I think of doing anything else. :) -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office
What is it we really want to measure here? Steady-state throughput or elapsed time to run a specific test (i.e. including ramp-up/ramp-down). The intention behind the current stats command was to measure steady-state throughput - i.e.once the test batch(es?) have been started, a number of stat snapshots are taken until throughput settles. That also probably allows tests to be run more quickly since they can be stopped immediately the steady state has been observed. Is that too hard to do with the current command set? Cheers, Eric
Eric Barton wrote:> What is it we really want to measure here? Steady-state > throughput or elapsed time to run a specific test (i.e. > including ramp-up/ramp-down). >Both :-) The end-to-end performance is more interesting to me right now. The timing data is more accurate and we can run shorter tests that with ''lst stat'' to get an idea of burst performance. Having both methods is desirable to me.> The intention behind the current stats command was to measure > steady-state throughput - i.e.once the test batch(es?) have > been started, a number of stat snapshots are taken until > throughput settles. That also probably allows tests to be > run more quickly since they can be stopped immediately the > steady state has been observed. > > Is that too hard to do with the current command set? >It could be made cleaner and output the data to .csv, etc - but one can get a rough idea of steady state performance. The data will be more accurate once the timestamps for the data are sent over the wire instead of computed locally on the lst console node. Nic
On Tue, Sep 29, 2009 at 01:32:48PM -0400, David Dillow wrote:> On Tue, 2009-09-29 at 11:51 -0500, Nic Henke wrote: > > I''m wondering if we couldn''t add a new ''batch_stat'' command. The idea is > > that the client code will fill in the start/stop times for each test and > > then after the test is done, ''batch_stat'' would collect this data. The > > collection would still be passive and a new command should minimize the > > protocol changes. The per-test data would allow us to get accurate perf > > numbers and also provide some data into how parallel the tests were, if > > there are any unfairness issues, etc. > > Along these lines, it would be nice if we could specify a run time for > each test rather than an amount of data to be transferred -- it makes it > easier to get aggregate bandwidth numbers, and often shows imbalances > nicely -- the node getting starved is the one that transfers less data.This would be a very useful feature. We''re working on to add LST tests to our automatic tests, where we met a problem that we could never tell how long the test would run by looking at ''--loop'' and ''--concurrency''. The LST already implemented a timer mechanism which is good at second resolution, which should suffice for controlling test run time. Isaac
On Tue, Sep 29, 2009 at 11:51:45AM -0500, Nic Henke wrote:> ...... > > The test servers concurrently posts one passive buffer for each > > request, so for each test request there''s one LNetMDAttach and one > > unlink operation and both operations need to grab the one big > > LNET_LOCK therefore it could be possible that the server CPU becomes a > > bottleneck before the network could be saturated. The solution is to, > > instead of one request per buffer, post one big buffer that could > > accommodate multiple requests to amortize the per buffer processing > > costs. > > > If we added timestamps to the data, the processing time & buffer sizing > would be less of an issue - it wouldn''t factor into the accuracy of the > numbers are are gathering.Probably not. The timestamps affect only the stat gathering RPCs which should be far out numbered by the test RPCs (loop x concurrency x test_client_count, for each test server). Isaac