nathan@clusterfs.com
2007-Aug-01  15:08 UTC
[Lustre-devel] [Bug 10969] Application summary profiling tool
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10969 (In reply to comment #67)> The brw_stats looks enough for the time being. But please keep in mind that > users do NOT have access to the servers and the brw_stats info stored on the > OSTs will NOT be available to the apps perf tool directly.Yes, I''m just trying to get a handle on whether we''re collecting the right data in the first place. Collecting/presenting it is a different challenge.> Anomalies were meant over all the clients and per client as well. It was > suggested as an idea to keep track of a slow client or a slow server for the > duration of an application. Also it can be a very powerful tool when combined > with the timestamps (see below please).With the current stats at any given moment we could compare e.g. the average ost_setattr execution time and note that OST5 is 10% slower than the average OST, or that client7 has the highest average write size on OST2. I think potentially one of most difficult parts of this tool is deciding how to prune down the data we present into a comprehensible amount.> Timestamped info means the ability to playback the I/O for the duration of an > application. It does not need to be very fine grained (i.e. aggregate > timestamped summary info for every X msecs/secs per each client/server should be > sufficient).e.g. something like: 11:02 client7 7MB w, 10MB r, 3004 RPCs, waited for 5 locks, 10 locks revoked The more concrete we can make our examples, the better.> Yes, we meant RPC request queues (e.g. time spent on queue, queue depth).Ok. We already collect this information per server. req_waittime 117364 samples [usec] 34 23445 21251101 7973894281 req_qdepth 117364 samples [reqs] 0 8 29906 30464> Probably not ALL RPC related info. I am assuming "ALL" would be overwhelming to > analyze and digest. Perhaps, we need to list the most striking ones. What do you > suggest Nathan?The slow outliers would probably be the most interesting. server info: - at 11:02 req 1002 type 42 from client7 took 102s to process - at that time, the q depth was 5, the avg waittime was 10s, and the average req of that type took 6s client info: - req 1002 from process 7 "ior" opc=fsync