Chad Mynhier
2007-Nov-02 00:43 UTC
[dtrace-discuss] Proposal: standard deviation aggregating function
I''ve started working with Adam Leventhal and Jon Haslam on 6325485 ("A stdev() aggregator would be a nice adjunct to avg()"). Here''s the proposal, comments are welcome. Chad SUMMARY This fast-track enhances the DTrace utility to address an existing RFE[1] requesting an aggregating function to calculate standard deviation, similar to the current aggregating function for average. The new function is a committed interface; this case seeks patch release binding. DETAILS Overview Currently, the DTrace utility includes an aggregating function to calculate the average of a set of numbers but does not provide the same to calculate standard deviation. Because this would be a useful for statistical analysis, we plan to introduce an aggregating function to calculate standard deviation. We plan to use the following approximation to standard deviation: sqrt(average(x^2) - average(x)^2) It is recognised that this is an imprecise approximation to standard deviation, but it is calculable as an aggregation, and it should be sufficient for most of the purposes to which DTrace is put. The approximation and its imprecision should be noted in documentation for DTrace. The planned implementation involves storing three 64-bit values: the total count, the sum of x, and the sum of x^2. These values will be post-processed in user-land to present the standard deviation. (This is similar to the implementation of the avg() aggregating function, which stores the total count and the sum of x.) Storing the sum of x^2 presents a very real possibility of integer overflow. We plan to store the sum of x^2 as a 128-bit value in two unsigned 64-bit integers. This will require implementing 128-bit addition and multiplication to support this. This will also involve implementing an arbitrary-precision square root function in user-land to handle those cases in which a long double is insufficient. EXAMPLE This is an example D script demonstrating the use of the stddev() aggregating function: #pragma D option quiet syscall::exece:entry, syscall::exec:entry { self->ts = timestamp; } syscall::exece:return, syscall::exec:return / self->ts / { t = timestamp - self->ts; @foo[probefunc] = avg(t); @bar[probefunc] = stddev(t); @baz[probefunc] = quantize(t); self->ts = 0; } END { printf("AVERAGE:"); printa(@foo); printf("\nSTDDEV:"); printa(@bar); printf("\n"); printa(@baz); } With sample output as follows: # ./stddev.d ^C AVERAGE: exece 567257 STDDEV: exece 158867 exece value ------------- Distribution ------------- count 131072 | 0 262144 |@@@@@@@@@@@@@@@@@@@ 128 524288 |@@@@@@@@@@@@@@@@@@@@@ 144 1048576 | 0 # REFERENCES [1] A stdev() aggregator would be a nice adjunct to avg() (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6325485)
Chad Mynhier
2007-Nov-02 00:56 UTC
[dtrace-discuss] Proposal: standard deviation aggregating function
Dan Price pointed out the confusing wording in the "The planned implementation invovles" paragraph. Here''s a better version: SUMMARY This fast-track enhances the DTrace utility to address an existing RFE[1] requesting an aggregating function to calculate standard deviation, similar to the current aggregating function for average. The new function is a committed interface; this case seeks patch release binding. DETAILS Overview Currently, the DTrace utility includes an aggregating function to calculate the average of a set of numbers but does not provide the same to calculate standard deviation. Because this would be a useful for statistical analysis, we plan to introduce an aggregating function to calculate standard deviation. We plan to use the following approximation to standard deviation: sqrt(average(x^2) - average(x)^2) It is recognised that this is an imprecise approximation to standard deviation, but it is calculable as an aggregation, and it should be sufficient for most of the purposes to which DTrace is put. The approximation and its imprecision should be noted in documentation for DTrace. The planned implementation involves storing two 64-bit values and one 128-bit value: the total count, the sum of x, and the sum of x^2. These values will be post-processed in user-land to present the standard deviation. (This is similar to the implementation of the avg() aggregating function, which stores the total count and the sum of x.) Note that storing the sum of x^2 would present the possibility of integer overflow. We plan to store the sum of x^2 as a 128-bit value in two unsigned 64-bit integers. This will require implementing 128-bit addition and multiplication to support this. This will also involve implementing an arbitrary-precision square root function in user-land to handle those cases in which a long double is insufficient. EXAMPLE This is an example D script demonstrating the use of the stddev() aggregating function: #pragma D option quiet syscall::exece:entry, syscall::exec:entry { self->ts = timestamp; } syscall::exece:return, syscall::exec:return / self->ts / { t = timestamp - self->ts; @foo[probefunc] = avg(t); @bar[probefunc] = stddev(t); @baz[probefunc] = quantize(t); self->ts = 0; } END { printf("AVERAGE:"); printa(@foo); printf("\nSTDDEV:"); printa(@bar); printf("\n"); printa(@baz); } With sample output as follows: # ./stddev.d ^C AVERAGE: exece 567257 STDDEV: exece 158867 exece value ------------- Distribution ------------- count 131072 | 0 262144 |@@@@@@@@@@@@@@@@@@@ 128 524288 |@@@@@@@@@@@@@@@@@@@@@ 144 1048576 | 0 # REFERENCES [1] A stdev() aggregator would be a nice adjunct to avg() (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6325485)
Chad Mynhier
2007-Nov-02 13:51 UTC
[dtrace-discuss] Proposal: standard deviation aggregating function
On 11/1/07, Alexander Kolbasov <akolb at aldan.sfbay.sun.com> wrote:> > SUMMARY > > > > This fast-track enhances the DTrace utility to address an > > existing RFE[1] requesting an aggregating function to calculate > > standard deviation, similar to the current aggregating function > > for average. > > Is it going to treat the aggregated values as uint64_t or int64_t? See the > discussion for max()/min()/avg() above. >My initial take on this is that, if we''re going to eventually fix that for max()/min()/avg(), I should go ahead and implement this correctly from the start. My only concern with this would be the possibility for confusion if stddev() is implemented this way before the fix for 6624541 goes in (e.g., someone sees a mean, standard deviation and distribution that just don''t seem to match.) I don''t know how likely that would be in practice, but the obvious solution is to get a fix for 6624541 in before this goes in. (I''d be happy to implement the fix for 6624541.) Chad