Peter Lawrence
2007-Mar-27 03:13 UTC
[dtrace-discuss] standard-deviation and other aggravations
Dear Tracy, the "standard approximation" for standard-deviation is sqrt ( average (X**2) - average(X)**2 ) sqrt (the average of the squares, minus the square of the average) since "average" is an aggregate, so is "standard-deviation" by this approximation, and it would be handy to have. furthermore, the documentation for what is an aggregating-function seems to be all wet... what you probably mean to say is that the underlying function is associative and commutative, in otherwords order doesn''t matter. underlying ftn aggregation -------------- ----------- "+" "sum" "*" "product" "min" "min" "max" "max" "inc" "count" "avg" "avg" ( = sum / count, is a stretch, but most folks find it acceptable, and for the same reason standard-deviation as well) these functions are available in other programming languages and are often refered to as "reductions". your definition for aggregating-function actually dis-allows "avg" since avg( avg (1, 99), 2) = 26 but avg (1, 99, 2) = 34 by the exact same logic that it currently disallows "median". so, I believe the definition needs to be repaired, and making it consistent with common usage (reductions) seems to make sense. -Pete Lawrence.
Roch - PAE
2007-Mar-27 07:09 UTC
[dtrace-discuss] standard-deviation and other aggravations
Peter Lawrence writes: > Dear Tracy, > the "standard approximation" for standard-deviation is > > > sqrt ( average (X**2) - average(X)**2 ) > I note, this is an homage to John Backus, developer of Fortran, has died at the age of 82. http://www.theinquirer.net/default.aspx?article=38376 > sqrt (the average of the squares, minus the square of the average) > As you probably know this definition leads to large numerical imprecision. The better formula being sqtr (average((X-average(X))^2)) but this would not be trackable as an aggregation. The simple formula is probably ok for many usage. > > > since "average" is an aggregate, so is "standard-deviation" by this > approximation, and it would be handy to have. > > As I recall average was added later and it''s not (internally in Dtrace) a pure ggregation but rather a simple combination of the sum() and count() aggregations. > > furthermore, the documentation for what is an aggregating-function > seems to be all wet... > You''re right, "Dtrace aggregation" definition should be clarified. The way I would extend the definition is to also include some simple arithmetic combination of other aggregating functions. This allows avg and stddev. The median require post processing all recorded values and clearly falls out of the realms of possibilities. > what you probably mean to say is that the underlying function is > associative and commutative, in otherwords order doesn''t matter. > It''s a bit more than that. The way aggregation works is that data is aggregated in per CPU structures (which means no locking so is fast) and then the per CPU data is aggregated again. Thus the property requirement stated below. > underlying ftn aggregation > -------------- ----------- > "+" "sum" > "*" "product" > "min" "min" > "max" "max" > "inc" "count" > "avg" "avg" ( = sum / count, is a stretch, but most folks > find it acceptable, and for the same reason > standard-deviation as well) > > > these functions are available in other programming languages and > are often refered to as "reductions". > I guess the product() pure aggregation could be provided as well. > > your definition for aggregating-function actually dis-allows "avg" > > since avg( avg (1, 99), 2) = 26 > but avg (1, 99, 2) = 34 > > by the exact same logic that it currently disallows "median". > > > so, I believe the definition needs to be repaired, and making it > consistent with common usage (reductions) seems to make sense. > > > -Pete Lawrence. > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Adam Leventhal
2007-Mar-27 16:56 UTC
[dtrace-discuss] standard-deviation and other aggravations
Thanks for the close reading. If you want to file a bug (doc/dtrace) and submit some alternate text, that would be a terrific contribution. Adam On Mon, Mar 26, 2007 at 08:13:27PM -0700, Peter Lawrence wrote:> Dear Tracy, > the "standard approximation" for standard-deviation is > > > sqrt ( average (X**2) - average(X)**2 ) > > sqrt (the average of the squares, minus the square of the average) > > > > since "average" is an aggregate, so is "standard-deviation" by this > approximation, and it would be handy to have. > > > > furthermore, the documentation for what is an aggregating-function > seems to be all wet... > > what you probably mean to say is that the underlying function is > associative and commutative, in otherwords order doesn''t matter. > > underlying ftn aggregation > -------------- ----------- > "+" "sum" > "*" "product" > "min" "min" > "max" "max" > "inc" "count" > "avg" "avg" ( = sum / count, is a stretch, but most folks > find it acceptable, and for the same reason > standard-deviation as well) > > > these functions are available in other programming languages and > are often refered to as "reductions". > > > your definition for aggregating-function actually dis-allows "avg" > > since avg( avg (1, 99), 2) = 26 > but avg (1, 99, 2) = 34 > > by the exact same logic that it currently disallows "median". > > > so, I believe the definition needs to be repaired, and making it > consistent with common usage (reductions) seems to make sense. > > > -Pete Lawrence. > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org-- Adam Leventhal, Solaris Kernel Development http://blogs.sun.com/ahl
Peter Lawrence
2007-Mar-27 18:02 UTC
[dtrace-discuss] standard-deviation and other aggravations
Roch - PAE wrote On 03/27/07 00:09,:> Peter Lawrence writes: > > Dear Tracy, > > the "standard approximation" for standard-deviation is > > > > > > sqrt ( average (X**2) - average(X)**2 ) > > > > I note, this is an homage to > > John Backus, developer of Fortran, has died at the age of 82. > http://www.theinquirer.net/default.aspx?article=38376 > > > > sqrt (the average of the squares, minus the square of the average) > > > > As you probably know this definition leads to large numerical > imprecision. The better formula being > > sqtr (average((X-average(X))^2)) > > but this would not be trackable as an aggregation. > > The simple formula is probably ok for many usage. > > > > > > > since "average" is an aggregate, so is "standard-deviation" by this > > approximation, and it would be handy to have. > > > > > > As I recall average was added later and it''s not (internally > in Dtrace) a pure ggregation but rather a simple combination > of the sum() and count() aggregations. > > > > > furthermore, the documentation for what is an aggregating-function > > seems to be all wet... > > > > You''re right, "Dtrace aggregation" definition should be > clarified. The way I would extend the definition is to also > include some simple arithmetic combination of other > aggregating functions. This allows avg and stddev. The > median require post processing all recorded values and > clearly falls out of the realms of possibilities. > > > > what you probably mean to say is that the underlying function is > > associative and commutative, in otherwords order doesn''t matter. > > > > It''s a bit more than that. The way aggregation works is that > data is aggregated in per CPU structures (which means no > locking so is fast) and then the per CPU data is aggregated again. > Thus the property requirement stated below.it is precisly associativity and commutativity that allows you to compute different parts of the eventual result on different cpus and at different times. the only other real requirement besides associativity and commutativity is that the intermediate results have size order(1) rather than order(n) if the size of an intermediate result wasn''t bounded by a constant then the result of an aggregation could simply be the list/set of all items seen so far, and then a final "post processing" operation could be applied when actually printing the value. this is what the implementors are trying to avoid, they want the intermediate results of a aggregation to be fixed size. in otherwords although "set union" is mathematically an aggregating function it isn''t in the implementation space for dtrace. -Pete Lawrence.> > > > underlying ftn aggregation > > -------------- ----------- > > "+" "sum" > > "*" "product" > > "min" "min" > > "max" "max" > > "inc" "count" > > "avg" "avg" ( = sum / count, is a stretch, but most folks > > find it acceptable, and for the same reason > > standard-deviation as well) > > > > > > these functions are available in other programming languages and > > are often refered to as "reductions". > > > > I guess the product() pure aggregation could be provided as well.also "&" "all" "|" "any" but I don''t think anyone would ever use prod, all, or any, in dtrace, they are just examples of the existing precedent of reductions, ie. aggregations.> > > > > your definition for aggregating-function actually dis-allows "avg" > > > > since avg( avg (1, 99), 2) = 26 > > but avg (1, 99, 2) = 34 > > > > by the exact same logic that it currently disallows "median". > > > > > > so, I believe the definition needs to be repaired, and making it > > consistent with common usage (reductions) seems to make sense. > > > > > > -Pete Lawrence. > > > > _______________________________________________ > > dtrace-discuss mailing list > > dtrace-discuss at opensolaris.org >
Chip Bennett
2007-Mar-27 18:59 UTC
[dtrace-discuss] standard-deviation and other aggravations
Peter, Would you consider posting a reference to a web site that describes the definition of aggregation and reduction that you are using. I recognize the stuff about order(1), order(n), etc. computations from my "algorithms" class, but I''ve not seen your definition of aggregation and reduction. Or at least I don''t remember it from back in 19xx (Perhaps I didn''t take enough statistics.) I tried googling, to no avail. It''s interesting that if you just search for "aggregations", the primary hits are related to DTrace. Thanks, Chip
Jay Edwards
2007-Mar-27 19:12 UTC
[dtrace-discuss] standard-deviation and other aggravations
On Tue Mar 27 18:59:58 UTC 2007, Chip Bennett <cbennett at laurustech.com> wrote:> Would you consider posting a reference to a web site that describes the > definition of aggregation and reduction that you are using.Aggregations and/or reductions basically turn an array into a scalar (from a computational standpoint). More generally, an aggregation operator applies an operation to every element of a set and produces a single answer. Counting, sums, averages, min, max, and other similar functions are basic aggregating functions. Jay.
Chip Bennett
2007-Mar-27 19:29 UTC
[dtrace-discuss] standard-deviation and other aggravations
Thanks Jay. I guess I was getting bent out of shape looking for some technical definition of aggregation that went beyond that, but I guess you''re right: it''s that simple. So I understand Peter''s point about the DTrace manual''s definition of aggregation is actually the subset of aggregations that are commutative and associative, but Peter goes on to say that DTrace aggregations might be better referred to as reductions. Not to put to fine a point on it, but if aggregations and reductions are the same thing, how is that definition any better? Thanks, Chip Jay Edwards wrote:> On Tue Mar 27 18:59:58 UTC 2007, Chip Bennett <cbennett at laurustech.com> wrote: > >> Would you consider posting a reference to a web site that describes the >> definition of aggregation and reduction that you are using. >> > > Aggregations and/or reductions basically turn an array into a scalar (from a computational standpoint). More generally, an aggregation operator applies an operation to every element of a set and produces a single answer. > > Counting, sums, averages, min, max, and other similar functions are basic aggregating functions. > > Jay. >
Bryan Cantrill
2007-Mar-27 20:12 UTC
[dtrace-discuss] standard-deviation and other aggravations
On Tue, Mar 27, 2007 at 02:29:26PM -0500, Chip Bennett wrote:> Thanks Jay. I guess I was getting bent out of shape looking for some > technical definition of aggregation that went beyond that, but I guess > you''re right: it''s that simple. So I understand Peter''s point about the > DTrace manual''s definition of aggregation is actually the subset of > aggregations that are commutative and associative, but Peter goes on to > say that DTrace aggregations might be better referred to as reductions. > Not to put to fine a point on it, but if aggregations and reductions are > the same thing, how is that definition any better?Perhaps putting an even finer point on it: if thinking of aggregations in terms of some other construct from some other language helps one understand them, fine -- but the "aggregation" nomenclature was carefully and deliberately selected, and it is not about to be replaced. As long as we''re discussing nomenclature, the issue that Peter raised is not with the definition of aggregations but rather with the definition of an aggregating function. It would (or should) clear up Peter''s confusion if aggregating functions were further defined to yield only a finite set of scalars. I thought this was obvious from the description in (for example) our USENIX paper (and I would have thought that making this constraint explicit would border on pedantry), but apologies if I underestimated either the difficulty of the concept or the pedantry of the reader... - Bryan -------------------------------------------------------------------------- Bryan Cantrill, Solaris Kernel Development. http://blogs.sun.com/bmc
Peter Lawrence
2007-Mar-27 20:44 UTC
[dtrace-discuss] standard-deviation and other aggravations
Chip, try googling "array reduction", you should get lots of Fortran, HPC, and APL hits... and, like Bryan said, I don''t care what you call it, aggregation is fine with me, but the definition of aggregating function should match that of array reduction since that is what you''re really doing. If you look at the constraints imposed on reduction functions in parallel and HPC programming languages it is exactly the same constraints imposed by DTrace (and for the same reasons!) -Pete Lawrence. Bennett wrote On 03/27/07 11:59,:> Peter, > > Would you consider posting a reference to a web site that describes the > definition of aggregation and reduction that you are using. I recognize > the stuff about order(1), order(n), etc. computations from my > "algorithms" class, but I''ve not seen your definition of aggregation and > reduction. Or at least I don''t remember it from back in 19xx (Perhaps > I didn''t take enough statistics.) I tried googling, to no avail. It''s > interesting that if you just search for "aggregations", the primary hits > are related to DTrace. > > Thanks, > Chip > > _______________________________________________ > dtrace-discuss mailing list > dtrace-discuss at opensolaris.org
Steven Reynolds
2007-Apr-02 03:14 UTC
[dtrace-discuss] Re: standard-deviation and other aggravations
When I read the DTrace Guide about aggregating functions, I remembered my days of studying statistics, and thought of the idea of a "sufficient statistic". The sufficient statistic is enough to compute your estimate of a parameter; you can throw away all of the measurements and only retain the sufficient statistic with out any loss. I believe that the idea of the "aggregating function" is in statistical terms a parameter with a scalar valued "sufficient statistic". For example, the sum of the measurements is a sufficient statistic for the average. This message posted from opensolaris.org