thr3ads.net - R devel - [Rd] ?hist and $density explanation [May 2006]

If this information is useful, please help other people find it:
Share via:

François Pinard

2006-May-18 01:20 UTC

[Rd] ?hist and $density explanation

Hi, people.  Within ?hist (using R 2.3.0), one reads:

 density: values f^(x[i]), as estimated density values. If
          'all(diff(breaks) == 1)', they are the relative frequencies
          'counts/n' and in general satisfy sum[i; f^(x[i])
          (b[i+1]-b[i])] = 1, where b[i] = 'breaks[i]'.

I trip on this explanation each time I read it.  Some R guardians will 
be tempted to say that since R itself does not trip, I am necessarily 
the problem :-).  But yet, non-obstant and nevertheless, maybe these few 
lines of documentation could be improved.

The "f^(x[i])" bit is somehow cryptic and not explained.  It suggests 
that there are as many densities as possible "i" values, and since
"i"
indexes "x", it indirectly suggests that length(density) == length(x),
which cannot be right.  The "sum[i; ...]" has to be taken up to the 
number of cells, not the number of "x" values.  Because
"x[i]" is a bit
meaningless in the above context, it should better be avoided.

The "^" may mean that "x[i]" is an index of "f",
some kind of TeX device
for shifting the notation.  It may also means "hat" to suggest the 
density is an approximation.  But the approximation of what?  Of course, 
I understand an untold model by which "density" estimates the density
of
some continuous distribution out of which the "x" values were sampled,
before the "hist()" function was called.  But "x" is not
necessarily
a sample of a continuum, it may well be the population, and the 
densities in the histogram may well be exact, and not an approximation.  
So it might be simpler to drop the "^" as well.

The concept of relative frequency is explained in case of equal width 
cells only, and not otherwise.  This concept is not reused elsewhere in 
"?hist".  So, it is not so useful, we could use "d" instead
of "f".

Finally, writing "breaks[i+1]-breaks[i]" is simpler and clearer than 
introducing an intermediate "b[i]" device.  Let's drop it.


Let me suggest a simpler rewriting of these few lines, using humbler 
notation while being more precise.  Let's start with something like:

 density: For each cell i, density[i] is the proportion of all x[]
          which get sorted into that cell, divided by the cell width.
          So, the value of 'sum(density * diff(breaks))' is 1.

and improve on it.

-- 
Fran?ois Pinard   http://pinard.progiciels-bpi.ca

Maybe Matching Threads

Search for more seemingly similar threads

R devel - May 2006 - ?hist and $density explanation

[Rd] ?hist and $density explanation

Maybe Matching Threads

Wisdom of the Ancients