thr3ads.net - R help - [R] density of hist(freq = FALSE) inversely affected by data magnitude [Jan 2013]

If this information is useful, please help other people find it:
Share via:

J Toll

2013-Jan-22 22:48 UTC

[R] density of hist(freq = FALSE) inversely affected by data magnitude

Hi,

I have a couple of observations, a question or two, and perhaps a
suggestion related to the plotting of density on the y-axis within the
hist() function when freq=FALSE.  I was using the function and trying
to develop an intuitive understanding of what the density is telling
me.  After reading through this fairly helpful post:

http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-with-a-relative-frequency-axis

I finally realized that in the case where freq = FALSE, the y-axis
isn't really telling me the density.  It's actually indicating the
density multiplied by the bin size.  I assume this is for the case
where the bins may be of non-regular size.

from hist.default:

dens <- counts/(n * diff(breaks))

So the count in each bin is divided by the total number of
observations (n) multiplied by the size of the bin.  The problem, as I
see it, is that the density ends up being scaled by the size of the
bins, which is inversely proportional to the magnitude of the data.
Therefore the magnitude of the data is directly affecting the density,
which seems problematic.

For example*:

set.seed(4444)
x <- runif(100)
y <- x / 1000

par(mfrow = c(2, 1))
hist(x, prob = TRUE)
hist(y, prob = TRUE)
>From this example, you see that the density for the y histogram is1000 times larger, simply because the y data is 1000 times smaller.
Again, that seems problematic.  It seems to me, that the density
should be unit-less, but here it's affected by the magnitude of the
data.

So, my question is, why is density calculated this way?

For the case where all the bins are of the same size, I would think
density should simply be calculated as:

dens <- counts / n

Of course, that might be somewhat misleading for the case where the
bin sizes vary.  So then why not calculate density as:

dens <- counts / (n * diff(breaks) / min(diff(breaks)))

Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect
of the magnitude of the data, and simply leaves the relative
difference in bin size.

For the case where all the bins are the same size, the calculation is
equivalent to dens <- counts / n

For all other cases, the density is scaled by the size of the bin, but
unaffected by the magnitude of the data.

So, what am I misunderstanding?  Why is density calculated as it is,
and what does it mean?

Thanks,


James


*example from
http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-with-a-relative-frequency-axis

William Dunlap

2013-Jan-22 23:33 UTC

head link

[R] density of hist(freq = FALSE) inversely affected by data magnitude

The probability density function is not unitless - it is the derivative of the
[cumulative] probability distribution function so it has units
delta-probability-mass
over delta-x.  It must integrate to 1 (over the all possible x). 
hist(freq=FALSE,x)
or hist(prob=TRUE,x) displays an estimate of the density function and the
following
example shows how the scale matches what you get from the presumed 
population density function.
> ffunction (n, sd) 
{
    x <- rnorm(n, sd = sd)
    hist(x, freq = FALSE) # estimated density
    s <- seq(min(x), max(x), len = 129)
    lines(s, dnorm(s, sd = sd), col = "red") # overlay expected
density for this sample
}> f(1e6, sd=1)
> f(100, sd=1)
> f(100, sd=0.0001)
> f(1e6, sd=0.0001)
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at
r-project.org] On Behalf
> Of J Toll
> Sent: Tuesday, January 22, 2013 2:48 PM
> To: r-help
> Subject: [R] density of hist(freq = FALSE) inversely affected by data
magnitude
> 
> Hi,
> 
> I have a couple of observations, a question or two, and perhaps a
> suggestion related to the plotting of density on the y-axis within the
> hist() function when freq=FALSE.  I was using the function and trying
> to develop an intuitive understanding of what the density is telling
> me.  After reading through this fairly helpful post:
> 
>
http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-
> with-a-relative-frequency-axis
> 
> I finally realized that in the case where freq = FALSE, the y-axis
> isn't really telling me the density.  It's actually indicating the
> density multiplied by the bin size.  I assume this is for the case
> where the bins may be of non-regular size.
> 
> from hist.default:
> 
> dens <- counts/(n * diff(breaks))
> 
> So the count in each bin is divided by the total number of
> observations (n) multiplied by the size of the bin.  The problem, as I
> see it, is that the density ends up being scaled by the size of the
> bins, which is inversely proportional to the magnitude of the data.
> Therefore the magnitude of the data is directly affecting the density,
> which seems problematic.
> 
> For example*:
> 
> set.seed(4444)
> x <- runif(100)
> y <- x / 1000
> 
> par(mfrow = c(2, 1))
> hist(x, prob = TRUE)
> hist(y, prob = TRUE)
> 
> >From this example, you see that the density for the y histogram is
> 1000 times larger, simply because the y data is 1000 times smaller.
> Again, that seems problematic.  It seems to me, that the density
> should be unit-less, but here it's affected by the magnitude of the
> data.
> 
> So, my question is, why is density calculated this way?
> 
> For the case where all the bins are of the same size, I would think
> density should simply be calculated as:
> 
> dens <- counts / n
> 
> Of course, that might be somewhat misleading for the case where the
> bin sizes vary.  So then why not calculate density as:
> 
> dens <- counts / (n * diff(breaks) / min(diff(breaks)))
> 
> Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect
> of the magnitude of the data, and simply leaves the relative
> difference in bin size.
> 
> For the case where all the bins are the same size, the calculation is
> equivalent to dens <- counts / n
> 
> For all other cases, the density is scaled by the size of the bin, but
> unaffected by the magnitude of the data.
> 
> So, what am I misunderstanding?  Why is density calculated as it is,
> and what does it mean?
> 
> Thanks,
> 
> 
> James
> 
> 
> *example from
http://stats.stackexchange.com/questions/17258/odd-problem-with-a-
> histogram-in-r-with-a-relative-frequency-axis
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

J Toll

2013-Jan-23 01:32 UTC

head link

[R] density of hist(freq = FALSE) inversely affected by data magnitude

Bill,

Thank you.  I got it.  That can require a fair amount of work to
interpret the density, especially with odd or irregular bin sizes.

Thanks again,

James



On Tue, Jan 22, 2013 at 5:33 PM, William Dunlap <wdunlap at tibco.com>
wrote:> The probability density function is not unitless - it is the derivative of
the
> [cumulative] probability distribution function so it has units
delta-probability-mass
> over delta-x.  It must integrate to 1 (over the all possible x). 
hist(freq=FALSE,x)
> or hist(prob=TRUE,x) displays an estimate of the density function and the
following
> example shows how the scale matches what you get from the presumed
> population density function.
>
>> f
> function (n, sd)
> {
>     x <- rnorm(n, sd = sd)
>     hist(x, freq = FALSE) # estimated density
>     s <- seq(min(x), max(x), len = 129)
>     lines(s, dnorm(s, sd = sd), col = "red") # overlay expected
density for this sample
> }
>> f(1e6, sd=1)
>> f(100, sd=1)
>> f(100, sd=0.0001)
>> f(1e6, sd=0.0001)
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com

William Dunlap

2013-Jan-23 16:51 UTC

head link

[R] density of hist(freq = FALSE) inversely affected by data magnitude

I think it is a fair bit of work to interpret the freq=TRUE (prob=FALSE)
version of hist() when the bins have unequal sizes.  E.g.,
in the following the bins are sized so that each contains
an equal number of observations.  The resulting flat
frequency plot is hard for me to interpret.  The density plot
is easy.

  > x <- rnorm(1000, sd=50)
  > hist(x, breaks=quantile(x,(0:10)/10), prob=TRUE)
  > hist(x, breaks=quantile(x,(0:10)/10), prob=FALSE)
  Warning message:
  In plot.histogram(r, freq = freq1, col = col, border = border, angle = angle, 
:
    the AREAS in the plot are wrong -- rather use freq=FALSE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

> -----Original Message-----
> From: J Toll [mailto:jctoll at gmail.com]
> Sent: Tuesday, January 22, 2013 5:32 PM
> To: William Dunlap
> Cc: r-help
> Subject: Re: [R] density of hist(freq = FALSE) inversely affected by data
magnitude
> 
> Bill,
> 
> Thank you.  I got it.  That can require a fair amount of work to
> interpret the density, especially with odd or irregular bin sizes.
> 
> Thanks again,
> 
> James
> 
> 
> 
> On Tue, Jan 22, 2013 at 5:33 PM, William Dunlap <wdunlap at
tibco.com> wrote:
> > The probability density function is not unitless - it is the
derivative of the
> > [cumulative] probability distribution function so it has units
delta-probability-mass
> > over delta-x.  It must integrate to 1 (over the all possible x). 
hist(freq=FALSE,x)
> > or hist(prob=TRUE,x) displays an estimate of the density function and
the following
> > example shows how the scale matches what you get from the presumed
> > population density function.
> >
> >> f
> > function (n, sd)
> > {
> >     x <- rnorm(n, sd = sd)
> >     hist(x, freq = FALSE) # estimated density
> >     s <- seq(min(x), max(x), len = 129)
> >     lines(s, dnorm(s, sd = sd), col = "red") # overlay
expected density for this sample
> > }
> >> f(1e6, sd=1)
> >> f(100, sd=1)
> >> f(100, sd=0.0001)
> >> f(1e6, sd=0.0001)
> >
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com

Apparently Analagous Threads

Search for more maybe matching threads

R help - Jan 2013 - density of hist(freq = FALSE) inversely affected by data magnitude

[R] density of hist(freq = FALSE) inversely affected by data magnitude

[R] density of hist(freq = FALSE) inversely affected by data magnitude

[R] density of hist(freq = FALSE) inversely affected by data magnitude

[R] density of hist(freq = FALSE) inversely affected by data magnitude

Apparently Analagous Threads