J Toll
2013-Jan-22 22:48 UTC
[R] density of hist(freq = FALSE) inversely affected by data magnitude
Hi, I have a couple of observations, a question or two, and perhaps a suggestion related to the plotting of density on the y-axis within the hist() function when freq=FALSE. I was using the function and trying to develop an intuitive understanding of what the density is telling me. After reading through this fairly helpful post: http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-with-a-relative-frequency-axis I finally realized that in the case where freq = FALSE, the y-axis isn't really telling me the density. It's actually indicating the density multiplied by the bin size. I assume this is for the case where the bins may be of non-regular size. from hist.default: dens <- counts/(n * diff(breaks)) So the count in each bin is divided by the total number of observations (n) multiplied by the size of the bin. The problem, as I see it, is that the density ends up being scaled by the size of the bins, which is inversely proportional to the magnitude of the data. Therefore the magnitude of the data is directly affecting the density, which seems problematic. For example*: set.seed(4444) x <- runif(100) y <- x / 1000 par(mfrow = c(2, 1)) hist(x, prob = TRUE) hist(y, prob = TRUE)>From this example, you see that the density for the y histogram is1000 times larger, simply because the y data is 1000 times smaller. Again, that seems problematic. It seems to me, that the density should be unit-less, but here it's affected by the magnitude of the data. So, my question is, why is density calculated this way? For the case where all the bins are of the same size, I would think density should simply be calculated as: dens <- counts / n Of course, that might be somewhat misleading for the case where the bin sizes vary. So then why not calculate density as: dens <- counts / (n * diff(breaks) / min(diff(breaks))) Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect of the magnitude of the data, and simply leaves the relative difference in bin size. For the case where all the bins are the same size, the calculation is equivalent to dens <- counts / n For all other cases, the density is scaled by the size of the bin, but unaffected by the magnitude of the data. So, what am I misunderstanding? Why is density calculated as it is, and what does it mean? Thanks, James *example from http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-with-a-relative-frequency-axis
William Dunlap
2013-Jan-22 23:33 UTC
[R] density of hist(freq = FALSE) inversely affected by data magnitude
The probability density function is not unitless - it is the derivative of the [cumulative] probability distribution function so it has units delta-probability-mass over delta-x. It must integrate to 1 (over the all possible x). hist(freq=FALSE,x) or hist(prob=TRUE,x) displays an estimate of the density function and the following example shows how the scale matches what you get from the presumed population density function.> ffunction (n, sd) { x <- rnorm(n, sd = sd) hist(x, freq = FALSE) # estimated density s <- seq(min(x), max(x), len = 129) lines(s, dnorm(s, sd = sd), col = "red") # overlay expected density for this sample }> f(1e6, sd=1) > f(100, sd=1) > f(100, sd=0.0001) > f(1e6, sd=0.0001)Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf > Of J Toll > Sent: Tuesday, January 22, 2013 2:48 PM > To: r-help > Subject: [R] density of hist(freq = FALSE) inversely affected by data magnitude > > Hi, > > I have a couple of observations, a question or two, and perhaps a > suggestion related to the plotting of density on the y-axis within the > hist() function when freq=FALSE. I was using the function and trying > to develop an intuitive understanding of what the density is telling > me. After reading through this fairly helpful post: > > http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r- > with-a-relative-frequency-axis > > I finally realized that in the case where freq = FALSE, the y-axis > isn't really telling me the density. It's actually indicating the > density multiplied by the bin size. I assume this is for the case > where the bins may be of non-regular size. > > from hist.default: > > dens <- counts/(n * diff(breaks)) > > So the count in each bin is divided by the total number of > observations (n) multiplied by the size of the bin. The problem, as I > see it, is that the density ends up being scaled by the size of the > bins, which is inversely proportional to the magnitude of the data. > Therefore the magnitude of the data is directly affecting the density, > which seems problematic. > > For example*: > > set.seed(4444) > x <- runif(100) > y <- x / 1000 > > par(mfrow = c(2, 1)) > hist(x, prob = TRUE) > hist(y, prob = TRUE) > > >From this example, you see that the density for the y histogram is > 1000 times larger, simply because the y data is 1000 times smaller. > Again, that seems problematic. It seems to me, that the density > should be unit-less, but here it's affected by the magnitude of the > data. > > So, my question is, why is density calculated this way? > > For the case where all the bins are of the same size, I would think > density should simply be calculated as: > > dens <- counts / n > > Of course, that might be somewhat misleading for the case where the > bin sizes vary. So then why not calculate density as: > > dens <- counts / (n * diff(breaks) / min(diff(breaks))) > > Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect > of the magnitude of the data, and simply leaves the relative > difference in bin size. > > For the case where all the bins are the same size, the calculation is > equivalent to dens <- counts / n > > For all other cases, the density is scaled by the size of the bin, but > unaffected by the magnitude of the data. > > So, what am I misunderstanding? Why is density calculated as it is, > and what does it mean? > > Thanks, > > > James > > > *example from http://stats.stackexchange.com/questions/17258/odd-problem-with-a- > histogram-in-r-with-a-relative-frequency-axis > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
J Toll
2013-Jan-23 01:32 UTC
[R] density of hist(freq = FALSE) inversely affected by data magnitude
Bill, Thank you. I got it. That can require a fair amount of work to interpret the density, especially with odd or irregular bin sizes. Thanks again, James On Tue, Jan 22, 2013 at 5:33 PM, William Dunlap <wdunlap at tibco.com> wrote:> The probability density function is not unitless - it is the derivative of the > [cumulative] probability distribution function so it has units delta-probability-mass > over delta-x. It must integrate to 1 (over the all possible x). hist(freq=FALSE,x) > or hist(prob=TRUE,x) displays an estimate of the density function and the following > example shows how the scale matches what you get from the presumed > population density function. > >> f > function (n, sd) > { > x <- rnorm(n, sd = sd) > hist(x, freq = FALSE) # estimated density > s <- seq(min(x), max(x), len = 129) > lines(s, dnorm(s, sd = sd), col = "red") # overlay expected density for this sample > } >> f(1e6, sd=1) >> f(100, sd=1) >> f(100, sd=0.0001) >> f(1e6, sd=0.0001) > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com
William Dunlap
2013-Jan-23 16:51 UTC
[R] density of hist(freq = FALSE) inversely affected by data magnitude
I think it is a fair bit of work to interpret the freq=TRUE (prob=FALSE) version of hist() when the bins have unequal sizes. E.g., in the following the bins are sized so that each contains an equal number of observations. The resulting flat frequency plot is hard for me to interpret. The density plot is easy. > x <- rnorm(1000, sd=50) > hist(x, breaks=quantile(x,(0:10)/10), prob=TRUE) > hist(x, breaks=quantile(x,(0:10)/10), prob=FALSE) Warning message: In plot.histogram(r, freq = freq1, col = col, border = border, angle = angle, : the AREAS in the plot are wrong -- rather use freq=FALSE Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> -----Original Message----- > From: J Toll [mailto:jctoll at gmail.com] > Sent: Tuesday, January 22, 2013 5:32 PM > To: William Dunlap > Cc: r-help > Subject: Re: [R] density of hist(freq = FALSE) inversely affected by data magnitude > > Bill, > > Thank you. I got it. That can require a fair amount of work to > interpret the density, especially with odd or irregular bin sizes. > > Thanks again, > > James > > > > On Tue, Jan 22, 2013 at 5:33 PM, William Dunlap <wdunlap at tibco.com> wrote: > > The probability density function is not unitless - it is the derivative of the > > [cumulative] probability distribution function so it has units delta-probability-mass > > over delta-x. It must integrate to 1 (over the all possible x). hist(freq=FALSE,x) > > or hist(prob=TRUE,x) displays an estimate of the density function and the following > > example shows how the scale matches what you get from the presumed > > population density function. > > > >> f > > function (n, sd) > > { > > x <- rnorm(n, sd = sd) > > hist(x, freq = FALSE) # estimated density > > s <- seq(min(x), max(x), len = 129) > > lines(s, dnorm(s, sd = sd), col = "red") # overlay expected density for this sample > > } > >> f(1e6, sd=1) > >> f(100, sd=1) > >> f(100, sd=0.0001) > >> f(1e6, sd=0.0001) > > > > Bill Dunlap > > Spotfire, TIBCO Software > > wdunlap tibco.com