Hi, ? I have a relatively big dataset and I want to construct some histograms using the histogram function in lattice. One thing I am interested in is to look at differences between density and percent. I know I can use the hist function but it seems that this function gives sometimes some wrong answers and the density is actually a percent since it is calculated as counts in the bin divided by the total no. of points. Let me explain. ? If I let the hist function to decide the breaks, or I use a small number, or one of the pre-determined methods to select breaks then everything seems to be in order. But if I decide to use ? for example ? 100 as a breaks (I have over 90000 data points so the number of breaks is not necessarily too large I would think) the density for the first bin is over 1, although for all the other breaks the density is actually a percent since it is the count for that bin divided by the total no. of points I have. So ?. Here it is something wrong or most probably I am doing something wrong. ? If I use the function histogram from lattice it is obvious that there is a difference between the percent param and the density param. I looked at the function code and I didn't understand it ? to be honest. It seems it calls inside the hist function, or a slightly modify variant of hist. Reading about the object trellis I saw I can access different info about the graph it generates but nothing about the actual data that goes into defining the histogram. How can I access the data from it? ? I am not sure if my problem is platform specific ? it should not be ? but I have Rx64 2.13.1 on windows machine, in case it counts. ? I appreciate your help, thanks, ? Monica
Hi Monica An example abbreviated from ?histogram x = histogram( ~ height, data = singer) names(x) # to see what is there str(x) # information x$panel.args.common $breaks [1] 59.36 61.28 63.20 65.12 67.04 68.96 70.88 72.80 74.72 76.64 $type [1] "percent" $equal.widths [1] TRUE $nint [1] 9 # x$panel.args: name as number x[[35]] [[1]] [[1]]$x [1] 64 62 66 65 60 61 65 66 65 63 67 65 62 65 68 65 63 65 62 65 66 62 65 63 65 66 65 62 65 66 65 61 65 66 65 62 63 67 60 67 66 62 65 62 [45] 61 62 66 60 65 65 61 64 68 64 63 62 64 62 64 65 60 65 70 63 67 66 65 62 68 67 67 63 67 66 63 72 62 61 66 64 60 61 66 66 66 62 70 65 [89] 64 63 65 69 61 66 65 61 63 64 67 66 68 70 65 65 65 64 66 64 70 63 70 64 63 67 65 63 66 66 64 64 70 70 66 66 66 69 67 65 69 72 71 66 [133] 76 74 71 66 68 67 70 65 72 70 68 64 73 66 68 67 64 68 73 69 71 69 76 71 69 71 66 69 71 71 71 69 70 69 68 70 68 69 72 70 72 69 73 71 [177] 72 68 68 71 66 68 71 73 73 70 68 70 75 68 71 70 74 70 75 75 69 72 71 70 71 68 70 75 72 66 72 70 69 72 75 67 75 74 72 72 74 72 72 74 [221] 70 66 68 75 68 70 72 67 70 70 69 72 71 74 75 etc to suite your requirements HTH Regards Duncan Duncan Mackay Department of Agronomy and Soil Science University of New England ARMIDALE NSW 2351 Email: home mackay at northnet.com.au At 23:50 31/08/2011, you wrote:>Hi, > > > >I have a relatively big dataset and I want to construct >some histograms using the histogram function in lattice. One thing I am >interested in is to look at differences between >density and percent. I know I can >use the hist function but it seems that this function gives sometimes some >wrong answers and the density is actually a >percent since it is calculated as counts in the >bin divided by the total no. of points. Let me explain. > > > >If I let the hist function to decide the breaks, or I use >a small number, or one of the pre-determined methods to select breaks then >everything seems to be in order. But if I decide to use ? for example ? 100 as >a breaks (I have over 90000 data points so the number of breaks is not >necessarily too large I would think) the density for the first bin is over 1, >although for all the other breaks the density is >actually a percent since it is >the count for that bin divided by the total no. >of points I have. So. Here it>is something wrong or most probably I am doing something wrong. > > > >If I use the function histogram from lattice it is >obvious that there is a difference between the percent param and the density >param. I looked at the function code and I >didn't understand it ? to be honest. >It seems it calls inside the hist function, or a slightly modify variant of >hist. Reading about the object trellis I saw I can access different info about >the graph it generates but nothing about the actual data that goes into >defining the histogram. How can I access the data from it? > > > >I am not sure if my problem is platform specific ? it should >not be ? but I have Rx64 2.13.1 on windows machine, in case it counts. > > > >I appreciate your help, thanks, > > > >Monica > > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
I'm not entirely sure that I understand what your problem is.
A reproducible example would probably have helped.
However I conjecture that the problem boils down to confusing
"probability" with "probability *density*".
Percentages are the (estimated) bin probabilities times 100.
The percentage for the i-th bin is 100*n_i/n where n_i is the count
for the i-th bin and n is the sum of the n_i.
The percentages sum to 100 (equivalent to probabilities summing to 1).
The *densities* in contrast *integrate* to 1.
The density value for the i-th bin is w_i * n_i/n where w_i is the width
of the i-th bin. (If the breaks have been set sensibly, the w_i all have
the same value, i.e. the bin widths are all the same.)
Does this answer your question? (In an example that I tried the percentages
and the density values are --- not surprisingly!!! --- completely
consistent.)
You are correct in observing that it is difficult to dig out the
``histogram values''
(the bar heights) when using lattice. You can actually get at them using
lattice:::hist.constructor(), but it's not for the fainthearted.
cheers,
Rolf Turner
P. S. You really should be absolutely certain that you know what you're
talking about before accusing a package of giving ``wrong answers''.
R. T.
On 01/09/11 01:50, Monica Pisica wrote:>
>
> Hi,
>
>
>
> I have a relatively big dataset and I want to construct
> some histograms using the histogram function in lattice. One thing I am
> interested in is to look at differences between density and percent. I know
I can
> use the hist function but it seems that this function gives sometimes some
> wrong answers and the density is actually a percent since it is calculated
as counts in the bin divided by the total no. of points. Let me explain.
>
>
>
> If I let the hist function to decide the breaks, or I use
> a small number, or one of the pre-determined methods to select breaks then
> everything seems to be in order. But if I decide to use ? for example ? 100
as
> a breaks (I have over 90000 data points so the number of breaks is not
> necessarily too large I would think) the density for the first bin is over
1,
> although for all the other breaks the density is actually a percent since
it is
> the count for that bin divided by the total no. of points I have. So ?.
Here it
> is something wrong or most probably I am doing something wrong.
>
>
>
> If I use the function histogram from lattice it is
> obvious that there is a difference between the percent param and the
density
> param. I looked at the function code and I didn't understand it ? to be
honest.
> It seems it calls inside the hist function, or a slightly modify variant of
> hist. Reading about the object trellis I saw I can access different info
about
> the graph it generates but nothing about the actual data that goes into
> defining the histogram. How can I access the data from it?
>
>
>
> I am not sure if my problem is platform specific ? it should
> not be ? but I have Rx64 2.13.1 on windows machine, in case it counts.
>
>
>
> I appreciate your help, thanks,