Hi, ? I have a relatively big dataset and I want to construct some histograms using the histogram function in lattice. One thing I am interested in is to look at differences between density and percent. I know I can use the hist function but it seems that this function gives sometimes some wrong answers and the density is actually a percent since it is calculated as counts in the bin divided by the total no. of points. Let me explain. ? If I let the hist function to decide the breaks, or I use a small number, or one of the pre-determined methods to select breaks then everything seems to be in order. But if I decide to use ? for example ? 100 as a breaks (I have over 90000 data points so the number of breaks is not necessarily too large I would think) the density for the first bin is over 1, although for all the other breaks the density is actually a percent since it is the count for that bin divided by the total no. of points I have. So ?. Here it is something wrong or most probably I am doing something wrong. ? If I use the function histogram from lattice it is obvious that there is a difference between the percent param and the density param. I looked at the function code and I didn't understand it ? to be honest. It seems it calls inside the hist function, or a slightly modify variant of hist. Reading about the object trellis I saw I can access different info about the graph it generates but nothing about the actual data that goes into defining the histogram. How can I access the data from it? ? I am not sure if my problem is platform specific ? it should not be ? but I have Rx64 2.13.1 on windows machine, in case it counts. ? I appreciate your help, thanks, ? Monica
Hi Monica An example abbreviated from ?histogram x = histogram( ~ height, data = singer) names(x) # to see what is there str(x) # information x$panel.args.common $breaks [1] 59.36 61.28 63.20 65.12 67.04 68.96 70.88 72.80 74.72 76.64 $type [1] "percent" $equal.widths [1] TRUE $nint [1] 9 # x$panel.args: name as number x[[35]] [[1]] [[1]]$x [1] 64 62 66 65 60 61 65 66 65 63 67 65 62 65 68 65 63 65 62 65 66 62 65 63 65 66 65 62 65 66 65 61 65 66 65 62 63 67 60 67 66 62 65 62 [45] 61 62 66 60 65 65 61 64 68 64 63 62 64 62 64 65 60 65 70 63 67 66 65 62 68 67 67 63 67 66 63 72 62 61 66 64 60 61 66 66 66 62 70 65 [89] 64 63 65 69 61 66 65 61 63 64 67 66 68 70 65 65 65 64 66 64 70 63 70 64 63 67 65 63 66 66 64 64 70 70 66 66 66 69 67 65 69 72 71 66 [133] 76 74 71 66 68 67 70 65 72 70 68 64 73 66 68 67 64 68 73 69 71 69 76 71 69 71 66 69 71 71 71 69 70 69 68 70 68 69 72 70 72 69 73 71 [177] 72 68 68 71 66 68 71 73 73 70 68 70 75 68 71 70 74 70 75 75 69 72 71 70 71 68 70 75 72 66 72 70 69 72 75 67 75 74 72 72 74 72 72 74 [221] 70 66 68 75 68 70 72 67 70 70 69 72 71 74 75 etc to suite your requirements HTH Regards Duncan Duncan Mackay Department of Agronomy and Soil Science University of New England ARMIDALE NSW 2351 Email: home mackay at northnet.com.au At 23:50 31/08/2011, you wrote:>Hi, > > > >I have a relatively big dataset and I want to construct >some histograms using the histogram function in lattice. One thing I am >interested in is to look at differences between >density and percent. I know I can >use the hist function but it seems that this function gives sometimes some >wrong answers and the density is actually a >percent since it is calculated as counts in the >bin divided by the total no. of points. Let me explain. > > > >If I let the hist function to decide the breaks, or I use >a small number, or one of the pre-determined methods to select breaks then >everything seems to be in order. But if I decide to use ? for example ? 100 as >a breaks (I have over 90000 data points so the number of breaks is not >necessarily too large I would think) the density for the first bin is over 1, >although for all the other breaks the density is >actually a percent since it is >the count for that bin divided by the total no. >of points I have. So. Here it>is something wrong or most probably I am doing something wrong. > > > >If I use the function histogram from lattice it is >obvious that there is a difference between the percent param and the density >param. I looked at the function code and I >didn't understand it ? to be honest. >It seems it calls inside the hist function, or a slightly modify variant of >hist. Reading about the object trellis I saw I can access different info about >the graph it generates but nothing about the actual data that goes into >defining the histogram. How can I access the data from it? > > > >I am not sure if my problem is platform specific ? it should >not be ? but I have Rx64 2.13.1 on windows machine, in case it counts. > > > >I appreciate your help, thanks, > > > >Monica > > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code.
I'm not entirely sure that I understand what your problem is. A reproducible example would probably have helped. However I conjecture that the problem boils down to confusing "probability" with "probability *density*". Percentages are the (estimated) bin probabilities times 100. The percentage for the i-th bin is 100*n_i/n where n_i is the count for the i-th bin and n is the sum of the n_i. The percentages sum to 100 (equivalent to probabilities summing to 1). The *densities* in contrast *integrate* to 1. The density value for the i-th bin is w_i * n_i/n where w_i is the width of the i-th bin. (If the breaks have been set sensibly, the w_i all have the same value, i.e. the bin widths are all the same.) Does this answer your question? (In an example that I tried the percentages and the density values are --- not surprisingly!!! --- completely consistent.) You are correct in observing that it is difficult to dig out the ``histogram values'' (the bar heights) when using lattice. You can actually get at them using lattice:::hist.constructor(), but it's not for the fainthearted. cheers, Rolf Turner P. S. You really should be absolutely certain that you know what you're talking about before accusing a package of giving ``wrong answers''. R. T. On 01/09/11 01:50, Monica Pisica wrote:> > > Hi, > > > > I have a relatively big dataset and I want to construct > some histograms using the histogram function in lattice. One thing I am > interested in is to look at differences between density and percent. I know I can > use the hist function but it seems that this function gives sometimes some > wrong answers and the density is actually a percent since it is calculated as counts in the bin divided by the total no. of points. Let me explain. > > > > If I let the hist function to decide the breaks, or I use > a small number, or one of the pre-determined methods to select breaks then > everything seems to be in order. But if I decide to use ? for example ? 100 as > a breaks (I have over 90000 data points so the number of breaks is not > necessarily too large I would think) the density for the first bin is over 1, > although for all the other breaks the density is actually a percent since it is > the count for that bin divided by the total no. of points I have. So ?. Here it > is something wrong or most probably I am doing something wrong. > > > > If I use the function histogram from lattice it is > obvious that there is a difference between the percent param and the density > param. I looked at the function code and I didn't understand it ? to be honest. > It seems it calls inside the hist function, or a slightly modify variant of > hist. Reading about the object trellis I saw I can access different info about > the graph it generates but nothing about the actual data that goes into > defining the histogram. How can I access the data from it? > > > > I am not sure if my problem is platform specific ? it should > not be ? but I have Rx64 2.13.1 on windows machine, in case it counts. > > > > I appreciate your help, thanks,