ted.harding at manchester.ac.uk
2009-Nov-07 14:00 UTC
[Rd] Binning of integers with hist() function odd results (P (PR#14047)
On 06-Nov-09 23:30:12, gug at fnal.gov wrote:> Full_Name: Gerald Guglielmo > Version: 2.8.1 (2008-12-22) > OS: OSX Leopard > Submission from: (NULL) (131.225.103.35) > > When I attempt to use the hist() function to bin integers the behavior > seems > very odd as the bin boundary seems inconsistent across the various > bins. For > some bins the upper boundary includes the next integer value, while in > others it > does not. If I add 0.1 to every value, then the hist() binning behavior > is what > I would normally expect. > >> h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5)) >> h1$mids > [1] 1.5 2.5 3.5 4.5 >> h1$counts > [1] 3 3 4 5 >> h2<-hist(c(1.1,2.1,2.1,3.1,3.1,3.1,4.1,4.1,4.1,4.1,5.1,5.1,5.1,5.1,5.1) >> ) >> h2$mids > [1] 1.5 2.5 3.5 4.5 5.5 >> h2$counts > [1] 1 2 3 4 5 > > Naively I would have expected the same distribution of counts in the > two cases, but clearly that is not happening. This is a simple example > to illustrate the behavior, originally I noticed this while binning a > large data sample where I had set the breaks=c(0,24,1).This is the correct intended bahaviour. By default, values which are exactly on the boundary between two bins are counted in the bin which is just below the boundary value. Except that the bottom-most break will count values on it into the bin just above it. Hence 1,2,2 all go into the [1,2] bin; 3,3,3 into (2,3]; 4,4,4,4 into (3,4]; and 5,5,5,5,5 into (4,5]. Hence the counts 3,3,4,5. Since you did not set breaks in h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5)), they were set using the default method, and you can see what they are with h1$breaks [1] 1 2 3 4 5 When you add 0.1 to each value, you push the values on the boundaries up into the next bin. Now each value is inside its bin, and not on any boundary. Hence 1.1 is in (1,2]; 2.1,2.1 in (2,3]; 3.1,3.1,3.1 in (3,4]; 4.1,4.1,4.1,4.1 in (4,5]; and 5.1,5.1,5.1,5.1,5.1 in (5,6], giving counts 1,2,3,4,5 as you observe. The default behaviour described above is defined by the default options include.lowest = TRUE, right = TRUE where: include.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks' value will be included in the first (or last, for 'right FALSE') bar. This will be ignored (with a warning) unless 'breaks' is a vector. right: logical; if 'TRUE', the histograms cells are right-closed (left open) intervals. See '?hist'. You can change this behaviour by shanging the options. Hoping this helps, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 07-Nov-09 Time: 13:57:07 ------------------------------ XFMail ------------------------------