gug at fnal.gov
2009-Nov-07 15:05 UTC
[Rd] Binning of integers with hist() function odd results (P (PR#14048)
Hi, Thank you for responding quickly and explaining the behavior. By adding "include.lowest=TRUE,right=FALSE" and manually including breaks that resolved the simple test case. Next I updated my more complex data set, which already had manually defined breaks, and that resolved my issues there too. I have now gone in and updated all my functions which use hist() so I hopefully won't forget this in the future. On Nov 7, 2009, at 7:57 AM, Ted Harding wrote:> On 06-Nov-09 23:30:12, gug at fnal.gov wrote: >> Full_Name: Gerald Guglielmo >> Version: 2.8.1 (2008-12-22) >> OS: OSX Leopard >> Submission from: (NULL) (131.225.103.35) >> >> When I attempt to use the hist() function to bin integers the >> behavior >> seems >> very odd as the bin boundary seems inconsistent across the various >> bins. For >> some bins the upper boundary includes the next integer value, while >> in >> others it >> does not. If I add 0.1 to every value, then the hist() binning >> behavior >> is what >> I would normally expect. >> >>> h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5)) >>> h1$mids >> [1] 1.5 2.5 3.5 4.5 >>> h1$counts >> [1] 3 3 4 5 >>> h2<- >>> hist(c(1.1,2.1,2.1,3.1,3.1,3.1,4.1,4.1,4.1,4.1,5.1,5.1,5.1,5.1,5.1) >>> ) >>> h2$mids >> [1] 1.5 2.5 3.5 4.5 5.5 >>> h2$counts >> [1] 1 2 3 4 5 >> >> Naively I would have expected the same distribution of counts in the >> two cases, but clearly that is not happening. This is a simple >> example >> to illustrate the behavior, originally I noticed this while binning a >> large data sample where I had set the breaks=c(0,24,1). > > This is the correct intended bahaviour. By default, values which are > exactly on the boundary between two bins are counted in the bin which > is just below the boundary value. Except that the bottom-most break > will count values on it into the bin just above it. > > Hence 1,2,2 all go into the [1,2] bin; 3,3,3 into (2,3]; > 4,4,4,4 into (3,4]; and 5,5,5,5,5 into (4,5]. Hence the counts > 3,3,4,5. > > Since you did not set breaks in > h1<-hist(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5)), > they were set using the default method, and you can see what they are > with > > h1$breaks > [1] 1 2 3 4 5 > > When you add 0.1 to each value, you push the values on the boundaries > up into the next bin. Now each value is inside its bin, and not on > any boundary. Hence 1.1 is in (1,2]; 2.1,2.1 in (2,3]; > 3.1,3.1,3.1 in (3,4]; 4.1,4.1,4.1,4.1 in (4,5]; and > 5.1,5.1,5.1,5.1,5.1 in (5,6], giving counts 1,2,3,4,5 as you observe. > > The default behaviour described above is defined by the default > options > > include.lowest = TRUE, right = TRUE > > where: > > include.lowest: logical; if 'TRUE', an 'x[i]' equal to the 'breaks' > value will be included in the first (or last, for 'right > FALSE') bar. This will be ignored (with a warning) unless > 'breaks' is a vector. > > right: logical; if 'TRUE', the histograms cells are right-closed > (left open) intervals. > > See '?hist'. You can change this behaviour by shanging the options. > > Hoping this helps, > Ted. > > -------------------------------------------------------------------- > E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk> > Fax-to-email: +44 (0)870 094 0861 > Date: 07-Nov-09 Time: 13:57:07 > ------------------------------ XFMail -------------------------------- -Jerry-> gug at fnal.gov Pepe's Theory of everything: "Under the right circumstances, things happen." [[alternative HTML version deleted]]
Possibly Parallel Threads
- Binning of integers with hist() function odd results (P (PR#14047)
- Binning of integers with hist() function odd results (PR#14046)
- plus/minus +/- in factor; not plotmath not expression
- dovecot deliver with sieve: Corrupted index cache file (in-memory index).cache: Broken fields for mail UID
- [PATCH] Adding the VM Pool migration for vms