Karl Ove Hufthammer
2011-Jul-25 14:00 UTC
[R] Binning numbers into integer-valued intervals (or: a version of cut or cut2 that makes sense)
Dear list members, I?m looking for a way to divide numbers into simple (i.e., integer-valued) intervals, and thought the ?cut? function in ?base? or the ?cut2? function in ?Hmisc? would, er, cut it. However, they seem to give rather surprising results. Since I want the endpoints of the intervals to be integers, I used the ?dig.lab? and ?digits? arguments. One assumption I made: If the number x gets the label (a, b], then x lies in the interval (a, b]. It turns out that this assumption was incorrect. Example: $ cut(c(20.8, 21.3, 21.7, 23, 25), 2, dig.lab=1) [1] (21,23] (21,23] (21,23] (23,25] (23,25] Levels: (21,23] (23,25] So the first number, 20.8, get put in the interval (21,23], which seem strange. I can see why this could happen, though, as perhaps the 20.8 is rounded to 21 before binning. But it?s even stranger that the *integer* 23 is put in in the interval (23,25] instead of in the interval (21,23]. Can anyone explain why? I then turned to ?cut2? in ?Hmisc?. But again I was surprised by the result: $ cut2(c(20.8, 21.3, 21.7, 23), g=2, digits=1) [1] [21,22) [21,22) [22,23] [22,23] Levels: [21,22) [22,23] Again 20.8 is placed in an interval that doesn?t mathematically contain it. And 21.3 and 21.7 are placed in *different* intervals, instead of both being placed in the interval [21,22). This may perhaps strictly not be a bug, but it?s certainly surprising behaviour! Since obviously none of the two functions do what I require them to do, is there a different function that does, hidden deep inside some R package? This function should take as input a vector of numbers, and output a vector of non-overlapping (but ?touching?) intervals with integer end-points so that each number is in exactly one interval. It should of course also include information on which interval each number belongs to. Version information (though I also observe this on R 2.13.1 on Windows): $ sessionInfo() R version 2.13.1 Patched (2011-07-25 r56494) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=nn_NO.UTF-8 LC_NUMERIC=C [3] LC_TIME=nn_NO.UTF-8 LC_COLLATE=nn_NO.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=nn_NO.UTF-8 [7] LC_PAPER=nn_NO.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] Hmisc_3.8-3 survival_2.36-9 loaded via a namespace (and not attached): [1] cluster_1.14.0 grid_2.13.1 lattice_0.19-30 -- Karl Ove Hufthammer
William Dunlap
2011-Jul-25 15:29 UTC
[R] Binning numbers into integer-valued intervals (or: a version of cut or cut2 that makes sense)
> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Karl Ove > Hufthammer > Sent: Monday, July 25, 2011 7:01 AM > To: r-help at stat.math.ethz.ch > Subject: [R] Binning numbers into integer-valued intervals (or: a version of cut or cut2 that makes > sense) > > Dear list members, > > I?m looking for a way to divide numbers into simple (i.e., integer-valued) > intervals, and thought the ?cut? function in ?base? or the ?cut2? function > in ?Hmisc? would, er, cut it. However, they seem to give rather surprising > results. > > Since I want the endpoints of the intervals to be integers, I used the > ?dig.lab? and ?digits? arguments. One assumption I made: If the number x > gets the label (a, b], then x lies in the interval (a, b]. It turns out that > this assumption was incorrect. Example: > > $ cut(c(20.8, 21.3, 21.7, 23, 25), 2, dig.lab=1) > [1] (21,23] (21,23] (21,23] (23,25] (23,25] > Levels: (21,23] (23,25] > > So the first number, 20.8, get put in the interval (21,23], which seem > strange. I can see why this could happen, though, as perhaps the 20.8 is > rounded to 21 before binning. But it?s even stranger that the *integer* 23 > is put in in the interval (23,25] instead of in the interval (21,23]. Can > anyone explain why?dig.lab does not affect the choice of break points, it only affects how they are converted to character form for the labels. Unfortunately, cut() does not return the actual breakpoints but if you make them yourself you know what they are. You need to find or make a function akin to pretty() that returns a "nice" set of breakpoints. pretty() itself may do: > x <- c(20.8, 21.3, 21.7, 23, 25) > pretty(x, n=2) [1] 20 22 24 26 > cut(x, breaks=pretty(x, n=2)) [1] (20,22] (20,22] (20,22] (22,24] (24,26] Levels: (20,22] (22,24] (24,26] Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com> > I then turned to ?cut2? in ?Hmisc?. But again I was surprised by the result: > > $ cut2(c(20.8, 21.3, 21.7, 23), g=2, digits=1) > [1] [21,22) [21,22) [22,23] [22,23] > Levels: [21,22) [22,23] > > Again 20.8 is placed in an interval that doesn?t mathematically contain it. > And 21.3 and 21.7 are placed in *different* intervals, instead of both being > placed in the interval [21,22). This may perhaps strictly not be a bug, but > it?s certainly surprising behaviour! > > Since obviously none of the two functions do what I require them to do, is > there a different function that does, hidden deep inside some R package? > This function should take as input a vector of numbers, and output a vector > of non-overlapping (but ?touching?) intervals with integer end-points so > that each number is in exactly one interval. It should of course also > include information on which interval each number belongs to. > > Version information (though I also observe this on R 2.13.1 on Windows): > > $ sessionInfo() > R version 2.13.1 Patched (2011-07-25 r56494) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=nn_NO.UTF-8 LC_NUMERIC=C > [3] LC_TIME=nn_NO.UTF-8 LC_COLLATE=nn_NO.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=nn_NO.UTF-8 > [7] LC_PAPER=nn_NO.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=nn_NO.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Hmisc_3.8-3 survival_2.36-9 > > loaded via a namespace (and not attached): > [1] cluster_1.14.0 grid_2.13.1 lattice_0.19-30 > > -- > Karl Ove Hufthammer > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.