I am trying to `cut' a continuous variable into contiguous classes containing approximately an equal number of observations. I thought quantile() was the appropriate function to use in order to find the breakpoints, but I end up with classes of different sizes - see example below. Does anybody have an explanation for that? And what is the `recommended' way of computing what I am looking for? Example:> ca$age[1] 28 42 46 45 34 44 48 45 38 45 49 45 41 46 49 46 44 48 52 48 45 50 53 57 46 [26] 52 54 57 47 52 55 59 50 54 57 60 51 55 46 63 51 59 48 35 53 59 57 37 55 32 [51] 60 43 59 37 30 47 60 38 34 48 32 38 36 49 33 42 38 58 35 43 39 59 39 43 42 [76] 60 40 44> table(cut(ca$age,breaks=c(-Inf,quantile(ca$age, seq(0,1,length=11)[-1]))))(-Inf,35] (35,38.4] (38.4,43] (43,45] (45,46.5] (46.5,49] (49,52] (52,55] 9 7 10 8 5 10 7 7 (55,59] (59,63] 10 5 Thanks in advance, Giovanni -- __________________________________________________ [ ] [ Giovanni Petris GPetris at uark.edu ] [ Department of Mathematical Sciences ] [ University of Arkansas - Fayetteville, AR 72701 ] [ Ph: (479) 575-6324, 575-8630 (fax) ] [ http://definetti.uark.edu/~gpetris/ ] [__________________________________________________]
On Fri, 6 Feb 2004, Giovanni Petris wrote:> > I am trying to `cut' a continuous variable into contiguous classes > containing approximately an equal number of observations. I thought > quantile() was the appropriate function to use in order to find the > breakpoints, but I end up with classes of different sizes - see > example below. Does anybody have an explanation for that? And what is > the `recommended' way of computing what I am looking for?Your variable is actually quite discrete, which is causing the problem. For example, you have two 35s, so the lower groups could only be equal if one 35 was in one group and the other in the other group. Now, if you want the groups to be equal even at the cost of not depending just on the value there are at least two possible approaches - break ties randomly, for example by jitter()ing the data first - order the data by age and then take the first 8, next 8, and so on. -thomas> Example: > > > ca$age > [1] 28 42 46 45 34 44 48 45 38 45 49 45 41 46 49 46 44 48 52 48 45 50 > 53 57 46 52 54 57 47 52 55 59 50 54 57 60 51 55 46 63 51 59 48 35 > 53 59 57 37 55 32 60 43 59 37 30 47 60 38 34 48 32 38 36 49 33 42 > 38 58 35 43 39 59 39 43 42 60 40 44> > table(cut(ca$age,breaks=c(-Inf,quantile(ca$age, seq(0,1,length=11)[-1])))) > > (-Inf,35] (35,38.4] (38.4,43] (43,45] (45,46.5] (46.5,49] (49,52] (52,55] > 9 7 10 8 5 10 7 7 > (55,59] (59,63] > 10 5 > > Thanks in advance, > Giovanni > > -- > > __________________________________________________ > [ ] > [ Giovanni Petris GPetris at uark.edu ] > [ Department of Mathematical Sciences ] > [ University of Arkansas - Fayetteville, AR 72701 ] > [ Ph: (479) 575-6324, 575-8630 (fax) ] > [ http://definetti.uark.edu/~gpetris/ ] > [__________________________________________________] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
On Fri, 6 Feb 2004 09:30:31 -0600 (CST) Giovanni Petris <GPetris at uark.edu> wrote:> > I am trying to `cut' a continuous variable into contiguous classes > containing approximately an equal number of observations. I thought > quantile() was the appropriate function to use in order to find the > breakpoints, but I end up with classes of different sizes - see > example below. Does anybody have an explanation for that? And what is > the `recommended' way of computing what I am looking for? > > Example: > > > ca$age > [1] 28 42 46 45 34 44 48 45 38 45 49 45 41 46 49 46 44 48 52 48 45 50 > 53 57 46 > [26] 52 54 57 47 52 55 59 50 54 57 60 51 55 46 63 51 59 48 35 53 59 57 > 37 55 32[51] 60 43 59 37 30 47 60 38 34 48 32 38 36 49 33 42 38 58 35 43 > 39 59 39 43 42[76] 60 40 44 > > table(cut(ca$age,breaks=c(-Inf,quantile(ca$age, > > seq(0,1,length=11)[-1])))) > > (-Inf,35] (35,38.4] (38.4,43] (43,45] (45,46.5] (46.5,49] (49,52] > (52,55] > 9 7 10 8 5 10 7 > 7 > (55,59] (59,63] > 10 5 > > Thanks in advance, > Giovanni > > -- >The cut2 function in the Hmisc package tries to do this the best it can. Frank --- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
Another problem with the R function "quantile" is that its definition of "quantiles" may be not what you expect. Consider the following: > x <- matrix(c(1:4)) > quantile(x,c(0,.25,.5,.75,1)) 0% 25% 50% 75% 100% 1.00 1.75 2.50 3.25 4.00 > x <- matrix(c(1:6)) > quantile(x,c(0,.25,.5,.75,1)) 0% 25% 50% 75% 100% 1.00 2.25 3.50 4.75 6.00 > x <- matrix(c(1:8)) > quantile(x,c(0,.25,.5,.75,1)) 0% 25% 50% 75% 100% 1.00 2.75 4.50 6.25 8.00 With your implicit definition of quantiles (splitting the data set into classes of equal size), each class should have 1.5 observations, so that the quantiles should be > x <- matrix(c(1:4)) > equalSizeClasses(x,c(0,.25,.5,.75,1)) 0% 25% 50% 75% 100% -Inf 1.50 2.50 3.50 +Inf > x <- matrix(c(1:6)) > equalSizeClasses(x,c(0,.25,.5,.75,1)) 0% 25% 50% 75% 100% -Inf 2.00 3.50 5.00 +Inf > x <- matrix(c(1:8)) > equalSizeClasses(x,c(0,.25,.5,.75,1)) 0% 25% 50% 75% 100% -Inf 2.50 4.50 6.50 +Inf Knut At 09:30 2004-02-06 -0600, Giovanni Petris wrote:>I am trying to `cut' a continuous variable into contiguous classes >containing approximately an equal number of observations. I thought >quantile() was the appropriate function to use in order to find the >breakpoints, but I end up with classes of different sizes - see >example below. Does anybody have an explanation for that? And what is >the `recommended' way of computing what I am looking for? > >Example: > > > ca$age > [1] 28 42 46 45 34 44 48 45 38 45 49 45 41 46 49 46 44 48 52 48 45 50 53 > 57 46 >[26] 52 54 57 47 52 55 59 50 54 57 60 51 55 46 63 51 59 48 35 53 59 57 37 >55 32 >[51] 60 43 59 37 30 47 60 38 34 48 32 38 36 49 33 42 38 58 35 43 39 59 39 >43 42 >[76] 60 40 44 > > table(cut(ca$age,breaks=c(-Inf,quantile(ca$age, seq(0,1,length=11)[-1])))) > >(-Inf,35] (35,38.4] (38.4,43] (43,45] (45,46.5] >(46.5,49] (49,52] (52,55] > 9 7 10 8 5 10 7 > 7 > (55,59] (59,63] > 10 5 > >Thanks in advance, >Giovanni > >-- > > __________________________________________________ >[ ] >[ Giovanni Petris GPetris at uark.edu ] >[ Department of Mathematical Sciences ] >[ University of Arkansas - Fayetteville, AR 72701 ] >[ Ph: (479) 575-6324, 575-8630 (fax) ] >[ http://definetti.uark.edu/~gpetris/ ] >[__________________________________________________] > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://www.stat.math.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide! http://www.R-project.org/posting-guide.htmlKnut M. Wittkowski, PhD,DSc ------------------------------------------ The Rockefeller University, GCRC Experimental Design and Biostatistics 1230 York Ave #121B, Box 322, NY,NY 10021 +1(212)327-7175, +1(212)327-8450 (Fax) kmw at rockefeller.edu http://www.rucares.org/clinicalresearch/dept/biometry/