Dan Bolser
2006-Mar-17 16:01 UTC
[R] Binning question (binning rows of a data.frame according to a variable)
Hi, I have tuples of data in rows of a data.frame, each column is a variable for the 'items' (one per row). One of the variables is the 'size' of the item (row). I would like to cut my data.frame into groups such that each group has the same *total size*. So, assuming that we order by size, some groups should have several small items while other groups have a few large items. All the groups should have approximately the same total size. I have tried various combinations of cut, quantile, and ecdf, and I just can't work out how to do this! Any help is greatly appreciated! All the best, Dan.
Dan Bolser
2006-Mar-17 16:12 UTC
[R] Binning question (binning rows of a data.frame according to a variable)
Dan Bolser wrote:> Hi, > > I have tuples of data in rows of a data.frame, each column is a variable > for the 'items' (one per row). > > One of the variables is the 'size' of the item (row). > > I would like to cut my data.frame into groups such that each group has > the same *total size*. So, assuming that we order by size, some groups > should have several small items while other groups have a few large > items. All the groups should have approximately the same total size. > > I have tried various combinations of cut, quantile, and ecdf, and I just > can't work out how to do this! > > Any help is greatly appreciated! > > All the best, > Dan. >Perhaps there is a cleaver way, but I just wrote this in despiration... my.groups <- 8 my.total <- sum(my.res.1$TOT) ## The 'size' variable in my data.frame my.approx.size <- my.total/ my.groups my.j <- 1 my.roll <- 0 my.factor <- numeric() for(i in sort(my.res.1$TOT)){ my.roll <- my.roll + i if (my.roll > my.approx.size * my.j) my.j <- my.j + 1 my.factor <- append(my.factor,my.j) } my.factor <- as.factor(my.factor) Then... > tapply(my.factor,my.factor,length) 1 2 3 4 5 6 7 8 152 62 45 34 25 21 14 8 And... > tapply(sort(my.res.1$TOT),my.factor,sum) 1 2 3 4 5 6 7 8 2880 2848 2912 2893 2832 2906 2776 3029 > Which isn't bad.> ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Adaikalavan Ramasamy
2006-Mar-19 05:37 UTC
[R] Binning question (binning rows of a data.frame according to a variable)
Do you by any chance want to sample from each group equally to get an equal representation matrix ? Here is an example of the input : mydf <- data.frame( value=1:100, value2=rnorm(100), grp=rep( LETTERS[1:4], c(35, 15, 30, 20) ) ) which has 35 observations from A, 15 from B, 30 from C and 20 from D. And here is a function that I wrote: sample.by.group <- function(df, grp, k, replace=FALSE){ if(length(k)==1){ k <- rep(k, length(unique(grp))) } if(!replace && any(k > table(grp))) stop( paste("Cannot take a sample larger than the population when 'replace = FALSE'.\n", "Please specify a value greater than", min(table(grp)), "or use 'replace = TRUE'.\n") ) ind <- model.matrix( ~ -1 + grp ) w.mat <- list(NULL) for(i in 1:ncol(ind)){ w.mat[[i]] <- sample( which( ind[,i]==1 ), k[i], replace=replace ) } out <- df[ unlist(w.mat), ] return(out) } And here are some examples of how to use it : mydf <- mydf[ sample(1:nrow(mydf)), ] # scramble it for fun out1 <- sample.by.group(mydf, mydf$grp, k=10 ) table( out1$grp ) out2 <- sample.by.group(mydf, mydf$grp, k=50, replace=T) # ie bootstrap table( out2$grp ) and you can even do bootstrapping or sampling with weights via: out3 <- sample.by.group(mydf, mydf$grp, k=c(20, 20, 30, 30), replace=T) table( out3$grp ) Regards, Adai On Fri, 2006-03-17 at 16:01 +0000, Dan Bolser wrote:> Hi, > > I have tuples of data in rows of a data.frame, each column is a variable > for the 'items' (one per row). > > One of the variables is the 'size' of the item (row). > > I would like to cut my data.frame into groups such that each group has > the same *total size*. So, assuming that we order by size, some groups > should have several small items while other groups have a few large > items. All the groups should have approximately the same total size. > > I have tried various combinations of cut, quantile, and ecdf, and I just > can't work out how to do this! > > Any help is greatly appreciated! > > All the best, > Dan. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Dan Bolser
2006-Mar-20 10:53 UTC
[R] Binning question (binning rows of a data.frame according to a variable)
hadley wickham wrote:>>Thing is, for one reason or another, the number of marbles per bag may >>systematically vary with age too. However, I am not interested in the >>number of marbles per bag, so I would like to group the students into 8 >>groups such that each group has the same total number of marbles. (Each >>group having a different sized age range, none the less ordered by age). > > > This sounds very much like a bin-packing problem > (http://en.wikipedia.org/wiki/Bin_packing_problem), which is NP-hard. > The wikipedia page mentions some heuristics you may want to look into. > > HadleyMan, I hate NP-hard problems! Thanks for the link :) Dan.