Johannes Graumann
2010-Oct-04 14:57 UTC
[R] Splitting a DF into rows according to a column
Hi, I'm turning my wheels on this and keep coming around to the same wrong solution - please have a look and give a hand ... The premise is: a DF like so> loremIpsum <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit.Quisque leo ipsum, ultricies scelerisque volutpat non, volutpat et nulla. Curabitur consequat ullamcorper tellus id imperdiet. Duis semper malesuada nulla, blandit lobortis diam fringilla at. Vestibulum nec tellus orci, eu sollicitudin quam. Phasellus sit amet enim diam. Phasellus mattis hendrerit varius. Curabitur ut tristique enim. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed convallis, tortor id vehicula facilisis, nunc justo facilisis tellus, sed eleifend nisi lacus id purus. Maecenas tempus sollicitudin libero, molestie laoreet metus dapibus eu. Mauris justo ante, mattis et pulvinar a, varius pretium eros. Curabitur fringilla dui ac dui rutrum pretium. Donec sed magna adipiscing nisi accumsan congue sed ac est. Vivamus lorem urna, tristique quis accumsan quis, ullamcorper aliquet velit."> tmpDF <- data.frame(Column1=rep(unlist(strsplit(loremIpsum,"")),length.out=510),Column2=runif(510,min=0,max=1e8)) is to be split into DFs with 50 entries in an ordered manner according to column2 (first DF ist o contain the rows with the 50 largest numbers, ...). Here is what I have been doing:> binSize <- 50 > splitMembership <-pmin(ceiling(order(tmpDF[["Column2"]],decreasing=TRUE)/binSize),floor(nrow(tmpDF)/binSize))> splitList <- split(tmpDF,splitMembership)Distribution seems to work ...> sapply(splitList,nrow)But this is NOT what I wanted ...> sapply(splitList,function(x){max(x[["Column2"]])})This was supposed to give me bins that are Column2-sorted and bin one should have a higher max than 2 than 3 ... Can anyone point out where (my now 3 reimplementations) fail? Thanks, Stupid Joh
On Oct 4, 2010, at 16:57 , Johannes Graumann wrote:> Hi, > > I'm turning my wheels on this and keep coming around to the same wrong > solution - please have a look and give a hand ... > > The premise is: a DF like so > >> loremIpsum <- "Lorem ipsum dolor sit amet, consectetur adipiscing elit. > Quisque leo ipsum, ultricies scelerisque volutpat non, volutpat et nulla. > Curabitur consequat ullamcorper tellus id imperdiet. Duis semper malesuada > nulla, blandit lobortis diam fringilla at. Vestibulum nec tellus orci, eu > sollicitudin quam. Phasellus sit amet enim diam. Phasellus mattis hendrerit > varius. Curabitur ut tristique enim. Lorem ipsum dolor sit amet, consectetur > adipiscing elit. Sed convallis, tortor id vehicula facilisis, nunc justo > facilisis tellus, sed eleifend nisi lacus id purus. Maecenas tempus > sollicitudin libero, molestie laoreet metus dapibus eu. Mauris justo ante, > mattis et pulvinar a, varius pretium eros. Curabitur fringilla dui ac dui > rutrum pretium. Donec sed magna adipiscing nisi accumsan congue sed ac est. > Vivamus lorem urna, tristique quis accumsan quis, ullamcorper aliquet > velit." >> tmpDF <- data.frame(Column1=rep(unlist(strsplit(loremIpsum," > ")),length.out=510),Column2=runif(510,min=0,max=1e8)) > > is to be split into DFs with 50 entries in an ordered manner according to > column2 (first DF ist o contain the rows with the 50 largest numbers, ...). > > Here is what I have been doing: > >> binSize <- 50 >> splitMembership <- > pmin(ceiling(order(tmpDF[["Column2"]],decreasing=TRUE)/binSize),floor(nrow(tmpDF)/binSize)) >> splitList <- split(tmpDF,splitMembership) > > Distribution seems to work ... >> sapply(splitList,nrow) > > But this is NOT what I wanted ... >> sapply(splitList,function(x){max(x[["Column2"]])}) > This was supposed to give me bins that are Column2-sorted and bin one should > have a higher max than 2 than 3 ... > > Can anyone point out where (my now 3 reimplementations) fail? > > Thanks, Stupid JohDear Stupid Joh, Have you considered something along the lines of o <- order(-x$Column2) xx <- x[o,] split(xx, (seq_len(NROW(x))-1) %/% 50) The above is a bit hard to follow, but it seems to work better with rank() instead of order():> splitMembership <-+ pmin(ceiling(rank(-tmpDF[["Column2"]])/binSize),floor(nrow(tmpDF)/binSize))> splitList <- split(tmpDF,splitMembership)> sapply(splitList,nrow)1 2 3 4 5 6 7 8 9 10 50 50 50 50 50 50 50 50 50 60> sapply(splitList,function(x){max(x[["Column2"]])})1 2 3 4 5 6 99877498 90567877 81965382 69112280 59814266 52130373 7 8 9 10 41557660 32630212 21226996 11880032 -- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com