Jun Shen
2016-Sep-02 17:02 UTC
[R] Improve code efficient with do.call, rbind and split contruction
Dear list, I have the following line of code to extract the last line of the split data and put them back together. do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID','DOSENO')]),function(x)x[nrow(x),])) the problem is when have a huge dataset, it takes too long to run. (actually it's > 3 hours and it's still running). The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so totally 800,000 split dataset. Is there anyway to speed it up? Thanks. Jun [[alternative HTML version deleted]]
Bert Gunter
2016-Sep-02 17:51 UTC
[R] Improve code efficient with do.call, rbind and split contruction
This is the sort of thing that dplyr or the data.table packages can probably do elegantly and efficiently. So you might consider looking at them. But as I use neither, let me suggest a base R solution. As you supplied no data for a reproducible example, I'll make up my own and hopefully I have understood you correctly. If not, maybe someone else will get it straight. Anyway... The "trick" is to use tapply() to select the necessary row indices of your data frame and forget about all the do.call and rbind stuff. e.g.> set.seed(1001) > df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)),+ g <- factor(sample(letters[1:6],100,rep=TRUE)), + y = runif(100))> > ix <- seq_len(nrow(df)) > > ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)])) > ixa b c d e f A 94 69 100 59 80 87 B 89 57 65 90 75 88 C 85 92 86 95 97 62 D 47 73 72 74 99 96 ## ix can now be used as an index into df as: df[ix,] This should help somewhat, but you still have to contend with the tapply() loop at the interpreted level. I'll leave speed comparisons to you. Cheers, Bert ## Note: if, in fact, your data frame is arranged in a regular way with, e.g. your SID, DOSENO groups all of the same size and together, then you can calculate the indices you want directly and skip the tapply business.I'm assuming this is not the case... Again, no data... Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Sep 2, 2016 at 10:02 AM, Jun Shen <jun.shen.ut at gmail.com> wrote:> Dear list, > > I have the following line of code to extract the last line of the split > data and put them back together. > > do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID','DOSENO')]),function(x)x[nrow(x),])) > > the problem is when have a huge dataset, it takes too long to run. > (actually it's > 3 hours and it's still running). > > The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so > totally 800,000 split dataset. Is there anyway to speed it up? Thanks. > > Jun > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
ruipbarradas at sapo.pt
2016-Sep-02 17:57 UTC
[R] Improve code efficient with do.call, rbind and split contruction
Hello, Try ?aggregate, it's probably faster. With a made up data.frame, since you haven't provided us with a dataset, simout.s1 <- data.frame(SID = rep(LETTERS[1:2], 10), DOSENO = rep(letters[1:4], each = 5), value = rnorm(20)) res2 <- aggregate(simout.s1$value, list(simout.s1$SID, simout.s1$DOSENO), function(x)x[NROW(x)]) names(res2) <- names(simout.s1) Use dput to post a data example. Something like the following. dput(head(simout.s1, 50)) #paste the output of this in your next mail Hope this helps, Rui Barradas Citando Jun Shen <jun.shen.ut at gmail.com>:> Dear list, > > I have the following line of code to extract the last line of the split > data and put them back together. > > do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID','DOSENO')]),function(x)x[nrow(x),])) > > the problem is when have a huge dataset, it takes too long to run. > (actually it's > 3 hours and it's still running). > > The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so > totally 800,000 split dataset. Is there anyway to speed it up? Thanks. > > Jun > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jun Shen
2016-Sep-02 18:37 UTC
[R] Improve code efficient with do.call, rbind and split contruction
Hi Bert, This is the best method I have seen this year! do.call, rbind has just gone to museum :) It took ~30 second to get the results. You deserve a medal!!!! Jun On Fri, Sep 2, 2016 at 1:51 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:> This is the sort of thing that dplyr or the data.table packages can > probably do elegantly and efficiently. So you might consider looking > at them. But as I use neither, let me suggest a base R solution. As > you supplied no data for a reproducible example, I'll make up my own > and hopefully I have understood you correctly. If not, maybe someone > else will get it straight. Anyway... > > The "trick" is to use tapply() to select the necessary row indices of > your data frame and forget about all the do.call and rbind stuff. e.g. > > > set.seed(1001) > > df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)), > + g <- factor(sample(letters[1:6],100,rep=TRUE)), > + y = runif(100)) > > > > ix <- seq_len(nrow(df)) > > > > ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)])) > > ix > a b c d e f > A 94 69 100 59 80 87 > B 89 57 65 90 75 88 > C 85 92 86 95 97 62 > D 47 73 72 74 99 96 > > ## ix can now be used as an index into df as: > df[ix,] > > This should help somewhat, but you still have to contend with the > tapply() loop at the interpreted level. I'll leave speed comparisons > to you. > > Cheers, > Bert > > ## Note: if, in fact, your data frame is arranged in a regular way > with, e.g. your SID, DOSENO groups all of the same size and together, > then you can calculate the indices you want directly and skip the > tapply business.I'm assuming this is not the case... Again, no data... > > > > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Sep 2, 2016 at 10:02 AM, Jun Shen <jun.shen.ut at gmail.com> wrote: > > Dear list, > > > > I have the following line of code to extract the last line of the split > > data and put them back together. > > > > do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID',' > DOSENO')]),function(x)x[nrow(x),])) > > > > the problem is when have a huge dataset, it takes too long to run. > > (actually it's > 3 hours and it's still running). > > > > The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so > > totally 800,000 split dataset. Is there anyway to speed it up? Thanks. > > > > Jun > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Charles C. Berry
2016-Sep-02 18:50 UTC
[R] Improve code efficient with do.call, rbind and split contruction
On Fri, 2 Sep 2016, Bert Gunter wrote: [snip]> > The "trick" is to use tapply() to select the necessary row indices of > your data frame and forget about all the do.call and rbind stuff. e.g. >I agree the way to go is "select the necessary row indices" but I get there a different way. See below.>> set.seed(1001) >> df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)), > + g <- factor(sample(letters[1:6],100,rep=TRUE)), > + y = runif(100)) >> >> ix <- seq_len(nrow(df)) >> >> ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)])) >> ix > a b c d e f > A 94 69 100 59 80 87 > B 89 57 65 90 75 88 > C 85 92 86 95 97 62 > D 47 73 72 74 99 96jx <- which( !duplicated( df[,c("f","g")], fromLast=TRUE )) xtabs(jx~f+g,df[jx,]) ## Show equivalence to Bert's `ix' g f a b c d e f A 94 69 100 59 80 87 B 89 57 65 90 75 88 C 85 92 86 95 97 62 D 47 73 72 74 99 96 Chuck