Charles C. Berry
2016-Sep-02 18:50 UTC
[R] Improve code efficient with do.call, rbind and split contruction
On Fri, 2 Sep 2016, Bert Gunter wrote: [snip]> > The "trick" is to use tapply() to select the necessary row indices of > your data frame and forget about all the do.call and rbind stuff. e.g. >I agree the way to go is "select the necessary row indices" but I get there a different way. See below.>> set.seed(1001) >> df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)), > + g <- factor(sample(letters[1:6],100,rep=TRUE)), > + y = runif(100)) >> >> ix <- seq_len(nrow(df)) >> >> ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)])) >> ix > a b c d e f > A 94 69 100 59 80 87 > B 89 57 65 90 75 88 > C 85 92 86 95 97 62 > D 47 73 72 74 99 96jx <- which( !duplicated( df[,c("f","g")], fromLast=TRUE )) xtabs(jx~f+g,df[jx,]) ## Show equivalence to Bert's `ix' g f a b c d e f A 94 69 100 59 80 87 B 89 57 65 90 75 88 C 85 92 86 95 97 62 D 47 73 72 74 99 96 Chuck
Bert Gunter
2016-Sep-02 20:48 UTC
[R] Improve code efficient with do.call, rbind and split contruction
Chuck: I think this is quite clever. But note that the which() is unnecessary: logical indicing suffices, e.g. df[!duplicated(df[,c("f","g")],fromLast = TRUE),] I thought that your approach would be faster because it moves comparisons from the tapply() to C code. But I was wrong. e.g. for 1e6 rows:> set.seed(1001) > df <- data.frame(f =factor(sample(LETTERS[1:4],1e6,rep=TRUE)),+ g =factor(sample(letters[1:6],1e6,rep=TRUE)), + y = runif(1e6)) ##using duplicated() > system.time(z <-df[!duplicated(df[,c("f","g")],fromLast = TRUE),]) user system elapsed 0.175 0.008 0.183 ## Using tapply() > system.time( + {ix <- seq_len(nrow(df)); + z <- df[with(df,tapply(ix,list(f,g),function(x)x[length(x)])),] + }) user system elapsed 0.025 0.003 0.028 This illustrates the faultiness of my "intuition." A guess would be that the subscripting to get the factor combinations and duplicated.data.frame method takes the extra time. Anyway... Best, Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Sep 2, 2016 at 11:50 AM, Charles C. Berry <ccberry at ucsd.edu> wrote:> On Fri, 2 Sep 2016, Bert Gunter wrote: > [snip] >> >> >> The "trick" is to use tapply() to select the necessary row indices of >> your data frame and forget about all the do.call and rbind stuff. e.g. >> > > I agree the way to go is "select the necessary row indices" but I get there > a different way. See below. > >>> set.seed(1001) >>> df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)), >> >> + g <- factor(sample(letters[1:6],100,rep=TRUE)), >> + y = runif(100)) >>> >>> >>> ix <- seq_len(nrow(df)) >>> >>> ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)])) >>> ix >> >> a b c d e f >> A 94 69 100 59 80 87 >> B 89 57 65 90 75 88 >> C 85 92 86 95 97 62 >> D 47 73 72 74 99 96 > > > > jx <- which( !duplicated( df[,c("f","g")], fromLast=TRUE )) > > xtabs(jx~f+g,df[jx,]) ## Show equivalence to Bert's `ix' > > g > f a b c d e f > A 94 69 100 59 80 87 > B 89 57 65 90 75 88 > C 85 92 86 95 97 62 > D 47 73 72 74 99 96 > > > Chuck > >
Bert Gunter
2016-Sep-03 17:41 UTC
[R] Improve code efficient with do.call, rbind and split contruction
Chuck et. al.: As I said previously, my intuition about the relative efficiency of tapply() and duplicated() in the context of this thread was wrong. But I wondered exactly how and to what extent. So I've fooled around a bit more and think I understand. Using the example I gave, the key is to replace the duplicated.data.frame method and the inner data.frame subscripting with the duplicated.default method via with() and the interaction() function (paste() -ing instead takes extra time):> system.time(z <-with(df,df[!duplicated(interaction(f,g),fromLast = TRUE),]))user system elapsed 0.039 0.006 0.045> > system.time(+ {ix <- seq_len(nrow(df)); + z <- with(df,df[tapply(ix,list(f,g),function(x)x[length(x)]),]) + }) user system elapsed 0.025 0.005 0.029 tapply() still appears slightly more efficient (which is still surprising to me), but only slightly. Hope this is informative. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Fri, Sep 2, 2016 at 1:48 PM, Bert Gunter <bgunter.4567 at gmail.com> wrote:> Chuck: > > I think this is quite clever. But note that the which() is > unnecessary: logical indicing suffices, e.g. > > df[!duplicated(df[,c("f","g")],fromLast = TRUE),] > > I thought that your approach would be faster because it moves > comparisons from the tapply() to C code. But I was wrong. e.g. for 1e6 > rows: > >> set.seed(1001) >> df <- data.frame(f =factor(sample(LETTERS[1:4],1e6,rep=TRUE)), > + g > =factor(sample(letters[1:6],1e6,rep=TRUE)), > + y = runif(1e6)) > > ##using duplicated() > > system.time(z <-df[!duplicated(df[,c("f","g")],fromLast = TRUE),]) > user system elapsed > 0.175 0.008 0.183 > > ## Using tapply() > > system.time( > + {ix <- seq_len(nrow(df)); > + z <- df[with(df,tapply(ix,list(f,g),function(x)x[length(x)])),] > + }) > user system elapsed > 0.025 0.003 0.028 > > > This illustrates the faultiness of my "intuition." A guess would be > that the subscripting to get the factor combinations and > duplicated.data.frame method takes the extra time. > > Anyway... > > Best, > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Sep 2, 2016 at 11:50 AM, Charles C. Berry <ccberry at ucsd.edu> wrote: >> On Fri, 2 Sep 2016, Bert Gunter wrote: >> [snip] >>> >>> >>> The "trick" is to use tapply() to select the necessary row indices of >>> your data frame and forget about all the do.call and rbind stuff. e.g. >>> >> >> I agree the way to go is "select the necessary row indices" but I get there >> a different way. See below. >> >>>> set.seed(1001) >>>> df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)), >>> >>> + g <- factor(sample(letters[1:6],100,rep=TRUE)), >>> + y = runif(100)) >>>> >>>> >>>> ix <- seq_len(nrow(df)) >>>> >>>> ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)])) >>>> ix >>> >>> a b c d e f >>> A 94 69 100 59 80 87 >>> B 89 57 65 90 75 88 >>> C 85 92 86 95 97 62 >>> D 47 73 72 74 99 96 >> >> >> >> jx <- which( !duplicated( df[,c("f","g")], fromLast=TRUE )) >> >> xtabs(jx~f+g,df[jx,]) ## Show equivalence to Bert's `ix' >> >> g >> f a b c d e f >> A 94 69 100 59 80 87 >> B 89 57 65 90 75 88 >> C 85 92 86 95 97 62 >> D 47 73 72 74 99 96 >> >> >> Chuck >> >>