On 09/16/2015 04:41 PM, Bert Gunter wrote:> Yes! Chuck's use of mapply is exactly the split/combine strategy I was > looking for. In retrospect, exactly how one should think about it. > Many thanks to all for a constructive discussion . > > -- Bert > > > Bert Gunter > >>>> >>>> Use mapply like this on large problems: >>>> >>>> unsplit( >>>> mapply( >>>> function(x,z) eval( x, list( y=z )), >>>> expression( A=y*2, B=y+3, C=sqrt(y) ), >>>> split( dat$Flow, dat$ASB ), >>>> SIMPLIFY=FALSE), >>>> dat$ASB) >>>> >>>> Chuck >>>>Is there any reason not to use data.table for this purpose, especially if efficiency is of concern? --- # load data.table and microbenchmark library(data.table) library(microbenchmark) # # prepare data DF <- data.frame( ASB = rep_len(factor(LETTERS[1:3]), 3e5), Flow = rnorm(3e5)^2) DT <- as.data.table(DF) DT[, ASB := as.character(ASB)] # # define functions # # Chuck's version fnSplit <- function(dat) { unsplit( mapply( function(x,z) eval( x, list( y=z )), expression( A=y*2, B=y+3, C=sqrt(y) ), split( dat$Flow, dat$ASB ), SIMPLIFY=FALSE), dat$ASB) } # # data.table-way (IMHO, much easier to read) fnDataTable <- function(dat) { dat[, result : if (.BY == "A") { 2 * Flow } else if (.BY == "B") { 3 + Flow } else if (.BY == "C") { sqrt(Flow) }, by = ASB] } # # benchmark # microbenchmark(fnSplit(DF), fnDataTable(DT)) identical(fnSplit(DF), fnDataTable(DT)[, result]) --- Actually, in Chuck's version the unsplit() part is slow. If the order is not of concern (e.g., DF is reordered before calling fnSplit), fnSplit is comparable to the DT-version. Denes
D?nes: A fair point! The only reason I have is ignorance -- I have not used data.table. I am not surprised that it and perhaps other packages (dplyr maybe?) can do things in a reasonable way very efficiently. The only problem is that it requires us to learn yet another package/paradigm. There may also be issues with ts flexibility compared to base R data structures, but, again, I must plead ignorance here. It is interesting that, mod the unsplit reconstruction of the original vectors, Chuck's base R solution is as efficient as data.table's. Cheers, Bert Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Wed, Sep 16, 2015 at 4:42 PM, D?nes T?th <toth.denes at ttk.mta.hu> wrote:> > > On 09/16/2015 04:41 PM, Bert Gunter wrote: >> >> Yes! Chuck's use of mapply is exactly the split/combine strategy I was >> looking for. In retrospect, exactly how one should think about it. >> Many thanks to all for a constructive discussion . >> >> -- Bert >> >> >> Bert Gunter >> >>>>> >>>>> Use mapply like this on large problems: >>>>> >>>>> unsplit( >>>>> mapply( >>>>> function(x,z) eval( x, list( y=z )), >>>>> expression( A=y*2, B=y+3, C=sqrt(y) ), >>>>> split( dat$Flow, dat$ASB ), >>>>> SIMPLIFY=FALSE), >>>>> dat$ASB) >>>>> >>>>> Chuck >>>>> > > > Is there any reason not to use data.table for this purpose, especially if > efficiency is of concern? > > --- > > # load data.table and microbenchmark > library(data.table) > library(microbenchmark) > # > # prepare data > DF <- data.frame( > ASB = rep_len(factor(LETTERS[1:3]), 3e5), > Flow = rnorm(3e5)^2) > DT <- as.data.table(DF) > DT[, ASB := as.character(ASB)] > # > # define functions > # > # Chuck's version > fnSplit <- function(dat) { > unsplit( > mapply( > function(x,z) eval( x, list( y=z )), > expression( A=y*2, B=y+3, C=sqrt(y) ), > split( dat$Flow, dat$ASB ), > SIMPLIFY=FALSE), > dat$ASB) > } > # > # data.table-way (IMHO, much easier to read) > fnDataTable <- function(dat) { > dat[, > result :> if (.BY == "A") { > 2 * Flow > } else if (.BY == "B") { > 3 + Flow > } else if (.BY == "C") { > sqrt(Flow) > }, > by = ASB] > } > # > # benchmark > # > microbenchmark(fnSplit(DF), fnDataTable(DT)) > identical(fnSplit(DF), fnDataTable(DT)[, result]) > > --- > > Actually, in Chuck's version the unsplit() part is slow. If the order is not > of concern (e.g., DF is reordered before calling fnSplit), fnSplit is > comparable to the DT-version. > > > Denes
> On 17 Sep 2015, at 01:42, D?nes T?th <toth.denes at ttk.mta.hu> wrote: > > > > On 09/16/2015 04:41 PM, Bert Gunter wrote: >> Yes! Chuck's use of mapply is exactly the split/combine strategy I was >> looking for. In retrospect, exactly how one should think about it. >> Many thanks to all for a constructive discussion . >> >> -- Bert >> >> >> Bert Gunter >> >>>>> >>>>> Use mapply like this on large problems: >>>>> >>>>> unsplit( >>>>> mapply( >>>>> function(x,z) eval( x, list( y=z )), >>>>> expression( A=y*2, B=y+3, C=sqrt(y) ), >>>>> split( dat$Flow, dat$ASB ), >>>>> SIMPLIFY=FALSE), >>>>> dat$ASB) >>>>> >>>>> Chuck >>>>> > > > Is there any reason not to use data.table for this purpose, especially if efficiency is of concern? > > --- > > # load data.table and microbenchmark > library(data.table) > library(microbenchmark) > # > # prepare data > DF <- data.frame( > ASB = rep_len(factor(LETTERS[1:3]), 3e5), > Flow = rnorm(3e5)^2) > DT <- as.data.table(DF) > DT[, ASB := as.character(ASB)] > # > # define functions > # > # Chuck's version > fnSplit <- function(dat) { > unsplit( > mapply( > function(x,z) eval( x, list( y=z )), > expression( A=y*2, B=y+3, C=sqrt(y) ), > split( dat$Flow, dat$ASB ), > SIMPLIFY=FALSE), > dat$ASB) > } > # > # data.table-way (IMHO, much easier to read) > fnDataTable <- function(dat) { > dat[, > result :> if (.BY == "A") { > 2 * Flow > } else if (.BY == "B") { > 3 + Flow > } else if (.BY == "C") { > sqrt(Flow) > }, > by = ASB] > } > # > # benchmark > # > microbenchmark(fnSplit(DF), fnDataTable(DT)) > identical(fnSplit(DF), fnDataTable(DT)[, result]) > > --- > > Actually, in Chuck's version the unsplit() part is slow. If the order is not of concern (e.g., DF is reordered before calling fnSplit), fnSplit is comparable to the DT-version. >But David?s version is faster than Chuck?s fnSplit. I modified David?s solution slightly to get a result that is identical to fnSplit. # David's version # my modification to return a vector just like fnSplit fnDavid <- function(dat) { z <- mapply( function(x,z) eval( x, list( y=z )), expression(A= y*2, B=y+3, C=sqrt(y) ), split( dat$Flow, dat$ASB ), USE.NAMES=FALSE, SIMPLIFY=TRUE ) as.vector(t(z)) } Added this to D?nes's code. Benchmarking with R package rbenchmark and testing result like this library(rbenchmark) benchmark(fnSplit(DF), fnDataTable(DT),fnDavid(DF)) identical(fnSplit(DF), fnDataTable(DT)[, result]) identical(fnSplit(DF), fnDavid(DF)) gave this: test replications elapsed relative user.self sys.self user.child 2 fnDataTable(DT) 100 0.829 1.000 0.762 0.066 0 3 fnDavid(DF) 100 1.615 1.948 1.515 0.098 0 1 fnSplit(DF) 100 2.878 3.472 2.685 0.190 0 sys.child 2 0 3 0 1 0> identical(fnSplit(DF), fnDataTable(DT)[, result])[1] TRUE> identical(fnSplit(DF), fnDavid(DF))[1] TRUE Berend
On Thu, 17 Sep 2015, Berend Hasselman wrote:> >> On 17 Sep 2015, at 01:42, D?nes T?th <toth.denes at ttk.mta.hu> wrote: >> >> >> >> On 09/16/2015 04:41 PM, Bert Gunter wrote: >>> Yes! Chuck's use of mapply is exactly the split/combine strategy I was >>> looking for. In retrospect, exactly how one should think about it. >>> Many thanks to all for a constructive discussion . >>> >>> -- Bert >>> >>> >>> Bert Gunter >>> >>>>>> >>>>>> Use mapply like this on large problems: >>>>>> >>>>>> unsplit( >>>>>> mapply( >>>>>> function(x,z) eval( x, list( y=z )), >>>>>> expression( A=y*2, B=y+3, C=sqrt(y) ), >>>>>> split( dat$Flow, dat$ASB ), >>>>>> SIMPLIFY=FALSE), >>>>>> dat$ASB) >>>>>> >>>>>> Chuck >>>>>> >> >> >> Is there any reason not to use data.table for this purpose, especially if efficiency is of concern? >> >> --- >> >> # load data.table and microbenchmark >> library(data.table) >> library(microbenchmark) >> # >> # prepare data >> DF <- data.frame( >> ASB = rep_len(factor(LETTERS[1:3]), 3e5), >> Flow = rnorm(3e5)^2) >> DT <- as.data.table(DF) >> DT[, ASB := as.character(ASB)] >> # >> # define functions >> # >> # Chuck's version >> fnSplit <- function(dat) { >> unsplit( >> mapply( >> function(x,z) eval( x, list( y=z )), >> expression( A=y*2, B=y+3, C=sqrt(y) ), >> split( dat$Flow, dat$ASB ), >> SIMPLIFY=FALSE), >> dat$ASB) >> } >> # >> # data.table-way (IMHO, much easier to read) >> fnDataTable <- function(dat) { >> dat[, >> result :>> if (.BY == "A") { >> 2 * Flow >> } else if (.BY == "B") { >> 3 + Flow >> } else if (.BY == "C") { >> sqrt(Flow) >> }, >> by = ASB] >> } >> # >> # benchmark >> # >> microbenchmark(fnSplit(DF), fnDataTable(DT)) >> identical(fnSplit(DF), fnDataTable(DT)[, result]) >> >> --- >> >> Actually, in Chuck's version the unsplit() part is slow. If the order is not of concern (e.g., DF is reordered before calling fnSplit), fnSplit is comparable to the DT-version. >> > > But David?s version is faster than Chuck?s fnSplit. I modified David?s solution slightly to get a result that is identical to fnSplit. > > # David's version > # my modification to return a vector just like fnSplit > fnDavid <- function(dat) { > z <- mapply( > function(x,z) eval( x, list( y=z )), > expression(A= y*2, B=y+3, C=sqrt(y) ), > split( dat$Flow, dat$ASB ), > USE.NAMES=FALSE, SIMPLIFY=TRUE > ) > as.vector(t(z)) > } > > Added this to D?nes's code. > Benchmarking with R package rbenchmark and testing result like this > > library(rbenchmark) > benchmark(fnSplit(DF), fnDataTable(DT),fnDavid(DF)) > identical(fnSplit(DF), fnDataTable(DT)[, result]) > identical(fnSplit(DF), fnDavid(DF)) > > gave this: > > test replications elapsed relative user.self sys.self user.child > 2 fnDataTable(DT) 100 0.829 1.000 0.762 0.066 0 > 3 fnDavid(DF) 100 1.615 1.948 1.515 0.098 0 > 1 fnSplit(DF) 100 2.878 3.472 2.685 0.190 0 > sys.child > 2 0 > 3 0 > 1 0 > >> identical(fnSplit(DF), fnDataTable(DT)[, result]) > [1] TRUE >> identical(fnSplit(DF), fnDavid(DF)) > [1] TRUEThe above `TRUE' depends on the structure of ASB here. identical(...) is often FALSE in the general case. A permutation of ASB is enough to show this:> DF$ASB <- sample(DF$ASB) > identical(fnSplit(DF), fnDavid(DF))[1] FALSE>unsplit() is the price you pay to cope with general orderings. Chuck