thr3ads.net - R help - [R] Multiple if function [Sep 2015]

If this information is useful, please help other people find it:
Share via:

Dénes Tóth

2015-Sep-16 23:42 UTC

[R] Multiple if function

On 09/16/2015 04:41 PM, Bert Gunter wrote:> Yes! Chuck's use of mapply is exactly the split/combine strategy I was
> looking for. In retrospect, exactly how one should think about it.
> Many thanks to all for a constructive discussion .
>
> -- Bert
>
>
> Bert Gunter
>
>>>>
>>>> Use mapply like this on large problems:
>>>>
>>>> unsplit(
>>>>    mapply(
>>>>        function(x,z) eval( x, list( y=z )),
>>>>        expression( A=y*2, B=y+3, C=sqrt(y) ),
>>>>        split( dat$Flow, dat$ASB ),
>>>>        SIMPLIFY=FALSE),
>>>>    dat$ASB)
>>>>
>>>> Chuck
>>>>

Is there any reason not to use data.table for this purpose, especially 
if efficiency is of concern?

---

# load data.table and microbenchmark
library(data.table)
library(microbenchmark)
#
# prepare data
DF <- data.frame(
     ASB = rep_len(factor(LETTERS[1:3]), 3e5),
     Flow = rnorm(3e5)^2)
DT <- as.data.table(DF)
DT[, ASB := as.character(ASB)]
#
# define functions
#
# Chuck's version
fnSplit <- function(dat) {
     unsplit(
         mapply(
             function(x,z) eval( x, list( y=z )),
             expression( A=y*2, B=y+3, C=sqrt(y) ),
             split( dat$Flow, dat$ASB ),
             SIMPLIFY=FALSE),
         dat$ASB)
}
#
# data.table-way (IMHO, much easier to read)
fnDataTable <- function(dat) {
     dat[,
         result :             if (.BY == "A") {
                 2 * Flow
             } else if (.BY == "B") {
                 3 + Flow
             } else if (.BY == "C") {
                 sqrt(Flow)
             },
         by = ASB]
}
#
# benchmark
#
microbenchmark(fnSplit(DF), fnDataTable(DT))
identical(fnSplit(DF), fnDataTable(DT)[, result])

---

Actually, in Chuck's version the unsplit() part is slow. If the order is 
not of concern (e.g., DF is reordered before calling fnSplit), fnSplit 
is comparable to the DT-version.


Denes

Bert Gunter

2015-Sep-17 01:53 UTC

head link

[R] Multiple if function

D?nes:

A fair point! The only reason I have is ignorance -- I have not used
data.table. I am not surprised that it and perhaps other packages
(dplyr maybe?) can do things in a reasonable way very efficiently. The
only problem is that it requires us to learn yet another
package/paradigm.  There may also be issues with ts flexibility
compared to base R data structures, but, again, I must plead ignorance
here.

It is interesting that, mod the unsplit reconstruction of the original
vectors, Chuck's base R solution is as efficient as data.table's.

Cheers,
Bert
Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Wed, Sep 16, 2015 at 4:42 PM, D?nes T?th <toth.denes at ttk.mta.hu>
wrote:>
>
> On 09/16/2015 04:41 PM, Bert Gunter wrote:
>>
>> Yes! Chuck's use of mapply is exactly the split/combine strategy I
was
>> looking for. In retrospect, exactly how one should think about it.
>> Many thanks to all for a constructive discussion .
>>
>> -- Bert
>>
>>
>> Bert Gunter
>>
>>>>>
>>>>> Use mapply like this on large problems:
>>>>>
>>>>> unsplit(
>>>>>    mapply(
>>>>>        function(x,z) eval( x, list( y=z )),
>>>>>        expression( A=y*2, B=y+3, C=sqrt(y) ),
>>>>>        split( dat$Flow, dat$ASB ),
>>>>>        SIMPLIFY=FALSE),
>>>>>    dat$ASB)
>>>>>
>>>>> Chuck
>>>>>
>
>
> Is there any reason not to use data.table for this purpose, especially if
> efficiency is of concern?
>
> ---
>
> # load data.table and microbenchmark
> library(data.table)
> library(microbenchmark)
> #
> # prepare data
> DF <- data.frame(
>     ASB = rep_len(factor(LETTERS[1:3]), 3e5),
>     Flow = rnorm(3e5)^2)
> DT <- as.data.table(DF)
> DT[, ASB := as.character(ASB)]
> #
> # define functions
> #
> # Chuck's version
> fnSplit <- function(dat) {
>     unsplit(
>         mapply(
>             function(x,z) eval( x, list( y=z )),
>             expression( A=y*2, B=y+3, C=sqrt(y) ),
>             split( dat$Flow, dat$ASB ),
>             SIMPLIFY=FALSE),
>         dat$ASB)
> }
> #
> # data.table-way (IMHO, much easier to read)
> fnDataTable <- function(dat) {
>     dat[,
>         result :>             if (.BY == "A") {
>                 2 * Flow
>             } else if (.BY == "B") {
>                 3 + Flow
>             } else if (.BY == "C") {
>                 sqrt(Flow)
>             },
>         by = ASB]
> }
> #
> # benchmark
> #
> microbenchmark(fnSplit(DF), fnDataTable(DT))
> identical(fnSplit(DF), fnDataTable(DT)[, result])
>
> ---
>
> Actually, in Chuck's version the unsplit() part is slow. If the order
is not
> of concern (e.g., DF is reordered before calling fnSplit), fnSplit is
> comparable to the DT-version.
>
>
> Denes

Berend Hasselman

2015-Sep-17 08:17 UTC

head link

[R] Multiple if function

> On 17 Sep 2015, at 01:42, D?nes T?th <toth.denes at ttk.mta.hu>
wrote:
> 
> 
> 
> On 09/16/2015 04:41 PM, Bert Gunter wrote:
>> Yes! Chuck's use of mapply is exactly the split/combine strategy I
was
>> looking for. In retrospect, exactly how one should think about it.
>> Many thanks to all for a constructive discussion .
>> 
>> -- Bert
>> 
>> 
>> Bert Gunter
>> 
>>>>> 
>>>>> Use mapply like this on large problems:
>>>>> 
>>>>> unsplit(
>>>>>   mapply(
>>>>>       function(x,z) eval( x, list( y=z )),
>>>>>       expression( A=y*2, B=y+3, C=sqrt(y) ),
>>>>>       split( dat$Flow, dat$ASB ),
>>>>>       SIMPLIFY=FALSE),
>>>>>   dat$ASB)
>>>>> 
>>>>> Chuck
>>>>> 
> 
> 
> Is there any reason not to use data.table for this purpose, especially if
efficiency is of concern?
> 
> ---
> 
> # load data.table and microbenchmark
> library(data.table)
> library(microbenchmark)
> #
> # prepare data
> DF <- data.frame(
>    ASB = rep_len(factor(LETTERS[1:3]), 3e5),
>    Flow = rnorm(3e5)^2)
> DT <- as.data.table(DF)
> DT[, ASB := as.character(ASB)]
> #
> # define functions
> #
> # Chuck's version
> fnSplit <- function(dat) {
>    unsplit(
>        mapply(
>            function(x,z) eval( x, list( y=z )),
>            expression( A=y*2, B=y+3, C=sqrt(y) ),
>            split( dat$Flow, dat$ASB ),
>            SIMPLIFY=FALSE),
>        dat$ASB)
> }
> #
> # data.table-way (IMHO, much easier to read)
> fnDataTable <- function(dat) {
>    dat[,
>        result :>            if (.BY == "A") {
>                2 * Flow
>            } else if (.BY == "B") {
>                3 + Flow
>            } else if (.BY == "C") {
>                sqrt(Flow)
>            },
>        by = ASB]
> }
> #
> # benchmark
> #
> microbenchmark(fnSplit(DF), fnDataTable(DT))
> identical(fnSplit(DF), fnDataTable(DT)[, result])
> 
> ---
> 
> Actually, in Chuck's version the unsplit() part is slow. If the order
is not of concern (e.g., DF is reordered before calling fnSplit), fnSplit is
comparable to the DT-version.
> 
But David?s version is faster than Chuck?s fnSplit. I modified David?s solution
slightly to get a result that is identical to fnSplit.

# David's version
# my modification to return a vector just like fnSplit
fnDavid <- function(dat) {
    z <- mapply(
          function(x,z) eval( x, list( y=z )),
          expression(A= y*2, B=y+3, C=sqrt(y) ),
          split( dat$Flow, dat$ASB ),
          USE.NAMES=FALSE, SIMPLIFY=TRUE
        )
    as.vector(t(z))
}

Added this to D?nes's code.
Benchmarking  with R package rbenchmark and testing result like this

library(rbenchmark)
benchmark(fnSplit(DF), fnDataTable(DT),fnDavid(DF))
identical(fnSplit(DF), fnDataTable(DT)[, result])
identical(fnSplit(DF), fnDavid(DF))

gave this:

             test replications elapsed relative user.self sys.self user.child
2 fnDataTable(DT)          100   0.829    1.000     0.762    0.066          0
3     fnDavid(DF)          100   1.615    1.948     1.515    0.098          0
1     fnSplit(DF)          100   2.878    3.472     2.685    0.190          0
  sys.child
2         0
3         0
1         0
> identical(fnSplit(DF), fnDataTable(DT)[, result])
[1] TRUE> identical(fnSplit(DF), fnDavid(DF))[1] TRUE
 

Berend

Charles C. Berry

2015-Sep-17 16:29 UTC

head link

[R] Multiple if function

On Thu, 17 Sep 2015, Berend Hasselman wrote:
>
>> On 17 Sep 2015, at 01:42, D?nes T?th <toth.denes at ttk.mta.hu>
wrote:
>>
>>
>>
>> On 09/16/2015 04:41 PM, Bert Gunter wrote:
>>> Yes! Chuck's use of mapply is exactly the split/combine
strategy I was
>>> looking for. In retrospect, exactly how one should think about it.
>>> Many thanks to all for a constructive discussion .
>>>
>>> -- Bert
>>>
>>>
>>> Bert Gunter
>>>
>>>>>>
>>>>>> Use mapply like this on large problems:
>>>>>>
>>>>>> unsplit(
>>>>>>   mapply(
>>>>>>       function(x,z) eval( x, list( y=z )),
>>>>>>       expression( A=y*2, B=y+3, C=sqrt(y) ),
>>>>>>       split( dat$Flow, dat$ASB ),
>>>>>>       SIMPLIFY=FALSE),
>>>>>>   dat$ASB)
>>>>>>
>>>>>> Chuck
>>>>>>
>>
>>
>> Is there any reason not to use data.table for this purpose, especially
if efficiency is of concern?
>>
>> ---
>>
>> # load data.table and microbenchmark
>> library(data.table)
>> library(microbenchmark)
>> #
>> # prepare data
>> DF <- data.frame(
>>    ASB = rep_len(factor(LETTERS[1:3]), 3e5),
>>    Flow = rnorm(3e5)^2)
>> DT <- as.data.table(DF)
>> DT[, ASB := as.character(ASB)]
>> #
>> # define functions
>> #
>> # Chuck's version
>> fnSplit <- function(dat) {
>>    unsplit(
>>        mapply(
>>            function(x,z) eval( x, list( y=z )),
>>            expression( A=y*2, B=y+3, C=sqrt(y) ),
>>            split( dat$Flow, dat$ASB ),
>>            SIMPLIFY=FALSE),
>>        dat$ASB)
>> }
>> #
>> # data.table-way (IMHO, much easier to read)
>> fnDataTable <- function(dat) {
>>    dat[,
>>        result :>>            if (.BY == "A") {
>>                2 * Flow
>>            } else if (.BY == "B") {
>>                3 + Flow
>>            } else if (.BY == "C") {
>>                sqrt(Flow)
>>            },
>>        by = ASB]
>> }
>> #
>> # benchmark
>> #
>> microbenchmark(fnSplit(DF), fnDataTable(DT))
>> identical(fnSplit(DF), fnDataTable(DT)[, result])
>>
>> ---
>>
>> Actually, in Chuck's version the unsplit() part is slow. If the
order is not of concern (e.g., DF is reordered before calling fnSplit), fnSplit
is comparable to the DT-version.
>>
>
> But David?s version is faster than Chuck?s fnSplit. I modified David?s
solution slightly to get a result that is identical to fnSplit.
>
> # David's version
> # my modification to return a vector just like fnSplit
> fnDavid <- function(dat) {
>    z <- mapply(
>          function(x,z) eval( x, list( y=z )),
>          expression(A= y*2, B=y+3, C=sqrt(y) ),
>          split( dat$Flow, dat$ASB ),
>          USE.NAMES=FALSE, SIMPLIFY=TRUE
>        )
>    as.vector(t(z))
> }
>
> Added this to D?nes's code.
> Benchmarking  with R package rbenchmark and testing result like this
>
> library(rbenchmark)
> benchmark(fnSplit(DF), fnDataTable(DT),fnDavid(DF))
> identical(fnSplit(DF), fnDataTable(DT)[, result])
> identical(fnSplit(DF), fnDavid(DF))
>
> gave this:
>
>             test replications elapsed relative user.self sys.self
user.child
> 2 fnDataTable(DT)          100   0.829    1.000     0.762    0.066         
0
> 3     fnDavid(DF)          100   1.615    1.948     1.515    0.098         
0
> 1     fnSplit(DF)          100   2.878    3.472     2.685    0.190         
0
>  sys.child
> 2         0
> 3         0
> 1         0
>
>> identical(fnSplit(DF), fnDataTable(DT)[, result])
> [1] TRUE
>> identical(fnSplit(DF), fnDavid(DF))
> [1] TRUE
The above `TRUE' depends on the structure of ASB here. identical(...) is 
often FALSE in the general case. A permutation of ASB is enough to show 
this:
> DF$ASB <- sample(DF$ASB)
> identical(fnSplit(DF), fnDavid(DF))
[1] FALSE>
unsplit() is the price you pay to cope with general orderings.

Chuck

R help - Sep 2015 - Multiple if function

[R] Multiple if function

[R] Multiple if function

[R] Multiple if function

[R] Multiple if function