thr3ads.net - R help - [R] Using and abusing %>% (was Re: Why can't I access this type?) [Mar 2015]

If this information is useful, please help other people find it:
Share via:

Patrick Connolly

2015-Mar-26 06:48 UTC

[R] Using and abusing %>% (was Re: Why can't I access this type?)

On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote:

...

|> Well...  Opinions may perhaps differ, but apart from '%>%'
being
|> butt-ugly it's also fairly slow:

Beauty, it is said, is in the eye of the beholder.  I'm impressed by
the way using %>% reduces or eliminates complicated nested brackets.
In this tiny example it's not obvious but it's very clear if the
objective is to sort the dataframe by three or four columns and
various lots of aggregation then returning a largish number of
consecutive columns, omitting the rest.  It's very easy to see what's
going on without the need for intermediate objects.

|>
|>  .....

|> Unit: microseconds
|> 
|>    expr
|>  subset(all.states, all.states$Frost > 150, select =
c("state",
|> "Frost"))
|>                        all.states[all.states$Frost > 150,
|> c("state", "Frost")]
|>                    all.states %>% filter(Frost > 150) %>%
|> select(state, Frost)
|>       min       lq      mean    median        uq      max neval cld
|>   139.112  148.673  163.3960  159.1760  170.7895 1763.200  1000  b
|>   104.039  111.973  127.2138  120.4395  128.6640 1381.809  1000 a
|>  1010.076 1033.519 1133.1469 1107.8480 1175.1800 2932.206  1000   c

It's no surprise that instructing a computer in something closer to
human language is an order of magnitude slower.  I'm sure you'd get
something even quicker using machine code.  I spend 3 or 4 orders of
magnitude more time writing code than running it.  It's much more
important to me to be able to read and modify than it is to have it
run at optimum speed.

|> 
|> Of course, this doesn't matter for interactive one-off use.  But
|> lately I've seen examples of the '%>%' operator creeping
into
|> functions in packages. 

That could indicate that %>% is seductively easy to use.  It's
probably true that there are places where it should be done the hard
way.


|>  However, it would be nice to see a fast pipe operator as part of
|> base R.

|> 
|> 
|> Henric Winell
|> 

-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___    Patrick Connolly   
 {~._.~}                   Great minds discuss ideas    
 _( Y )_  	         Average minds discuss events 
(:_~*~_:)                  Small minds discuss people  
 (_)-(_)  	                      ..... Eleanor Roosevelt
	  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

Henric Winell

2015-Mar-27 14:27 UTC

head link

[R] Using and abusing %>% (was Re: Why can't I access this type?)

On 2015-03-26 07:48, Patrick Connolly wrote:
> On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote:
>
> ...
>
> |> Well...  Opinions may perhaps differ, but apart from '%>%'
being
> |> butt-ugly it's also fairly slow:
>
> Beauty, it is said, is in the eye of the beholder.  I'm impressed by
> the way using %>% reduces or eliminates complicated nested brackets.
I didn't dispute whether '%>%' may be useful -- I just pointed
out that
it is slow.  However, it is only part of the problem: 'filter()' and 
'select()', although aesthetically pleasing, also seem to be slow:

 > all.states <- data.frame(state.x77, Name = rownames(state.x77))
 >
 > f1 <- function()
+     all.states[all.states$Frost > 150, c("Name",
"Frost")]
 >
 > f2 <- function()
+     subset(all.states, Frost > 150, select = c("Name",
"Frost"))
 >
 > f3 <- function() {
+     filt <- subset(all.states, Frost > 150)
+     subset(filt, select = c("Name", "Frost"))
+ }
 >
 > f4 <- function()
+     all.states %>% subset(Frost > 150) %>%
+         subset(select = c("Name", "Frost"))
 >
 > f5 <- function()
+     select(filter(all.states, Frost > 150), Name, Frost)
 >
 > f6 <- function()
+     all.states %>% filter(Frost > 150) %>% select(Name, Frost)
 >
 > mb <- microbenchmark(
+     f1(), f2(), f3(), f4(), f5(), f6(),
+     times = 1000L
+ )
 > print(mb, signif = 3L)
Unit: microseconds
  expr min   lq      mean median   uq  max neval   cld
  f1() 115  124  134.8812    129  134 1500  1000 a
  f2() 128  141  147.4694    145  151 1520  1000 a
  f3() 303  328  344.3175    338  348 1740  1000  b
  f4() 458  494  518.0830    510  523 1890  1000   c
  f5() 806  848  887.7270    875  894 3510  1000    d
  f6() 971 1010 1056.5659   1040 1060 3110  1000     e

So, using '%>%', but leaving 'filter()' and
'select()' out of the
equation, as in 'f4()' is only half as bad as the "full"
'dplyr' idiom
in 'f6()'.  In this case, since we're talking microseconds, the
speed-up
is negligible but that *is* beside the point.
> In this tiny example it's not obvious but it's very clear if the
> objective is to sort the dataframe by three or four columns and
> various lots of aggregation then returning a largish number of
> consecutive columns, omitting the rest.  It's very easy to see
what's
> going on without the need for intermediate objects.
Why are you opposed to using intermediate objects?  In this case, as can 
be seen from 'f3()', it will also have the benefit of being faster than 
either '%>%' or the "full" 'dplyr' idiom.
> |> [...]
>
> It's no surprise that instructing a computer in something closer to
> human language is an order of magnitude slower.
Certainly not true, at least for compiled languages.  In any case, 
judging from off-list correspondence, it definitely came as a surprise 
to some R users...

Given that '%>%' is so heavily marketed through 'dplyr',
where the
latter is said to provide "blazing fast performance for in-memory data 
by writing key pieces in C++" and "a fast, consistent tool for working
with data frame like objects, both in memory and out of memory", I
don't
think it's far-fetched to expect that it should be more performant than 
base R.
> I'm sure you'd get something even quicker using machine code.
Don't be ridiculous.  We're mainly discussing

all.states[all.states$Frost > 150, c("state", "Frost")]

vs.

all.states %>% filter(Frost > 150) %>% select(state, Frost)

i.e., pure R code.
> I spend 3 or 4 orders of magnitude more time writing code than running it.
You and me both.  But that doesn't mean speed is of no or little importance.
> It's much more important to me to be able to read and modify than > it is to have it run at optimum speed.

Good for you.  But surely, if this is your goal, nothing beats 
intermediate objects.  And like I said, it may still be faster than the 
'dplyr' idiom.
> |> Of course, this doesn't matter for interactive one-off use.  But
> |> lately I've seen examples of the '%>%' operator
creeping into
> |> functions in packages.
>
> That could indicate that %>% is seductively easy to use.  It's
> probably true that there are places where it should be done the hard
> way.
We all know how easy it is to write ugly and sluggish code in R.  But 
'foo[i,j]' is neither ugly nor sluggish and certainly not "the hard
way."
> |>  However, it would be nice to see a fast pipe operator as part of
> |> base R.
Heck, it doesn't even have to be fast as long as it's a bit more elegant
than '%>%'.


Henric Winell


>
> |>
> |>
> |> Henric Winell
> |>
>

Hadley Wickham

2015-Mar-28 04:40 UTC

head link

[R] Using and abusing %>% (was Re: Why can't I access this type?)

> I didn't dispute whether '%>%' may be useful -- I just
pointed out that it
> is slow.  However, it is only part of the problem: 'filter()' and
> 'select()', although aesthetically pleasing, also seem to be slow:
>
>> all.states <- data.frame(state.x77, Name = rownames(state.x77))
>>
>> f1 <- function()
> +     all.states[all.states$Frost > 150, c("Name",
"Frost")]
>>
>> f2 <- function()
> +     subset(all.states, Frost > 150, select = c("Name",
"Frost"))
>>
>> f3 <- function() {
> +     filt <- subset(all.states, Frost > 150)
> +     subset(filt, select = c("Name", "Frost"))
> + }
>>
>> f4 <- function()
> +     all.states %>% subset(Frost > 150) %>%
> +         subset(select = c("Name", "Frost"))
>>
>> f5 <- function()
> +     select(filter(all.states, Frost > 150), Name, Frost)
>>
>> f6 <- function()
> +     all.states %>% filter(Frost > 150) %>% select(Name, Frost)
>>
>> mb <- microbenchmark(
> +     f1(), f2(), f3(), f4(), f5(), f6(),
> +     times = 1000L
> + )
>> print(mb, signif = 3L)
> Unit: microseconds
>  expr min   lq      mean median   uq  max neval   cld
>  f1() 115  124  134.8812    129  134 1500  1000 a
>  f2() 128  141  147.4694    145  151 1520  1000 a
>  f3() 303  328  344.3175    338  348 1740  1000  b
>  f4() 458  494  518.0830    510  523 1890  1000   c
>  f5() 806  848  887.7270    875  894 3510  1000    d
>  f6() 971 1010 1056.5659   1040 1060 3110  1000     e
>
> So, using '%>%', but leaving 'filter()' and
'select()' out of the equation,
> as in 'f4()' is only half as bad as the "full"
'dplyr' idiom in 'f6()'.  In
> this case, since we're talking microseconds, the speed-up is negligible
but
> that *is* beside the point.
When benchmarking it's important to consider both the relative and
absolute difference and to think about how the cost scales as the data
grows - the cost of using using %>% is fixed, and 500 ?s doesn't seem
like a huge performance penalty to me.

Hadley

-- 
http://had.co.nz/

Patrick Connolly

2015-Mar-28 07:48 UTC

head link

[R] Using and abusing %>% (was Re: Why can't I access this type?)

On Fri, 27-Mar-2015 at 03:27PM +0100, Henric Winell wrote:

|> On 2015-03-26 07:48, Patrick Connolly wrote:
|> 
|> >On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote:
|> >
|> >...
|> >
|> >|> Well...  Opinions may perhaps differ, but apart from
'%>%' being
|> >|> butt-ugly it's also fairly slow:
|> >
|> >Beauty, it is said, is in the eye of the beholder.  I'm impressed
by
|> >the way using %>% reduces or eliminates complicated nested
brackets.
|> 
|> I didn't dispute whether '%>%' may be useful -- I just
pointed out

Likewise I didn't dispute that it might not be as fast as other ways,
but I was disputing the claim that it was ugly.

|> that it is slow.  However, it is only part of the problem:
|> 'filter()' and 'select()', although aesthetically
pleasing, also
|> seem to be slow:

So not 'butt ugly' like '%>%'?

|> 
....

|> > mb <- microbenchmark(
|> +     f1(), f2(), f3(), f4(), f5(), f6(),
|> +     times = 1000L
|> + )
|> > print(mb, signif = 3L)
|> Unit: microseconds
|>  expr min   lq      mean median   uq  max neval   cld
|>  f1() 115  124  134.8812    129  134 1500  1000 a
|>  f2() 128  141  147.4694    145  151 1520  1000 a
|>  f3() 303  328  344.3175    338  348 1740  1000  b
|>  f4() 458  494  518.0830    510  523 1890  1000   c
|>  f5() 806  848  887.7270    875  894 3510  1000    d
|>  f6() 971 1010 1056.5659   1040 1060 3110  1000     e
|> 
|> So, using '%>%', but leaving 'filter()' and
'select()' out of the
|> equation, as in 'f4()' is only half as bad as the "full"
'dplyr'
|> idiom in 'f6()'.  In this case, since we're talking
microseconds,
|> the speed-up is negligible but that *is* beside the point.

Agreed that the more 'dplyr' used the slower it gets but don't agree
that it's an issue except in packages that should be optimized.  The
lack of speed won't stop me using it any more than I'll stop using
dataframes because matrices are much faster than them.  The OP's
example can be done using matrix syntax:

state.x77[state.x77[, "Frost"] > 150, "Frost", drop =
FALSE]


which is more than an order of magnitude faster than subscripting a
dataframe.  See No 4. here:

  microbenchmark(## 1. using subset()
        subset(all.states, all.states$Frost > 150, select =
c("state","Frost")),
        ## 2. standard R indexing
        all.states[all.states$Frost > 150,
c("state","Frost")],
        ## 3. leave out redundant 'state' column
        all.states[all.states$Frost > 150, "Frost", drop = FALSE],
        ## 4. avoid using 'slow' dataframes altogether
        state.x77[state.x77[, "Frost"] > 150, "Frost",
drop = FALSE],
        ## 5. easy, slow way without square brackets or quote marks
        all.states %>% filter(Frost > 150) %>% select(state, Frost),
        times = 1000L
        )

Unit: microseconds
                                                                      expr
  subset(all.states, all.states$Frost > 150, select = c("state",
"Frost"))
                   all.states[all.states$Frost > 150, c("state",
"Frost")]
                 all.states[all.states$Frost > 150, "Frost", drop =
FALSE]
              state.x77[state.x77[, "Frost"] > 150,
"Frost", drop = FALSE]
               all.states %>% filter(Frost > 150) %>% select(state,
Frost)
      min        lq       mean    median        uq      max neval  cld
  223.960  229.9290  236.16557  232.4060  241.4165  291.083  1000   c 
  177.187  182.6075  203.04666  185.1475  194.4815 7259.760  1000   c 
  125.281  130.4835  135.83826  132.6985  141.7375  210.576  1000  b  
    6.442   10.3860   10.61733   11.0405   11.4855   25.077  1000 a   
 1416.592 1437.7015 1562.91898 1447.5695 1473.4440 9394.071  1000   
d> 
[...]

|> 
|> >In this tiny example it's not obvious but it's very clear if
the
|> >objective is to sort the dataframe by three or four columns and
|> >various lots of aggregation then returning a largish number of
|> >consecutive columns, omitting the rest.  It's very easy to see
what's
|> >going on without the need for intermediate objects.
|> 
|> Why are you opposed to using intermediate objects?  In this case,

I'm not opposed to intermediate objects nor to dogs.  It's just easier
to keep things tidy without either.


|> as can be seen from 'f3()', it will also have the benefit of being
|> faster than either '%>%' or the "full"
'dplyr' idiom.
|> 
|> >|> [...]
|> >
|> >It's no surprise that instructing a computer in something closer
to
|> >human language is an order of magnitude slower.
|> 
|> Certainly not true, at least for compiled languages.  In any case,
|> judging from off-list correspondence, it definitely came as a
|> surprise to some R users...
|> 
|> Given that '%>%' is so heavily marketed through
'dplyr', where the
|> latter is said to provide "blazing fast performance for in-memory
|> data by writing key pieces in C++" and "a fast, consistent tool
for
|> working with data frame like objects, both in memory and out of
|> memory", I don't think it's far-fetched to expect that it
should be
|> more performant than base R.
|> 

I've never come across 'marketing' of free software.  Evidently
that's
a looser use of the word.

...


|> >I spend 3 or 4 orders of magnitude more time writing code than running
it.
|> 
|> You and me both.  But that doesn't mean speed is of no or little
importance.

I never claimed it was.  Tardiness hasn't yet become an issue for me.
When it does, I'll revert to the old ways.

|> 
|> >It's much more important to me to be able to read and modify than
|> > it is to have it run at optimum speed.
|> 
|> Good for you.  But surely, if this is your goal, nothing beats
|> intermediate objects.  

Nothing except chaining, that is.  I went 16 years without it and now
find it amazing how useful it is.  As they say: You're never too old
to learn.


|>  And like I said, it may still be faster than the 'dplyr' idiom.
|> 
|> >|> Of course, this doesn't matter for interactive one-off use. 
But
|> >|> lately I've seen examples of the '%>%' operator
creeping into
|> >|> functions in packages.
|> >
|> >That could indicate that %>% is seductively easy to use.  It's
|> >probably true that there are places where it should be done the hard
|> >way.
|> 
|> We all know how easy it is to write ugly and sluggish code in R.
|> But 'foo[i,j]' is neither ugly nor sluggish and certainly not
"the
|> hard way."

I meant to put a ':-)' in there. Such adjectives as 'easy' and
'hard' are
relative.  There's little difference in difficulty at each step, but
integrating them and revising later are considerably easier using the
so-called "'dplyr' idiom" -- especially if each link in the
chain is
on a separate line.

|> 
|> >|>  However, it would be nice to see a fast pipe operator as part
of
|> >|> base R.
|> 
|> Heck, it doesn't even have to be fast as long as it's a bit more
|> elegant than '%>%'.

IMHO, %>% fits in nicely with %/%, %%, and %in%.  Elegance, like
beauty, is in the eye of the beholder.

|> 
|> 
|> Henric Winell
|> 
|> 
|> 
|> >
|> >|>
|> >|>
|> >|> Henric Winell
|> >|>
|> >

-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___    Patrick Connolly   
 {~._.~}                   Great minds discuss ideas    
 _( Y )_  	         Average minds discuss events 
(:_~*~_:)                  Small minds discuss people  
 (_)-(_)  	                      ..... Eleanor Roosevelt
	  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.

R help - Mar 2015 - Using and abusing %>% (was Re: Why can't I access this type?)

[R] Using and abusing %>% (was Re: Why can't I access this type?)

[R] Using and abusing %>% (was Re: Why can't I access this type?)

[R] Using and abusing %>% (was Re: Why can't I access this type?)

[R] Using and abusing %>% (was Re: Why can't I access this type?)