Patrick Connolly
2015-Mar-26 06:48 UTC
[R] Using and abusing %>% (was Re: Why can't I access this type?)
On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote: ... |> Well... Opinions may perhaps differ, but apart from '%>%' being |> butt-ugly it's also fairly slow: Beauty, it is said, is in the eye of the beholder. I'm impressed by the way using %>% reduces or eliminates complicated nested brackets. In this tiny example it's not obvious but it's very clear if the objective is to sort the dataframe by three or four columns and various lots of aggregation then returning a largish number of consecutive columns, omitting the rest. It's very easy to see what's going on without the need for intermediate objects. |> |> ..... |> Unit: microseconds |> |> expr |> subset(all.states, all.states$Frost > 150, select = c("state", |> "Frost")) |> all.states[all.states$Frost > 150, |> c("state", "Frost")] |> all.states %>% filter(Frost > 150) %>% |> select(state, Frost) |> min lq mean median uq max neval cld |> 139.112 148.673 163.3960 159.1760 170.7895 1763.200 1000 b |> 104.039 111.973 127.2138 120.4395 128.6640 1381.809 1000 a |> 1010.076 1033.519 1133.1469 1107.8480 1175.1800 2932.206 1000 c It's no surprise that instructing a computer in something closer to human language is an order of magnitude slower. I'm sure you'd get something even quicker using machine code. I spend 3 or 4 orders of magnitude more time writing code than running it. It's much more important to me to be able to read and modify than it is to have it run at optimum speed. |> |> Of course, this doesn't matter for interactive one-off use. But |> lately I've seen examples of the '%>%' operator creeping into |> functions in packages. That could indicate that %>% is seductively easy to use. It's probably true that there are places where it should be done the hard way. |> However, it would be nice to see a fast pipe operator as part of |> base R. |> |> |> Henric Winell |> -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___ Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_ Average minds discuss events (:_~*~_:) Small minds discuss people (_)-(_) ..... Eleanor Roosevelt ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
Henric Winell
2015-Mar-27 14:27 UTC
[R] Using and abusing %>% (was Re: Why can't I access this type?)
On 2015-03-26 07:48, Patrick Connolly wrote:> On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote: > > ... > > |> Well... Opinions may perhaps differ, but apart from '%>%' being > |> butt-ugly it's also fairly slow: > > Beauty, it is said, is in the eye of the beholder. I'm impressed by > the way using %>% reduces or eliminates complicated nested brackets.I didn't dispute whether '%>%' may be useful -- I just pointed out that it is slow. However, it is only part of the problem: 'filter()' and 'select()', although aesthetically pleasing, also seem to be slow: > all.states <- data.frame(state.x77, Name = rownames(state.x77)) > > f1 <- function() + all.states[all.states$Frost > 150, c("Name", "Frost")] > > f2 <- function() + subset(all.states, Frost > 150, select = c("Name", "Frost")) > > f3 <- function() { + filt <- subset(all.states, Frost > 150) + subset(filt, select = c("Name", "Frost")) + } > > f4 <- function() + all.states %>% subset(Frost > 150) %>% + subset(select = c("Name", "Frost")) > > f5 <- function() + select(filter(all.states, Frost > 150), Name, Frost) > > f6 <- function() + all.states %>% filter(Frost > 150) %>% select(Name, Frost) > > mb <- microbenchmark( + f1(), f2(), f3(), f4(), f5(), f6(), + times = 1000L + ) > print(mb, signif = 3L) Unit: microseconds expr min lq mean median uq max neval cld f1() 115 124 134.8812 129 134 1500 1000 a f2() 128 141 147.4694 145 151 1520 1000 a f3() 303 328 344.3175 338 348 1740 1000 b f4() 458 494 518.0830 510 523 1890 1000 c f5() 806 848 887.7270 875 894 3510 1000 d f6() 971 1010 1056.5659 1040 1060 3110 1000 e So, using '%>%', but leaving 'filter()' and 'select()' out of the equation, as in 'f4()' is only half as bad as the "full" 'dplyr' idiom in 'f6()'. In this case, since we're talking microseconds, the speed-up is negligible but that *is* beside the point.> In this tiny example it's not obvious but it's very clear if the > objective is to sort the dataframe by three or four columns and > various lots of aggregation then returning a largish number of > consecutive columns, omitting the rest. It's very easy to see what's > going on without the need for intermediate objects.Why are you opposed to using intermediate objects? In this case, as can be seen from 'f3()', it will also have the benefit of being faster than either '%>%' or the "full" 'dplyr' idiom.> |> [...] > > It's no surprise that instructing a computer in something closer to > human language is an order of magnitude slower.Certainly not true, at least for compiled languages. In any case, judging from off-list correspondence, it definitely came as a surprise to some R users... Given that '%>%' is so heavily marketed through 'dplyr', where the latter is said to provide "blazing fast performance for in-memory data by writing key pieces in C++" and "a fast, consistent tool for working with data frame like objects, both in memory and out of memory", I don't think it's far-fetched to expect that it should be more performant than base R.> I'm sure you'd get something even quicker using machine code.Don't be ridiculous. We're mainly discussing all.states[all.states$Frost > 150, c("state", "Frost")] vs. all.states %>% filter(Frost > 150) %>% select(state, Frost) i.e., pure R code.> I spend 3 or 4 orders of magnitude more time writing code than running it.You and me both. But that doesn't mean speed is of no or little importance.> It's much more important to me to be able to read and modify than> it is to have it run at optimum speed. Good for you. But surely, if this is your goal, nothing beats intermediate objects. And like I said, it may still be faster than the 'dplyr' idiom.> |> Of course, this doesn't matter for interactive one-off use. But > |> lately I've seen examples of the '%>%' operator creeping into > |> functions in packages. > > That could indicate that %>% is seductively easy to use. It's > probably true that there are places where it should be done the hard > way.We all know how easy it is to write ugly and sluggish code in R. But 'foo[i,j]' is neither ugly nor sluggish and certainly not "the hard way."> |> However, it would be nice to see a fast pipe operator as part of > |> base R.Heck, it doesn't even have to be fast as long as it's a bit more elegant than '%>%'. Henric Winell> > |> > |> > |> Henric Winell > |> >
Hadley Wickham
2015-Mar-28 04:40 UTC
[R] Using and abusing %>% (was Re: Why can't I access this type?)
> I didn't dispute whether '%>%' may be useful -- I just pointed out that it > is slow. However, it is only part of the problem: 'filter()' and > 'select()', although aesthetically pleasing, also seem to be slow: > >> all.states <- data.frame(state.x77, Name = rownames(state.x77)) >> >> f1 <- function() > + all.states[all.states$Frost > 150, c("Name", "Frost")] >> >> f2 <- function() > + subset(all.states, Frost > 150, select = c("Name", "Frost")) >> >> f3 <- function() { > + filt <- subset(all.states, Frost > 150) > + subset(filt, select = c("Name", "Frost")) > + } >> >> f4 <- function() > + all.states %>% subset(Frost > 150) %>% > + subset(select = c("Name", "Frost")) >> >> f5 <- function() > + select(filter(all.states, Frost > 150), Name, Frost) >> >> f6 <- function() > + all.states %>% filter(Frost > 150) %>% select(Name, Frost) >> >> mb <- microbenchmark( > + f1(), f2(), f3(), f4(), f5(), f6(), > + times = 1000L > + ) >> print(mb, signif = 3L) > Unit: microseconds > expr min lq mean median uq max neval cld > f1() 115 124 134.8812 129 134 1500 1000 a > f2() 128 141 147.4694 145 151 1520 1000 a > f3() 303 328 344.3175 338 348 1740 1000 b > f4() 458 494 518.0830 510 523 1890 1000 c > f5() 806 848 887.7270 875 894 3510 1000 d > f6() 971 1010 1056.5659 1040 1060 3110 1000 e > > So, using '%>%', but leaving 'filter()' and 'select()' out of the equation, > as in 'f4()' is only half as bad as the "full" 'dplyr' idiom in 'f6()'. In > this case, since we're talking microseconds, the speed-up is negligible but > that *is* beside the point.When benchmarking it's important to consider both the relative and absolute difference and to think about how the cost scales as the data grows - the cost of using using %>% is fixed, and 500 ?s doesn't seem like a huge performance penalty to me. Hadley -- http://had.co.nz/
Patrick Connolly
2015-Mar-28 07:48 UTC
[R] Using and abusing %>% (was Re: Why can't I access this type?)
On Fri, 27-Mar-2015 at 03:27PM +0100, Henric Winell wrote: |> On 2015-03-26 07:48, Patrick Connolly wrote: |> |> >On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote: |> > |> >... |> > |> >|> Well... Opinions may perhaps differ, but apart from '%>%' being |> >|> butt-ugly it's also fairly slow: |> > |> >Beauty, it is said, is in the eye of the beholder. I'm impressed by |> >the way using %>% reduces or eliminates complicated nested brackets. |> |> I didn't dispute whether '%>%' may be useful -- I just pointed out Likewise I didn't dispute that it might not be as fast as other ways, but I was disputing the claim that it was ugly. |> that it is slow. However, it is only part of the problem: |> 'filter()' and 'select()', although aesthetically pleasing, also |> seem to be slow: So not 'butt ugly' like '%>%'? |> .... |> > mb <- microbenchmark( |> + f1(), f2(), f3(), f4(), f5(), f6(), |> + times = 1000L |> + ) |> > print(mb, signif = 3L) |> Unit: microseconds |> expr min lq mean median uq max neval cld |> f1() 115 124 134.8812 129 134 1500 1000 a |> f2() 128 141 147.4694 145 151 1520 1000 a |> f3() 303 328 344.3175 338 348 1740 1000 b |> f4() 458 494 518.0830 510 523 1890 1000 c |> f5() 806 848 887.7270 875 894 3510 1000 d |> f6() 971 1010 1056.5659 1040 1060 3110 1000 e |> |> So, using '%>%', but leaving 'filter()' and 'select()' out of the |> equation, as in 'f4()' is only half as bad as the "full" 'dplyr' |> idiom in 'f6()'. In this case, since we're talking microseconds, |> the speed-up is negligible but that *is* beside the point. Agreed that the more 'dplyr' used the slower it gets but don't agree that it's an issue except in packages that should be optimized. The lack of speed won't stop me using it any more than I'll stop using dataframes because matrices are much faster than them. The OP's example can be done using matrix syntax: state.x77[state.x77[, "Frost"] > 150, "Frost", drop = FALSE] which is more than an order of magnitude faster than subscripting a dataframe. See No 4. here: microbenchmark(## 1. using subset() subset(all.states, all.states$Frost > 150, select = c("state","Frost")), ## 2. standard R indexing all.states[all.states$Frost > 150, c("state","Frost")], ## 3. leave out redundant 'state' column all.states[all.states$Frost > 150, "Frost", drop = FALSE], ## 4. avoid using 'slow' dataframes altogether state.x77[state.x77[, "Frost"] > 150, "Frost", drop = FALSE], ## 5. easy, slow way without square brackets or quote marks all.states %>% filter(Frost > 150) %>% select(state, Frost), times = 1000L ) Unit: microseconds expr subset(all.states, all.states$Frost > 150, select = c("state", "Frost")) all.states[all.states$Frost > 150, c("state", "Frost")] all.states[all.states$Frost > 150, "Frost", drop = FALSE] state.x77[state.x77[, "Frost"] > 150, "Frost", drop = FALSE] all.states %>% filter(Frost > 150) %>% select(state, Frost) min lq mean median uq max neval cld 223.960 229.9290 236.16557 232.4060 241.4165 291.083 1000 c 177.187 182.6075 203.04666 185.1475 194.4815 7259.760 1000 c 125.281 130.4835 135.83826 132.6985 141.7375 210.576 1000 b 6.442 10.3860 10.61733 11.0405 11.4855 25.077 1000 a 1416.592 1437.7015 1562.91898 1447.5695 1473.4440 9394.071 1000 d>[...] |> |> >In this tiny example it's not obvious but it's very clear if the |> >objective is to sort the dataframe by three or four columns and |> >various lots of aggregation then returning a largish number of |> >consecutive columns, omitting the rest. It's very easy to see what's |> >going on without the need for intermediate objects. |> |> Why are you opposed to using intermediate objects? In this case, I'm not opposed to intermediate objects nor to dogs. It's just easier to keep things tidy without either. |> as can be seen from 'f3()', it will also have the benefit of being |> faster than either '%>%' or the "full" 'dplyr' idiom. |> |> >|> [...] |> > |> >It's no surprise that instructing a computer in something closer to |> >human language is an order of magnitude slower. |> |> Certainly not true, at least for compiled languages. In any case, |> judging from off-list correspondence, it definitely came as a |> surprise to some R users... |> |> Given that '%>%' is so heavily marketed through 'dplyr', where the |> latter is said to provide "blazing fast performance for in-memory |> data by writing key pieces in C++" and "a fast, consistent tool for |> working with data frame like objects, both in memory and out of |> memory", I don't think it's far-fetched to expect that it should be |> more performant than base R. |> I've never come across 'marketing' of free software. Evidently that's a looser use of the word. ... |> >I spend 3 or 4 orders of magnitude more time writing code than running it. |> |> You and me both. But that doesn't mean speed is of no or little importance. I never claimed it was. Tardiness hasn't yet become an issue for me. When it does, I'll revert to the old ways. |> |> >It's much more important to me to be able to read and modify than |> > it is to have it run at optimum speed. |> |> Good for you. But surely, if this is your goal, nothing beats |> intermediate objects. Nothing except chaining, that is. I went 16 years without it and now find it amazing how useful it is. As they say: You're never too old to learn. |> And like I said, it may still be faster than the 'dplyr' idiom. |> |> >|> Of course, this doesn't matter for interactive one-off use. But |> >|> lately I've seen examples of the '%>%' operator creeping into |> >|> functions in packages. |> > |> >That could indicate that %>% is seductively easy to use. It's |> >probably true that there are places where it should be done the hard |> >way. |> |> We all know how easy it is to write ugly and sluggish code in R. |> But 'foo[i,j]' is neither ugly nor sluggish and certainly not "the |> hard way." I meant to put a ':-)' in there. Such adjectives as 'easy' and 'hard' are relative. There's little difference in difficulty at each step, but integrating them and revising later are considerably easier using the so-called "'dplyr' idiom" -- especially if each link in the chain is on a separate line. |> |> >|> However, it would be nice to see a fast pipe operator as part of |> >|> base R. |> |> Heck, it doesn't even have to be fast as long as it's a bit more |> elegant than '%>%'. IMHO, %>% fits in nicely with %/%, %%, and %in%. Elegance, like beauty, is in the eye of the beholder. |> |> |> Henric Winell |> |> |> |> > |> >|> |> >|> |> >|> Henric Winell |> >|> |> > -- ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~. ___ Patrick Connolly {~._.~} Great minds discuss ideas _( Y )_ Average minds discuss events (:_~*~_:) Small minds discuss people (_)-(_) ..... Eleanor Roosevelt ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.