thr3ads.net - R devel - [Rd] the pipe |> and line breaks in pipelines [Dec 2020]

If this information is useful, please help other people find it:
Share via:
Timothy Goodman
2020-Dec-09 21:56 UTC
[Rd] the pipe |> and line breaks in pipelines

I'm thrilled to hear it!  Thank you!

- Tim

P.S. I re-added the r-devel list, since Kevin's reply was sent just to me,
but I thought there might be others interested in knowing about those work
items.  (I hope that's OK, email-etiquette-wise.)

On Wed, Dec 9, 2020 at 1:10 PM Kevin Ushey <kevinushey at gmail.com>
wrote:
> You might be surprised to learn that the RStudio IDE engineers might
> be receptive to such a feature request. :-)
>
> https://github.com/rstudio/rstudio/issues/8589
> https://github.com/rstudio/rstudio/issues/8590
>
> (Spoiler alert: I am one of the RStudio IDE engineers, and I think
> this would be worth doing.)
>
> Best,
> Kevin
>
> On Wed, Dec 9, 2020 at 12:16 PM Timothy Goodman <timsgoodman at
gmail.com>
> wrote:
> >
> > Since my larger concern is being able to conveniently select and
re-run
> part of a multiline pipeline, I don't think wrapping in parentheses
will
> help.  I'd have to add a closing paren at the end of the selection,
which
> is no more convenient than having to highlight all but the last pipe.
> (Admittedly, wrapping in parens would allow my preferred syntax of having
> pipes at the start of the line, but I don't think that's worth the
cost of
> having to constantly move the trailing paren around.)
> >
> > My back-up plan if I fail to persuade you all is indeed to beg the
> developers of RStudio to add an option to do the transformation I would
> want when executing notebook code, but I'm anticipating the objection
of "R
> Notebooks shouldn't transform invalid R code into valid R code." 
I was
> hoping "Let's make this new pipe |> work differently in a case
that's
> currently an error" would be an easier sell.
> >
> > Also, just to reiterate: Only one of my two suggestions really
requires
> caring about newlines.  (That's my preferred solution, but I understand
> it'd be the bigger change.)  The other suggestion just amounts to
ignoring
> a final |> when code is submitted for execution.
> >
> >  -Tim
> >
> > On Wed, Dec 9, 2020 at 11:58 AM Kevin Ushey <kevinushey at
gmail.com>
> wrote:
> >>
> >> I agree with Duncan that the right solution is to wrap the pipe
> >> expression with parentheses. Having the parser treat newlines
> >> differently based on whether the session is interactive, or on
what
> >> type of operator happens to follow a newline, feels like a pretty
big
> >> can of worms.
> >>
> >> I think this (or something similar) would accomplish what you want
> >> while still retaining the nice aesthetics of the pipe expression,
with
> >> a minimal amount of syntax "noise":
> >>
> >> result <- (
> >>   data
> >>     |> op1()
> >>     |> op2()
> >> )
> >>
> >> For interactive sessions where you wanted to execute only parts of
the
> >> pipeline at a time, I could see that being accomplished by the
editor
> >> -- it could transform the expression so that it could be handled
by R,
> >> either by hoisting the pipe operator(s) up a line, or by wrapping
the
> >> to-be-executed expression in parentheses for you. If such a style
of
> >> coding became popular enough, I'm sure the developers of such
editors
> >> would be interested and willing to support this ...
> >>
> >> Perhaps more importantly, it would be much easier to accomplish
than a
> >> change to the behavior of the R parser, and it would be work that
> >> wouldn't have to be maintained by the R Core team.
> >>
> >> Best,
> >> Kevin
> >>
> >> On Wed, Dec 9, 2020 at 11:34 AM Timothy Goodman <timsgoodman at
gmail.com>
> wrote:
> >> >
> >> > If I type my_data_frame_1 and press Enter (or Ctrl+Enter to
execute
> the
> >> > command in the Notebook environment I'm using) I
certainly *would*
> expect R
> >> > to treat it as a complete statement.
> >> >
> >> > But what I'm talking about is a different case, where I
highlight a
> >> > multi-line statement in my notebook:
> >> >
> >> >     my_data_frame1
> >> >         |> filter(some_conditions_1)
> >> >
> >> > and then press Ctrl+Enter.  Or, I suppose the equivalent
would be to
> run an
> >> > R script containing those two lines of code, or to run a
multi-line
> >> > statement like that from the console (which in RStudio I can
do by
> pressing
> >> > Shift+Enter between the lines.)
> >> >
> >> > In those cases, R could either (1) Give an error message [the
current
> >> > behavior], or (2) understand that the first line is meant to
be piped
> to
> >> > the second.  The second option would be significantly more
useful,
> and is
> >> > almost certainly what the user intended.
> >> >
> >> > (For what it's worth, there are some languages, such as
Javascript,
> that
> >> > consider the first token of the next line when determining if
the
> previous
> >> > line was complete.  JavaScript's rules around this are
overly
> complicated,
> >> > but a rule like "a pipe following a line break is
treated as
> continuing the
> >> > previous line" would be much simpler.  And while it
might be
> objectionable
> >> > to treat the operator %>% different from other operators,
the
> addition of
> >> > |>, which isn't truly an operator at all, seems like
the right time to
> >> > consider it.)
> >> >
> >> > -Tim
> >> >
> >> > On Wed, Dec 9, 2020 at 3:12 AM Duncan Murdoch <
> murdoch.duncan at gmail.com>
> >> > wrote:
> >> >
> >> > > The requirement for operators at the end of the line
comes from the
> >> > > interactive nature of R.  If you type
> >> > >
> >> > >      my_data_frame_1
> >> > >
> >> > > how could R know that you are not done, and are planning
to type the
> >> > > rest of the expression
> >> > >
> >> > >        %>% filter(some_conditions_1)
> >> > >        ...
> >> > >
> >> > > before it should consider the expression complete?  The
way
> languages
> >> > > like C do this is by requiring a statement terminator at
the end.
> You
> >> > > can also do it by wrapping the entire thing in
parentheses ().
> >> > >
> >> > > However, be careful: Don't use braces:  they
don't work.  And parens
> >> > > have the side effect of removing invisibility from the
result
> (which is
> >> > > a design flaw or bonus, depending on your point of
view).  So I
> actually
> >> > > wouldn't advise this workaround.
> >> > >
> >> > > Duncan Murdoch
> >> > >
> >> > >
> >> > > On 09/12/2020 12:45 a.m., Timothy Goodman wrote:
> >> > > > Hi,
> >> > > >
> >> > > > I'm a data scientist who routinely uses R in my
day-to-day work,
> for
> >> > > tasks
> >> > > > such as cleaning and transforming data, exploratory
data
> analysis, etc.
> >> > > > This includes frequent use of the pipe operator
from the magrittr
> and
> >> > > dplyr
> >> > > > libraries, %>%.  So, I was pleased to hear about
the recent work
> on a
> >> > > > native pipe operator, |>.
> >> > > >
> >> > > > This seems like a good time to bring up the main
pain point I
> encounter
> >> > > > when using pipes in R, and some suggestions on what
could be done
> about
> >> > > > it.  The issue is that the pipe operator can't
be placed at the
> start of
> >> > > a
> >> > > > line of code (except in parentheses).  That's
no different than
> any
> >> > > binary
> >> > > > operator in R, but I find it's a source of
difficulty for the pipe
> >> > > because
> >> > > > of how pipes are often used.
> >> > > >
> >> > > > [I'm assuming here that my usage is fairly
typical of a lot of
> users; at
> >> > > > any rate, I don't think I'm *too* unusual.]
> >> > > >
> >> > > > === Why this is a problem ==> >> > >
>
> >> > > > It's very common (for me, and I suspect for
many users of dplyr)
> to write
> >> > > > multi-step pipelines and put each step on its own
line for
> readability.
> >> > > > Something like this:
> >> > > >
> >> > > >    ### Example 1 ###
> >> > > >    my_data_frame_1 %>%
> >> > > >      filter(some_conditions_1) %>%
> >> > > >      inner_join(my_data_frame_2, by =
some_columns_1) %>%
> >> > > >      group_by(some_columns_2) %>%
> >> > > >      summarize(some_aggregate_functions_1) %>%
> >> > > >      filter(some_conditions_2) %>%
> >> > > >      left_join(my_data_frame_3, by =
some_columns_3) %>%
> >> > > >      group_by(some_columns_4) %>%
> >> > > >      summarize(some_aggregate_functions_2) %>%
> >> > > >      arrange(some_columns_5)
> >> > > >
> >> > > > [I guess some might consider this an overly long
pipeline; for me
> it's
> >> > > > pretty typical.  I *could* split it up by assigning
intermediate
> results
> >> > > to
> >> > > > variables, but much of the value I get from the
pipe is that it
> lets my
> >> > > > code communicate which results are temporary, and
which will be
> used
> >> > > again
> >> > > > later.  Assigning variables for single-use results
would remove
> that
> >> > > > expressiveness.]
> >> > > >
> >> > > > I would prefer (for reasons I'll explain) to be
able to write the
> above
> >> > > > example like this, which isn't valid R:
> >> > > >
> >> > > >    ### Example 2 (not valid R) ###
> >> > > >    my_data_frame_1
> >> > > >      %>% filter(some_conditions_1)
> >> > > >      %>% inner_join(my_data_frame_2, by =
some_columns_1)
> >> > > >      %>% group_by(some_columns_2)
> >> > > >      %>% summarize(some_aggregate_functions_1)
> >> > > >      %>% filter(some_conditions_2)
> >> > > >      %>% left_join(my_data_frame_3, by =
some_columns_3)
> >> > > >      %>% group_by(some_columns_4)
> >> > > >      %>% summarize(some_aggregate_functions_2)
> >> > > >      %>% arrange(some_columns_5)
> >> > > >
> >> > > > One (minor) advantage is obvious: It lets you
easily line up the
> pipes,
> >> > > > which means that you can see at a glance that the
whole block is
> a single
> >> > > > pipeline, and you'd immediately notice if you
inadvertently
> omitted a
> >> > > pipe,
> >> > > > which otherwise can lead to confusing output. 
[It's also
> aesthetically
> >> > > > pleasing, especially when %>% is replaced with
|>, but that's
> >> > > subjective.]
> >> > > >
> >> > > > But the bigger issue happens when I want to re-run
just *part* of
> the
> >> > > > pipeline.  I do this often when debugging: if the
output of the
> pipeline
> >> > > > seems wrong, I re-run the first few steps and check
the output,
> then
> >> > > > include a little more and re-run again, etc., until
I locate my
> mistake.
> >> > > > Working in an interactive notebook environment,
this involves
> using the
> >> > > > cursor to select just the part of the code I want
to re-run.
> >> > > >
> >> > > > It's fast and easy to select *entire* lines of
code, but
> unfortunately
> >> > > with
> >> > > > the pipes placed at the end of the line I must
instead select
> everything
> >> > > > *except* the last three characters of the line (the
last two
> characters
> >> > > for
> >> > > > the new pipe).  Then when I want to re-run the same
partial
> pipeline with
> >> > > > the next line of code included, I can't just
press SHIFT+Down to
> select
> >> > > it
> >> > > > as I otherwise would, but instead must move the
cursor
> horizontally to a
> >> > > > position three characters before the end of *that*
line (which is
> >> > > generally
> >> > > > different due to varying line lengths).  And so
forth each time I
> want to
> >> > > > include an additional line.
> >> > > >
> >> > > > Moreover, with the staggered positions of the pipes
at the end of
> each
> >> > > > line, it's very easy to accidentally select the
final pipe on a
> line, and
> >> > > > then sit there for a moment wondering if the
environment has
> stopped
> >> > > > responding before realizing it's just waiting
for further input
> (i.e.,
> >> > > for
> >> > > > the right-hand side).  These small delays and
disruptions add up
> over the
> >> > > > course of a day.
> >> > > >
> >> > > > This desire to select and re-run the first part of
a pipeline is
> also the
> >> > > > reason why it doesn't suffice to achieve syntax
like my "Example
> 2" by
> >> > > > wrapping the entire pipeline in parentheses. 
That's of no use if
> I want
> >> > > to
> >> > > > re-run a selection that doesn't include the
final close-paren.
> >> > > >
> >> > > > === Possible Solutions ==> >> > >
>
> >> > > > I can think of two, but maybe there are others. 
The first would
> make
> >> > > > "Example 2" into valid code, and the
second would allow you to
> run a
> >> > > > selection that included a trailing pipe.
> >> > > >
> >> > > >    Solution 1: Add a special case to how R is
parsed, so if the
> first
> >> > > > (non-whitespace) token after an end-line is a pipe,
that pipe
> gets moved
> >> > > to
> >> > > > before the end-line.
> >> > > >      - Argument for: This lets you write code like
example 2,
> which
> >> > > > addresses the pain point around re-running part of
a pipeline,
> and has
> >> > > > advantages for readability.  Also, since starting a
line with a
> pipe
> >> > > > operator is currently invalid, the change
wouldn't break any
> working
> >> > > code.
> >> > > >      - Argument against: It would make the behavior
of %>%
> inconsistent
> >> > > with
> >> > > > that of other binary operators in R.  (However,
this objection
> might not
> >> > > > apply to the new pipe, |>, which I understand is
being
> implemented as a
> >> > > > syntax transformation rather than a binary
operator.)
> >> > > >
> >> > > >    Solution 2: Ignore the pipe operator if it
occurs as the final
> token
> >> > > of
> >> > > > the code being executed.
> >> > > >      - Argument for: This would mean the user could
select and
> re-run the
> >> > > > first few lines of a longer pipeline (selecting
*entire* lines),
> avoiding
> >> > > > the difficulties described above.
> >> > > >      - Argument against: This means that %>%
would be valid even
> if it
> >> > > > occurred without a right-hand side, which is
inconsistent with
> other
> >> > > > operators in R.  (But, as above, this objection
might not apply
> to |>.)
> >> > > > Also, this solution still doesn't enable the
syntax of "Example
> 2", with
> >> > > > its readability benefit.
> >> > > >
> >> > > > Thanks for reading this and considering it.
> >> > > >
> >> > > > - Tim Goodman
> >> > > >
> >> > > >       [[alternative HTML version deleted]]
> >> > > >
> >> > > > ______________________________________________
> >> > > > R-devel at r-project.org mailing list
> >> > > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >> > > >
> >> > >
> >> > >
> >> >
> >> >         [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > R-devel at r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]
R devel - Dec 2020 - the pipe |> and line breaks in pipelines

[Rd] the pipe |> and line breaks in pipelines