thr3ads.net - R devel - [Rd] the pipe |> and line breaks in pipelines [Dec 2020]

If this information is useful, please help other people find it:
Share via:

Duncan Murdoch

2020-Dec-09 11:12 UTC

[Rd] the pipe |> and line breaks in pipelines

The requirement for operators at the end of the line comes from the 
interactive nature of R.  If you type

     my_data_frame_1

how could R know that you are not done, and are planning to type the 
rest of the expression

       %>% filter(some_conditions_1)
       ...

before it should consider the expression complete?  The way languages 
like C do this is by requiring a statement terminator at the end.  You 
can also do it by wrapping the entire thing in parentheses ().

However, be careful: Don't use braces:  they don't work.  And parens 
have the side effect of removing invisibility from the result (which is 
a design flaw or bonus, depending on your point of view).  So I actually 
wouldn't advise this workaround.

Duncan Murdoch


On 09/12/2020 12:45 a.m., Timothy Goodman wrote:> Hi,
> 
> I'm a data scientist who routinely uses R in my day-to-day work, for
tasks
> such as cleaning and transforming data, exploratory data analysis, etc.
> This includes frequent use of the pipe operator from the magrittr and dplyr
> libraries, %>%.  So, I was pleased to hear about the recent work on a
> native pipe operator, |>.
> 
> This seems like a good time to bring up the main pain point I encounter
> when using pipes in R, and some suggestions on what could be done about
> it.  The issue is that the pipe operator can't be placed at the start
of a
> line of code (except in parentheses).  That's no different than any
binary
> operator in R, but I find it's a source of difficulty for the pipe
because
> of how pipes are often used.
> 
> [I'm assuming here that my usage is fairly typical of a lot of users;
at
> any rate, I don't think I'm *too* unusual.]
> 
> === Why this is a problem ==> 
> It's very common (for me, and I suspect for many users of dplyr) to
write
> multi-step pipelines and put each step on its own line for readability.
> Something like this:
> 
>    ### Example 1 ###
>    my_data_frame_1 %>%
>      filter(some_conditions_1) %>%
>      inner_join(my_data_frame_2, by = some_columns_1) %>%
>      group_by(some_columns_2) %>%
>      summarize(some_aggregate_functions_1) %>%
>      filter(some_conditions_2) %>%
>      left_join(my_data_frame_3, by = some_columns_3) %>%
>      group_by(some_columns_4) %>%
>      summarize(some_aggregate_functions_2) %>%
>      arrange(some_columns_5)
> 
> [I guess some might consider this an overly long pipeline; for me it's
> pretty typical.  I *could* split it up by assigning intermediate results to
> variables, but much of the value I get from the pipe is that it lets my
> code communicate which results are temporary, and which will be used again
> later.  Assigning variables for single-use results would remove that
> expressiveness.]
> 
> I would prefer (for reasons I'll explain) to be able to write the above
> example like this, which isn't valid R:
> 
>    ### Example 2 (not valid R) ###
>    my_data_frame_1
>      %>% filter(some_conditions_1)
>      %>% inner_join(my_data_frame_2, by = some_columns_1)
>      %>% group_by(some_columns_2)
>      %>% summarize(some_aggregate_functions_1)
>      %>% filter(some_conditions_2)
>      %>% left_join(my_data_frame_3, by = some_columns_3)
>      %>% group_by(some_columns_4)
>      %>% summarize(some_aggregate_functions_2)
>      %>% arrange(some_columns_5)
> 
> One (minor) advantage is obvious: It lets you easily line up the pipes,
> which means that you can see at a glance that the whole block is a single
> pipeline, and you'd immediately notice if you inadvertently omitted a
pipe,
> which otherwise can lead to confusing output.  [It's also aesthetically
> pleasing, especially when %>% is replaced with |>, but that's
subjective.]
> 
> But the bigger issue happens when I want to re-run just *part* of the
> pipeline.  I do this often when debugging: if the output of the pipeline
> seems wrong, I re-run the first few steps and check the output, then
> include a little more and re-run again, etc., until I locate my mistake.
> Working in an interactive notebook environment, this involves using the
> cursor to select just the part of the code I want to re-run.
> 
> It's fast and easy to select *entire* lines of code, but unfortunately
with
> the pipes placed at the end of the line I must instead select everything
> *except* the last three characters of the line (the last two characters for
> the new pipe).  Then when I want to re-run the same partial pipeline with
> the next line of code included, I can't just press SHIFT+Down to select
it
> as I otherwise would, but instead must move the cursor horizontally to a
> position three characters before the end of *that* line (which is generally
> different due to varying line lengths).  And so forth each time I want to
> include an additional line.
> 
> Moreover, with the staggered positions of the pipes at the end of each
> line, it's very easy to accidentally select the final pipe on a line,
and
> then sit there for a moment wondering if the environment has stopped
> responding before realizing it's just waiting for further input (i.e.,
for
> the right-hand side).  These small delays and disruptions add up over the
> course of a day.
> 
> This desire to select and re-run the first part of a pipeline is also the
> reason why it doesn't suffice to achieve syntax like my "Example
2" by
> wrapping the entire pipeline in parentheses.  That's of no use if I
want to
> re-run a selection that doesn't include the final close-paren.
> 
> === Possible Solutions ==> 
> I can think of two, but maybe there are others.  The first would make
> "Example 2" into valid code, and the second would allow you to
run a
> selection that included a trailing pipe.
> 
>    Solution 1: Add a special case to how R is parsed, so if the first
> (non-whitespace) token after an end-line is a pipe, that pipe gets moved to
> before the end-line.
>      - Argument for: This lets you write code like example 2, which
> addresses the pain point around re-running part of a pipeline, and has
> advantages for readability.  Also, since starting a line with a pipe
> operator is currently invalid, the change wouldn't break any working
code.
>      - Argument against: It would make the behavior of %>% inconsistent
with
> that of other binary operators in R.  (However, this objection might not
> apply to the new pipe, |>, which I understand is being implemented as a
> syntax transformation rather than a binary operator.)
> 
>    Solution 2: Ignore the pipe operator if it occurs as the final token of
> the code being executed.
>      - Argument for: This would mean the user could select and re-run the
> first few lines of a longer pipeline (selecting *entire* lines), avoiding
> the difficulties described above.
>      - Argument against: This means that %>% would be valid even if it
> occurred without a right-hand side, which is inconsistent with other
> operators in R.  (But, as above, this objection might not apply to |>.)
> Also, this solution still doesn't enable the syntax of "Example
2", with
> its readability benefit.
> 
> Thanks for reading this and considering it.
> 
> - Tim Goodman
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Timothy Goodman

2020-Dec-09 19:33 UTC

head link

[Rd] the pipe |> and line breaks in pipelines

If I type my_data_frame_1 and press Enter (or Ctrl+Enter to execute the
command in the Notebook environment I'm using) I certainly *would* expect R
to treat it as a complete statement.

But what I'm talking about is a different case, where I highlight a
multi-line statement in my notebook:

    my_data_frame1
        |> filter(some_conditions_1)

and then press Ctrl+Enter.  Or, I suppose the equivalent would be to run an
R script containing those two lines of code, or to run a multi-line
statement like that from the console (which in RStudio I can do by pressing
Shift+Enter between the lines.)

In those cases, R could either (1) Give an error message [the current
behavior], or (2) understand that the first line is meant to be piped to
the second.  The second option would be significantly more useful, and is
almost certainly what the user intended.

(For what it's worth, there are some languages, such as Javascript, that
consider the first token of the next line when determining if the previous
line was complete.  JavaScript's rules around this are overly complicated,
but a rule like "a pipe following a line break is treated as continuing the
previous line" would be much simpler.  And while it might be objectionable
to treat the operator %>% different from other operators, the addition of
|>, which isn't truly an operator at all, seems like the right time to
consider it.)

-Tim

On Wed, Dec 9, 2020 at 3:12 AM Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> The requirement for operators at the end of the line comes from the
> interactive nature of R.  If you type
>
>      my_data_frame_1
>
> how could R know that you are not done, and are planning to type the
> rest of the expression
>
>        %>% filter(some_conditions_1)
>        ...
>
> before it should consider the expression complete?  The way languages
> like C do this is by requiring a statement terminator at the end.  You
> can also do it by wrapping the entire thing in parentheses ().
>
> However, be careful: Don't use braces:  they don't work.  And
parens
> have the side effect of removing invisibility from the result (which is
> a design flaw or bonus, depending on your point of view).  So I actually
> wouldn't advise this workaround.
>
> Duncan Murdoch
>
>
> On 09/12/2020 12:45 a.m., Timothy Goodman wrote:
> > Hi,
> >
> > I'm a data scientist who routinely uses R in my day-to-day work,
for
> tasks
> > such as cleaning and transforming data, exploratory data analysis,
etc.
> > This includes frequent use of the pipe operator from the magrittr and
> dplyr
> > libraries, %>%.  So, I was pleased to hear about the recent work on
a
> > native pipe operator, |>.
> >
> > This seems like a good time to bring up the main pain point I
encounter
> > when using pipes in R, and some suggestions on what could be done
about
> > it.  The issue is that the pipe operator can't be placed at the
start of
> a
> > line of code (except in parentheses).  That's no different than
any
> binary
> > operator in R, but I find it's a source of difficulty for the pipe
> because
> > of how pipes are often used.
> >
> > [I'm assuming here that my usage is fairly typical of a lot of
users; at
> > any rate, I don't think I'm *too* unusual.]
> >
> > === Why this is a problem ==> >
> > It's very common (for me, and I suspect for many users of dplyr)
to write
> > multi-step pipelines and put each step on its own line for
readability.
> > Something like this:
> >
> >    ### Example 1 ###
> >    my_data_frame_1 %>%
> >      filter(some_conditions_1) %>%
> >      inner_join(my_data_frame_2, by = some_columns_1) %>%
> >      group_by(some_columns_2) %>%
> >      summarize(some_aggregate_functions_1) %>%
> >      filter(some_conditions_2) %>%
> >      left_join(my_data_frame_3, by = some_columns_3) %>%
> >      group_by(some_columns_4) %>%
> >      summarize(some_aggregate_functions_2) %>%
> >      arrange(some_columns_5)
> >
> > [I guess some might consider this an overly long pipeline; for me
it's
> > pretty typical.  I *could* split it up by assigning intermediate
results
> to
> > variables, but much of the value I get from the pipe is that it lets
my
> > code communicate which results are temporary, and which will be used
> again
> > later.  Assigning variables for single-use results would remove that
> > expressiveness.]
> >
> > I would prefer (for reasons I'll explain) to be able to write the
above
> > example like this, which isn't valid R:
> >
> >    ### Example 2 (not valid R) ###
> >    my_data_frame_1
> >      %>% filter(some_conditions_1)
> >      %>% inner_join(my_data_frame_2, by = some_columns_1)
> >      %>% group_by(some_columns_2)
> >      %>% summarize(some_aggregate_functions_1)
> >      %>% filter(some_conditions_2)
> >      %>% left_join(my_data_frame_3, by = some_columns_3)
> >      %>% group_by(some_columns_4)
> >      %>% summarize(some_aggregate_functions_2)
> >      %>% arrange(some_columns_5)
> >
> > One (minor) advantage is obvious: It lets you easily line up the
pipes,
> > which means that you can see at a glance that the whole block is a
single
> > pipeline, and you'd immediately notice if you inadvertently
omitted a
> pipe,
> > which otherwise can lead to confusing output.  [It's also
aesthetically
> > pleasing, especially when %>% is replaced with |>, but
that's
> subjective.]
> >
> > But the bigger issue happens when I want to re-run just *part* of the
> > pipeline.  I do this often when debugging: if the output of the
pipeline
> > seems wrong, I re-run the first few steps and check the output, then
> > include a little more and re-run again, etc., until I locate my
mistake.
> > Working in an interactive notebook environment, this involves using
the
> > cursor to select just the part of the code I want to re-run.
> >
> > It's fast and easy to select *entire* lines of code, but
unfortunately
> with
> > the pipes placed at the end of the line I must instead select
everything
> > *except* the last three characters of the line (the last two
characters
> for
> > the new pipe).  Then when I want to re-run the same partial pipeline
with
> > the next line of code included, I can't just press SHIFT+Down to
select
> it
> > as I otherwise would, but instead must move the cursor horizontally to
a
> > position three characters before the end of *that* line (which is
> generally
> > different due to varying line lengths).  And so forth each time I want
to
> > include an additional line.
> >
> > Moreover, with the staggered positions of the pipes at the end of each
> > line, it's very easy to accidentally select the final pipe on a
line, and
> > then sit there for a moment wondering if the environment has stopped
> > responding before realizing it's just waiting for further input
(i.e.,
> for
> > the right-hand side).  These small delays and disruptions add up over
the
> > course of a day.
> >
> > This desire to select and re-run the first part of a pipeline is also
the
> > reason why it doesn't suffice to achieve syntax like my
"Example 2" by
> > wrapping the entire pipeline in parentheses.  That's of no use if
I want
> to
> > re-run a selection that doesn't include the final close-paren.
> >
> > === Possible Solutions ==> >
> > I can think of two, but maybe there are others.  The first would make
> > "Example 2" into valid code, and the second would allow you
to run a
> > selection that included a trailing pipe.
> >
> >    Solution 1: Add a special case to how R is parsed, so if the first
> > (non-whitespace) token after an end-line is a pipe, that pipe gets
moved
> to
> > before the end-line.
> >      - Argument for: This lets you write code like example 2, which
> > addresses the pain point around re-running part of a pipeline, and has
> > advantages for readability.  Also, since starting a line with a pipe
> > operator is currently invalid, the change wouldn't break any
working
> code.
> >      - Argument against: It would make the behavior of %>%
inconsistent
> with
> > that of other binary operators in R.  (However, this objection might
not
> > apply to the new pipe, |>, which I understand is being implemented
as a
> > syntax transformation rather than a binary operator.)
> >
> >    Solution 2: Ignore the pipe operator if it occurs as the final
token
> of
> > the code being executed.
> >      - Argument for: This would mean the user could select and re-run
the
> > first few lines of a longer pipeline (selecting *entire* lines),
avoiding
> > the difficulties described above.
> >      - Argument against: This means that %>% would be valid even if
it
> > occurred without a right-hand side, which is inconsistent with other
> > operators in R.  (But, as above, this objection might not apply to
|>.)
> > Also, this solution still doesn't enable the syntax of
"Example 2", with
> > its readability benefit.
> >
> > Thanks for reading this and considering it.
> >
> > - Tim Goodman
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-devel at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
>
	[[alternative HTML version deleted]]

R devel - Dec 2020 - the pipe |> and line breaks in pipelines

[Rd] the pipe |> and line breaks in pipelines

[Rd] the pipe |> and line breaks in pipelines