thr3ads.net - R devel - [Rd] the pipe |> and line breaks in pipelines [Dec 2020]

If this information is useful, please help other people find it:
Share via:

Timothy Goodman

2020-Dec-09 05:45 UTC

[Rd] the pipe |> and line breaks in pipelines

Hi,

I'm a data scientist who routinely uses R in my day-to-day work, for tasks
such as cleaning and transforming data, exploratory data analysis, etc.
This includes frequent use of the pipe operator from the magrittr and dplyr
libraries, %>%.  So, I was pleased to hear about the recent work on a
native pipe operator, |>.

This seems like a good time to bring up the main pain point I encounter
when using pipes in R, and some suggestions on what could be done about
it.  The issue is that the pipe operator can't be placed at the start of a
line of code (except in parentheses).  That's no different than any binary
operator in R, but I find it's a source of difficulty for the pipe because
of how pipes are often used.

[I'm assuming here that my usage is fairly typical of a lot of users; at
any rate, I don't think I'm *too* unusual.]

=== Why this is a problem ==
It's very common (for me, and I suspect for many users of dplyr) to write
multi-step pipelines and put each step on its own line for readability.
Something like this:

  ### Example 1 ###
  my_data_frame_1 %>%
    filter(some_conditions_1) %>%
    inner_join(my_data_frame_2, by = some_columns_1) %>%
    group_by(some_columns_2) %>%
    summarize(some_aggregate_functions_1) %>%
    filter(some_conditions_2) %>%
    left_join(my_data_frame_3, by = some_columns_3) %>%
    group_by(some_columns_4) %>%
    summarize(some_aggregate_functions_2) %>%
    arrange(some_columns_5)

[I guess some might consider this an overly long pipeline; for me it's
pretty typical.  I *could* split it up by assigning intermediate results to
variables, but much of the value I get from the pipe is that it lets my
code communicate which results are temporary, and which will be used again
later.  Assigning variables for single-use results would remove that
expressiveness.]

I would prefer (for reasons I'll explain) to be able to write the above
example like this, which isn't valid R:

  ### Example 2 (not valid R) ###
  my_data_frame_1
    %>% filter(some_conditions_1)
    %>% inner_join(my_data_frame_2, by = some_columns_1)
    %>% group_by(some_columns_2)
    %>% summarize(some_aggregate_functions_1)
    %>% filter(some_conditions_2)
    %>% left_join(my_data_frame_3, by = some_columns_3)
    %>% group_by(some_columns_4)
    %>% summarize(some_aggregate_functions_2)
    %>% arrange(some_columns_5)

One (minor) advantage is obvious: It lets you easily line up the pipes,
which means that you can see at a glance that the whole block is a single
pipeline, and you'd immediately notice if you inadvertently omitted a pipe,
which otherwise can lead to confusing output.  [It's also aesthetically
pleasing, especially when %>% is replaced with |>, but that's
subjective.]

But the bigger issue happens when I want to re-run just *part* of the
pipeline.  I do this often when debugging: if the output of the pipeline
seems wrong, I re-run the first few steps and check the output, then
include a little more and re-run again, etc., until I locate my mistake.
Working in an interactive notebook environment, this involves using the
cursor to select just the part of the code I want to re-run.

It's fast and easy to select *entire* lines of code, but unfortunately with
the pipes placed at the end of the line I must instead select everything
*except* the last three characters of the line (the last two characters for
the new pipe).  Then when I want to re-run the same partial pipeline with
the next line of code included, I can't just press SHIFT+Down to select it
as I otherwise would, but instead must move the cursor horizontally to a
position three characters before the end of *that* line (which is generally
different due to varying line lengths).  And so forth each time I want to
include an additional line.

Moreover, with the staggered positions of the pipes at the end of each
line, it's very easy to accidentally select the final pipe on a line, and
then sit there for a moment wondering if the environment has stopped
responding before realizing it's just waiting for further input (i.e., for
the right-hand side).  These small delays and disruptions add up over the
course of a day.

This desire to select and re-run the first part of a pipeline is also the
reason why it doesn't suffice to achieve syntax like my "Example
2" by
wrapping the entire pipeline in parentheses.  That's of no use if I want to
re-run a selection that doesn't include the final close-paren.

=== Possible Solutions ==
I can think of two, but maybe there are others.  The first would make
"Example 2" into valid code, and the second would allow you to run a
selection that included a trailing pipe.

  Solution 1: Add a special case to how R is parsed, so if the first
(non-whitespace) token after an end-line is a pipe, that pipe gets moved to
before the end-line.
    - Argument for: This lets you write code like example 2, which
addresses the pain point around re-running part of a pipeline, and has
advantages for readability.  Also, since starting a line with a pipe
operator is currently invalid, the change wouldn't break any working code.
    - Argument against: It would make the behavior of %>% inconsistent with
that of other binary operators in R.  (However, this objection might not
apply to the new pipe, |>, which I understand is being implemented as a
syntax transformation rather than a binary operator.)

  Solution 2: Ignore the pipe operator if it occurs as the final token of
the code being executed.
    - Argument for: This would mean the user could select and re-run the
first few lines of a longer pipeline (selecting *entire* lines), avoiding
the difficulties described above.
    - Argument against: This means that %>% would be valid even if it
occurred without a right-hand side, which is inconsistent with other
operators in R.  (But, as above, this objection might not apply to |>.)
Also, this solution still doesn't enable the syntax of "Example
2", with
its readability benefit.

Thanks for reading this and considering it.

- Tim Goodman

	[[alternative HTML version deleted]]

Stefan Evert

2020-Dec-09 11:08 UTC

head link

[Rd] the pipe |> and line breaks in pipelines

I'm not a pipe user, so I may be overlooking some issue, but wouldn't
simply putting identity() on the last line solve your main problem?

### Example 1 ###
 my_data_frame_1 %>%
   filter(some_conditions_1) %>%
   inner_join(my_data_frame_2, by = some_columns_1) %>%
   group_by(some_columns_2) %>%
   summarize(some_aggregate_functions_1) %>%
   filter(some_conditions_2) %>%
   left_join(my_data_frame_3, by = some_columns_3) %>%
   group_by(some_columns_4) %>%
   summarize(some_aggregate_functions_2) %>%
   arrange(some_columns_5) %>%
   identity()

I agree that it would be visually more pleasing to have the pipe symbols lined
up at the start of each line, but I don't think it's worth breaking
R's principle of evaluating any line with a complete expression.

With your solution 1, R wouldn't be able to execute any complete command
because it would have to wait and see if the next line happens to start with
%>%.

With your solution 2, 
  
  my_data_frame_1 %>%

would be a complete expression (because an extra trailing %>% is allowed on
the last line of a pipe) and hence execute immediately rather than wait for the
next line.

Best,
Stefan

> On 9 Dec 2020, at 06:45, Timothy Goodman <timsgoodman at gmail.com>
wrote:
> 
> Hi,
> 
> I'm a data scientist who routinely uses R in my day-to-day work, for
tasks
> such as cleaning and transforming data, exploratory data analysis, etc.
> This includes frequent use of the pipe operator from the magrittr and dplyr
> libraries, %>%.  So, I was pleased to hear about the recent work on a
> native pipe operator, |>.
> 
> This seems like a good time to bring up the main pain point I encounter
> when using pipes in R, and some suggestions on what could be done about
> it.  The issue is that the pipe operator can't be placed at the start
of a
> line of code (except in parentheses).  That's no different than any
binary
> operator in R, but I find it's a source of difficulty for the pipe
because
> of how pipes are often used.
> 
> [I'm assuming here that my usage is fairly typical of a lot of users;
at
> any rate, I don't think I'm *too* unusual.]
> 
> === Why this is a problem ==> 
> It's very common (for me, and I suspect for many users of dplyr) to
write
> multi-step pipelines and put each step on its own line for readability.
> Something like this:
> 
>  ### Example 1 ###
>  my_data_frame_1 %>%
>    filter(some_conditions_1) %>%
>    inner_join(my_data_frame_2, by = some_columns_1) %>%
>    group_by(some_columns_2) %>%
>    summarize(some_aggregate_functions_1) %>%
>    filter(some_conditions_2) %>%
>    left_join(my_data_frame_3, by = some_columns_3) %>%
>    group_by(some_columns_4) %>%
>    summarize(some_aggregate_functions_2) %>%
>    arrange(some_columns_5)
> 
> [I guess some might consider this an overly long pipeline; for me it's
> pretty typical.  I *could* split it up by assigning intermediate results to
> variables, but much of the value I get from the pipe is that it lets my
> code communicate which results are temporary, and which will be used again
> later.  Assigning variables for single-use results would remove that
> expressiveness.]
> 
> I would prefer (for reasons I'll explain) to be able to write the above
> example like this, which isn't valid R:
> 
>  ### Example 2 (not valid R) ###
>  my_data_frame_1
>    %>% filter(some_conditions_1)
>    %>% inner_join(my_data_frame_2, by = some_columns_1)
>    %>% group_by(some_columns_2)
>    %>% summarize(some_aggregate_functions_1)
>    %>% filter(some_conditions_2)
>    %>% left_join(my_data_frame_3, by = some_columns_3)
>    %>% group_by(some_columns_4)
>    %>% summarize(some_aggregate_functions_2)
>    %>% arrange(some_columns_5)
> 
> One (minor) advantage is obvious: It lets you easily line up the pipes,
> which means that you can see at a glance that the whole block is a single
> pipeline, and you'd immediately notice if you inadvertently omitted a
pipe,
> which otherwise can lead to confusing output.  [It's also aesthetically
> pleasing, especially when %>% is replaced with |>, but that's
subjective.]
> 
> But the bigger issue happens when I want to re-run just *part* of the
> pipeline.  I do this often when debugging: if the output of the pipeline
> seems wrong, I re-run the first few steps and check the output, then
> include a little more and re-run again, etc., until I locate my mistake.
> Working in an interactive notebook environment, this involves using the
> cursor to select just the part of the code I want to re-run.
> 
> It's fast and easy to select *entire* lines of code, but unfortunately
with
> the pipes placed at the end of the line I must instead select everything
> *except* the last three characters of the line (the last two characters for
> the new pipe).  Then when I want to re-run the same partial pipeline with
> the next line of code included, I can't just press SHIFT+Down to select
it
> as I otherwise would, but instead must move the cursor horizontally to a
> position three characters before the end of *that* line (which is generally
> different due to varying line lengths).  And so forth each time I want to
> include an additional line.
> 
> Moreover, with the staggered positions of the pipes at the end of each
> line, it's very easy to accidentally select the final pipe on a line,
and
> then sit there for a moment wondering if the environment has stopped
> responding before realizing it's just waiting for further input (i.e.,
for
> the right-hand side).  These small delays and disruptions add up over the
> course of a day.
> 
> This desire to select and re-run the first part of a pipeline is also the
> reason why it doesn't suffice to achieve syntax like my "Example
2" by
> wrapping the entire pipeline in parentheses.  That's of no use if I
want to
> re-run a selection that doesn't include the final close-paren.
> 
> === Possible Solutions ==> 
> I can think of two, but maybe there are others.  The first would make
> "Example 2" into valid code, and the second would allow you to
run a
> selection that included a trailing pipe.
> 
>  Solution 1: Add a special case to how R is parsed, so if the first
> (non-whitespace) token after an end-line is a pipe, that pipe gets moved to
> before the end-line.
>    - Argument for: This lets you write code like example 2, which
> addresses the pain point around re-running part of a pipeline, and has
> advantages for readability.  Also, since starting a line with a pipe
> operator is currently invalid, the change wouldn't break any working
code.
>    - Argument against: It would make the behavior of %>% inconsistent
with
> that of other binary operators in R.  (However, this objection might not
> apply to the new pipe, |>, which I understand is being implemented as a
> syntax transformation rather than a binary operator.)
> 
>  Solution 2: Ignore the pipe operator if it occurs as the final token of
> the code being executed.
>    - Argument for: This would mean the user could select and re-run the
> first few lines of a longer pipeline (selecting *entire* lines), avoiding
> the difficulties described above.
>    - Argument against: This means that %>% would be valid even if it
> occurred without a right-hand side, which is inconsistent with other
> operators in R.  (But, as above, this objection might not apply to |>.)
> Also, this solution still doesn't enable the syntax of "Example
2", with
> its readability benefit.
> 
> Thanks for reading this and considering it.
> 
> - Tim Goodman
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

Duncan Murdoch

2020-Dec-09 11:12 UTC

head link

[Rd] the pipe |> and line breaks in pipelines

The requirement for operators at the end of the line comes from the 
interactive nature of R.  If you type

     my_data_frame_1

how could R know that you are not done, and are planning to type the 
rest of the expression

       %>% filter(some_conditions_1)
       ...

before it should consider the expression complete?  The way languages 
like C do this is by requiring a statement terminator at the end.  You 
can also do it by wrapping the entire thing in parentheses ().

However, be careful: Don't use braces:  they don't work.  And parens 
have the side effect of removing invisibility from the result (which is 
a design flaw or bonus, depending on your point of view).  So I actually 
wouldn't advise this workaround.

Duncan Murdoch


On 09/12/2020 12:45 a.m., Timothy Goodman wrote:> Hi,
> 
> I'm a data scientist who routinely uses R in my day-to-day work, for
tasks
> such as cleaning and transforming data, exploratory data analysis, etc.
> This includes frequent use of the pipe operator from the magrittr and dplyr
> libraries, %>%.  So, I was pleased to hear about the recent work on a
> native pipe operator, |>.
> 
> This seems like a good time to bring up the main pain point I encounter
> when using pipes in R, and some suggestions on what could be done about
> it.  The issue is that the pipe operator can't be placed at the start
of a
> line of code (except in parentheses).  That's no different than any
binary
> operator in R, but I find it's a source of difficulty for the pipe
because
> of how pipes are often used.
> 
> [I'm assuming here that my usage is fairly typical of a lot of users;
at
> any rate, I don't think I'm *too* unusual.]
> 
> === Why this is a problem ==> 
> It's very common (for me, and I suspect for many users of dplyr) to
write
> multi-step pipelines and put each step on its own line for readability.
> Something like this:
> 
>    ### Example 1 ###
>    my_data_frame_1 %>%
>      filter(some_conditions_1) %>%
>      inner_join(my_data_frame_2, by = some_columns_1) %>%
>      group_by(some_columns_2) %>%
>      summarize(some_aggregate_functions_1) %>%
>      filter(some_conditions_2) %>%
>      left_join(my_data_frame_3, by = some_columns_3) %>%
>      group_by(some_columns_4) %>%
>      summarize(some_aggregate_functions_2) %>%
>      arrange(some_columns_5)
> 
> [I guess some might consider this an overly long pipeline; for me it's
> pretty typical.  I *could* split it up by assigning intermediate results to
> variables, but much of the value I get from the pipe is that it lets my
> code communicate which results are temporary, and which will be used again
> later.  Assigning variables for single-use results would remove that
> expressiveness.]
> 
> I would prefer (for reasons I'll explain) to be able to write the above
> example like this, which isn't valid R:
> 
>    ### Example 2 (not valid R) ###
>    my_data_frame_1
>      %>% filter(some_conditions_1)
>      %>% inner_join(my_data_frame_2, by = some_columns_1)
>      %>% group_by(some_columns_2)
>      %>% summarize(some_aggregate_functions_1)
>      %>% filter(some_conditions_2)
>      %>% left_join(my_data_frame_3, by = some_columns_3)
>      %>% group_by(some_columns_4)
>      %>% summarize(some_aggregate_functions_2)
>      %>% arrange(some_columns_5)
> 
> One (minor) advantage is obvious: It lets you easily line up the pipes,
> which means that you can see at a glance that the whole block is a single
> pipeline, and you'd immediately notice if you inadvertently omitted a
pipe,
> which otherwise can lead to confusing output.  [It's also aesthetically
> pleasing, especially when %>% is replaced with |>, but that's
subjective.]
> 
> But the bigger issue happens when I want to re-run just *part* of the
> pipeline.  I do this often when debugging: if the output of the pipeline
> seems wrong, I re-run the first few steps and check the output, then
> include a little more and re-run again, etc., until I locate my mistake.
> Working in an interactive notebook environment, this involves using the
> cursor to select just the part of the code I want to re-run.
> 
> It's fast and easy to select *entire* lines of code, but unfortunately
with
> the pipes placed at the end of the line I must instead select everything
> *except* the last three characters of the line (the last two characters for
> the new pipe).  Then when I want to re-run the same partial pipeline with
> the next line of code included, I can't just press SHIFT+Down to select
it
> as I otherwise would, but instead must move the cursor horizontally to a
> position three characters before the end of *that* line (which is generally
> different due to varying line lengths).  And so forth each time I want to
> include an additional line.
> 
> Moreover, with the staggered positions of the pipes at the end of each
> line, it's very easy to accidentally select the final pipe on a line,
and
> then sit there for a moment wondering if the environment has stopped
> responding before realizing it's just waiting for further input (i.e.,
for
> the right-hand side).  These small delays and disruptions add up over the
> course of a day.
> 
> This desire to select and re-run the first part of a pipeline is also the
> reason why it doesn't suffice to achieve syntax like my "Example
2" by
> wrapping the entire pipeline in parentheses.  That's of no use if I
want to
> re-run a selection that doesn't include the final close-paren.
> 
> === Possible Solutions ==> 
> I can think of two, but maybe there are others.  The first would make
> "Example 2" into valid code, and the second would allow you to
run a
> selection that included a trailing pipe.
> 
>    Solution 1: Add a special case to how R is parsed, so if the first
> (non-whitespace) token after an end-line is a pipe, that pipe gets moved to
> before the end-line.
>      - Argument for: This lets you write code like example 2, which
> addresses the pain point around re-running part of a pipeline, and has
> advantages for readability.  Also, since starting a line with a pipe
> operator is currently invalid, the change wouldn't break any working
code.
>      - Argument against: It would make the behavior of %>% inconsistent
with
> that of other binary operators in R.  (However, this objection might not
> apply to the new pipe, |>, which I understand is being implemented as a
> syntax transformation rather than a binary operator.)
> 
>    Solution 2: Ignore the pipe operator if it occurs as the final token of
> the code being executed.
>      - Argument for: This would mean the user could select and re-run the
> first few lines of a longer pipeline (selecting *entire* lines), avoiding
> the difficulties described above.
>      - Argument against: This means that %>% would be valid even if it
> occurred without a right-hand side, which is inconsistent with other
> operators in R.  (But, as above, this objection might not apply to |>.)
> Also, this solution still doesn't enable the syntax of "Example
2", with
> its readability benefit.
> 
> Thanks for reading this and considering it.
> 
> - Tim Goodman
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

Gabor Grothendieck

2020-Dec-09 11:22 UTC

head link

[Rd] the pipe |> and line breaks in pipelines

On Wed, Dec 9, 2020 at 4:03 AM Timothy Goodman <timsgoodman at gmail.com>
wrote:> But the bigger issue happens when I want to re-run just *part* of the
> pipeline.
Insert one of the following into the pipeline. It does not require that you
edit any lines.   It only involves inserting a line.

  print %>%
  { str(.); . } %>%
  { . ->> .save } %>%

R devel - Dec 2020 - the pipe |> and line breaks in pipelines

[Rd] the pipe |> and line breaks in pipelines

[Rd] the pipe |> and line breaks in pipelines

[Rd] the pipe |> and line breaks in pipelines

[Rd] the pipe |> and line breaks in pipelines