Hi, I'm a data scientist who routinely uses R in my day-to-day work, for tasks such as cleaning and transforming data, exploratory data analysis, etc. This includes frequent use of the pipe operator from the magrittr and dplyr libraries, %>%. So, I was pleased to hear about the recent work on a native pipe operator, |>. This seems like a good time to bring up the main pain point I encounter when using pipes in R, and some suggestions on what could be done about it. The issue is that the pipe operator can't be placed at the start of a line of code (except in parentheses). That's no different than any binary operator in R, but I find it's a source of difficulty for the pipe because of how pipes are often used. [I'm assuming here that my usage is fairly typical of a lot of users; at any rate, I don't think I'm *too* unusual.] === Why this is a problem == It's very common (for me, and I suspect for many users of dplyr) to write multi-step pipelines and put each step on its own line for readability. Something like this: ### Example 1 ### my_data_frame_1 %>% filter(some_conditions_1) %>% inner_join(my_data_frame_2, by = some_columns_1) %>% group_by(some_columns_2) %>% summarize(some_aggregate_functions_1) %>% filter(some_conditions_2) %>% left_join(my_data_frame_3, by = some_columns_3) %>% group_by(some_columns_4) %>% summarize(some_aggregate_functions_2) %>% arrange(some_columns_5) [I guess some might consider this an overly long pipeline; for me it's pretty typical. I *could* split it up by assigning intermediate results to variables, but much of the value I get from the pipe is that it lets my code communicate which results are temporary, and which will be used again later. Assigning variables for single-use results would remove that expressiveness.] I would prefer (for reasons I'll explain) to be able to write the above example like this, which isn't valid R: ### Example 2 (not valid R) ### my_data_frame_1 %>% filter(some_conditions_1) %>% inner_join(my_data_frame_2, by = some_columns_1) %>% group_by(some_columns_2) %>% summarize(some_aggregate_functions_1) %>% filter(some_conditions_2) %>% left_join(my_data_frame_3, by = some_columns_3) %>% group_by(some_columns_4) %>% summarize(some_aggregate_functions_2) %>% arrange(some_columns_5) One (minor) advantage is obvious: It lets you easily line up the pipes, which means that you can see at a glance that the whole block is a single pipeline, and you'd immediately notice if you inadvertently omitted a pipe, which otherwise can lead to confusing output. [It's also aesthetically pleasing, especially when %>% is replaced with |>, but that's subjective.] But the bigger issue happens when I want to re-run just *part* of the pipeline. I do this often when debugging: if the output of the pipeline seems wrong, I re-run the first few steps and check the output, then include a little more and re-run again, etc., until I locate my mistake. Working in an interactive notebook environment, this involves using the cursor to select just the part of the code I want to re-run. It's fast and easy to select *entire* lines of code, but unfortunately with the pipes placed at the end of the line I must instead select everything *except* the last three characters of the line (the last two characters for the new pipe). Then when I want to re-run the same partial pipeline with the next line of code included, I can't just press SHIFT+Down to select it as I otherwise would, but instead must move the cursor horizontally to a position three characters before the end of *that* line (which is generally different due to varying line lengths). And so forth each time I want to include an additional line. Moreover, with the staggered positions of the pipes at the end of each line, it's very easy to accidentally select the final pipe on a line, and then sit there for a moment wondering if the environment has stopped responding before realizing it's just waiting for further input (i.e., for the right-hand side). These small delays and disruptions add up over the course of a day. This desire to select and re-run the first part of a pipeline is also the reason why it doesn't suffice to achieve syntax like my "Example 2" by wrapping the entire pipeline in parentheses. That's of no use if I want to re-run a selection that doesn't include the final close-paren. === Possible Solutions == I can think of two, but maybe there are others. The first would make "Example 2" into valid code, and the second would allow you to run a selection that included a trailing pipe. Solution 1: Add a special case to how R is parsed, so if the first (non-whitespace) token after an end-line is a pipe, that pipe gets moved to before the end-line. - Argument for: This lets you write code like example 2, which addresses the pain point around re-running part of a pipeline, and has advantages for readability. Also, since starting a line with a pipe operator is currently invalid, the change wouldn't break any working code. - Argument against: It would make the behavior of %>% inconsistent with that of other binary operators in R. (However, this objection might not apply to the new pipe, |>, which I understand is being implemented as a syntax transformation rather than a binary operator.) Solution 2: Ignore the pipe operator if it occurs as the final token of the code being executed. - Argument for: This would mean the user could select and re-run the first few lines of a longer pipeline (selecting *entire* lines), avoiding the difficulties described above. - Argument against: This means that %>% would be valid even if it occurred without a right-hand side, which is inconsistent with other operators in R. (But, as above, this objection might not apply to |>.) Also, this solution still doesn't enable the syntax of "Example 2", with its readability benefit. Thanks for reading this and considering it. - Tim Goodman [[alternative HTML version deleted]]
I'm not a pipe user, so I may be overlooking some issue, but wouldn't simply putting identity() on the last line solve your main problem? ### Example 1 ### my_data_frame_1 %>% filter(some_conditions_1) %>% inner_join(my_data_frame_2, by = some_columns_1) %>% group_by(some_columns_2) %>% summarize(some_aggregate_functions_1) %>% filter(some_conditions_2) %>% left_join(my_data_frame_3, by = some_columns_3) %>% group_by(some_columns_4) %>% summarize(some_aggregate_functions_2) %>% arrange(some_columns_5) %>% identity() I agree that it would be visually more pleasing to have the pipe symbols lined up at the start of each line, but I don't think it's worth breaking R's principle of evaluating any line with a complete expression. With your solution 1, R wouldn't be able to execute any complete command because it would have to wait and see if the next line happens to start with %>%. With your solution 2, my_data_frame_1 %>% would be a complete expression (because an extra trailing %>% is allowed on the last line of a pipe) and hence execute immediately rather than wait for the next line. Best, Stefan> On 9 Dec 2020, at 06:45, Timothy Goodman <timsgoodman at gmail.com> wrote: > > Hi, > > I'm a data scientist who routinely uses R in my day-to-day work, for tasks > such as cleaning and transforming data, exploratory data analysis, etc. > This includes frequent use of the pipe operator from the magrittr and dplyr > libraries, %>%. So, I was pleased to hear about the recent work on a > native pipe operator, |>. > > This seems like a good time to bring up the main pain point I encounter > when using pipes in R, and some suggestions on what could be done about > it. The issue is that the pipe operator can't be placed at the start of a > line of code (except in parentheses). That's no different than any binary > operator in R, but I find it's a source of difficulty for the pipe because > of how pipes are often used. > > [I'm assuming here that my usage is fairly typical of a lot of users; at > any rate, I don't think I'm *too* unusual.] > > === Why this is a problem ==> > It's very common (for me, and I suspect for many users of dplyr) to write > multi-step pipelines and put each step on its own line for readability. > Something like this: > > ### Example 1 ### > my_data_frame_1 %>% > filter(some_conditions_1) %>% > inner_join(my_data_frame_2, by = some_columns_1) %>% > group_by(some_columns_2) %>% > summarize(some_aggregate_functions_1) %>% > filter(some_conditions_2) %>% > left_join(my_data_frame_3, by = some_columns_3) %>% > group_by(some_columns_4) %>% > summarize(some_aggregate_functions_2) %>% > arrange(some_columns_5) > > [I guess some might consider this an overly long pipeline; for me it's > pretty typical. I *could* split it up by assigning intermediate results to > variables, but much of the value I get from the pipe is that it lets my > code communicate which results are temporary, and which will be used again > later. Assigning variables for single-use results would remove that > expressiveness.] > > I would prefer (for reasons I'll explain) to be able to write the above > example like this, which isn't valid R: > > ### Example 2 (not valid R) ### > my_data_frame_1 > %>% filter(some_conditions_1) > %>% inner_join(my_data_frame_2, by = some_columns_1) > %>% group_by(some_columns_2) > %>% summarize(some_aggregate_functions_1) > %>% filter(some_conditions_2) > %>% left_join(my_data_frame_3, by = some_columns_3) > %>% group_by(some_columns_4) > %>% summarize(some_aggregate_functions_2) > %>% arrange(some_columns_5) > > One (minor) advantage is obvious: It lets you easily line up the pipes, > which means that you can see at a glance that the whole block is a single > pipeline, and you'd immediately notice if you inadvertently omitted a pipe, > which otherwise can lead to confusing output. [It's also aesthetically > pleasing, especially when %>% is replaced with |>, but that's subjective.] > > But the bigger issue happens when I want to re-run just *part* of the > pipeline. I do this often when debugging: if the output of the pipeline > seems wrong, I re-run the first few steps and check the output, then > include a little more and re-run again, etc., until I locate my mistake. > Working in an interactive notebook environment, this involves using the > cursor to select just the part of the code I want to re-run. > > It's fast and easy to select *entire* lines of code, but unfortunately with > the pipes placed at the end of the line I must instead select everything > *except* the last three characters of the line (the last two characters for > the new pipe). Then when I want to re-run the same partial pipeline with > the next line of code included, I can't just press SHIFT+Down to select it > as I otherwise would, but instead must move the cursor horizontally to a > position three characters before the end of *that* line (which is generally > different due to varying line lengths). And so forth each time I want to > include an additional line. > > Moreover, with the staggered positions of the pipes at the end of each > line, it's very easy to accidentally select the final pipe on a line, and > then sit there for a moment wondering if the environment has stopped > responding before realizing it's just waiting for further input (i.e., for > the right-hand side). These small delays and disruptions add up over the > course of a day. > > This desire to select and re-run the first part of a pipeline is also the > reason why it doesn't suffice to achieve syntax like my "Example 2" by > wrapping the entire pipeline in parentheses. That's of no use if I want to > re-run a selection that doesn't include the final close-paren. > > === Possible Solutions ==> > I can think of two, but maybe there are others. The first would make > "Example 2" into valid code, and the second would allow you to run a > selection that included a trailing pipe. > > Solution 1: Add a special case to how R is parsed, so if the first > (non-whitespace) token after an end-line is a pipe, that pipe gets moved to > before the end-line. > - Argument for: This lets you write code like example 2, which > addresses the pain point around re-running part of a pipeline, and has > advantages for readability. Also, since starting a line with a pipe > operator is currently invalid, the change wouldn't break any working code. > - Argument against: It would make the behavior of %>% inconsistent with > that of other binary operators in R. (However, this objection might not > apply to the new pipe, |>, which I understand is being implemented as a > syntax transformation rather than a binary operator.) > > Solution 2: Ignore the pipe operator if it occurs as the final token of > the code being executed. > - Argument for: This would mean the user could select and re-run the > first few lines of a longer pipeline (selecting *entire* lines), avoiding > the difficulties described above. > - Argument against: This means that %>% would be valid even if it > occurred without a right-hand side, which is inconsistent with other > operators in R. (But, as above, this objection might not apply to |>.) > Also, this solution still doesn't enable the syntax of "Example 2", with > its readability benefit. > > Thanks for reading this and considering it. > > - Tim Goodman > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
The requirement for operators at the end of the line comes from the interactive nature of R. If you type my_data_frame_1 how could R know that you are not done, and are planning to type the rest of the expression %>% filter(some_conditions_1) ... before it should consider the expression complete? The way languages like C do this is by requiring a statement terminator at the end. You can also do it by wrapping the entire thing in parentheses (). However, be careful: Don't use braces: they don't work. And parens have the side effect of removing invisibility from the result (which is a design flaw or bonus, depending on your point of view). So I actually wouldn't advise this workaround. Duncan Murdoch On 09/12/2020 12:45 a.m., Timothy Goodman wrote:> Hi, > > I'm a data scientist who routinely uses R in my day-to-day work, for tasks > such as cleaning and transforming data, exploratory data analysis, etc. > This includes frequent use of the pipe operator from the magrittr and dplyr > libraries, %>%. So, I was pleased to hear about the recent work on a > native pipe operator, |>. > > This seems like a good time to bring up the main pain point I encounter > when using pipes in R, and some suggestions on what could be done about > it. The issue is that the pipe operator can't be placed at the start of a > line of code (except in parentheses). That's no different than any binary > operator in R, but I find it's a source of difficulty for the pipe because > of how pipes are often used. > > [I'm assuming here that my usage is fairly typical of a lot of users; at > any rate, I don't think I'm *too* unusual.] > > === Why this is a problem ==> > It's very common (for me, and I suspect for many users of dplyr) to write > multi-step pipelines and put each step on its own line for readability. > Something like this: > > ### Example 1 ### > my_data_frame_1 %>% > filter(some_conditions_1) %>% > inner_join(my_data_frame_2, by = some_columns_1) %>% > group_by(some_columns_2) %>% > summarize(some_aggregate_functions_1) %>% > filter(some_conditions_2) %>% > left_join(my_data_frame_3, by = some_columns_3) %>% > group_by(some_columns_4) %>% > summarize(some_aggregate_functions_2) %>% > arrange(some_columns_5) > > [I guess some might consider this an overly long pipeline; for me it's > pretty typical. I *could* split it up by assigning intermediate results to > variables, but much of the value I get from the pipe is that it lets my > code communicate which results are temporary, and which will be used again > later. Assigning variables for single-use results would remove that > expressiveness.] > > I would prefer (for reasons I'll explain) to be able to write the above > example like this, which isn't valid R: > > ### Example 2 (not valid R) ### > my_data_frame_1 > %>% filter(some_conditions_1) > %>% inner_join(my_data_frame_2, by = some_columns_1) > %>% group_by(some_columns_2) > %>% summarize(some_aggregate_functions_1) > %>% filter(some_conditions_2) > %>% left_join(my_data_frame_3, by = some_columns_3) > %>% group_by(some_columns_4) > %>% summarize(some_aggregate_functions_2) > %>% arrange(some_columns_5) > > One (minor) advantage is obvious: It lets you easily line up the pipes, > which means that you can see at a glance that the whole block is a single > pipeline, and you'd immediately notice if you inadvertently omitted a pipe, > which otherwise can lead to confusing output. [It's also aesthetically > pleasing, especially when %>% is replaced with |>, but that's subjective.] > > But the bigger issue happens when I want to re-run just *part* of the > pipeline. I do this often when debugging: if the output of the pipeline > seems wrong, I re-run the first few steps and check the output, then > include a little more and re-run again, etc., until I locate my mistake. > Working in an interactive notebook environment, this involves using the > cursor to select just the part of the code I want to re-run. > > It's fast and easy to select *entire* lines of code, but unfortunately with > the pipes placed at the end of the line I must instead select everything > *except* the last three characters of the line (the last two characters for > the new pipe). Then when I want to re-run the same partial pipeline with > the next line of code included, I can't just press SHIFT+Down to select it > as I otherwise would, but instead must move the cursor horizontally to a > position three characters before the end of *that* line (which is generally > different due to varying line lengths). And so forth each time I want to > include an additional line. > > Moreover, with the staggered positions of the pipes at the end of each > line, it's very easy to accidentally select the final pipe on a line, and > then sit there for a moment wondering if the environment has stopped > responding before realizing it's just waiting for further input (i.e., for > the right-hand side). These small delays and disruptions add up over the > course of a day. > > This desire to select and re-run the first part of a pipeline is also the > reason why it doesn't suffice to achieve syntax like my "Example 2" by > wrapping the entire pipeline in parentheses. That's of no use if I want to > re-run a selection that doesn't include the final close-paren. > > === Possible Solutions ==> > I can think of two, but maybe there are others. The first would make > "Example 2" into valid code, and the second would allow you to run a > selection that included a trailing pipe. > > Solution 1: Add a special case to how R is parsed, so if the first > (non-whitespace) token after an end-line is a pipe, that pipe gets moved to > before the end-line. > - Argument for: This lets you write code like example 2, which > addresses the pain point around re-running part of a pipeline, and has > advantages for readability. Also, since starting a line with a pipe > operator is currently invalid, the change wouldn't break any working code. > - Argument against: It would make the behavior of %>% inconsistent with > that of other binary operators in R. (However, this objection might not > apply to the new pipe, |>, which I understand is being implemented as a > syntax transformation rather than a binary operator.) > > Solution 2: Ignore the pipe operator if it occurs as the final token of > the code being executed. > - Argument for: This would mean the user could select and re-run the > first few lines of a longer pipeline (selecting *entire* lines), avoiding > the difficulties described above. > - Argument against: This means that %>% would be valid even if it > occurred without a right-hand side, which is inconsistent with other > operators in R. (But, as above, this objection might not apply to |>.) > Also, this solution still doesn't enable the syntax of "Example 2", with > its readability benefit. > > Thanks for reading this and considering it. > > - Tim Goodman > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
On Wed, Dec 9, 2020 at 4:03 AM Timothy Goodman <timsgoodman at gmail.com> wrote:> But the bigger issue happens when I want to re-run just *part* of the > pipeline.Insert one of the following into the pipeline. It does not require that you edit any lines. It only involves inserting a line. print %>% { str(.); . } %>% { . ->> .save } %>%