Ivan Krylov
2022-Jul-15 17:21 UTC
[Rd] Feature Request: Allow Underscore Separated Numbers
On Fri, 15 Jul 2022 11:25:32 -0400 <avi.e.gross at gmail.com> wrote:> R normally delays evaluation so chunks of code are handed over > untouched to functions that often play with the text directly without > evaluating it until, perhaps, much later.Do they play with the text, or with the syntax tree after it went through the parser? While it's true that R saves the source text of the functions for ease of debugging, it's not guaranteed that a given object will have source references, and typical NSE functions operate on language objects which are tree-like structures containing R values, not source text. You are, of course, right that any changes to the syntax of the language must be carefully considered, but if anyone wants to play with this idea, it can be implemented in a very simple manner: --- src/main/gram.y (revision 82598) +++ src/main/gram.y (working copy) @@ -2526,7 +2526,7 @@ YYTEXT_PUSH(c, yyp); /* We don't care about other than ASCII digits */ while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E' - || c == 'x' || c == 'X' || c == 'L') + || c == 'x' || c == 'X' || c == 'L' || c == '_') { count++; if (c == 'L') /* must be at the end. Won't allow 1Le3 (at present). */ @@ -2533,6 +2533,9 @@ { YYTEXT_PUSH(c, yyp); break; } + if (c == '_') { /* allow an underscore anywhere inside the literal */ + continue; + } if (c == 'x' || c == 'X') { if (count > 2 || last != '0') break; /* 0x must be first */ To an NSE function, the underscored literals are indistinguishable from normal ones, because they don't see the literals: stopifnot(all.equal(\() 1000000, \() 1_000_000)) f <- function(x, y) stopifnot(all.equal(substitute(x), substitute(y))) f(1e6, 1_000_000) Although it's true that the source references change as a result: lapply( list(\() 1000000, \() 1_000_000), \(.) as.character(getSrcref(.)) ) # [[1]] # [1] "\\() 1000000" # # [[2]] # [1] "\\() 1_000_000" This patch is somewhat simplistic: it allows both multiple underscores in succession and underscores at the end of the number literal. Perl does so too, but with a warning: perl -wE'say "true" if 1__000_ == 1000' # Misplaced _ in number at -e line 1. # Misplaced _ in number at -e line 1. # true -- Best regards, Ivan
Allowing underscores in numeric literals is becoming a very common feature in computing languages. All of these languages (and more) now support it python: https://peps.python.org/pep-0515/ javascript: https://v8.dev/features/numeric-separators julia: https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/#Floating-Point-Numbers java: https://docs.oracle.com/javase/7/docs/technotes/guides/language/underscores-literals.html#:~:text=In%20Java%20SE%207%20and,the%20readability%20of%20your%20code. ruby: https://docs.ruby-lang.org/en/2.0.0/syntax/literals_rdoc.html#label-Numbers perl: https://perldoc.perl.org/perldata#Scalar-value-constructors rust: https://doc.rust-lang.org/rust-by-example/primitives/literals.html C#: https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/floating-point-numeric-types#real-literals go: https://go.dev/ref/spec#Integer_literals Its use in this context also dates back to at least Ada 83 (http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#:~:text=A%20decimal%20literal%20is%20a,the%20base%20is%20implicitly%20ten).&text=An%20underline%20character%20inserted%20between,value%20of%20this%20numeric%20literal.) Many other communities see the benefit of this feature, I think R's community would benefit from it as well. On Fri, Jul 15, 2022 at 1:22 PM Ivan Krylov <krylov.r00t at gmail.com> wrote:> > On Fri, 15 Jul 2022 11:25:32 -0400 > <avi.e.gross at gmail.com> wrote: > > > R normally delays evaluation so chunks of code are handed over > > untouched to functions that often play with the text directly without > > evaluating it until, perhaps, much later. > > Do they play with the text, or with the syntax tree after it went > through the parser? While it's true that R saves the source text of the > functions for ease of debugging, it's not guaranteed that a given > object will have source references, and typical NSE functions operate > on language objects which are tree-like structures containing R values, > not source text. > > You are, of course, right that any changes to the syntax of the > language must be carefully considered, but if anyone wants to play with > this idea, it can be implemented in a very simple manner: > > --- src/main/gram.y (revision 82598) > +++ src/main/gram.y (working copy) > @@ -2526,7 +2526,7 @@ > YYTEXT_PUSH(c, yyp); > /* We don't care about other than ASCII digits */ > while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E' > - || c == 'x' || c == 'X' || c == 'L') > + || c == 'x' || c == 'X' || c == 'L' || c == '_') > { > count++; > if (c == 'L') /* must be at the end. Won't allow 1Le3 (at present). */ > @@ -2533,6 +2533,9 @@ > { YYTEXT_PUSH(c, yyp); > break; > } > + if (c == '_') { /* allow an underscore anywhere inside the literal */ > + continue; > + } > > if (c == 'x' || c == 'X') { > if (count > 2 || last != '0') break; /* 0x must be first */ > > To an NSE function, the underscored literals are indistinguishable from > normal ones, because they don't see the literals: > > stopifnot(all.equal(\() 1000000, \() 1_000_000)) > f <- function(x, y) stopifnot(all.equal(substitute(x), substitute(y))) > f(1e6, 1_000_000) > > Although it's true that the source references change as a result: > > lapply( > list(\() 1000000, \() 1_000_000), > \(.) as.character(getSrcref(.)) > ) > # [[1]] > # [1] "\\() 1000000" > # > # [[2]] > # [1] "\\() 1_000_000" > > This patch is somewhat simplistic: it allows both multiple underscores > in succession and underscores at the end of the number literal. Perl > does so too, but with a warning: > > perl -wE'say "true" if 1__000_ == 1000' > # Misplaced _ in number at -e line 1. > # Misplaced _ in number at -e line 1. > # true > > -- > Best regards, > Ivan > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
@vi@e@gross m@iii@g oii gm@ii@com
2022-Jul-15 21:05 UTC
[Rd] Feature Request: Allow Underscore Separated Numbers
Yes, Ivan, obviously someone can try out a change and check if it causes problems. And although I would think the majority of delayed execution eventually either is never invoked or is done as you describe using internal functions in trees, I suspect there exist some that do not. For example, I can write code in another mini-language I create that I will then analyze. What stops me from leaving quotes out from around a regular expression because I am going to read the text exactly as is and manipulate it, as long as the RE does not contain anything that keeps it from being accepted as an argument to a function, such as no commas. Inside I may have something like a pattern to match a file name starting with anything then an underscore and then a digit or two and finally a file suffix. I would not want anything to parse that and remove the underscore that is part of a filename. The argument is meant to be atomic. Many things in the tidyverse do variations on delayed evaluation and some seem to be a piece at a time. An example would be how mutate() allows multiple clauses for new=f(old) where later lines use columns created in earlier lines that did not exist before and can only be used if the preceding part went well. I may be wrong on how it is done, but it strikes me as possible they read in the raw text till they match an end of some kind like a top-level comma or top level close-parenthesis. My GUESS is only then might they evaluate that chunk after substitutions or other ploys to use some namespace. Will a column name like evil__666__ survive? Again, I am not AGAINST any proposal but the people who have to pay the price in terms of needing to arrange or pay for development, documentation, testing and so on, are the ones needed to be convinced. My point is that in some ways R is a different kind of programming language than say python. I experimented briefly in python and note their implementation of this feature is fairly robust. I mean casting a string to an int works as expected as in: a= int("1" "_" "122") returns 1122. Be warned though that the current python implementation generates an error if you have two or more underscores in a row as in: a=1__1 SyntaxError: invalid decimal literal a=1___1 SyntaxError: invalid decimal literal And it does not tolerate one or more underscore at the end with the same error and really gets mad at an initial underscore like _1 where it asks if you mean "_" as a single underscore is not only a valid variable, as well as multiple consecutive underscores, but is often used as an I DON'T CARE in code like this, albeit any variable can be used as the last instance keeps the value: (_,_,a) = (1,2,3) _ 2 a 3 (In the above, you are seeing commands and output alternating, if not clear.) And as it happens, half of python variable contain runs of underscores to the point where some say member functions like __name__ and __init__ are called dunder name and dunder init as in double double underscore. And note that python is not that much younger than R/S and this feature was added fairly late in version 3.6, about 5 years ago, long after version 3.0 made many programs for version 2.x incompatible. My point is not python but someone may want to see how the underscore in a number feature is actually implemented in any of the languages that now allow it and carefully document exactly in what circumstances it is allowed in R and also where, if anywhere, it differs from other such places. If it can be done with a very few localized changes, great. My objections about making regular expressions more complex by needing to handle underscore likely are not a major obstacle as python supports those too. Luckily, my opinion is just my own as I have no direct stake in the outcome. I personally handle large numbers fine. Avi -----Original Message----- From: Ivan Krylov <krylov.r00t at gmail.com> Sent: Friday, July 15, 2022 1:22 PM To: avi.e.gross at gmail.com Cc: r-devel at r-project.org Subject: Re: [Rd] Feature Request: Allow Underscore Separated Numbers On Fri, 15 Jul 2022 11:25:32 -0400 <avi.e.gross at gmail.com> wrote:> R normally delays evaluation so chunks of code are handed over > untouched to functions that often play with the text directly without > evaluating it until, perhaps, much later.Do they play with the text, or with the syntax tree after it went through the parser? While it's true that R saves the source text of the functions for ease of debugging, it's not guaranteed that a given object will have source references, and typical NSE functions operate on language objects which are tree-like structures containing R values, not source text. You are, of course, right that any changes to the syntax of the language must be carefully considered, but if anyone wants to play with this idea, it can be implemented in a very simple manner: --- src/main/gram.y (revision 82598) +++ src/main/gram.y (working copy) @@ -2526,7 +2526,7 @@ YYTEXT_PUSH(c, yyp); /* We don't care about other than ASCII digits */ while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E' - || c == 'x' || c == 'X' || c == 'L') + || c == 'x' || c == 'X' || c == 'L' || c == '_') { count++; if (c == 'L') /* must be at the end. Won't allow 1Le3 (at present). */ @@ -2533,6 +2533,9 @@ { YYTEXT_PUSH(c, yyp); break; } + if (c == '_') { /* allow an underscore anywhere inside the literal */ + continue; + } if (c == 'x' || c == 'X') { if (count > 2 || last != '0') break; /* 0x must be first */ To an NSE function, the underscored literals are indistinguishable from normal ones, because they don't see the literals: stopifnot(all.equal(\() 1000000, \() 1_000_000)) f <- function(x, y) stopifnot(all.equal(substitute(x), substitute(y))) f(1e6, 1_000_000) Although it's true that the source references change as a result: lapply( list(\() 1000000, \() 1_000_000), \(.) as.character(getSrcref(.)) ) # [[1]] # [1] "\\() 1000000" # # [[2]] # [1] "\\() 1_000_000" This patch is somewhat simplistic: it allows both multiple underscores in succession and underscores at the end of the number literal. Perl does so too, but with a warning: perl -wE'say "true" if 1__000_ == 1000' # Misplaced _ in number at -e line 1. # Misplaced _ in number at -e line 1. # true -- Best regards, Ivan