Bill Dunlap
2022-Jul-15 19:34 UTC
[Rd] Feature Request: Allow Underscore Separated Numbers
The token '._1' (period underscore digit) is currently parsed as a symbol (name). It would become a number if underscore were ignored as in the first proposal. The just-between-digits alternative would avoid this change. -Bill On Fri, Jul 15, 2022 at 12:26 PM Jim Hester <james.f.hester at gmail.com> wrote:> I think keeping it simple and less restrictive is the best approach, > for ease of implementation, limiting future maintenance, and so users > have the flexibility to format these however they wish. So I would > probably lean towards allowing multiple delimiters anywhere (including > trailing) or possibly just between digits. > > On Fri, Jul 15, 2022 at 2:26 PM Duncan Murdoch <murdoch.duncan at gmail.com> > wrote: > > > > Thanks for posting that list. The Python document is the only one I've > > read so far; it has a really nice summary > > (https://peps.python.org/pep-0515/#prior-art) of the differences in > > implementations among 10 languages. Which choice would you recommend, > > and why? > > > > - I think Ivan's quick solution doesn't quite match any of them. > > - C, Fortran and C++ have special support in R, but none of them use > > underscore separators. > > - C++ does support separators, but uses "'", not "_", and some ancient > > forms of Fortran ignore embedded spaces. > > > > Duncan Murdoch > > > > On 15/07/2022 1:58 p.m., Jim Hester wrote: > > > Allowing underscores in numeric literals is becoming a very common > > > feature in computing languages. All of these languages (and more) now > > > support it > > > > > > python: https://peps.python.org/pep-0515/ > > > javascript: https://v8.dev/features/numeric-separators > > > julia: > https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/#Floating-Point-Numbers > > > java: > https://docs.oracle.com/javase/7/docs/technotes/guides/language/underscores-literals.html#:~:text=In%20Java%20SE%207%20and,the%20readability%20of%20your%20code > . > > > ruby: > https://docs.ruby-lang.org/en/2.0.0/syntax/literals_rdoc.html#label-Numbers > > > perl: https://perldoc.perl.org/perldata#Scalar-value-constructors > > > rust: > https://doc.rust-lang.org/rust-by-example/primitives/literals.html > > > C#: > https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/builtin-types/floating-point-numeric-types#real-literals > > > go: https://go.dev/ref/spec#Integer_literals > > > > > > Its use in this context also dates back to at least Ada 83 > > > ( > http://archive.adaic.com/standards/83lrm/html/lrm-02-04.html#:~:text=A%20decimal%20literal%20is%20a,the%20base%20is%20implicitly%20ten).&text=An%20underline%20character%20inserted%20between,value%20of%20this%20numeric%20literal > .) > > > > > > Many other communities see the benefit of this feature, I think R's > > > community would benefit from it as well. > > > > > > On Fri, Jul 15, 2022 at 1:22 PM Ivan Krylov <krylov.r00t at gmail.com> > wrote: > > >> > > >> On Fri, 15 Jul 2022 11:25:32 -0400 > > >> <avi.e.gross at gmail.com> wrote: > > >> > > >>> R normally delays evaluation so chunks of code are handed over > > >>> untouched to functions that often play with the text directly without > > >>> evaluating it until, perhaps, much later. > > >> > > >> Do they play with the text, or with the syntax tree after it went > > >> through the parser? While it's true that R saves the source text of > the > > >> functions for ease of debugging, it's not guaranteed that a given > > >> object will have source references, and typical NSE functions operate > > >> on language objects which are tree-like structures containing R > values, > > >> not source text. > > >> > > >> You are, of course, right that any changes to the syntax of the > > >> language must be carefully considered, but if anyone wants to play > with > > >> this idea, it can be implemented in a very simple manner: > > >> > > >> --- src/main/gram.y (revision 82598) > > >> +++ src/main/gram.y (working copy) > > >> @@ -2526,7 +2526,7 @@ > > >> YYTEXT_PUSH(c, yyp); > > >> /* We don't care about other than ASCII digits */ > > >> while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E' > > >> - || c == 'x' || c == 'X' || c == 'L') > > >> + || c == 'x' || c == 'X' || c == 'L' || c == '_') > > >> { > > >> count++; > > >> if (c == 'L') /* must be at the end. Won't allow 1Le3 (at > present). */ > > >> @@ -2533,6 +2533,9 @@ > > >> { YYTEXT_PUSH(c, yyp); > > >> break; > > >> } > > >> + if (c == '_') { /* allow an underscore anywhere inside the > literal */ > > >> + continue; > > >> + } > > >> > > >> if (c == 'x' || c == 'X') { > > >> if (count > 2 || last != '0') break; /* 0x must be > first */ > > >> > > >> To an NSE function, the underscored literals are indistinguishable > from > > >> normal ones, because they don't see the literals: > > >> > > >> stopifnot(all.equal(\() 1000000, \() 1_000_000)) > > >> f <- function(x, y) stopifnot(all.equal(substitute(x), substitute(y))) > > >> f(1e6, 1_000_000) > > >> > > >> Although it's true that the source references change as a result: > > >> > > >> lapply( > > >> list(\() 1000000, \() 1_000_000), > > >> \(.) as.character(getSrcref(.)) > > >> ) > > >> # [[1]] > > >> # [1] "\\() 1000000" > > >> # > > >> # [[2]] > > >> # [1] "\\() 1_000_000" > > >> > > >> This patch is somewhat simplistic: it allows both multiple underscores > > >> in succession and underscores at the end of the number literal. Perl > > >> does so too, but with a warning: > > >> > > >> perl -wE'say "true" if 1__000_ == 1000' > > >> # Misplaced _ in number at -e line 1. > > >> # Misplaced _ in number at -e line 1. > > >> # true > > >> > > >> -- > > >> Best regards, > > >> Ivan > > >> > > >> ______________________________________________ > > >> R-devel at r-project.org mailing list > > >> https://stat.ethz.ch/mailman/listinfo/r-devel > > > > > > ______________________________________________ > > > R-devel at r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Ivan Krylov
2022-Jul-16 09:24 UTC
[Rd] Feature Request: Allow Underscore Separated Numbers
On Fri, 15 Jul 2022 12:34:24 -0700 Bill Dunlap <williamwdunlap at gmail.com> wrote:> The token '._1' (period underscore digit) is currently parsed as a > symbol (name). It would become a number if underscore were ignored > as in the first proposal. The just-between-digits alternative would > avoid this change.Thanks for spotting this! Here's a patch that allows underscores only between digits and only inside the significand of a number: --- src/main/gram.y (revision 82598) +++ src/main/gram.y (working copy) @@ -2526,7 +2526,7 @@ YYTEXT_PUSH(c, yyp); /* We don't care about other than ASCII digits */ while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E' - || c == 'x' || c == 'X' || c == 'L') + || c == 'x' || c == 'X' || c == 'L' || c == '_') { count++; if (c == 'L') /* must be at the end. Won't allow 1Le3 (at present). */ @@ -2538,11 +2538,16 @@ if (count > 2 || last != '0') break; /* 0x must be first */ YYTEXT_PUSH(c, yyp); while(isdigit(c = xxgetc()) || ('a' <= c && c <= 'f') || - ('A' <= c && c <= 'F') || c == '.') { + ('A' <= c && c <= 'F') || c == '.' || c == '_') { if (c == '.') { if (seendot) return ERROR; seendot = 1; } + if (c == '_') { + /* disallow underscores following 0x or followed by non-digit */ + if (nd == 0 || typeofnext() >= 2) break; + continue; + } YYTEXT_PUSH(c, yyp); nd++; } @@ -2588,6 +2593,11 @@ break; seendot = 1; } + /* underscores in significand followed by a digit must be skipped */ + if (c == '_') { + if (seenexp || typeofnext() >= 2) break; + continue; + } YYTEXT_PUSH(c, yyp); last = c; } -- Best regards, Ivan