@vi@e@gross m@iii@g oii gm@ii@com
2022-Jul-15 15:25 UTC
[Rd] Feature Request: Allow Underscore Separated Numbers
Andr?, I am not saying a change cannot be done and am not familiar enough with the internals of R. If you just want the interpreter to evaluate CONSTANTS in the code as what you consider syntactic sugar and replace 1_000 with 1000, that sounds superficially possible. But is it? R normally delays evaluation so chunks of code are handed over untouched to functions that often play with the text directly without evaluating it until, perhaps, much later. And I have pointed out how much work is done with things like regular expressions or reading things in from a file that is not done in the REPL but in functions behind the scene. So if there is any way for a number to slide in without being modified, or places where you want the darn underscores preserved, you may well cause a glitch. Languages that design in the ability have obviously dealt with issues and presumably anyone writing code anew can use a new definition in their work so they handle such numbers. I am not saying such a change cannot be done, simply that existing languages are careful about making changes as they strive to retain compatibility. So even assuming your statement about not needing to change as.numeric or read.csv functions is true, aren?t you introducing a change in which the users will inadvertently use the feature in strings or files and assume it is a globally recognized feature? I use CSV files and other such formats quite a bit as a way to exchange data between R and other environments and unless they all change and allow underscores in numbers, there can be issues. So, yes, you are suggesting nothing in R will write out numbers with underscores. But if others do and you import the data into R with a reader that does not understand, we have anomalies. I am not arguing with anyone about this. Like many proposed features, it sounds reasonable just by itself. But for a language that was crafted and then modified many times, the burden is often on those wanting a change to convince us that it can be done benignly, effectively and cheaply AND that it is more worthwhile than a thousand other pending ideas already submitted. I have never used str2lang() in my life directly so would changing that really help if as.numeric() and other such functions were left alone and did not call it? What if I read in a .CSV a line at a time and use various methods including regular expressions to split the line into parts and then make the parts into numbers based on some primitive algorithm that maps digits 0-9 into small integers 0-9 and then positionally multiplies digits to the left by 10 for each level and adds them up. Will that algorithm know about underscores and not only ignore them but keep track of how many times it multiplies the other parts by 10? Sure, we can write a new algorithm with added complexity but in my view, we can solve the problem in the few cases it matters without such a change. Had this been built in originally, maybe not a problem. But consider the enormous expense of UNICODE and the truly major upheaval needed to get it working at a time when lots of code using pointers had a reasonable expectation that all characters took up the same number of bytes, and calculating the length of a string could be done by simply subtracting one pointer from another. Now, you actually have to read the entire string and count code points, or keep the length as a part of the structure that is changed any time it changes and so on. But arguably UNICODE support is now required in many cases. So, yes, underscores in numbers may become commonplace and cause headaches for a while. But mathematically, I don?t see them as needed and see many ways to allow a programmer to see what a number is without any problems in the few times they want it. Cut and paste in code can easily take out any snippet accurately and pluck it into a function that displays it with commas or whatever. But definitely, lazy humans constantly make mistakes and even with this would still make some. But if R developers seem confident this change can be done, go for it! Numeric literals, like other constants, have often been something compiled languages have optimized out of the way, such as combining multiple instances of the same one into one memory location. Avi From: GILLIBERT, Andre <Andre.Gillibert at chu-rouen.fr> Sent: Friday, July 15, 2022 2:31 AM To: avi.e.gross at gmail.com; r-devel at r-project.org Subject: RE: [Rd] Feature Request: Allow Underscore Separated Numbers On 2022-07-14 8:21 p.m., avi.e.gross at gmail.com <mailto:avi.e.gross at gmail.com> wrote:> Devin, > > I cannot say anyone wants to tweak R after the fact to accept numeric > items with underscores as that might impact all kinds of places. >I am not sure that the feature request of Devin Marlin was correctly understood. I guess that he thought about adding syntactic sugar to numeric literals in the language. Functions such as as.numeric(), or read.csv() would not be changed. The main difference would be to make valid code that currently is a "syntax error", such as:> 3*100_000Error: unexpected input in "3*100_" Breaking code with that feature is possible but improbable. Indeed, code expecting that str2lang("3*100_000") make a syntax error (catching the error with try) would break. Most code generating other code then parsing it with str2lang() should be fine, because it would generate old-style code with normal numeric constants. -- Sincerely Andr? GILLIBERT [[alternative HTML version deleted]]
Ivan Krylov
2022-Jul-15 17:21 UTC
[Rd] Feature Request: Allow Underscore Separated Numbers
On Fri, 15 Jul 2022 11:25:32 -0400 <avi.e.gross at gmail.com> wrote:> R normally delays evaluation so chunks of code are handed over > untouched to functions that often play with the text directly without > evaluating it until, perhaps, much later.Do they play with the text, or with the syntax tree after it went through the parser? While it's true that R saves the source text of the functions for ease of debugging, it's not guaranteed that a given object will have source references, and typical NSE functions operate on language objects which are tree-like structures containing R values, not source text. You are, of course, right that any changes to the syntax of the language must be carefully considered, but if anyone wants to play with this idea, it can be implemented in a very simple manner: --- src/main/gram.y (revision 82598) +++ src/main/gram.y (working copy) @@ -2526,7 +2526,7 @@ YYTEXT_PUSH(c, yyp); /* We don't care about other than ASCII digits */ while (isdigit(c = xxgetc()) || c == '.' || c == 'e' || c == 'E' - || c == 'x' || c == 'X' || c == 'L') + || c == 'x' || c == 'X' || c == 'L' || c == '_') { count++; if (c == 'L') /* must be at the end. Won't allow 1Le3 (at present). */ @@ -2533,6 +2533,9 @@ { YYTEXT_PUSH(c, yyp); break; } + if (c == '_') { /* allow an underscore anywhere inside the literal */ + continue; + } if (c == 'x' || c == 'X') { if (count > 2 || last != '0') break; /* 0x must be first */ To an NSE function, the underscored literals are indistinguishable from normal ones, because they don't see the literals: stopifnot(all.equal(\() 1000000, \() 1_000_000)) f <- function(x, y) stopifnot(all.equal(substitute(x), substitute(y))) f(1e6, 1_000_000) Although it's true that the source references change as a result: lapply( list(\() 1000000, \() 1_000_000), \(.) as.character(getSrcref(.)) ) # [[1]] # [1] "\\() 1000000" # # [[2]] # [1] "\\() 1_000_000" This patch is somewhat simplistic: it allows both multiple underscores in succession and underscores at the end of the number literal. Perl does so too, but with a warning: perl -wE'say "true" if 1__000_ == 1000' # Misplaced _ in number at -e line 1. # Misplaced _ in number at -e line 1. # true -- Best regards, Ivan