romain.francois at dbmail.com
2009-Mar-20 18:56 UTC
[Rd] Why does the lexical analyzer drop comments ?
It happens in the token function in gram.c: c = SkipSpace(); if (c == '#') c = SkipComment(); and then SkipComment goes like that: static int SkipComment(void) { int c; while ((c = xxgetc()) != '\n' && c != R_EOF) ; if (c == R_EOF) EndOfFile = 2; return c; } which effectively drops comments. Would it be possible to keep the information somewhere ? The source code says this: * The function yylex() scans the input, breaking it into * tokens which are then passed to the parser. The lexical * analyser maintains a symbol table (in a very messy fashion). so my question is could we use this symbol table to keep track of, say, COMMENT tokens. Why would I even care about that ? I'm writing a package that will perform syntax highlighting of R source code based on the output of the parser, and it seems a waste to drop the comments. An also, when you print a function to the R console, you don't get the comments, and some of them might be useful to the user. Am I mad if I contemplate looking into this ? Romain -- Romain Francois Independent R Consultant +33(0) 6 28 91 30 30 http://romainfrancois.blog.free.fr [[alternative HTML version deleted]]
On 3/20/2009 2:56 PM, romain.francois at dbmail.com wrote:> It happens in the token function in gram.c: > > ? ? ? c = SkipSpace(); > ? ? ? if (c == '#') c = SkipComment(); > > and then SkipComment goes like that: > > static int SkipComment(void) > { > ? ? ? int c; > ? ? ? while ((c = xxgetc()) != '\n' && c != R_EOF) ; > ? ? ? if (c == R_EOF) EndOfFile = 2; > ? ? ? return c; > } > > which effectively drops comments. > > Would it be possible to keep the information somewhere ? > > The source code says this: > > ? *? The function yylex() scans the input, breaking it into > ? *? tokens which are then passed to the parser.? The lexical > ? *? analyser maintains a symbol table (in a very messy fashion). > > so my question is could we use this symbol table to keep track of, say, COMMENT tokens. > > Why would I even care about that ? I'm writing a package that will > perform syntax highlighting of R source code based on the output of the > parser, and it seems a waste to drop the comments. > > An also, when you print a function to the R console, you don't get the comments, and some of them might be useful to the user. > > Am I mad if I contemplate looking into this ?Comments are syntactically the same as whitespace. You don't want them to affect the parsing. If you're doing syntax highlighting, you can determine the whitespace by looking at the srcref records, and then parse that to determine what isn't being counted as tokens. (I think you'll find a few things there besides whitespace, but it is a fairly limited set, so shouldn't be too hard to recognize.) The Rd parser is different, because in an Rd file, whitespace is significant, so it gets kept. Duncan Murdoch
Hi Romain, I've been thinking for quite a long time on how to keep comments when parsing R code and finally got a trick with inspiration from one of my friends, i.e. to mask the comments in special assignments to "cheat" R parser: # keep.comment: whether to keep the comments or not # keep.blank.line: preserve blank lines or not? # begin.comment and end.comment: special identifiers that mark the orignial # comments as 'begin.comment = "#[ comments ]end.comment"' # and these marks will be removed after the modified code is parsed tidy.source <- function(source = "clipboard", keep.comment = TRUE, keep.blank.line = FALSE, begin.comment, end.comment, ...) { # parse and deparse the code tidy.block = function(block.text) { exprs = parse(text = block.text) n = length(exprs) res = character(n) for (i in 1:n) { dep = paste(deparse(exprs[i]), collapse = "\n") res[i] = substring(dep, 12, nchar(dep) - 1) } return(res) } text.lines = readLines(source, warn = FALSE) if (keep.comment) { # identifier for comments identifier = function() paste(sample(LETTERS), collapse = "") if (missing(begin.comment)) begin.comment = identifier() if (missing(end.comment)) end.comment = identifier() # remove leading and trailing white spaces text.lines = gsub("^[[:space:]]+|[[:space:]]+$", "", text.lines) # make sure the identifiers are not in the code # or the original code might be modified while (length(grep(sprintf("%s|%s", begin.comment, end.comment), text.lines))) { begin.comment = identifier() end.comment = identifier() } head.comment = substring(text.lines, 1, 1) == "#" # add identifiers to comment lines to cheat R parser if (any(head.comment)) { text.lines[head.comment] = gsub("\"", "\'", text.lines[head.comment]) text.lines[head.comment] = sprintf("%s=\"%s%s\"", begin.comment, text.lines[head.comment], end.comment) } # keep blank lines? blank.line = text.lines == "" if (any(blank.line) & keep.blank.line) text.lines[blank.line] = sprintf("%s=\"%s\"", begin.comment, end.comment) text.tidy = tidy.block(text.lines) # remove the identifiers text.tidy = gsub(sprintf("%s = \"|%s\"", begin.comment, end.comment), "", text.tidy) } else { text.tidy = tidy.block(text.lines) } cat(paste(text.tidy, collapse = "\n"), "\n", ...) invisible(text.tidy) } The above function can deal with comments which are in single lines, e.g. f = tempfile() writeLines(' # rotation of the word "Animation" # in a loop; change the angle and color # step by step for (i in 1:360) { # redraw the plot again and again plot(1,ann=FALSE,type="n",axes=FALSE) # rotate; use rainbow() colors text(1,1,"Animation",srt=i,col=rainbow(360)[i],cex=7*i/360) # pause for a while Sys.sleep(0.01)} ', f) Then parse the code file 'f':> tidy.source(f)# rotation of the word 'Animation' # in a loop; change the angle and color # step by step for (i in 1:360) { # redraw the plot again and again plot(1, ann = FALSE, type = "n", axes = FALSE) # rotate; use rainbow() colors text(1, 1, "Animation", srt = i, col = rainbow(360)[i], cex = 7 * i/360) # pause for a while Sys.sleep(0.01) } Of course this function has some limitations: it does not support inline comments or comments which are inside incomplete code lines. Peter's example f #here ( #here a #here (possibly) = #here 1 #this one belongs to the argument, though ) #but here as well will be parsed as f (a = 1) I'm quite interested in syntax highlighting of R code and saw your previous discussions in another posts (with Jose Quesada, etc). I'd like to do something for your package if I could be of some help. Regards, Yihui -- Yihui Xie <xieyihui at gmail.com> Phone: +86-(0)10-82509086 Fax: +86-(0)10-82509086 Mobile: +86-15810805877 Homepage: http://www.yihui.name School of Statistics, Room 1037, Mingde Main Building, Renmin University of China, Beijing, 100872, China 2009/3/21 <romain.francois at dbmail.com>:> > It happens in the token function in gram.c: > > ?????? c = SkipSpace(); > ?????? if (c == '#') c = SkipComment(); > > and then SkipComment goes like that: > > static int SkipComment(void) > { > ?????? int c; > ?????? while ((c = xxgetc()) != '\n' && c != R_EOF) ; > ?????? if (c == R_EOF) EndOfFile = 2; > ?????? return c; > } > > which effectively drops comments. > > Would it be possible to keep the information somewhere ? > > The source code says this: > > ??*?? The function yylex() scans the input, breaking it into > ??*?? tokens which are then passed to the parser.?? The lexical > ??*?? analyser maintains a symbol table (in a very messy fashion). > > so my question is could we use this symbol table to keep track of, say, COMMENT tokens. > > Why would I even care about that ? I'm writing a package that will > perform syntax highlighting of R source code based on the output of the > parser, and it seems a waste to drop the comments. > > An also, when you print a function to the R console, you don't get the comments, and some of them might be useful to the user. > > Am I mad if I contemplate looking into this ? > > Romain > > -- > Romain Francois > Independent R Consultant > +33(0) 6 28 91 30 30 > http://romainfrancois.blog.free.fr >