John Wiedenhoeft
2008-Nov-08 12:20 UTC
[R] Parsing regular expressions differently - feature request
Hi there, I rejoiced when I realized that you can use Perl regex from within R. However, as the FAQ states "Some functions, particularly those involving regular expression matching, themselves use metacharacters, which may need to be escaped by the backslash mechanism. In those cases you may need a quadruple backslash to represent a single literal one. " I was wondering if that is really necessary for perl=TRUE? wouldn't it be possible to parse a string differently in a regex context, e.g. automatically insert \\ for each \ , such that you can use the perl syntax directly? For example, if you want to input a newline as a character, you would use \n anyway. At the moment one says \\n to make it clear to R that you mean \n to make clear that you mean newline... this is pretty annoying. How likely is it that you want to pass a real newline character to PCRE directly? If it's anyhow possible to pass everything between " and " directly to PCRE without expanding it internally in R, please add this to a future version (as an option like noescape=TRUE perhaps?)! I would love to use R instead of Perl for working with regex, without having to do two levels of escape all the time. Thanks, John
Gabor Grothendieck
2008-Nov-08 13:51 UTC
[R] Parsing regular expressions differently - feature request
Some feature to simplify entry of backslashes has been mentioned many times and keeps coming up from time to time. It would not only be useful for regexp's but also for latex and Windows path names and I too hope that it will be addressed. On Sat, Nov 8, 2008 at 7:20 AM, John Wiedenhoeft <john at nurfuerspam.de> wrote:> Hi there, > > I rejoiced when I realized that you can use Perl regex from within R. However, > as the FAQ states "Some functions, particularly those involving regular > expression matching, themselves use metacharacters, which may need to be > escaped by the backslash mechanism. In those cases you may need a quadruple > backslash to represent a single literal one. " > > I was wondering if that is really necessary for perl=TRUE? wouldn't it be > possible to parse a string differently in a regex context, e.g. automatically > insert \\ for each \ , such that you can use the perl syntax directly? For > example, if you want to input a newline as a character, you would use \n > anyway. At the moment one says \\n to make it clear to R that you mean \n to > make clear that you mean newline... this is pretty annoying. How likely is it > that you want to pass a real newline character to PCRE directly? > > If it's anyhow possible to pass everything between " and " directly to PCRE > without expanding it internally in R, please add this to a future version (as > an option like noescape=TRUE perhaps?)! I would love to use R instead of Perl > for working with regex, without having to do two levels of escape all the > time. > > Thanks, > John > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Duncan Murdoch
2008-Nov-08 14:41 UTC
[R] Parsing regular expressions differently - feature request
On 08/11/2008 7:20 AM, John Wiedenhoeft wrote:> Hi there, > > I rejoiced when I realized that you can use Perl regex from within R. However, > as the FAQ states "Some functions, particularly those involving regular > expression matching, themselves use metacharacters, which may need to be > escaped by the backslash mechanism. In those cases you may need a quadruple > backslash to represent a single literal one. " > > I was wondering if that is really necessary for perl=TRUE? wouldn't it be > possible to parse a string differently in a regex context, e.g. automatically > insert \\ for each \ , such that you can use the perl syntax directly? For > example, if you want to input a newline as a character, you would use \n > anyway. At the moment one says \\n to make it clear to R that you mean \n to > make clear that you mean newline... this is pretty annoying. How likely is it > that you want to pass a real newline character to PCRE directly?No, that's not possible. At the level where the parsing takes place R has no idea of its eventual use, so it can't tell that some strings are going to be interpreted as Perl, and others not. As Gabor mentioned, there have been various discussions of adding a new syntax for strings that are parsed literally, without processing any escapes, but no consensus on the right syntax to use. There are currently some fragile tricks that let you avoid escapes, e.g. using scan() to read a line: > re <- scan(what="", n=1) 1: [^\\] Read 1 item > re [1] "[^\\\\]" (I call this fragile because it works in scripts processed at console level, but not if you type the same thing into a function.) So I agree, it would be nice to have new syntax to allow this. Last time this came up, I argued for something like \verb in LaTeX where the delimiter could be specified differently in each use. Duncan TL suggested triple quotes, as in Python. I think now that triple quotes would be be better than the particular form I suggested. Duncan Murdoch> > If it's anyhow possible to pass everything between " and " directly to PCRE > without expanding it internally in R, please add this to a future version (as > an option like noescape=TRUE perhaps?)! I would love to use R instead of Perl > for working with regex, without having to do two levels of escape all the > time. > > Thanks, > John > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
William Dunlap
2008-Nov-18 18:36 UTC
[R] Parsing regular expressions differently - feature request
Duncan Murdoch murdoch at stats.uwo.ca Sat Nov 8 15:41:34 CET 2008 wrote:> On 08/11/2008 7:20 AM, John Wiedenhoeft wrote: > > Hi there, > > > > I rejoiced when I realized that you can use Perl regex from withinR. However,> > as the FAQ states "Some functions, particularly those involvingregular> > expression matching, themselves use metacharacters, which may needto be> > escaped by the backslash mechanism. In those cases you may need aquadruple> > backslash to represent a single literal one. " > > > > I was wondering if that is really necessary for perl=TRUE? wouldn'tit be> > possible to parse a string differently in a regex context, e.g.automatically> > insert \\ for each \ , such that you can use the perl syntaxdirectly? For> > example, if you want to input a newline as a character, you woulduse \n> > anyway. At the moment one says \\n to make it clear to R that youmean \n to> > make clear that you mean newline... this is pretty annoying. Howlikely is it> > that you want to pass a real newline character to PCRE directly? > > No, that's not possible. At the level where the parsing takes place R> has no idea of its eventual use, so it can't tell that some stringsare> going to be interpreted as Perl, and others not. > > As Gabor mentioned, there have been various discussions of adding anew> syntax for strings that are parsed literally, without processing any > escapes, but no consensus on the right syntax to use. > ... [scan() example elided] ... > So I agree, it would be nice to have new syntax to allow this. Last > time this came up, I argued for something like \verb in LaTeX wherethe> delimiter could be specified differently in each use. Duncan TL > suggested triple quotes, as in Python. I think now that triple quotes> would be be better than the particular form I suggested. > > Duncan MurdochWould a string with this alternate quoting be tagged (e.g., with a class that inherits from character) so that the deparser could display it in the style in which it was input? Functions which generate file names using the native Windows notation would like to have them displayed without the extra backslashes. However, adding a new class for this could mess up other things. Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com