Roey Angel
2012-Mar-02 08:36 UTC
[R] Why do my regular expressions require a double escape \\ to get a literal??
Hi, I was recently misfortunate enough to have to use regular expressions to sort out some data in R. I'm working on a data file which contains taxonomical data of bacteria in hierarchical order. A sample of this file can be generated using: tax.data <- read.table(header=F, con <- textConnection(' G9SS7BA01D15EC Bacteria(100) Cyanobacteria(84) unclassified G9SS7BA01C9UIR Bacteria(100) Proteobacteria(94) Alphaproteobacteria(89) G9SS7BA01CM00D Bacteria(100) Proteobacteria(99) Alphaproteobacteria(99) ')) close(con) What I try to do is to remove the parenthesis and the number inside (which could contain a decimal point) I assumed that the following command would solve it, but instead I got an error. tax.data <- as.data.frame(apply(tax.data, 2, function(x) gsub('\(.*\)','',x))) Error: '\(' is an unrecognized escape in character string starting "\(" And it doesn't matter if I use perl = TRUE or not. To solve it I need to use a double escape sign '\\' before opening and closing the parenthesis: tax.data <- as.data.frame(apply(tax.data, 2, function(x) gsub('\\(.*\\)','',x))) This yields the desired result but I wonder why it does that? No other regular expression system I'm used to (e.g. Perl, Shell) works like that. I'm using R 2.14 (but also R 2.10) and I get the same results on Ubuntu and win XP. I'd appreciate any explanation. Thanks in advance, baffled Roey -- Dr. Roey Angel Max-Planck-Institute for Terrestrial Microbiology Karl-von-Frisch-Strasse 10 D-35043 Marburg, Germany Office: +49 (0)6421/178-832 Mobile: +49 (0)176/612-785-88
Berend Hasselman
2012-Mar-02 10:00 UTC
[R] Why do my regular expressions require a double escape \\ to get a literal??
On 02-03-2012, at 09:36, Roey Angel wrote:> Hi, > I was recently misfortunate enough to have to use regular expressions to sort out some data in R. > I'm working on a data file which contains taxonomical data of bacteria in hierarchical order. > A sample of this file can be generated using: > > tax.data <- read.table(header=F, con <- textConnection(' > G9SS7BA01D15EC Bacteria(100) Cyanobacteria(84) unclassified > G9SS7BA01C9UIR Bacteria(100) Proteobacteria(94) Alphaproteobacteria(89) > G9SS7BA01CM00D Bacteria(100) Proteobacteria(99) Alphaproteobacteria(99) > ')) > close(con) > > What I try to do is to remove the parenthesis and the number inside (which could contain a decimal point) > I assumed that the following command would solve it, but instead I got an error. > > tax.data <- as.data.frame(apply(tax.data, 2, function(x) gsub('\(.*\)','',x))) > Error: '\(' is an unrecognized escape in character string starting "\(" > > And it doesn't matter if I use perl = TRUE or not. > To solve it I need to use a double escape sign '\\' before opening and closing the parenthesis: > > tax.data <- as.data.frame(apply(tax.data, 2, function(x) gsub('\\(.*\\)','',x))) > > This yields the desired result but I wonder why it does that? > No other regular expression system I'm used to (e.g. Perl, Shell) works like that. > > I'm using R 2.14 (but also R 2.10) and I get the same results on Ubuntu and win XP. > > I'd appreciate any explanation.Section "Character vectors" in the R Intro manual. ?Quotes The regular expression is provided as a string to gsub. In strings there are escape sequences. To get the \ as a single \ to the regular expression parser it has to be \-ed in the string stage: \\ Berend
Berend Hasselman
2012-Mar-02 14:31 UTC
[R] Why do my regular expressions require a double escape \\ to get a literal??
On 02-03-2012, at 14:13, Roey Angel wrote:> Hi Bernard, thanks for the quick reply. > Of course, I understand that an escape is needed because parenthesis are reserved symbols in regular expressions. > My problem is that if I just use \( I get the error: > > Error: '\(' is an unrecognized escape in character string starting "\(" > > so in order to get a literal ( I need to use \\( > which is odd cause I've never encountered that in any other language and also all the R manuals dont mention that. >It is not odd as the previous poster has already mentioned. I have encountered this (e.g. awk). You need the \\ because the expression between tour quotes is interpreted twice: once and first as a character string (in which \( is illegal but \\ is legal) and then as a regular expression in which you want to match a literal ( and ) which must be escaped in the regular expression since they are meta characters. If you don't like doing that (the \\) use this instead as.data.frame(apply(tax.data, 2, function(x) gsub('[(].*[)]','',x))) i.e. put the ( and ) in a character class. Berend>> On 02-03-2012, at 09:36, Roey Angel wrote: >> >>> Hi, >>> I was recently misfortunate enough to have to use regular expressions to sort out some data in R. >>> I'm working on a data file which contains taxonomical data of bacteria in hierarchical order. >>> A sample of this file can be generated using: >>> >>> tax.data<- read.table(header=F, con<- textConnection(' >>> G9SS7BA01D15EC Bacteria(100) Cyanobacteria(84) unclassified >>> G9SS7BA01C9UIR Bacteria(100) Proteobacteria(94) Alphaproteobacteria(89) >>> G9SS7BA01CM00D Bacteria(100) Proteobacteria(99) Alphaproteobacteria(99) >>> ')) >>> close(con) >>> >>> What I try to do is to remove the parenthesis and the number inside (which could contain a decimal point) >>> I assumed that the following command would solve it, but instead I got an error. >>> >>> tax.data<- as.data.frame(apply(tax.data, 2, function(x) gsub('\(.*\)','',x))) >>> Error: '\(' is an unrecognized escape in character string starting "\(" >>> >>> And it doesn't matter if I use perl = TRUE or not. >>> To solve it I need to use a double escape sign '\\' before opening and closing the parenthesis: >>> >>> tax.data<- as.data.frame(apply(tax.data, 2, function(x) gsub('\\(.*\\)','',x))) >>> >>> This yields the desired result but I wonder why it does that? >>> No other regular expression system I'm used to (e.g. Perl, Shell) works like that. >>> >>> I'm using R 2.14 (but also R 2.10) and I get the same results on Ubuntu and win XP. >>> >>> I'd appreciate any explanation. >> Section "Character vectors" in the R Intro manual. >> >> ?Quotes >> >> The regular expression is provided as a string to gsub. In strings there are escape sequences. >> To get the \ as a single \ to the regular expression parser it has to be \-ed in the string stage: \\ >> >> Berend >> >> > <angel.vcf>