Judith Flores
2009-Aug-25 20:17 UTC
[R] Regular expression to define contents between parentheses
Hello dear R-helpers, I haven't been able to figure out of find a solution in the R-help archives about how to delete all the characters contained in groups of parenthesis. I have a vector that looks more or less like this: myvector<-c("something (80 km/h, sd) & more (6 kg/L,sd)", "somethingelse (48 m/s, sd) & moretoo (50g/L , sd)") I want to extract all the strings that are not contained in parenthesis, the goal would be to obtain the following new vector: subvector<-c("something & more", "somethingelse & moretoo") I tried the following, but this pattern seems to enclose all that is included between the first opened parenthesis and the last closed parethesis, which makes sense, but it's not what I need: subvector<-gsub("\\((.*)\\)","",myvector Your help will be very appreciated. Thank you, Judith
Bert Gunter
2009-Aug-25 20:41 UTC
[R] Regular expression to define contents between parentheses
Judith: I believe that this is, indeed, tough; it might require PERL regex's to do entirely within the regular expression language. You might also wish to check out the gsubfn package to see if it could help. However, a reasonably simple alternative approach that I think will work is to use strsplit(): 1. Split on "(" 2. lapply on the resulting list of vectors and remove all elements from each vector that contain a ")" using, e.g. grep(). 3. sapply paste() on the now "cleaned" list to get back the cleaned up strings. I leave it to you to work out details -- or point out why I'm wrong. Alternatively, wait for someone smarter to reply -- which I'm sure will occur given the clarity with which you posed your problem. Cheers, Bert Gunter Genentech Nonclinical Biostatisics -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Judith Flores Sent: Tuesday, August 25, 2009 1:18 PM To: RHelp Subject: [R] Regular expression to define contents between parentheses Hello dear R-helpers, I haven't been able to figure out of find a solution in the R-help archives about how to delete all the characters contained in groups of parenthesis. I have a vector that looks more or less like this: myvector<-c("something (80 km/h, sd) & more (6 kg/L,sd)", "somethingelse (48 m/s, sd) & moretoo (50g/L , sd)") I want to extract all the strings that are not contained in parenthesis, the goal would be to obtain the following new vector: subvector<-c("something & more", "somethingelse & moretoo") I tried the following, but this pattern seems to enclose all that is included between the first opened parenthesis and the last closed parethesis, which makes sense, but it's not what I need: subvector<-gsub("\\((.*)\\)","",myvector Your help will be very appreciated. Thank you, Judith ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Alexander Shenkin
2009-Aug-25 20:48 UTC
[R] Regular expression to define contents between parentheses
Hi Judith, This probably isn't the only way to do it, but: gsub("\\(.*?\\)", "", myvector, perl=TRUE) seems to do the trick. The problem is that regular expressions are greedy, so you were matching everything between the first and last parens, as you noticed. Putting the question mark there makes it a "minimal" matching operation. Apparently this is only implemented in perl regex's, or at least in that syntax. Hence the 'perl=TRUE'. hth, allie On 8/25/2009 3:17 PM, Judith Flores wrote:> Hello dear R-helpers, > > I haven't been able to figure out of find a solution in the R-help archives about how to delete all the characters contained in groups of parenthesis. I have a vector that looks more or less like this: > > myvector<-c("something (80 km/h, sd) & more (6 kg/L,sd)", "somethingelse (48 m/s, sd) & moretoo (50g/L , sd)") > > I want to extract all the strings that are not contained in parenthesis, the goal would be to obtain the following new vector: > > subvector<-c("something & more", "somethingelse & moretoo") > > I tried the following, but this pattern seems to enclose all that is included between the first opened parenthesis and the last closed parethesis, which makes sense, but it's not what I need: > > subvector<-gsub("\\((.*)\\)","",myvector > > > Your help will be very appreciated. > > Thank you, > > Judith > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Gabor Grothendieck
2009-Aug-25 21:02 UTC
[R] Regular expression to define contents between parentheses
Instead of using .* use [^)]* so that you only get up to the next ) and also it seems that you want to trim spaces so add space star at the beginning and end: gsub(" *\\([^)]*\\) *", "", myvector) On Tue, Aug 25, 2009 at 4:17 PM, Judith Flores<juryef at yahoo.com> wrote:> Hello dear R-helpers, > > ? I haven't been able to figure out of find a solution in the R-help archives about how to delete all the characters contained in groups of parenthesis. I have a vector that looks more or less like this: > > myvector<-c("something (80 km/h, sd) & more (6 kg/L,sd)", "somethingelse (48 m/s, sd) & moretoo (50g/L , sd)") > > I want to extract all the strings that are not contained in parenthesis, the goal would be to obtain the following new vector: > > subvector<-c("something & more", "somethingelse & moretoo") > > I tried the following, but this pattern seems to enclose all that is included between the first opened parenthesis and the last closed parethesis, which makes sense, but it's not what I need: > > subvector<-gsub("\\((.*)\\)","",myvector > > > Your help will be very appreciated. > > Thank you, > > Judith > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >