Judith Flores
2009-Aug-25 20:17 UTC
[R] Regular expression to define contents between parentheses
Hello dear R-helpers,
I haven't been able to figure out of find a solution in the R-help
archives about how to delete all the characters contained in groups of
parenthesis. I have a vector that looks more or less like this:
myvector<-c("something (80 km/h, sd) & more (6 kg/L,sd)",
"somethingelse (48 m/s, sd) & moretoo (50g/L , sd)")
I want to extract all the strings that are not contained in parenthesis, the
goal would be to obtain the following new vector:
subvector<-c("something & more", "somethingelse &
moretoo")
I tried the following, but this pattern seems to enclose all that is included
between the first opened parenthesis and the last closed parethesis, which makes
sense, but it's not what I need:
subvector<-gsub("\\((.*)\\)","",myvector
Your help will be very appreciated.
Thank you,
Judith
Bert Gunter
2009-Aug-25 20:41 UTC
[R] Regular expression to define contents between parentheses
Judith:
I believe that this is, indeed, tough; it might require PERL regex's to do
entirely within the regular expression language. You might also wish to
check out the gsubfn package to see if it could help.
However, a reasonably simple alternative approach that I think will work is
to use strsplit():
1. Split on "("
2. lapply on the resulting list of vectors and remove all elements from each
vector that contain a ")" using, e.g. grep().
3. sapply paste() on the now "cleaned" list to get back the cleaned up
strings.
I leave it to you to work out details -- or point out why I'm wrong.
Alternatively, wait for someone smarter to reply -- which I'm sure will
occur given the clarity with which you posed your problem.
Cheers,
Bert Gunter
Genentech Nonclinical Biostatisics
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
Behalf Of Judith Flores
Sent: Tuesday, August 25, 2009 1:18 PM
To: RHelp
Subject: [R] Regular expression to define contents between parentheses
Hello dear R-helpers,
I haven't been able to figure out of find a solution in the R-help
archives about how to delete all the characters contained in groups of
parenthesis. I have a vector that looks more or less like this:
myvector<-c("something (80 km/h, sd) & more (6 kg/L,sd)",
"somethingelse (48
m/s, sd) & moretoo (50g/L , sd)")
I want to extract all the strings that are not contained in parenthesis, the
goal would be to obtain the following new vector:
subvector<-c("something & more", "somethingelse &
moretoo")
I tried the following, but this pattern seems to enclose all that is
included between the first opened parenthesis and the last closed
parethesis, which makes sense, but it's not what I need:
subvector<-gsub("\\((.*)\\)","",myvector
Your help will be very appreciated.
Thank you,
Judith
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Alexander Shenkin
2009-Aug-25 20:48 UTC
[R] Regular expression to define contents between parentheses
Hi Judith,
This probably isn't the only way to do it, but:
gsub("\\(.*?\\)", "", myvector, perl=TRUE)
seems to do the trick.
The problem is that regular expressions are greedy, so you were matching
everything between the first and last parens, as you noticed. Putting
the question mark there makes it a "minimal" matching operation.
Apparently this is only implemented in perl regex's, or at least in that
syntax. Hence the 'perl=TRUE'.
hth,
allie
On 8/25/2009 3:17 PM, Judith Flores wrote:> Hello dear R-helpers,
>
> I haven't been able to figure out of find a solution in the R-help
archives about how to delete all the characters contained in groups of
parenthesis. I have a vector that looks more or less like this:
>
> myvector<-c("something (80 km/h, sd) & more (6 kg/L,sd)",
"somethingelse (48 m/s, sd) & moretoo (50g/L , sd)")
>
> I want to extract all the strings that are not contained in parenthesis,
the goal would be to obtain the following new vector:
>
> subvector<-c("something & more", "somethingelse &
moretoo")
>
> I tried the following, but this pattern seems to enclose all that is
included between the first opened parenthesis and the last closed parethesis,
which makes sense, but it's not what I need:
>
> subvector<-gsub("\\((.*)\\)","",myvector
>
>
> Your help will be very appreciated.
>
> Thank you,
>
> Judith
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Gabor Grothendieck
2009-Aug-25 21:02 UTC
[R] Regular expression to define contents between parentheses
Instead of using .* use [^)]* so that you only get up to the next ) and also
it seems that you want to trim spaces so add space star at the beginning
and end:
gsub(" *\\([^)]*\\) *", "", myvector)
On Tue, Aug 25, 2009 at 4:17 PM, Judith Flores<juryef at yahoo.com>
wrote:> Hello dear R-helpers,
>
> ? I haven't been able to figure out of find a solution in the R-help
archives about how to delete all the characters contained in groups of
parenthesis. I have a vector that looks more or less like this:
>
> myvector<-c("something (80 km/h, sd) & more (6 kg/L,sd)",
"somethingelse (48 m/s, sd) & moretoo (50g/L , sd)")
>
> I want to extract all the strings that are not contained in parenthesis,
the goal would be to obtain the following new vector:
>
> subvector<-c("something & more", "somethingelse &
moretoo")
>
> I tried the following, but this pattern seems to enclose all that is
included between the first opened parenthesis and the last closed parethesis,
which makes sense, but it's not what I need:
>
> subvector<-gsub("\\((.*)\\)","",myvector
>
>
> Your help will be very appreciated.
>
> Thank you,
>
> Judith
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>