thr3ads.net - R help - [R] help with regular expressions in R [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Mark Kimpel

2009-Aug-20 15:30 UTC

[R] help with regular expressions in R

I'm having trouble achieving the results I want using a regular expression.
I want to eliminate all characters that fall within square brackets as well
as the brackets themselves, returning an "". I'm not sure if
it's R's use of
double slash escapes or something else that is tripping me up. If I only use
one slash I get
1: '\[' is an unrecognized escape in a character string
2: '\]' is an unrecognized escape in a character string
3: unrecognized escapes removed from "\[*.\]"

Below is my self-contained code followed by sessionInfo().

Thanks in advance for your help. I'm going to be doing a lot of text mining
in the near future. I have an excellent O'Reilly book on regex's. What
is
the best reference for R's special treatment of these animals?
Mark


myCharVec <- c("[the rain in spain]", "(the rain in
spain)")
gsub('\\[*.\\]', '', myCharVec)

#what I get
# [1] "[the rain in spai"   "(the rain in spain)"

#what I want
[1] ""   "(the rain in spain)"
> sessionInfo()R version 2.10.0 Under development (unstable) (2009-08-12 r49193)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] RWeka_0.3-20 tm_0.4

loaded via a namespace (and not attached):
[1] grid_2.10.0 rJava_0.6-3 slam_0.1-3


------------------------------------------------------------
Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail

"The real problem is not whether machines think but whether men do."
-- B.
F. Skinner
******************************************************************

	[[alternative HTML version deleted]]

jim holtman

2009-Aug-20 15:42 UTC

head link

[R] help with regular expressions in R

How about this:
> myCharVec <- c("[the rain in spain]", "(the rain in
spain)")
> gsub('\\[.*\\]', '', myCharVec)[1] ""                    "(the rain in
spain)">

you had "*." when you should have ".*"

On Thu, Aug 20, 2009 at 11:30 AM, Mark Kimpel<mwkimpel at gmail.com>
wrote:> I'm having trouble achieving the results I want using a regular
expression.
> I want to eliminate all characters that fall within square brackets as well
> as the brackets themselves, returning an "". I'm not sure if
it's R's use of
> double slash escapes or something else that is tripping me up. If I only
use
> one slash I get
> 1: '\[' is an unrecognized escape in a character string
> 2: '\]' is an unrecognized escape in a character string
> 3: unrecognized escapes removed from "\[*.\]"
>
> Below is my self-contained code followed by sessionInfo().
>
> Thanks in advance for your help. I'm going to be doing a lot of text
mining
> in the near future. I have an excellent O'Reilly book on regex's.
What is
> the best reference for R's special treatment of these animals?
> Mark
>
>
> myCharVec <- c("[the rain in spain]", "(the rain in
spain)")
> gsub('\\[*.\\]', '', myCharVec)
>
> #what I get
> # [1] "[the rain in spai" ? "(the rain in spain)"
>
> #what I want
> [1] "" ? "(the rain in spain)"
>
>> sessionInfo()
> R version 2.10.0 Under development (unstable) (2009-08-12 r49193)
> x86_64-unknown-linux-gnu
>
> locale:
> ?[1] LC_CTYPE=en_US.UTF-8 ? ? ? LC_NUMERIC=C
> ?[3] LC_TIME=en_US.UTF-8 ? ? ? ?LC_COLLATE=en_US.UTF-8
> ?[5] LC_MONETARY=C ? ? ? ? ? ? ?LC_MESSAGES=en_US.UTF-8
> ?[7] LC_PAPER=en_US.UTF-8 ? ? ? LC_NAME=C
> ?[9] LC_ADDRESS=C ? ? ? ? ? ? ? LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats ? ? graphics ?grDevices datasets ?utils ? ? methods ? base
>
> other attached packages:
> [1] RWeka_0.3-20 tm_0.4
>
> loaded via a namespace (and not attached):
> [1] grid_2.10.0 rJava_0.6-3 slam_0.1-3
>
>
> ------------------------------------------------------------
> Mark W. Kimpel MD ?** Neuroinformatics ** Dept. of Psychiatry
> Indiana University School of Medicine
>
> 15032 Hunter Court, Westfield, IN ?46074
>
> (317) 490-5129 Work, & Mobile & VoiceMail
>
> "The real problem is not whether machines think but whether men
do." -- B.
> F. Skinner
> ******************************************************************
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

davidr at rhotrading.com

2009-Aug-20 15:43 UTC

head link

[R] [SPAM] - help with regular expressions in R - Bayesian Filter detected spam

Possibly just a typo:> gsub('\\[.*\\]', '', myCharVec)           ^^
[1] ""                    "(the rain in spain)"

HTH,
-- David


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On Behalf Of Mark Kimpel
Sent: Thursday, August 20, 2009 10:31 AM
To: r-help at r-project.org
Subject: [SPAM] - [R] help with regular expressions in R - Bayesian
Filter detected spam

I'm having trouble achieving the results I want using a regular
expression.
I want to eliminate all characters that fall within square brackets as
well
as the brackets themselves, returning an "". I'm not sure if
it's R's
use of
double slash escapes or something else that is tripping me up. If I only
use
one slash I get
1: '\[' is an unrecognized escape in a character string
2: '\]' is an unrecognized escape in a character string
3: unrecognized escapes removed from "\[*.\]"

Below is my self-contained code followed by sessionInfo().

Thanks in advance for your help. I'm going to be doing a lot of text
mining
in the near future. I have an excellent O'Reilly book on regex's. What
is
the best reference for R's special treatment of these animals?
Mark


myCharVec <- c("[the rain in spain]", "(the rain in
spain)")
gsub('\\[*.\\]', '', myCharVec)

#what I get
# [1] "[the rain in spai"   "(the rain in spain)"

#what I want
[1] ""   "(the rain in spain)"
> sessionInfo()R version 2.10.0 Under development (unstable) (2009-08-12 r49193)
x86_64-unknown-linux-gnu

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base

other attached packages:
[1] RWeka_0.3-20 tm_0.4

loaded via a namespace (and not attached):
[1] grid_2.10.0 rJava_0.6-3 slam_0.1-3


------------------------------------------------------------
Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail

"The real problem is not whether machines think but whether men do."
--
B.
F. Skinner
******************************************************************

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
imal, self-contained, reproducible code.


This e-mail and any materials attached hereto, including, without limitation,
all content hereof and thereof (collectively, "Rho Content") are
confidential and proprietary to Rho Trading Securities, LLC ("Rho")
and/or its affiliates, and are protected by intellectual property laws.  Without
the prior written consent of Rho, the Rho Content may not (i) be disclosed to
any third party or (ii) be reproduced or otherwise used by anyone other than
current employees of Rho or its affiliates, on behalf of Rho or its affiliates.

THE RHO CONTENT IS PROVIDED AS IS, WITHOUT REPRESENTATIONS OR WARRANTIES OF ANY
KIND.  TO THE MAXIMUM EXTENT PERMISSIBLE UNDER APPLICABLE LAW, RHO HEREBY
DISCLAIMS ANY AND ALL WARRANTIES, EXPRESS AND IMPLIED, RELATING TO THE RHO
CONTENT, AND NEITHER RHO NOR ANY OF ITS AFFILIATES SHALL IN ANY EVENT BE LIABLE
FOR ANY DAMAGES OF ANY NATURE WHATSOEVER, INCLUDING, BUT NOT LIMITED TO, DIRECT,
INDIRECT, CONSEQUENTIAL, SPECIAL AND PUNITIVE DAMAGES, LOSS OF PROFITS AND
TRADING LOSSES, RESULTING FROM ANY PERSON'S USE OR RELIANCE UPON, OR
INABILITY TO USE, ANY RHO CONTENT, EVEN IF RHO IS ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES OR IF SUCH DAMAGES WERE FORESEEABLE.

Chuck Taylor

2009-Aug-20 15:46 UTC

head link

[R] help with regular expressions in R

Mark,

Try this:
> myCharVec[1] "[the rain in spain]" "(the rain in spain)"
> gsub("\\[.*\\]", "", myCharVec)[1] ""                    "(the rain in spain)"

You need two backslashes to "escape" the square brackets. The regular
expression "\\[.\\]" translates to "a [ followed by 0 or more
instances
of any character followd by ]".

Best regards,
Chuck Taylor
TIBCO Spotfire
Seattle

-----Original Message-----

I want to eliminate all characters that fall within square brackets as
well
as the brackets themselves, returning an "". ...

#what I want
[1] ""   "(the rain in spain)"

Mark Kimpel

2009-Aug-20 16:28 UTC

head link

[R] help with regular expressions in R

Well, I guess I'm not quite there yet. What I gave earlier was a simplified
example, and did not accurately reflect the complexity of the task.

This is my real world example. As you can see, what I need to do is delete
an arbitrary number of characters, including brackets and parens enclosing
them, multiple times within the same string. Help?

myCharVec <-  "medicare [link  220.30.05]  ssa (1-800-772-1213). 2008
[link
145.30.05] amounts  (2d) gross income (magi) here. (2e)"
myCharVec
myCharVec <- gsub('\\[.*\\]', '', myCharVec)
myCharVec
myCharVec <- gsub('\\(.*\\)', '', myCharVec)
myCharVec

#what I want
# "medicare  ssa . 2008  amounts gross income here."

myCharVec <-  "medicare [link  220.30.05]  ssa (1-800-772-1213). 2008
[link
145.30.05] amounts  (2d) gross income (magi) here.
(2e)"> myCharVec[1] "medicare [link  220.30.05]  ssa (1-800-772-1213). 2008 [link
145.30.05] amounts  (2d) gross income (magi) here.
(2e)"> myCharVec <- gsub('\\[.*\\]', '', myCharVec)
> myCharVec[1] "medicare  amounts  (2d) gross income (magi) here.
(2e)"> myCharVec <- gsub('\\(.*\\)', '', myCharVec)
> myCharVec
[1] "medicare  amounts  ">
> #what I want
> # "medicare  ssa . 2008  amounts gross income here."------------------------------------------------------------
Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail

"The real problem is not whether machines think but whether men do."
-- B.
F. Skinner
******************************************************************


On Thu, Aug 20, 2009 at 11:39 AM, William Dunlap <wdunlap@tibco.com>
wrote:
>
> > -----Original Message-----
> > From: r-help-bounces@r-project.org
> > [mailto:r-help-bounces@r-project.org] On Behalf Of Mark Kimpel
> > Sent: Thursday, August 20, 2009 8:31 AM
> > To: r-help@r-project.org
> > Subject: [R] help with regular expressions in R
> > ...
> > myCharVec <- c("[the rain in spain]", "(the rain in
spain)")
> > gsub('\\[*.\\]', '', myCharVec)
>
> Change the '*.' to '.*'.
>
> Your expression matches 0 or more left square brackets,
> followed by 1 character, followed by a right squared bracket.
>
> "\\[.*\]]" matches a left square bracket, followed by 0 or more
> characters, followed by a right square bracket.
>
> Bill Dunlap
> TIBCO Software Inc - Spotfire Division
> wdunlap tibco.com
>
> >
> > #what I get
> > # [1] "[the rain in spai"   "(the rain in spain)"
> >
> > #what I want
> > [1] ""   "(the rain in spain)"
> >
> > > sessionInfo()
> > R version 2.10.0 Under development (unstable) (2009-08-12 r49193)
> > x86_64-unknown-linux-gnu
> >
> > locale:
> >  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> >  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> >  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
> >  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
> >  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] stats     graphics  grDevices datasets  utils     methods   base
> >
> > other attached packages:
> > [1] RWeka_0.3-20 tm_0.4
> >
> > loaded via a namespace (and not attached):
> > [1] grid_2.10.0 rJava_0.6-3 slam_0.1-3
> >
> >
> > ------------------------------------------------------------
> > Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
> > Indiana University School of Medicine
> >
> > 15032 Hunter Court, Westfield, IN  46074
> >
> > (317) 490-5129 Work, & Mobile & VoiceMail
> >
> > "The real problem is not whether machines think but whether
> > men do." -- B.
> > F. Skinner
> > ******************************************************************
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
	[[alternative HTML version deleted]]

Mark Kimpel

2009-Aug-20 17:11 UTC

head link

[R] help with regular expressions in R

Thanks guys. I've pulled my O'Reilly book and will begin reviewing it.
------------------------------------------------------------
Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
Indiana University School of Medicine

15032 Hunter Court, Westfield, IN  46074

(317) 490-5129 Work, & Mobile & VoiceMail

"The real problem is not whether machines think but whether men do."
-- B.
F. Skinner
******************************************************************


On Thu, Aug 20, 2009 at 12:37 PM, Phil Spector
<spector@stat.berkeley.edu>wrote:
> Mark -
>   It looks like you're running into the greediness of regular
expressions.
> When R sees ".*" it tries to find the longest match,  which also
grabs
> some of the stuff you want.  You can either replace .* with something
> like [^\\])]* (i.e. one or more of any character *except* "]" or
")" ),
> or use perl=TRUE, which allows the question mark ("?") to mean
the shortest
> match instead of the longest.  Here's what I'd use:
>
>  gsub('[\\[(].*?[\\])]','',myCharVec,perl=TRUE)
>
> In English:  substitute the shortest string starting with "[" or
"(" and
> ending with "]" or ")" with nothing.
>
>   Hope this helps.
>                                                     - Phil
>
>
>
>
> On Thu, 20 Aug 2009, Mark Kimpel wrote:
>
>  Well, I guess I'm not quite there yet. What I gave earlier was a
>> simplified
>> example, and did not accurately reflect the complexity of the task.
>>
>> This is my real world example. As you can see, what I need to do is
delete
>> an arbitrary number of characters, including brackets and parens
enclosing
>> them, multiple times within the same string. Help?
>>
>> myCharVec <-  "medicare [link  220.30.05]  ssa
(1-800-772-1213). 2008
>> [link
>> 145.30.05] amounts  (2d) gross income (magi) here. (2e)"
>> myCharVec
>> myCharVec <- gsub('\\[.*\\]', '', myCharVec)
>> myCharVec
>> myCharVec <- gsub('\\(.*\\)', '', myCharVec)
>> myCharVec
>>
>> #what I want
>> # "medicare  ssa . 2008  amounts gross income here."
>>
>> myCharVec <-  "medicare [link  220.30.05]  ssa
(1-800-772-1213). 2008
>> [link
>> 145.30.05] amounts  (2d) gross income (magi) here. (2e)"
>>
>>> myCharVec
>>>
>> [1] "medicare [link  220.30.05]  ssa (1-800-772-1213). 2008 [link
>> 145.30.05] amounts  (2d) gross income (magi) here. (2e)"
>>
>>> myCharVec <- gsub('\\[.*\\]', '', myCharVec)
>>> myCharVec
>>>
>> [1] "medicare  amounts  (2d) gross income (magi) here. (2e)"
>>
>>> myCharVec <- gsub('\\(.*\\)', '', myCharVec)
>>> myCharVec
>>>
>> [1] "medicare  amounts  "
>>
>>>
>>> #what I want
>>> # "medicare  ssa . 2008  amounts gross income here."
>>>
>> ------------------------------------------------------------
>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>> Indiana University School of Medicine
>>
>> 15032 Hunter Court, Westfield, IN  46074
>>
>> (317) 490-5129 Work, & Mobile & VoiceMail
>>
>> "The real problem is not whether machines think but whether men
do." -- B.
>> F. Skinner
>> ******************************************************************
>>
>>
>> On Thu, Aug 20, 2009 at 11:39 AM, William Dunlap
<wdunlap@tibco.com>
>> wrote:
>>
>>
>>>  -----Original Message-----
>>>> From: r-help-bounces@r-project.org
>>>> [mailto:r-help-bounces@r-project.org] On Behalf Of Mark Kimpel
>>>> Sent: Thursday, August 20, 2009 8:31 AM
>>>> To: r-help@r-project.org
>>>> Subject: [R] help with regular expressions in R
>>>> ...
>>>> myCharVec <- c("[the rain in spain]", "(the
rain in spain)")
>>>> gsub('\\[*.\\]', '', myCharVec)
>>>>
>>>
>>> Change the '*.' to '.*'.
>>>
>>> Your expression matches 0 or more left square brackets,
>>> followed by 1 character, followed by a right squared bracket.
>>>
>>> "\\[.*\]]" matches a left square bracket, followed by 0
or more
>>> characters, followed by a right square bracket.
>>>
>>> Bill Dunlap
>>> TIBCO Software Inc - Spotfire Division
>>> wdunlap tibco.com
>>>
>>>
>>>> #what I get
>>>> # [1] "[the rain in spai"   "(the rain in
spain)"
>>>>
>>>> #what I want
>>>> [1] ""   "(the rain in spain)"
>>>>
>>>>  sessionInfo()
>>>>>
>>>> R version 2.10.0 Under development (unstable) (2009-08-12
r49193)
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:
>>>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>>>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>>>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices datasets  utils     methods  
base
>>>>
>>>> other attached packages:
>>>> [1] RWeka_0.3-20 tm_0.4
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] grid_2.10.0 rJava_0.6-3 slam_0.1-3
>>>>
>>>>
>>>> ------------------------------------------------------------
>>>> Mark W. Kimpel MD  ** Neuroinformatics ** Dept. of Psychiatry
>>>> Indiana University School of Medicine
>>>>
>>>> 15032 Hunter Court, Westfield, IN  46074
>>>>
>>>> (317) 490-5129 Work, & Mobile & VoiceMail
>>>>
>>>> "The real problem is not whether machines think but
whether
>>>> men do." -- B.
>>>> F. Skinner
>>>>
******************************************************************
>>>>
>>>>      [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help@r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible
code.
>>>>
>>>>
>>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
	[[alternative HTML version deleted]]

Seemingly Similar Threads

Search for more apparently analagous threads

R help - Aug 2009 - help with regular expressions in R

[R] help with regular expressions in R

[R] help with regular expressions in R

[R] [SPAM] - help with regular expressions in R - Bayesian Filter detected spam

[R] help with regular expressions in R

[R] help with regular expressions in R

[R] help with regular expressions in R

Seemingly Similar Threads