thr3ads.net - R help - [R] Regexp subexpression [Mar 2006]

If this information is useful, please help other people find it:
Share via:

Dieter Menne

2006-Mar-25 16:22 UTC

[R] Regexp subexpression

I can't get the PERL subexpression translated to R. Following, for example,
B. Ripley's

http://finzi.psych.upenn.edu/R/Rhelp02a/archive/58984.html

I am using sub, but it looks like an ugly substitute. Assume I want to
extract the first alpha part and the first numeric part, but only if they
are in sequence.

Do I really have to use the sub twice, first extracting the first variable,
then the second? The third example should return nothing, because it's
inverted, but it returns the whole string. I know I could check that
separately, but is there no better way?

  patid=c("ALAN334","AzD44","44AZD")
  txt =sub("([[:alpha:]]+)([[:digit:]])+","\\1",patid)
  num =sub("([[:alpha:]]+)([[:digit:]])+","\\2",patid)

It would be nice if the following data frame would be returned:

txt     num
ALAN    334
AzD     44
NA      NA (or "", "", but not so nice)

Dieter

Dieter Menne

2006-Mar-25 16:32 UTC

head link

[R] Regexp subexpression

Dieter Menne <dieter.menne <at> menne-biomed.de> writes:
> 
>   patid=c("ALAN334","AzD44","44AZD")
>   txt =sub("([[:alpha:]]+)([[:digit:]])+","\\1",patid)
>   num =sub("([[:alpha:]]+)([[:digit:]])+","\\2",patid)
> 
Sorry, a ")" was at the wrong place. Here the corrected version.

   patid=c("ALAN334","AzD44","44AZD")
   txt =sub("([[:alpha:]]+)([[:digit:]]+)","\\1",patid)
   num =sub("([[:alpha:]]+)([[:digit:]]+)","\\2",patid)


Dieter

Gabor Grothendieck

2006-Mar-25 17:12 UTC

head link

[R] Regexp subexpression

In the third case there is no match so there are no
substitutions.  Handle it separately:

pat = "^([[:alpha:]]+)([[:digit:]]+)"
result <- cbind(txt = sub(pat, "\\1", patid), num = sub(pat,
"\\2", patid))
result[regexpr(pat, paid) < 0,] <- NA


On 3/25/06, Dieter Menne <dieter.menne at menne-biomed.de>
wrote:> I can't get the PERL subexpression translated to R. Following, for
example,
> B. Ripley's
>
> http://finzi.psych.upenn.edu/R/Rhelp02a/archive/58984.html
>
> I am using sub, but it looks like an ugly substitute. Assume I want to
> extract the first alpha part and the first numeric part, but only if they
> are in sequence.
>
> Do I really have to use the sub twice, first extracting the first variable,
> then the second? The third example should return nothing, because it's
> inverted, but it returns the whole string. I know I could check that
> separately, but is there no better way?
>
>  patid=c("ALAN334","AzD44","44AZD")
>  txt =sub("([[:alpha:]]+)([[:digit:]])+","\\1",patid)
>  num =sub("([[:alpha:]]+)([[:digit:]])+","\\2",patid)
>
> It would be nice if the following data frame would be returned:
>
> txt     num
> ALAN    334
> AzD     44
> NA      NA (or "", "", but not so nice)
>
> Dieter
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Dieter Menne

2006-Mar-25 17:24 UTC

head link

[R] Regexp subexpression

Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
> 
> In the third case there is no match so there are no
> substitutions.  Handle it separately:
> 
> pat = "^([[:alpha:]]+)([[:digit:]]+)"
> result <- cbind(txt = sub(pat, "\\1", patid), num = sub(pat,
"\\2", patid))
> result[regexpr(pat, paid) < 0,] <- NA
> 
Thanks, Gabor, that something like a compressed version of mine.  My main 
question was if I was missing something obvious, because I found the double sub 
messy. I am a surprised that there is not 

pat = "^([[:alpha:]]+)([[:digit:]]+)"
mygrep(pat, patid)

returning a list with all subexpressions.

Dieter

Gabor Grothendieck

2006-Mar-25 17:38 UTC

head link

[R] Regexp subexpression

We could use sapply to reduce it slightly:

result <- sapply(sprintf("\\%d", 1:2), sub, pattern = pat, x =
patid)
result[regexpr(pat, patid) < 0,] <- NA


On 3/25/06, Dieter Menne <dieter.menne at menne-biomed.de>
wrote:> Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
>
> >
> > In the third case there is no match so there are no
> > substitutions.  Handle it separately:
> >
> > pat = "^([[:alpha:]]+)([[:digit:]]+)"
> > result <- cbind(txt = sub(pat, "\\1", patid), num =
sub(pat, "\\2", patid))
> > result[regexpr(pat, paid) < 0,] <- NA
> >
>
> Thanks, Gabor, that something like a compressed version of mine.  My main
> question was if I was missing something obvious, because I found the double
sub
> messy. I am a surprised that there is not
>
> pat = "^([[:alpha:]]+)([[:digit:]]+)"
> mygrep(pat, patid)
>
> returning a list with all subexpressions.
>
> Dieter
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Dieter Menne

2006-Mar-25 18:04 UTC

head link

[R] Regexp subexpression

Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
> 
> We could use sapply to reduce it slightly:
> 
> result <- sapply(sprintf("\\%d", 1:2), sub, pattern = pat, x =
patid)
> result[regexpr(pat, patid) < 0,] <- NA
> 
Looks like we should make a generalized wrapper for it.

Dieter

Duncan Temple Lang

2006-Mar-25 18:10 UTC

head link

[R] Regexp subexpression

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1



Dieter Menne wrote:> Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
> 
> 
>>In the third case there is no match so there are no
>>substitutions.  Handle it separately:
>>
>>pat = "^([[:alpha:]]+)([[:digit:]]+)"
>>result <- cbind(txt = sub(pat, "\\1", patid), num =
sub(pat, "\\2", patid))
>>result[regexpr(pat, paid) < 0,] <- NA
>>
> 
> 
> Thanks, Gabor, that something like a compressed version of mine.  My main 
> question was if I was missing something obvious, because I found the double
sub
> messy. I am a surprised that there is not 
> 
> pat = "^([[:alpha:]]+)([[:digit:]]+)"
> mygrep(pat, patid)
> 
> returning a list with all subexpressions.
I have been surprised about that also a long time back and
have code that I will get around to putting into R to allow
the matching subexpressions be returned as a character vector
to avoid having to do silly  tricks with strsplit() to break
them up.
> 
> Dieter
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
- --
Duncan Temple Lang                    duncan at wald.ucdavis.edu
Department of Statistics              work:  (530) 752-4782
4210 Mathematical Sciences Building   fax:   (530) 752-7099
One Shields Ave.
University of California at Davis
Davis,
CA 95616,
USA
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (Darwin)

iD8DBQFEJYeB9p/Jzwa2QP4RAnAEAJwP+3Gr6RJLje+m9oOSwTlsdoN72ACeKLyM
d3eoIYuZERKv2AzibwiMPM4=79U5
-----END PGP SIGNATURE-----

Gabor Grothendieck

2006-Mar-25 19:24 UTC

head link

[R] Regexp subexpression

Here is one more variation. This time we provide an alternative .*
to soak up the entire expression when it would have otherwise
failed so that the substitution occurs regardless giving us
empty strings instead of the same string back:
> pat = "^([[:alpha:]]+)([[:digit:]]+)|.*"
> sapply(sprintf("\\%d", 1:2), sub, pattern = pat, x = patid)     \\1    \\2
[1,] "ALAN" "334"
[2,] "AzD"  "44"
[3,] ""     ""

If NAs are needed, use the same result[regexpr(pat, patid) < 0,] <- NA
as last time.

On 3/25/06, Gabor Grothendieck <ggrothendieck at gmail.com>
wrote:> We could use sapply to reduce it slightly:
>
> result <- sapply(sprintf("\\%d", 1:2), sub, pattern = pat, x =
patid)
> result[regexpr(pat, patid) < 0,] <- NA
>
>
> On 3/25/06, Dieter Menne <dieter.menne at menne-biomed.de> wrote:
> > Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:
> >
> > >
> > > In the third case there is no match so there are no
> > > substitutions.  Handle it separately:
> > >
> > > pat = "^([[:alpha:]]+)([[:digit:]]+)"
> > > result <- cbind(txt = sub(pat, "\\1", patid), num =
sub(pat, "\\2", patid))
> > > result[regexpr(pat, paid) < 0,] <- NA
> > >
> >
> > Thanks, Gabor, that something like a compressed version of mine.  My
main
> > question was if I was missing something obvious, because I found the
double sub
> > messy. I am a surprised that there is not
> >
> > pat = "^([[:alpha:]]+)([[:digit:]]+)"
> > mygrep(pat, patid)
> >
> > returning a list with all subexpressions.
> >
> > Dieter
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
> >
>

Maybe Matching Threads

Search for more possibly parallel threads

R help - Mar 2006 - Regexp subexpression

[R] Regexp subexpression

[R] Regexp subexpression

[R] Regexp subexpression

[R] Regexp subexpression

[R] Regexp subexpression

[R] Regexp subexpression

[R] Regexp subexpression

[R] Regexp subexpression

Maybe Matching Threads