I can't get the PERL subexpression translated to R. Following, for example, B. Ripley's http://finzi.psych.upenn.edu/R/Rhelp02a/archive/58984.html I am using sub, but it looks like an ugly substitute. Assume I want to extract the first alpha part and the first numeric part, but only if they are in sequence. Do I really have to use the sub twice, first extracting the first variable, then the second? The third example should return nothing, because it's inverted, but it returns the whole string. I know I could check that separately, but is there no better way? patid=c("ALAN334","AzD44","44AZD") txt =sub("([[:alpha:]]+)([[:digit:]])+","\\1",patid) num =sub("([[:alpha:]]+)([[:digit:]])+","\\2",patid) It would be nice if the following data frame would be returned: txt num ALAN 334 AzD 44 NA NA (or "", "", but not so nice) Dieter
Dieter Menne <dieter.menne <at> menne-biomed.de> writes:> > patid=c("ALAN334","AzD44","44AZD") > txt =sub("([[:alpha:]]+)([[:digit:]])+","\\1",patid) > num =sub("([[:alpha:]]+)([[:digit:]])+","\\2",patid) >Sorry, a ")" was at the wrong place. Here the corrected version. patid=c("ALAN334","AzD44","44AZD") txt =sub("([[:alpha:]]+)([[:digit:]]+)","\\1",patid) num =sub("([[:alpha:]]+)([[:digit:]]+)","\\2",patid) Dieter
In the third case there is no match so there are no substitutions. Handle it separately: pat = "^([[:alpha:]]+)([[:digit:]]+)" result <- cbind(txt = sub(pat, "\\1", patid), num = sub(pat, "\\2", patid)) result[regexpr(pat, paid) < 0,] <- NA On 3/25/06, Dieter Menne <dieter.menne at menne-biomed.de> wrote:> I can't get the PERL subexpression translated to R. Following, for example, > B. Ripley's > > http://finzi.psych.upenn.edu/R/Rhelp02a/archive/58984.html > > I am using sub, but it looks like an ugly substitute. Assume I want to > extract the first alpha part and the first numeric part, but only if they > are in sequence. > > Do I really have to use the sub twice, first extracting the first variable, > then the second? The third example should return nothing, because it's > inverted, but it returns the whole string. I know I could check that > separately, but is there no better way? > > patid=c("ALAN334","AzD44","44AZD") > txt =sub("([[:alpha:]]+)([[:digit:]])+","\\1",patid) > num =sub("([[:alpha:]]+)([[:digit:]])+","\\2",patid) > > It would be nice if the following data frame would be returned: > > txt num > ALAN 334 > AzD 44 > NA NA (or "", "", but not so nice) > > Dieter > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:> > In the third case there is no match so there are no > substitutions. Handle it separately: > > pat = "^([[:alpha:]]+)([[:digit:]]+)" > result <- cbind(txt = sub(pat, "\\1", patid), num = sub(pat, "\\2", patid)) > result[regexpr(pat, paid) < 0,] <- NA >Thanks, Gabor, that something like a compressed version of mine. My main question was if I was missing something obvious, because I found the double sub messy. I am a surprised that there is not pat = "^([[:alpha:]]+)([[:digit:]]+)" mygrep(pat, patid) returning a list with all subexpressions. Dieter
We could use sapply to reduce it slightly: result <- sapply(sprintf("\\%d", 1:2), sub, pattern = pat, x = patid) result[regexpr(pat, patid) < 0,] <- NA On 3/25/06, Dieter Menne <dieter.menne at menne-biomed.de> wrote:> Gabor Grothendieck <ggrothendieck <at> gmail.com> writes: > > > > > In the third case there is no match so there are no > > substitutions. Handle it separately: > > > > pat = "^([[:alpha:]]+)([[:digit:]]+)" > > result <- cbind(txt = sub(pat, "\\1", patid), num = sub(pat, "\\2", patid)) > > result[regexpr(pat, paid) < 0,] <- NA > > > > Thanks, Gabor, that something like a compressed version of mine. My main > question was if I was missing something obvious, because I found the double sub > messy. I am a surprised that there is not > > pat = "^([[:alpha:]]+)([[:digit:]]+)" > mygrep(pat, patid) > > returning a list with all subexpressions. > > Dieter > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html >
Gabor Grothendieck <ggrothendieck <at> gmail.com> writes:> > We could use sapply to reduce it slightly: > > result <- sapply(sprintf("\\%d", 1:2), sub, pattern = pat, x = patid) > result[regexpr(pat, patid) < 0,] <- NA >Looks like we should make a generalized wrapper for it. Dieter
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Dieter Menne wrote:> Gabor Grothendieck <ggrothendieck <at> gmail.com> writes: > > >>In the third case there is no match so there are no >>substitutions. Handle it separately: >> >>pat = "^([[:alpha:]]+)([[:digit:]]+)" >>result <- cbind(txt = sub(pat, "\\1", patid), num = sub(pat, "\\2", patid)) >>result[regexpr(pat, paid) < 0,] <- NA >> > > > Thanks, Gabor, that something like a compressed version of mine. My main > question was if I was missing something obvious, because I found the double sub > messy. I am a surprised that there is not > > pat = "^([[:alpha:]]+)([[:digit:]]+)" > mygrep(pat, patid) > > returning a list with all subexpressions.I have been surprised about that also a long time back and have code that I will get around to putting into R to allow the matching subexpressions be returned as a character vector to avoid having to do silly tricks with strsplit() to break them up.> > Dieter > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html- -- Duncan Temple Lang duncan at wald.ucdavis.edu Department of Statistics work: (530) 752-4782 4210 Mathematical Sciences Building fax: (530) 752-7099 One Shields Ave. University of California at Davis Davis, CA 95616, USA -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2 (Darwin) iD8DBQFEJYeB9p/Jzwa2QP4RAnAEAJwP+3Gr6RJLje+m9oOSwTlsdoN72ACeKLyM d3eoIYuZERKv2AzibwiMPM4=79U5 -----END PGP SIGNATURE-----
Here is one more variation. This time we provide an alternative .* to soak up the entire expression when it would have otherwise failed so that the substitution occurs regardless giving us empty strings instead of the same string back:> pat = "^([[:alpha:]]+)([[:digit:]]+)|.*" > sapply(sprintf("\\%d", 1:2), sub, pattern = pat, x = patid)\\1 \\2 [1,] "ALAN" "334" [2,] "AzD" "44" [3,] "" "" If NAs are needed, use the same result[regexpr(pat, patid) < 0,] <- NA as last time. On 3/25/06, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:> We could use sapply to reduce it slightly: > > result <- sapply(sprintf("\\%d", 1:2), sub, pattern = pat, x = patid) > result[regexpr(pat, patid) < 0,] <- NA > > > On 3/25/06, Dieter Menne <dieter.menne at menne-biomed.de> wrote: > > Gabor Grothendieck <ggrothendieck <at> gmail.com> writes: > > > > > > > > In the third case there is no match so there are no > > > substitutions. Handle it separately: > > > > > > pat = "^([[:alpha:]]+)([[:digit:]]+)" > > > result <- cbind(txt = sub(pat, "\\1", patid), num = sub(pat, "\\2", patid)) > > > result[regexpr(pat, paid) < 0,] <- NA > > > > > > > Thanks, Gabor, that something like a compressed version of mine. My main > > question was if I was missing something obvious, because I found the double sub > > messy. I am a surprised that there is not > > > > pat = "^([[:alpha:]]+)([[:digit:]]+)" > > mygrep(pat, patid) > > > > returning a list with all subexpressions. > > > > Dieter > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html > > >