Bert Gunter
2016-Sep-05 05:41 UTC
[R] element wise pattern recognition and string substitution
Well, he did provide an example, and...> z <- c('TX.WT.CUT.mean','mg.tx.cv')> sub("^.+?\\.(.+)\\.[^.]+$","\\1",z)[1] "WT.CUT" "tx" ## seems to do what was requested. Jeff would have to amplify on his initial statement however: do you mean that separate patterns can always be combined via "|" ? Or something deeper? Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:> Your opening assertion is false. > > Provide a reproducible example and someone will demonstrate. > -- > Sent from my phone. Please excuse my brevity. > > On September 4, 2016 9:06:59 PM PDT, Jun Shen <jun.shen.ut at gmail.com> wrote: >>Dear list, >> >>I have a vector of strings that cannot be described by one pattern. So >>let's say I construct a vector of patterns in the same length as the >>vector >>of strings, can I do the element wise pattern recognition and string >>substitution. >> >>For example, >> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)" >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)" >> >>patterns <- c(pattern1,pattern2) >>strings <- c('TX.WT.CUT.mean','mg.tx.cv') >> >>Say I want to extract "WT.CUT" from the first string and "tx" from the >>second string. If I do >> >>sub(patterns, '\\2', strings), only the first pattern will be used. >> >>looping the patterns doesn't work the way I want. Appreciate any >>comments. >>Thanks. >> >>Jun >> >> [[alternative HTML version deleted]] >> >>______________________________________________ >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>https://stat.ethz.ch/mailman/listinfo/r-help >>PLEASE do read the posting guide >>http://www.R-project.org/posting-guide.html >>and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Jeff Newmiller
2016-Sep-05 15:37 UTC
[R] element wise pattern recognition and string substitution
Yes, sorry I did not look closer... regex can match any finite language, so there are no data sets you can feed to R that cannot be matched. [1] You may find it hard to see the pattern, or you may want to build the pattern programmatically to alleviate tedium for yourself, but regexes are not the constraint. http://www.cs.nuim.ie/~jpower/Courses/Previous/parsing/node18.html -- Sent from my phone. Please excuse my brevity. On September 4, 2016 10:41:45 PM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote:>Well, he did provide an example, and... > > >> z <- c('TX.WT.CUT.mean','mg.tx.cv') > >> sub("^.+?\\.(.+)\\.[^.]+$","\\1",z) >[1] "WT.CUT" "tx" > > >## seems to do what was requested. > >Jeff would have to amplify on his initial statement however: do you >mean that separate patterns can always be combined via "|" ? Or >something deeper? > >Cheers, >Bert >Bert Gunter > >"The trouble with having an open mind is that people keep coming along >and sticking things into it." >-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > >On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller ><jdnewmil at dcn.davis.ca.us> wrote: >> Your opening assertion is false. >> >> Provide a reproducible example and someone will demonstrate. >> -- >> Sent from my phone. Please excuse my brevity. >> >> On September 4, 2016 9:06:59 PM PDT, Jun Shen <jun.shen.ut at gmail.com> >wrote: >>>Dear list, >>> >>>I have a vector of strings that cannot be described by one pattern. >So >>>let's say I construct a vector of patterns in the same length as the >>>vector >>>of strings, can I do the element wise pattern recognition and string >>>substitution. >>> >>>For example, >>> >>>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)" >>>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)" >>> >>>patterns <- c(pattern1,pattern2) >>>strings <- c('TX.WT.CUT.mean','mg.tx.cv') >>> >>>Say I want to extract "WT.CUT" from the first string and "tx" from >the >>>second string. If I do >>> >>>sub(patterns, '\\2', strings), only the first pattern will be used. >>> >>>looping the patterns doesn't work the way I want. Appreciate any >>>comments. >>>Thanks. >>> >>>Jun >>> >>> [[alternative HTML version deleted]] >>> >>>______________________________________________ >>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>https://stat.ethz.ch/mailman/listinfo/r-help >>>PLEASE do read the posting guide >>>http://www.R-project.org/posting-guide.html >>>and provide commented, minimal, self-contained, reproducible code. >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code.
Bert Gunter
2016-Sep-05 16:01 UTC
[R] element wise pattern recognition and string substitution
Jeff: It is not obvious to me that the ability to *match* an arbitrary pattern (including one of several different ones via "|" , per the link you included) implies that sub() and friends can extract it, e.g. via the /N construct or otherwise. I would appreciate it if you or someone else could show me how this can be done. Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Sep 5, 2016 at 8:37 AM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:> Yes, sorry I did not look closer... regex can match any finite language, so there are no data sets you can feed to R that cannot be matched. [1] You may find it hard to see the pattern, or you may want to build the pattern programmatically to alleviate tedium for yourself, but regexes are not the constraint. > > http://www.cs.nuim.ie/~jpower/Courses/Previous/parsing/node18.html > -- > Sent from my phone. Please excuse my brevity. > > On September 4, 2016 10:41:45 PM PDT, Bert Gunter <bgunter.4567 at gmail.com> wrote: >>Well, he did provide an example, and... >> >> >>> z <- c('TX.WT.CUT.mean','mg.tx.cv') >> >>> sub("^.+?\\.(.+)\\.[^.]+$","\\1",z) >>[1] "WT.CUT" "tx" >> >> >>## seems to do what was requested. >> >>Jeff would have to amplify on his initial statement however: do you >>mean that separate patterns can always be combined via "|" ? Or >>something deeper? >> >>Cheers, >>Bert >>Bert Gunter >> >>"The trouble with having an open mind is that people keep coming along >>and sticking things into it." >>-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >>On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller >><jdnewmil at dcn.davis.ca.us> wrote: >>> Your opening assertion is false. >>> >>> Provide a reproducible example and someone will demonstrate. >>> -- >>> Sent from my phone. Please excuse my brevity. >>> >>> On September 4, 2016 9:06:59 PM PDT, Jun Shen <jun.shen.ut at gmail.com> >>wrote: >>>>Dear list, >>>> >>>>I have a vector of strings that cannot be described by one pattern. >>So >>>>let's say I construct a vector of patterns in the same length as the >>>>vector >>>>of strings, can I do the element wise pattern recognition and string >>>>substitution. >>>> >>>>For example, >>>> >>>>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)" >>>>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)" >>>> >>>>patterns <- c(pattern1,pattern2) >>>>strings <- c('TX.WT.CUT.mean','mg.tx.cv') >>>> >>>>Say I want to extract "WT.CUT" from the first string and "tx" from >>the >>>>second string. If I do >>>> >>>>sub(patterns, '\\2', strings), only the first pattern will be used. >>>> >>>>looping the patterns doesn't work the way I want. Appreciate any >>>>comments. >>>>Thanks. >>>> >>>>Jun >>>> >>>> [[alternative HTML version deleted]] >>>> >>>>______________________________________________ >>>>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>https://stat.ethz.ch/mailman/listinfo/r-help >>>>PLEASE do read the posting guide >>>>http://www.R-project.org/posting-guide.html >>>>and provide commented, minimal, self-contained, reproducible code. >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >
Jun Shen
2016-Sep-05 16:56 UTC
[R] element wise pattern recognition and string substitution
Thanks for the reply, Bert. Your solution solves the example. I actually have a more general situation where I have this dot concatenated string from multiple variables. The problem is those variables may have values with dots in there. The number of dots are not consistent for all values of a variable. So I am thinking to define a vector of patterns for the vector of the string and hopefully to find a way to use a pattern from the pattern vector for each value of the string vector. The only way I can think of is "for" loop, which can be slow. Also these are happening in a function I am writing. Just wonder if there is another more efficient way. Thanks a lot. Jun On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter <bgunter.4567 at gmail.com> wrote:> Well, he did provide an example, and... > > > > z <- c('TX.WT.CUT.mean','mg.tx.cv') > > > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z) > [1] "WT.CUT" "tx" > > > ## seems to do what was requested. > > Jeff would have to amplify on his initial statement however: do you > mean that separate patterns can always be combined via "|" ? Or > something deeper? > > Cheers, > Bert > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> > wrote: > > Your opening assertion is false. > > > > Provide a reproducible example and someone will demonstrate. > > -- > > Sent from my phone. Please excuse my brevity. > > > > On September 4, 2016 9:06:59 PM PDT, Jun Shen <jun.shen.ut at gmail.com> > wrote: > >>Dear list, > >> > >>I have a vector of strings that cannot be described by one pattern. So > >>let's say I construct a vector of patterns in the same length as the > >>vector > >>of strings, can I do the element wise pattern recognition and string > >>substitution. > >> > >>For example, > >> > >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)" > >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)" > >> > >>patterns <- c(pattern1,pattern2) > >>strings <- c('TX.WT.CUT.mean','mg.tx.cv') > >> > >>Say I want to extract "WT.CUT" from the first string and "tx" from the > >>second string. If I do > >> > >>sub(patterns, '\\2', strings), only the first pattern will be used. > >> > >>looping the patterns doesn't work the way I want. Appreciate any > >>comments. > >>Thanks. > >> > >>Jun > >> > >> [[alternative HTML version deleted]] > >> > >>______________________________________________ > >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>https://stat.ethz.ch/mailman/listinfo/r-help > >>PLEASE do read the posting guide > >>http://www.R-project.org/posting-guide.html > >>and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. >[[alternative HTML version deleted]]
Bert Gunter
2016-Sep-05 19:11 UTC
[R] element wise pattern recognition and string substitution
Jun: You need to provide a clear specification via regular expressions of the patterns you wish to match -- at least for me to decipher it. Others may be smarter than I, though... Jeff: Thanks. I have now convinced myself that it can be done (a "proof" of sorts): If pat1, pat2,..., patn are m different patterns (in a vector of patterns) to be matched in a vector of n strings, where only one of the patterns will match in any string, then use paste() (probably via do.call()) or otherwise to paste them together separated by "|" to form the concatenated pattern, pat. Then sub(paste0("^.*(",pat, ").*$"),"\\1",thevector) should extract the matching pattern in each (perhaps with a little fiddling due to precedence rules); e.g.> z <-c(".fg.h.g.a.a", "bb..dd.ef.tgf.", "foo...b.bbb.tgy")> pat1 <- "a+\\.*a+" > pat2 <-"b+\\.*b+" > pat <- c(pat1,pat2)> pat <- do.call(paste,c(as.list(pat), sep="|")) > pat[1] "a+\\.*a+|b+\\.*b+"> sub(paste0("^[^b]*(",pat,").*$"), "\\1", z)[1] "a.a" "bb" "b.bbb" Cheers, Bert Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Mon, Sep 5, 2016 at 9:56 AM, Jun Shen <jun.shen.ut at gmail.com> wrote:> Thanks for the reply, Bert. > > Your solution solves the example. I actually have a more general situation > where I have this dot concatenated string from multiple variables. The > problem is those variables may have values with dots in there. The number of > dots are not consistent for all values of a variable. So I am thinking to > define a vector of patterns for the vector of the string and hopefully to > find a way to use a pattern from the pattern vector for each value of the > string vector. The only way I can think of is "for" loop, which can be slow. > Also these are happening in a function I am writing. Just wonder if there is > another more efficient way. Thanks a lot. > > Jun > > On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter <bgunter.4567 at gmail.com> wrote: >> >> Well, he did provide an example, and... >> >> >> > z <- c('TX.WT.CUT.mean','mg.tx.cv') >> >> > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z) >> [1] "WT.CUT" "tx" >> >> >> ## seems to do what was requested. >> >> Jeff would have to amplify on his initial statement however: do you >> mean that separate patterns can always be combined via "|" ? Or >> something deeper? >> >> Cheers, >> Bert >> Bert Gunter >> >> "The trouble with having an open mind is that people keep coming along >> and sticking things into it." >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >> On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> >> wrote: >> > Your opening assertion is false. >> > >> > Provide a reproducible example and someone will demonstrate. >> > -- >> > Sent from my phone. Please excuse my brevity. >> > >> > On September 4, 2016 9:06:59 PM PDT, Jun Shen <jun.shen.ut at gmail.com> >> > wrote: >> >>Dear list, >> >> >> >>I have a vector of strings that cannot be described by one pattern. So >> >>let's say I construct a vector of patterns in the same length as the >> >>vector >> >>of strings, can I do the element wise pattern recognition and string >> >>substitution. >> >> >> >>For example, >> >> >> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)" >> >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)" >> >> >> >>patterns <- c(pattern1,pattern2) >> >>strings <- c('TX.WT.CUT.mean','mg.tx.cv') >> >> >> >>Say I want to extract "WT.CUT" from the first string and "tx" from the >> >>second string. If I do >> >> >> >>sub(patterns, '\\2', strings), only the first pattern will be used. >> >> >> >>looping the patterns doesn't work the way I want. Appreciate any >> >>comments. >> >>Thanks. >> >> >> >>Jun >> >> >> >> [[alternative HTML version deleted]] >> >> >> >>______________________________________________ >> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >>https://stat.ethz.ch/mailman/listinfo/r-help >> >>PLEASE do read the posting guide >> >>http://www.R-project.org/posting-guide.html >> >>and provide commented, minimal, self-contained, reproducible code. >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > http://www.R-project.org/posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. > >
Ista Zahn
2016-Sep-07 13:30 UTC
[R] element wise pattern recognition and string substitution
On Mon, Sep 5, 2016 at 12:56 PM, Jun Shen <jun.shen.ut at gmail.com> wrote:> Thanks for the reply, Bert. > > Your solution solves the example. I actually have a more general situation > where I have this dot concatenated string from multiple variables. The > problem is those variables may have values with dots in there.If you concatenated the variables yourself you could go back a step and use another separator, i.e., one that doesn't appear in the original variables. The separator does not need to be a single character, e.g., "__.__" would be fine. This will make later parsing with regular expressions much easier. The number> of dots are not consistent for all values of a variable. So I am thinking > to define a vector of patterns for the vector of the string and hopefully > to find a way to use a pattern from the pattern vector for each value of > the string vector. The only way I can think of is "for" loop, which can be > slow. Also these are happening in a function I am writing. Just wonder if > there is another more efficient way. Thanks a lot. > > Jun > > On Mon, Sep 5, 2016 at 1:41 AM, Bert Gunter <bgunter.4567 at gmail.com> wrote: > >> Well, he did provide an example, and... >> >> >> > z <- c('TX.WT.CUT.mean','mg.tx.cv') >> >> > sub("^.+?\\.(.+)\\.[^.]+$","\\1",z) >> [1] "WT.CUT" "tx" >> >> >> ## seems to do what was requested. >> >> Jeff would have to amplify on his initial statement however: do you >> mean that separate patterns can always be combined via "|" ? Or >> something deeper? >> >> Cheers, >> Bert >> Bert Gunter >> >> "The trouble with having an open mind is that people keep coming along >> and sticking things into it." >> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) >> >> >> On Sun, Sep 4, 2016 at 9:30 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> >> wrote: >> > Your opening assertion is false. >> > >> > Provide a reproducible example and someone will demonstrate. >> > -- >> > Sent from my phone. Please excuse my brevity. >> > >> > On September 4, 2016 9:06:59 PM PDT, Jun Shen <jun.shen.ut at gmail.com> >> wrote: >> >>Dear list, >> >> >> >>I have a vector of strings that cannot be described by one pattern. So >> >>let's say I construct a vector of patterns in the same length as the >> >>vector >> >>of strings, can I do the element wise pattern recognition and string >> >>substitution. >> >> >> >>For example, >> >> >> >>pattern1 <- "([^.]*)\\.([^.]*\\.[^.]*)\\.(.*)" >> >>pattern2 <- "([^.]*)\\.([^.]*)\\.(.*)" >> >> >> >>patterns <- c(pattern1,pattern2) >> >>strings <- c('TX.WT.CUT.mean','mg.tx.cv') >> >> >> >>Say I want to extract "WT.CUT" from the first string and "tx" from the >> >>second string. If I do >> >> >> >>sub(patterns, '\\2', strings), only the first pattern will be used. >> >> >> >>looping the patterns doesn't work the way I want. Appreciate any >> >>comments. >> >>Thanks. >> >> >> >>Jun >> >> >> >> [[alternative HTML version deleted]] >> >> >> >>______________________________________________ >> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> >>https://stat.ethz.ch/mailman/listinfo/r-help >> >>PLEASE do read the posting guide >> >>http://www.R-project.org/posting-guide.html >> >>and provide commented, minimal, self-contained, reproducible code. >> > >> > ______________________________________________ >> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide http://www.R-project.org/ >> posting-guide.html >> > and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.