Giulio Di Giovanni
2010-Jun-28 23:17 UTC
[R] Identify and extract a whole word of variable length using regular expressions
Hi everybody, I'm quite weak with regular expression, and I need some help... I have strings of the type>a[1,] "ppe46 Rv3018c MT3098/MT3101 MTV012.32c" [2,] "ppe16 Rv1135c MT1168" [3,] "ppe21 Rv1548c MT1599 MTCY48.17" [4,] "ppe12 Rv0755c MT0779" [5,] "PE_PGRS51 Rv3367" [etc..for several hundreds] I want have instead only: [1,] "Rv3018c" [2,] "Rv1135c" [3,] "Rv1548c" [4,] "Rv0755c" [5,] "Rv3367" Besides these examples, the only thing I know for sure is that the "magic" substrings I want to extract are entire word all starting by "Rv". So "Rvxxxxx", preceded and followed by a space, and of a variable length. I don't have any other infos. Do you know how to pick them? I checked for their presence using grep, and "\\<Rv*\\>" expression, I tried with some string functions from Hmisc, or in the other way, by substituting with empty strings everything except the Rv word, but I didn't achieve that much... Could you please give me some suggestions? Thanks a lot, Giulio _________________________________________________________________ [[alternative HTML version deleted]]
Phil Spector
2010-Jun-28 23:22 UTC
[R] Identify and extract a whole word of variable length using regular expressions
Giulio - This> sub('^.* ?(Rv[^ ]*) ?.*$','\\1',a)[1] "Rv3018c" "Rv1135c" "Rv1548c" "Rv0755c" "Rv3367" seems to do what you want. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spector at stat.berkeley.edu On Mon, 28 Jun 2010, Giulio Di Giovanni wrote:> > > Hi everybody, > > I'm quite weak with regular expression, and I need some help... > I have strings of the type > >> a > > [1,] "ppe46 Rv3018c MT3098/MT3101 MTV012.32c" > [2,] "ppe16 Rv1135c MT1168" > [3,] "ppe21 Rv1548c MT1599 MTCY48.17" > [4,] "ppe12 Rv0755c MT0779" > [5,] "PE_PGRS51 Rv3367" > [etc..for several hundreds] > > I want have instead only: > > [1,] "Rv3018c" > > [2,] "Rv1135c" > > [3,] "Rv1548c" > > [4,] "Rv0755c" > > [5,] "Rv3367" > > > Besides these examples, the only thing I know for sure is that the "magic" substrings I want to extract are entire word all starting by "Rv". So "Rvxxxxx", preceded and followed by a space, and of a variable length. I don't have any other infos. > > Do you know how to pick them? I checked for their presence using grep, and "\\<Rv*\\>" expression, I tried with some string functions from Hmisc, or in the other way, by substituting with empty strings everything except the Rv word, but I didn't achieve that much... > Could you please give me some suggestions? > > Thanks a lot, > > > Giulio > > _________________________________________________________________ > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
Gabor Grothendieck
2010-Jun-28 23:28 UTC
[R] Identify and extract a whole word of variable length using regular expressions
On Mon, Jun 28, 2010 at 7:17 PM, Giulio Di Giovanni <perimessaggini at hotmail.com> wrote:> > > Hi everybody, > > I'm quite weak with regular expression, and I need some help... > I have strings of the type > >>a > > [1,] "ppe46 Rv3018c MT3098/MT3101 MTV012.32c" > [2,] "ppe16 Rv1135c MT1168" > [3,] "ppe21 Rv1548c MT1599 MTCY48.17" > [4,] "ppe12 Rv0755c MT0779" > [5,] "PE_PGRS51 Rv3367" > [etc..for several hundreds] > > I want have instead only: > > [1,] "Rv3018c" > > [2,] "Rv1135c" > > [3,] "Rv1548c" > > [4,] "Rv0755c" > > [5,] "Rv3367" > > > Besides these examples, the only thing I know for sure is that the "magic" substrings I want to extract are entire word all starting by "Rv". So "Rvxxxxx", preceded and followed by a space, and of a variable length. I don't have any other infos. > > Do you know how to pick them? I checked for their presence using grep, and "\\<Rv*\\>" expression, I tried with some string functions from Hmisc, or in the other way, by substituting with empty strings everything except the Rv word, but I didn't achieve that much... > Could you please give me some suggestions? >You can use strapply in gsubfn to pick out strings by content. The regular expression says match a word bound followed by R followed by v followed by 0 or more non-spaces: library(gsubfn) strapply(a, "\\bRv\\S*", c, perl = TRUE, simplify = TRUE)