Giulio Di Giovanni
2009-Sep-16 13:53 UTC
[R] How to extract a specific substring from a string (regular expressions) ? See details inside
Hi all, I have thousands of strings like these ones: "1159_1; YP_177963; PPE FAMILY PROTEIN" "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575" "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE DEHYDROGENASE" and various others.. I'm interested to extract the code for the protein (in this example: YP_177963, CAA15575, CAA17111). I found only one common criterion to identify the protein codes in ALL my strings: I need a sequence of characters selected in this way: start: the first alphabetic capital letter followed after three characters by a digit end: the last following digit before a non-digit character, or nothing. Tricky, isn't it? Well, I'm not an expert, and I played a lot with regular expressions and sub() command with no big results. Also with substring.location in Hmisc package (but here I don't know how to use regular expressions). Maybe there are other more useful functions or maybe is just a matter to use regular expression in a better way... Can anybody help me? Thanks a lot in advance... _________________________________________________________________ Racconta la tua estate, crea il tuo blog. [[alternative HTML version deleted]]
Henrique Dallazuanna
2009-Sep-16 14:14 UTC
[R] How to extract a specific substring from a string (regular expressions) ? See details inside
Try this: library(gsubfn) strapply(x, "[A-Z]{3}[0-9]+") On Wed, Sep 16, 2009 at 10:53 AM, Giulio Di Giovanni <perimessaggini at hotmail.com> wrote:> > > > Hi all, > > I have thousands of strings like these ones: > > > > "1159_1; YP_177963; PPE FAMILY PROTEIN" > > "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575" > > "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE DEHYDROGENASE" > > > > and various others.. > > > > I'm interested to extract the code for the protein (in this example: YP_177963, CAA15575, CAA17111). > > I found only one common criterion to identify the protein codes in ALL my strings: > > I need a sequence of characters selected in this way: > > > > start: > > the first alphabetic capital letter followed after three characters by a digit > > > > end: > > the last following digit before a non-digit character, or nothing. > > > > Tricky, isn't it? > > Well, I'm not an expert, and I played a lot with regular expressions and sub() command with no big results. Also with substring.location in Hmisc package (but here I don't know how to use regular expressions). > > Maybe there are other more useful functions ?or maybe is just a matter to use regular expression in a better way... > > > > Can anybody help me? > > > > Thanks a lot in advance... > > > _________________________________________________________________ > Racconta la tua estate, crea il tuo blog. > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paran?-Brasil 25? 25' 40" S 49? 16' 22" O
jim holtman
2009-Sep-16 14:15 UTC
[R] How to extract a specific substring from a string (regular expressions) ? See details inside
This should do it for you:> pat <- ".*(\\b[A-Z]..[0-9]+).*" > grep(pat, x)[1] 1 3 5> sub(pat, '\\1', x)[1] "YP_177963" "" "CAA15575" "" "CAA17111">On Wed, Sep 16, 2009 at 9:53 AM, Giulio Di Giovanni <perimessaggini at hotmail.com> wrote:> > > > Hi all, > > I have thousands of strings like these ones: > > > > "1159_1; YP_177963; PPE FAMILY PROTEIN" > > "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575" > > "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE DEHYDROGENASE" > > > > and various others.. > > > > I'm interested to extract the code for the protein (in this example: YP_177963, CAA15575, CAA17111). > > I found only one common criterion to identify the protein codes in ALL my strings: > > I need a sequence of characters selected in this way: > > > > start: > > the first alphabetic capital letter followed after three characters by a digit > > > > end: > > the last following digit before a non-digit character, or nothing. > > > > Tricky, isn't it? > > Well, I'm not an expert, and I played a lot with regular expressions and sub() command with no big results. Also with substring.location in Hmisc package (but here I don't know how to use regular expressions). > > Maybe there are other more useful functions ?or maybe is just a matter to use regular expression in a better way... > > > > Can anybody help me? > > > > Thanks a lot in advance... > > > _________________________________________________________________ > Racconta la tua estate, crea il tuo blog. > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
Gabor Grothendieck
2009-Sep-16 14:47 UTC
[R] How to extract a specific substring from a string (regular expressions) ? See details inside
Assuming the rule is an upper case alphabetic character followed by two other characters followed by a string of digits then try this:> library(gsubfn) > strapply(x, "[A-Z][^ ][^ ][0-9]+")[[1]] [1] "YP_177963" [[2]] [1] "CAA15575" [[3]] [1] "CAA17111" If you prefer the output as one long vector of strings try this:> strapply(x, "[A-Z][^ ][^ ][0-9]+", simplify = c)[1] "YP_177963" "CAA15575" "CAA17111" If the string that denotes a protein can be part of a word which itself does not denote a protein then we will need something like this:> strapply(x, "\\b[A-Z][^ ][^ ][0-9]+\\b", perl = TRUE)[[1]] [1] "YP_177963" [[2]] [1] "CAA15575" [[3]] [1] "CAA17111" however, I would expect this second solution using perl's \b to be much slower because the first one uses tcl code underneath whereas the second uses R code. See http://gsubfn.googlecode.com for more. On Wed, Sep 16, 2009 at 9:53 AM, Giulio Di Giovanni <perimessaggini at hotmail.com> wrote:> > > > Hi all, > > I have thousands of strings like these ones: > > > > "1159_1; YP_177963; PPE FAMILY PROTEIN" > > "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575" > > "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE DEHYDROGENASE" > > > > and various others.. > > > > I'm interested to extract the code for the protein (in this example: YP_177963, CAA15575, CAA17111). > > I found only one common criterion to identify the protein codes in ALL my strings: > > I need a sequence of characters selected in this way: > > > > start: > > the first alphabetic capital letter followed after three characters by a digit > > > > end: > > the last following digit before a non-digit character, or nothing. > > > > Tricky, isn't it? > > Well, I'm not an expert, and I played a lot with regular expressions and sub() command with no big results. Also with substring.location in Hmisc package (but here I don't know how to use regular expressions). > > Maybe there are other more useful functions ?or maybe is just a matter to use regular expression in a better way... > > > > Can anybody help me? > > > > Thanks a lot in advance... > > > _________________________________________________________________ > Racconta la tua estate, crea il tuo blog. > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >