thr3ads.net - R help - [R] How to extract a specific substring from a string (regular expressions) ? See details inside [Sep 2009]

If this information is useful, please help other people find it:
Share via:

Giulio Di Giovanni

2009-Sep-16 13:53 UTC

[R] How to extract a specific substring from a string (regular expressions) ? See details inside

Hi all,

I have thousands of strings like these ones:

 

"1159_1; YP_177963; PPE FAMILY PROTEIN"

"1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575"

"1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE
DEHYDROGENASE"

 

and various others..

 

I'm interested to extract the code for the protein (in this example:
YP_177963, CAA15575, CAA17111).

I found only one common criterion to identify the protein codes in ALL my
strings:

I need a sequence of characters selected in this way:

 

start:

the first alphabetic capital letter followed after three characters by a digit

 

end: 

the last following digit before a non-digit character, or nothing.

 

Tricky, isn't it?

Well, I'm not an expert, and I played a lot with regular expressions and
sub() command with no big results. Also with substring.location in Hmisc package
(but here I don't know how to use regular expressions).

Maybe there are other more useful functions  or maybe is just a matter to use
regular expression in a better way...

 

Can anybody help me?

 

Thanks a lot in advance...


_________________________________________________________________
Racconta la tua estate, crea il tuo blog.

	[[alternative HTML version deleted]]

Henrique Dallazuanna

2009-Sep-16 14:14 UTC

head link

[R] How to extract a specific substring from a string (regular expressions) ? See details inside

Try this:

library(gsubfn)
strapply(x, "[A-Z]{3}[0-9]+")

On Wed, Sep 16, 2009 at 10:53 AM, Giulio Di Giovanni
<perimessaggini at hotmail.com> wrote:>
>
>
> Hi all,
>
> I have thousands of strings like these ones:
>
>
>
> "1159_1; YP_177963; PPE FAMILY PROTEIN"
>
> "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575"
>
> "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE
DEHYDROGENASE"
>
>
>
> and various others..
>
>
>
> I'm interested to extract the code for the protein (in this example:
YP_177963, CAA15575, CAA17111).
>
> I found only one common criterion to identify the protein codes in ALL my
strings:
>
> I need a sequence of characters selected in this way:
>
>
>
> start:
>
> the first alphabetic capital letter followed after three characters by a
digit
>
>
>
> end:
>
> the last following digit before a non-digit character, or nothing.
>
>
>
> Tricky, isn't it?
>
> Well, I'm not an expert, and I played a lot with regular expressions
and sub() command with no big results. Also with substring.location in Hmisc
package (but here I don't know how to use regular expressions).
>
> Maybe there are other more useful functions ?or maybe is just a matter to
use regular expression in a better way...
>
>
>
> Can anybody help me?
>
>
>
> Thanks a lot in advance...
>
>
> _________________________________________________________________
> Racconta la tua estate, crea il tuo blog.
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Henrique Dallazuanna
Curitiba-Paran?-Brasil
25? 25' 40" S 49? 16' 22" O

jim holtman

2009-Sep-16 14:15 UTC

head link

[R] How to extract a specific substring from a string (regular expressions) ? See details inside

This should do it for you:
> pat <- ".*(\\b[A-Z]..[0-9]+).*"
> grep(pat, x)
[1] 1 3 5> sub(pat, '\\1', x)[1] "YP_177963" ""          "CAA15575" 
""          "CAA17111">

On Wed, Sep 16, 2009 at 9:53 AM, Giulio Di Giovanni
<perimessaggini at hotmail.com> wrote:>
>
>
> Hi all,
>
> I have thousands of strings like these ones:
>
>
>
> "1159_1; YP_177963; PPE FAMILY PROTEIN"
>
> "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575"
>
> "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE
DEHYDROGENASE"
>
>
>
> and various others..
>
>
>
> I'm interested to extract the code for the protein (in this example:
YP_177963, CAA15575, CAA17111).
>
> I found only one common criterion to identify the protein codes in ALL my
strings:
>
> I need a sequence of characters selected in this way:
>
>
>
> start:
>
> the first alphabetic capital letter followed after three characters by a
digit
>
>
>
> end:
>
> the last following digit before a non-digit character, or nothing.
>
>
>
> Tricky, isn't it?
>
> Well, I'm not an expert, and I played a lot with regular expressions
and sub() command with no big results. Also with substring.location in Hmisc
package (but here I don't know how to use regular expressions).
>
> Maybe there are other more useful functions ?or maybe is just a matter to
use regular expression in a better way...
>
>
>
> Can anybody help me?
>
>
>
> Thanks a lot in advance...
>
>
> _________________________________________________________________
> Racconta la tua estate, crea il tuo blog.
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

Gabor Grothendieck

2009-Sep-16 14:47 UTC

head link

[R] How to extract a specific substring from a string (regular expressions) ? See details inside

Assuming the rule is an upper case alphabetic character followed by
two other characters followed by a string of digits then try this:
> library(gsubfn)
> strapply(x, "[A-Z][^ ][^ ][0-9]+")[[1]]
[1] "YP_177963"

[[2]]
[1] "CAA15575"

[[3]]
[1] "CAA17111"

If you prefer the output as one long vector of strings try this:
> strapply(x, "[A-Z][^ ][^ ][0-9]+", simplify = c)[1] "YP_177963" "CAA15575"  "CAA17111"

If the string that denotes a protein can be part of a word which
itself does not denote a protein then we will need something like
this:
> strapply(x, "\\b[A-Z][^ ][^ ][0-9]+\\b", perl = TRUE)[[1]]
[1] "YP_177963"

[[2]]
[1] "CAA15575"

[[3]]
[1] "CAA17111"

however, I would expect this second solution using perl's \b to be
much slower because the first one uses tcl code underneath whereas the
second uses R code.

See http://gsubfn.googlecode.com for more.

On Wed, Sep 16, 2009 at 9:53 AM, Giulio Di Giovanni
<perimessaggini at hotmail.com> wrote:>
>
>
> Hi all,
>
> I have thousands of strings like these ones:
>
>
>
> "1159_1; YP_177963; PPE FAMILY PROTEIN"
>
> "1100_13; SECRETED L-ALANINE DEHYDROGENASE ALD CAA15575"
>
> "1141_24; gi;2894249;emb;CAA17111.1; PROBABLE ISOCITRATE
DEHYDROGENASE"
>
>
>
> and various others..
>
>
>
> I'm interested to extract the code for the protein (in this example:
YP_177963, CAA15575, CAA17111).
>
> I found only one common criterion to identify the protein codes in ALL my
strings:
>
> I need a sequence of characters selected in this way:
>
>
>
> start:
>
> the first alphabetic capital letter followed after three characters by a
digit
>
>
>
> end:
>
> the last following digit before a non-digit character, or nothing.
>
>
>
> Tricky, isn't it?
>
> Well, I'm not an expert, and I played a lot with regular expressions
and sub() command with no big results. Also with substring.location in Hmisc
package (but here I don't know how to use regular expressions).
>
> Maybe there are other more useful functions ?or maybe is just a matter to
use regular expression in a better way...
>
>
>
> Can anybody help me?
>
>
>
> Thanks a lot in advance...
>
>
> _________________________________________________________________
> Racconta la tua estate, crea il tuo blog.
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Possibly Parallel Threads

Search for more reasonably related threads

R help - Sep 2009 - How to extract a specific substring from a string (regular expressions) ? See details inside

[R] How to extract a specific substring from a string (regular expressions) ? See details inside

[R] How to extract a specific substring from a string (regular expressions) ? See details inside

[R] How to extract a specific substring from a string (regular expressions) ? See details inside

[R] How to extract a specific substring from a string (regular expressions) ? See details inside

Possibly Parallel Threads