Joris Meys
2010-May-28 12:21 UTC
[R] How to get values out of a string using regular expressions?
Dear all,
I have a vector of filenames which begins like this :
X <- c("OrthoP1_DNA_str.aln", "OrthoP10_DNA_str.aln",
"OrthoP100_DNA_str.aln",
"OrthoP101_DNA_str.aln", "OrthoP102_DNA_str.aln",
"OrthoP103_DNA_str.aln",
"OrthoP104_DNA_str.aln", "OrthoP105_DNA_str.aln",
"OrthoP106_DNA_str.aln",
"OrthoP107_DNA_str.aln")
using
grep("(\\d+)",X,perl=T,value=T)
I get the complete values back. Yet, I want a vector :
c(1,10,100,101,102,103,104,105,106,107)
In Perl, using the brackets allows for extracting only the numbers (using a
construct with $1 for those who know Perl).
I want to do the same in R, but can't find a way of doing that without
extensive string manipulations. Problem is that the length of the numbers
differ, so I can't use substr.
I tried> strsplit(X,"\\d+")
[[1]]
[1] "OrthoP" "_DNA_str.aln"
which gives me exactly what I want to throw away. So :> strsplit(X,"\\D+")
[[1]]
[1] "" "1"
[[2]]
[1] "" "10"
gives something I can use, but it still requires a lot of list manipulation
afterwards to get the right vector. Is there an option or a function I'm
missing somewhere?
Cheers
Joris
--
Joris Meys
Statistical Consultant
Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control
Coupure Links 653
B-9000 Gent
tel : +32 9 264 59 87
Joris.Meys@Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
[[alternative HTML version deleted]]
Gabor Grothendieck
2010-May-28 12:25 UTC
[R] How to get values out of a string using regular expressions?
Try this:
as.numeric(gsub("\\D", "", X))
On Fri, May 28, 2010 at 8:21 AM, Joris Meys <jorismeys at gmail.com>
wrote:> Dear all,
>
> I have a vector of filenames which begins like this :
> X <- c("OrthoP1_DNA_str.aln",
"OrthoP10_DNA_str.aln",
> "OrthoP100_DNA_str.aln",
> "OrthoP101_DNA_str.aln", "OrthoP102_DNA_str.aln",
"OrthoP103_DNA_str.aln",
> "OrthoP104_DNA_str.aln", "OrthoP105_DNA_str.aln",
"OrthoP106_DNA_str.aln",
> "OrthoP107_DNA_str.aln")
>
> using
> grep("(\\d+)",X,perl=T,value=T)
>
> I get the complete values back. Yet, I want a vector :
>
> c(1,10,100,101,102,103,104,105,106,107)
>
> In Perl, using the brackets allows for extracting only the numbers (using a
> construct with $1 for those who know Perl).
>
> I want to do the same in R, but can't find a way of doing that without
> extensive string manipulations. Problem is that the length of the numbers
> differ, so I can't use substr.
> I tried
>> strsplit(X,"\\d+")
> [[1]]
> [1] "OrthoP" ? ? ? "_DNA_str.aln"
> which gives me exactly what I want to throw away. So :
>> strsplit(X,"\\D+")
> [[1]]
> [1] "" ?"1"
>
> [[2]]
> [1] "" ? "10"
> gives something I can use, but it still requires a lot of list manipulation
> afterwards to get the right vector. Is there an option or a function
I'm
> missing somewhere?
>
> Cheers
> Joris
>
> --
> Joris Meys
> Statistical Consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Applied mathematics, biometrics and process control
>
> Coupure Links 653
> B-9000 Gent
>
> tel : +32 9 264 59 87
> Joris.Meys at Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>