Hello,
I would like to have a function that retrieve matching strings in the same way
as with java.util.regex (java 1.4.2).
Example:
f('^.*(xx?)\\.([0-9]*)$','abcxx.785')
=>
c('xx','785')
First of all: Is it possible to achiev this with grep(... perl=TRUE,value=TRUE
)?
As I would call this function very often with large data, I'm reluctant to
use Sjava for performance reasons.
Is this a wrong assumption that using Java directly would be slower or use more
memory than to have a native R function?
Does someone already has a solution for this :)
Thanks,
Marc Mamin
Gabor Grothendieck
2004-Oct-27  14:02 UTC
[R] regexp,grep: capturing more than one substring
Marc Mamin <M.Mamin <at> intershop.de> writes:
: 
: Hello,
: 
: I would like to have a function that retrieve matching strings in the same 
way as with java.util.regex (java 1.4.2).
: 
: Example:
: 
: f('^.*(xx?)\\.([0-9]*)$','abcxx.785')
: =>
: c('xx','785')
: 
: First of all: Is it possible to achiev this with grep(... 
perl=TRUE,value=TRUE )?
Actually you don't even need perl= to do that.  The
function below pastes togther a string like "\\1 \\2" 
where n determines how many of them there are.  
Then it uses gsub with the regexp in r.  Finally it is
split into individual strings.
The calculation of n, the number of backreferences, is
not foolproof so you can specify your own n if your
expression has parentheses that are not backreferences.
Also specifying n might speed it up a bit, e.g. n = 2
in the example.  The value of sep= should be a delimiter
not in your string.
s can be a vector of strings.  It returns in a list of
strings in any case, one element of the list for each
element of vector s.  If s is just a scalar string
then it will return a one element list containing
the elements as a vector.  You may wish to call it
like this f(...args...)[[1]] in that case as
shown in the example.
f <- function(r, s, n = nchar(gsub("[^(]","",r)), sep =
"\10" ) {
    x <- gsub(r, paste("\\", 1:n, sep = "", collapse =
sep), s)
    strsplit(x, split = sep)
}
f( '^.*(xx?)\\.([0-9]*)$', 'abcxx.785' )[[1]]