thr3ads.net - R help - [R] Regex exercise [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Bert Gunter

2010-Aug-20 20:55 UTC

[R] Regex exercise

For regular expression afficianados, I'd like a cleverer solution to
the following problem (my solution works just fine for my needs; I'm
just trying to improve my regex skills):

Given the string (entered, say, at a readline prompt):

 "1   2 -5, 3- 6 4  8 5-7 10"   ## only integers will be entered

parse it to produce the numeric vector:

c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)

Note that "-" in the expression is used to indicate a range of values
instead of ":"

Here's my UNclever solution:

First convert more than one space to a single space and then replace
"<any spaces>-<any spaces>" by ":" by:
>  x1 <- gsub(" *- *",":",gsub(" +","
",resp))  #giving
> x1[1] "1 2:5, 3:6 4 8 5:7 10"    ## Note that the comma remains

Next convert the single string into a character vector via strsplit by
splitting on anything but ":" or a digit:
> x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]]   #giving
> x2[1] "1"    "2:5"  "3:6" "4"   
"8"    "5:7"  "10"

Finally, parse() the vector, eval() each element, and unlist() the
resulting list of numeric vectors:
>  unlist(lapply(parse(text=x2),eval)) #giving, as desired, [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10


This seems far too clumsy and circumlocuitous not to have a more
elegant solution from a true regex expert.

(Special note to Thomas Lumley: This seems one of the few instances
where eval(parse..)) may actually be appropriate.)

Cheers to all,

Bert

-- 
Bert Gunter
Genentech Nonclinical Biostatistics

RICHARD M. HEIBERGER

2010-Aug-20 21:16 UTC

head link

[R] Regex exercise

Bert,

we can save a lot of time by using paste and then only one call to eval and
parse.
> x2 <- c("1",    "2:5",  "3:6",
"4",    "8",    "5:7",  "10")
> system.time(for (i in 1:100)  unlist(lapply(parse(text=x2),eval)))   user  system elapsed
   0.06    0.00    0.03> system.time(for (i in 1:100) 
eval(parse(text=paste("c(",paste(x2,collapse=","),")"))))
   user  system elapsed
   0.01    0.00    0.03>
Rich

	[[alternative HTML version deleted]]

Thomas Lumley

2010-Aug-20 21:33 UTC

head link

[R] Regex exercise

On Fri, 20 Aug 2010, Bert Gunter wrote:
> Given the string (entered, say, at a readline prompt):
>
> "1   2 -5, 3- 6 4  8 5-7 10"   ## only integers will be entered
Presumably only non-negative integers
> (Special note to Thomas Lumley: This seems one of the few instances
> where eval(parse..)) may actually be appropriate.)
>
Yes, implementing a new minilanguage is a valid use.  It isn't necessary,
and the following could probably be improved on

s<-"1?? 2 -5, 3- 6 4? 8 5-7 10"

npos<-gregexpr("[0-9]+",s)[[1]]
numbers<-as.numeric(substring(s,npos,attr(npos,"match.length")+npos-1))
hyphens<-findInterval(gregexpr("-",s)[[1]],npos)
nn<-as.list(numbers)
nn[hyphens+1]<-mapply(seq,numbers[hyphens]+1,numbers[hyphens+1])
unlist(nn)


      -thomas

Thomas Lumley
Professor of Biostatistics
University of Washington, Seattle

Michael Hannon

2010-Aug-20 23:39 UTC

head link

[R] Regex exercise

> For regular expression afficianados, I'd like a cleverer solution to
> the following problem (my solution works just fine for my needs; I'm
> just trying to improve my regex skills):
> 
> Given the string (entered, say, at a readline prompt):
> 
> "1  2 -5, 3- 6 4  8 5-7 10"  ## only integers will be entered
> 
> parse it to produce the numeric vector:
> 
> c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
> 
> Note that "-" in the expression is used to indicate a range of
values
> instead of ":"
> 
> Here's my UNclever solution:
> 
> First convert more than one space to a single space and then replace
> "<any spaces>-<any spaces>" by ":" by:
> 
> >  x1 <- gsub(" *- *",":",gsub("
+"," ",resp))  #giving
> > x1
> [1] "1 2:5, 3:6 4 8 5:7 10"    ## Note that the comma remains
> 
> Next convert the single string into a character vector via strsplit by
> splitting on anything but ":" or a digit:
> 
> > x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]]  #giving
> > x2
> [1] "1"    "2:5"  "3:6" "4"   
"8"    "5:7"  "10"
> 
> Finally, parse() the vector, eval() each element, and unlist() the
> resulting list of numeric vectors:
> 
> >  unlist(lapply(parse(text=x2),eval)) #giving, as desired,
> [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
> 
> 
> This seems far too clumsy and circumlocuitous not to have a more
> elegant solution from a true regex expert.
> 
> (Special note to Thomas Lumley: This seems one of the few instances
> where eval(parse..)) may actually be appropriate.)
Howdy.  I don't know that I can produce anything less circumlocutory, but I
note that your "x2" form has a simple-enough structure that it can be
further
parsed with regular expressions, i.e., as opposed to using parse and eval.  I
don't know that this is an improvement -- just a variation on the theme.

I've appended an example.

-- Mike

#### Original vector
x <- "1  2 -5, 3- 6 4  8 5-7 10"; x

#### Convert ranges to standard R form
x1 <- gsub("[ ]*-[ ]*", ":", x); x1

#### Get rid of the comma
x2 <- gsub(",", " ", x1); x2

#### Remove extra spaces
x3 <- gsub("[ ]+", " ", x2); x3

#### Split off elements, now in standard form
x4 <- unlist(strsplit(x3, " ")); x4

#### Use regular expression for simple parse of elements
x5 <- sapply(x4, function(a) {
          n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1",
a)
          n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1",
a)
          n1:n2}, USE.NAMES=FALSE); x5
x6 <- unlist(x5); x6

##########################################################
> #### Original vector
> x <- "1  2 -5, 3- 6 4  8 5-7 10"; x
[1] "1  2 -5, 3- 6 4  8 5-7 10"> 
> #### Convert ranges to standard R form
> x1 <- gsub("[ ]*-[ ]*", ":", x); x1
[1] "1  2:5, 3:6 4  8 5:7 10"> 
> #### Get rid of the comma
> x2 <- gsub(",", " ", x1); x2
[1] "1  2:5  3:6 4  8 5:7 10"> 
> #### Remove extra spaces
> x3 <- gsub("[ ]+", " ", x2); x3
[1] "1 2:5 3:6 4 8 5:7 10"> 
> #### Split off elements, now in standard form
> x4 <- unlist(strsplit(x3, " ")); x4[1] "1"   "2:5" "3:6" "4"  
"8"   "5:7" "10" > 
> #### Use regular expression for simple parse of elements
> x5 <- sapply(x4, function(a) {+           n1 <- gsub("([[:digit:]]):[[:digit:]]",
"\\1", a)
+           n2 <- gsub("[[:digit:]]:([[:digit:]])",
"\\1", a)
+           n1:n2}, USE.NAMES=FALSE); x5
[[1]]
[1] 1

[[2]]
[1] 2 3 4 5

[[3]]
[1] 3 4 5 6

[[4]]
[1] 4

[[5]]
[1] 8

[[6]]
[1] 5 6 7

[[7]]
[1] 10
> x6 <- unlist(x5); x6
 [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10>

Greg Snow

2010-Aug-23 20:21 UTC

head link

[R] Regex exercise

How about:

x <- "1  2 -5, 3- 6 4  8 5-7 10"; x

library(gsubfn)

strapply( x, '(([0-9]+) *- *([0-9]+))|([0-9]+)', 
	function(one,two,three,four) {
		if( nchar(four) > 0 ) return(as.numeric(four) )
		return( seq( from=as.numeric(two), to=as.numeric(three) ) )
	}
)[[1]]



If x is a vector of strings and you remove the [[1]] then you will get a list
with each element corresponding to a string in x (unlisting will give a single
vector).

This could be easily extended to handle floating point numbers instead of just
integers and even negative numbers (as long as you have a clear rule to
distinguish between a negative and a the end of the range).

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Bert Gunter
> Sent: Friday, August 20, 2010 2:55 PM
> To: r-help at r-project.org
> Subject: [R] Regex exercise
> 
> For regular expression afficianados, I'd like a cleverer solution to
> the following problem (my solution works just fine for my needs; I'm
> just trying to improve my regex skills):
> 
> Given the string (entered, say, at a readline prompt):
> 
>  "1   2 -5, 3- 6 4  8 5-7 10"   ## only integers will be entered
> 
> parse it to produce the numeric vector:
> 
> c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10)
> 
> Note that "-" in the expression is used to indicate a range of
values
> instead of ":"
> 
> Here's my UNclever solution:
> 
> First convert more than one space to a single space and then replace
> "<any spaces>-<any spaces>" by ":" by:
> 
> >  x1 <- gsub(" *- *",":",gsub("
+"," ",resp))  #giving
> > x1
> [1] "1 2:5, 3:6 4 8 5:7 10"    ## Note that the comma remains
> 
> Next convert the single string into a character vector via strsplit by
> splitting on anything but ":" or a digit:
> 
> > x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]]   #giving
> > x2
> [1] "1"    "2:5"  "3:6" "4"   
"8"    "5:7"  "10"
> 
> Finally, parse() the vector, eval() each element, and unlist() the
> resulting list of numeric vectors:
> 
> >  unlist(lapply(parse(text=x2),eval)) #giving, as desired,
>  [1]  1  2  3  4  5  3  4  5  6  4  8  5  6  7 10
> 
> 
> This seems far too clumsy and circumlocuitous not to have a more
> elegant solution from a true regex expert.
> 
> (Special note to Thomas Lumley: This seems one of the few instances
> where eval(parse..)) may actually be appropriate.)
> 
> Cheers to all,
> 
> Bert
> 
> --
> Bert Gunter
> Genentech Nonclinical Biostatistics
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.

Maybe Matching Threads

Search for more reasonably related threads

R help - Aug 2010 - Regex exercise

[R] Regex exercise

[R] Regex exercise

[R] Regex exercise

[R] Regex exercise

[R] Regex exercise

Maybe Matching Threads