For regular expression afficianados, I'd like a cleverer solution to the following problem (my solution works just fine for my needs; I'm just trying to improve my regex skills): Given the string (entered, say, at a readline prompt): "1 2 -5, 3- 6 4 8 5-7 10" ## only integers will be entered parse it to produce the numeric vector: c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10) Note that "-" in the expression is used to indicate a range of values instead of ":" Here's my UNclever solution: First convert more than one space to a single space and then replace "<any spaces>-<any spaces>" by ":" by:> x1 <- gsub(" *- *",":",gsub(" +"," ",resp)) #giving > x1[1] "1 2:5, 3:6 4 8 5:7 10" ## Note that the comma remains Next convert the single string into a character vector via strsplit by splitting on anything but ":" or a digit:> x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]] #giving > x2[1] "1" "2:5" "3:6" "4" "8" "5:7" "10" Finally, parse() the vector, eval() each element, and unlist() the resulting list of numeric vectors:> unlist(lapply(parse(text=x2),eval)) #giving, as desired,[1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10 This seems far too clumsy and circumlocuitous not to have a more elegant solution from a true regex expert. (Special note to Thomas Lumley: This seems one of the few instances where eval(parse..)) may actually be appropriate.) Cheers to all, Bert -- Bert Gunter Genentech Nonclinical Biostatistics
Bert, we can save a lot of time by using paste and then only one call to eval and parse.> x2 <- c("1", "2:5", "3:6", "4", "8", "5:7", "10") > system.time(for (i in 1:100) unlist(lapply(parse(text=x2),eval)))user system elapsed 0.06 0.00 0.03> system.time(for (i in 1:100) eval(parse(text=paste("c(",paste(x2,collapse=","),")")))) user system elapsed 0.01 0.00 0.03>Rich [[alternative HTML version deleted]]
On Fri, 20 Aug 2010, Bert Gunter wrote:> Given the string (entered, say, at a readline prompt): > > "1 2 -5, 3- 6 4 8 5-7 10" ## only integers will be enteredPresumably only non-negative integers> (Special note to Thomas Lumley: This seems one of the few instances > where eval(parse..)) may actually be appropriate.) >Yes, implementing a new minilanguage is a valid use. It isn't necessary, and the following could probably be improved on s<-"1?? 2 -5, 3- 6 4? 8 5-7 10" npos<-gregexpr("[0-9]+",s)[[1]] numbers<-as.numeric(substring(s,npos,attr(npos,"match.length")+npos-1)) hyphens<-findInterval(gregexpr("-",s)[[1]],npos) nn<-as.list(numbers) nn[hyphens+1]<-mapply(seq,numbers[hyphens]+1,numbers[hyphens+1]) unlist(nn) -thomas Thomas Lumley Professor of Biostatistics University of Washington, Seattle
> For regular expression afficianados, I'd like a cleverer solution to > the following problem (my solution works just fine for my needs; I'm > just trying to improve my regex skills): > > Given the string (entered, say, at a readline prompt): > > "1 2 -5, 3- 6 4 8 5-7 10" ## only integers will be entered > > parse it to produce the numeric vector: > > c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10) > > Note that "-" in the expression is used to indicate a range of values > instead of ":" > > Here's my UNclever solution: > > First convert more than one space to a single space and then replace > "<any spaces>-<any spaces>" by ":" by: > > > x1 <- gsub(" *- *",":",gsub(" +"," ",resp)) #giving > > x1 > [1] "1 2:5, 3:6 4 8 5:7 10" ## Note that the comma remains > > Next convert the single string into a character vector via strsplit by > splitting on anything but ":" or a digit: > > > x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]] #giving > > x2 > [1] "1" "2:5" "3:6" "4" "8" "5:7" "10" > > Finally, parse() the vector, eval() each element, and unlist() the > resulting list of numeric vectors: > > > unlist(lapply(parse(text=x2),eval)) #giving, as desired, > [1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10 > > > This seems far too clumsy and circumlocuitous not to have a more > elegant solution from a true regex expert. > > (Special note to Thomas Lumley: This seems one of the few instances > where eval(parse..)) may actually be appropriate.)Howdy. I don't know that I can produce anything less circumlocutory, but I note that your "x2" form has a simple-enough structure that it can be further parsed with regular expressions, i.e., as opposed to using parse and eval. I don't know that this is an improvement -- just a variation on the theme. I've appended an example. -- Mike #### Original vector x <- "1 2 -5, 3- 6 4 8 5-7 10"; x #### Convert ranges to standard R form x1 <- gsub("[ ]*-[ ]*", ":", x); x1 #### Get rid of the comma x2 <- gsub(",", " ", x1); x2 #### Remove extra spaces x3 <- gsub("[ ]+", " ", x2); x3 #### Split off elements, now in standard form x4 <- unlist(strsplit(x3, " ")); x4 #### Use regular expression for simple parse of elements x5 <- sapply(x4, function(a) { n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1", a) n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1", a) n1:n2}, USE.NAMES=FALSE); x5 x6 <- unlist(x5); x6 ##########################################################> #### Original vector > x <- "1 2 -5, 3- 6 4 8 5-7 10"; x[1] "1 2 -5, 3- 6 4 8 5-7 10"> > #### Convert ranges to standard R form > x1 <- gsub("[ ]*-[ ]*", ":", x); x1[1] "1 2:5, 3:6 4 8 5:7 10"> > #### Get rid of the comma > x2 <- gsub(",", " ", x1); x2[1] "1 2:5 3:6 4 8 5:7 10"> > #### Remove extra spaces > x3 <- gsub("[ ]+", " ", x2); x3[1] "1 2:5 3:6 4 8 5:7 10"> > #### Split off elements, now in standard form > x4 <- unlist(strsplit(x3, " ")); x4[1] "1" "2:5" "3:6" "4" "8" "5:7" "10"> > #### Use regular expression for simple parse of elements > x5 <- sapply(x4, function(a) {+ n1 <- gsub("([[:digit:]]):[[:digit:]]", "\\1", a) + n2 <- gsub("[[:digit:]]:([[:digit:]])", "\\1", a) + n1:n2}, USE.NAMES=FALSE); x5 [[1]] [1] 1 [[2]] [1] 2 3 4 5 [[3]] [1] 3 4 5 6 [[4]] [1] 4 [[5]] [1] 8 [[6]] [1] 5 6 7 [[7]] [1] 10> x6 <- unlist(x5); x6[1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10>
How about: x <- "1 2 -5, 3- 6 4 8 5-7 10"; x library(gsubfn) strapply( x, '(([0-9]+) *- *([0-9]+))|([0-9]+)', function(one,two,three,four) { if( nchar(four) > 0 ) return(as.numeric(four) ) return( seq( from=as.numeric(two), to=as.numeric(three) ) ) } )[[1]] If x is a vector of strings and you remove the [[1]] then you will get a list with each element corresponding to a string in x (unlisting will give a single vector). This could be easily extended to handle floating point numbers instead of just integers and even negative numbers (as long as you have a clear rule to distinguish between a negative and a the end of the range). -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Bert Gunter > Sent: Friday, August 20, 2010 2:55 PM > To: r-help at r-project.org > Subject: [R] Regex exercise > > For regular expression afficianados, I'd like a cleverer solution to > the following problem (my solution works just fine for my needs; I'm > just trying to improve my regex skills): > > Given the string (entered, say, at a readline prompt): > > "1 2 -5, 3- 6 4 8 5-7 10" ## only integers will be entered > > parse it to produce the numeric vector: > > c(1, 2, 3, 4, 5, 3, 4, 5, 6, 8, 5, 6, 7, 10) > > Note that "-" in the expression is used to indicate a range of values > instead of ":" > > Here's my UNclever solution: > > First convert more than one space to a single space and then replace > "<any spaces>-<any spaces>" by ":" by: > > > x1 <- gsub(" *- *",":",gsub(" +"," ",resp)) #giving > > x1 > [1] "1 2:5, 3:6 4 8 5:7 10" ## Note that the comma remains > > Next convert the single string into a character vector via strsplit by > splitting on anything but ":" or a digit: > > > x2 <- strsplit(x1,split="[^:[:digit:]]+")[[1]] #giving > > x2 > [1] "1" "2:5" "3:6" "4" "8" "5:7" "10" > > Finally, parse() the vector, eval() each element, and unlist() the > resulting list of numeric vectors: > > > unlist(lapply(parse(text=x2),eval)) #giving, as desired, > [1] 1 2 3 4 5 3 4 5 6 4 8 5 6 7 10 > > > This seems far too clumsy and circumlocuitous not to have a more > elegant solution from a true regex expert. > > (Special note to Thomas Lumley: This seems one of the few instances > where eval(parse..)) may actually be appropriate.) > > Cheers to all, > > Bert > > -- > Bert Gunter > Genentech Nonclinical Biostatistics > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.