Davis, Brian
2010-Jul-16 17:59 UTC
[R] Deleting a variable number of characters from a string
I have a text processing problem I'm hoping someone can help me solve. This issue it this. I have a character string in which I need to delete a variable number of characters from the string. The string itself contains the number of characters to be deleted. The number of characters to be deleted is proceeded by either a "+" or a "-". A toy example: Suppose I have x<-c("A-1CB-2GHX", "*+11gAgggTgtgggH")> x[1] "A-1CB-2GHX" "*+11gAgggTgtgggH" What I need as output is "ABX" "*H" I know I can use gsub to remove the control character and the number portion with gsub("(\\-|\\+)([0-9]+)", replacement="", x) However, I can't figure out how to delete the variable number of characters after the number portion of the string. Any ideas? In case this helps> sessionInfo()R version 2.11.1 (2010-05-31) x86_64-pc-mingw32 locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 Brian [[alternative HTML version deleted]]
Hi Davis, Please try ??regex gsub("(\\-|\\+)([0-9]+)(\\w*)(\\w)", replacement="\\4", x) ----- A R learner. -- View this message in context: http://r.789695.n4.nabble.com/Deleting-a-variable-number-of-characters-from-a-string-tp2291754p2291797.html Sent from the R help mailing list archive at Nabble.com.
Gabor Grothendieck
2010-Jul-16 20:23 UTC
[R] Deleting a variable number of characters from a string
On Fri, Jul 16, 2010 at 1:59 PM, Davis, Brian <Brian.Davis at uth.tmc.edu> wrote:> I have a text processing problem I'm hoping someone can help me solve. ?This issue it this. > > ?I have a character string in which I need to delete a variable number of characters from the string. ?The string itself contains the number of characters to be deleted. ?The number of characters to be deleted is proceeded by either a "+" or a "-". > > A toy example: > > Suppose I have > > x<-c("A-1CB-2GHX", "*+11gAgggTgtgggH") >> x > [1] "A-1CB-2GHX" ? ? ? "*+11gAgggTgtgggH" > > What I need as output is > "ABX" "*H" > > I know I can use gsub to remove the control character and the number portion with > > gsub("(\\-|\\+)([0-9]+)", replacement="", x) > > However, I can't figure out how to delete the variable number of characters after the number portion of the string. >Using gsubfn in the gsubfn package we match - the - or + via [-+], - the digits via \\d+ and - the remaining characters via [^-+]* parenthesizing the digits and remaining characters so that they form back references which are passed to the function as args 1 and 2 respectively. gsubfn supports a formula notation for functions and the specified function using that formula notation has arguments d and s and function body which strips the characters and returns the rest to be substituted back in: > library(gsubfn) > gsubfn("[-+](\\d+)([^-+]*)", d + s ~ substring(s, as.numeric(d) + 1), x) [1] "ABX" "*H" See http://gsubfn.googlecode.com for more.