Davis, Brian
2010-Jul-16  17:59 UTC
[R] Deleting a variable number of characters from a string
I have a text processing problem I'm hoping someone can help me solve.  This
issue it this.
 I have a character string in which I need to delete a variable number of
characters from the string.  The string itself contains the number of characters
to be deleted.  The number of characters to be deleted is proceeded by either a
"+" or a "-".
A toy example:
Suppose I have
x<-c("A-1CB-2GHX",
"*+11gAgggTgtgggH")> x
[1] "A-1CB-2GHX"       "*+11gAgggTgtgggH"
What I need as output is
"ABX" "*H"
I know I can use gsub to remove the control character and the number portion
with
gsub("(\\-|\\+)([0-9]+)", replacement="", x)
However, I can't figure out how to delete the variable number of characters
after the number portion of the string.
Any ideas?
In case this helps> sessionInfo()
R version 2.11.1 (2010-05-31)
x86_64-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
Brian
	[[alternative HTML version deleted]]
Hi Davis, 
Please try ??regex
gsub("(\\-|\\+)([0-9]+)(\\w*)(\\w)", replacement="\\4", x)
-----
A R learner.
-- 
View this message in context:
http://r.789695.n4.nabble.com/Deleting-a-variable-number-of-characters-from-a-string-tp2291754p2291797.html
Sent from the R help mailing list archive at Nabble.com.
Gabor Grothendieck
2010-Jul-16  20:23 UTC
[R] Deleting a variable number of characters from a string
On Fri, Jul 16, 2010 at 1:59 PM, Davis, Brian <Brian.Davis at uth.tmc.edu> wrote:> I have a text processing problem I'm hoping someone can help me solve. ?This issue it this. > > ?I have a character string in which I need to delete a variable number of characters from the string. ?The string itself contains the number of characters to be deleted. ?The number of characters to be deleted is proceeded by either a "+" or a "-". > > A toy example: > > Suppose I have > > x<-c("A-1CB-2GHX", "*+11gAgggTgtgggH") >> x > [1] "A-1CB-2GHX" ? ? ? "*+11gAgggTgtgggH" > > What I need as output is > "ABX" "*H" > > I know I can use gsub to remove the control character and the number portion with > > gsub("(\\-|\\+)([0-9]+)", replacement="", x) > > However, I can't figure out how to delete the variable number of characters after the number portion of the string. >Using gsubfn in the gsubfn package we match - the - or + via [-+], - the digits via \\d+ and - the remaining characters via [^-+]* parenthesizing the digits and remaining characters so that they form back references which are passed to the function as args 1 and 2 respectively. gsubfn supports a formula notation for functions and the specified function using that formula notation has arguments d and s and function body which strips the characters and returns the rest to be substituted back in: > library(gsubfn) > gsubfn("[-+](\\d+)([^-+]*)", d + s ~ substring(s, as.numeric(d) + 1), x) [1] "ABX" "*H" See http://gsubfn.googlecode.com for more.