Joshua Banta
2014-Jan-02 03:55 UTC
[R] Data parsing question: adding characters within a string of characters
Dear Listserve, I have a data-parsing question for you. I recognize this is more in the domain of PERL/Python, but I don't know those languages! On the other hand, I am pretty good overall with R, so I'd rather get the job done within the R "ecosphere." Here is what I want to do. Consider the following data: string <- "ATCGCCCGTA[AGA]TAACCG" I want to alter string so that it looks like this: ATCGCCCGTA[A][G][A]TAACCG In other words, I want to design a piece of code that will scan a character string, find bracketed groups of characters, break up each character within the bracket into its own individual bracketed character, and then put the group of individually bracketed characters back into the character string. The lengths of the character strings enclosed by a bracket will vary, but in every case, I want to do the same thing: break up each character within the bracket into its own individual bracketed character, and then put the group of individually bracketed characters back into the character string. So, for example, another string may look like this: string2 <- "ATTATACGCA[AAATGCCCCA]GCTA[AT]GCATTA" I want to alter string so that it looks like this: "ATTATACGCA[A][A][A][T][G][C][C][C][C][A]GCTA[A][T]GCATTA" Thank you all in advance and have a great 2014! ----------------------------------- Josh Banta, Ph.D Assistant Professor Department of Biology The University of Texas at Tyler Tyler, TX 75799 Tel: (903) 565-5655 http://plantevolutionaryecology.org [[alternative HTML version deleted]]
Frede Aakmann Tøgersen
2014-Jan-02 11:19 UTC
[R] Data parsing question: adding characters within a string of characters
Hi Joshua This is one way to do it. Not sure if it this is an efficient implementation for your needs; it depends on the size of your data. string1 <- "ATCGCCCGTA[AGA]TAACCG" string2 <- "ATTATACGCA[AAATGCCCCA]GCTA[AT]GCATTA" foo <- function(genes){ mypaste <- function(x) paste("[", paste(x, collapse = "]["), "]", sep = "") tmp <- strsplit(genes, "[[:punct:]]")[[1]] str <- gregexpr("\\[", genes)[[1]] stp <- gregexpr("\\]", genes)[[1]] tmp2 <- substring(genes, str + 1, stp - 1) ndx <- match(tmp2, tmp) tmp[ndx] <- lapply(strsplit(tmp2, ""), mypaste) result <- paste(tmp, collapse = "") return(result) }> foo(string2)[1] "ATTATACGCA[A][A][A][T][G][C][C][C][C][A]GCTA[A][T]GCATTA"> foo(string1)[1] "ATCGCCCGTA[A][G][A]TAACCG">Yours sincerely / Med venlig hilsen Frede Aakmann T?gersen Specialist, M.Sc., Ph.D. Plant Performance & Modeling Technology & Service Solutions T +45 9730 5135 M +45 2547 6050 frtog at vestas.com http://www.vestas.com Company reg. name: Vestas Wind Systems A/S This e-mail is subject to our e-mail disclaimer statement. Please refer to www.vestas.com/legal/notice If you have received this e-mail in error please contact the sender.> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] > On Behalf Of Joshua Banta > Sent: 2. januar 2014 04:56 > To: R Help > Subject: [R] Data parsing question: adding characters within a string of > characters > > Dear Listserve, > > I have a data-parsing question for you. I recognize this is more in the domain > of PERL/Python, but I don't know those languages! On the other hand, I am > pretty good overall with R, so I'd rather get the job done within the R > "ecosphere." > > Here is what I want to do. Consider the following data: > > string <- "ATCGCCCGTA[AGA]TAACCG" > > I want to alter string so that it looks like this: > > ATCGCCCGTA[A][G][A]TAACCG > > In other words, I want to design a piece of code that will scan a character > string, find bracketed groups of characters, break up each character within > the bracket into its own individual bracketed character, and then put the > group of individually bracketed characters back into the character string. The > lengths of the character strings enclosed by a bracket will vary, but in every > case, I want to do the same thing: break up each character within the bracket > into its own individual bracketed character, and then put the group of > individually bracketed characters back into the character string. > > So, for example, another string may look like this: > > string2 <- "ATTATACGCA[AAATGCCCCA]GCTA[AT]GCATTA" > > I want to alter string so that it looks like this: > > "ATTATACGCA[A][A][A][T][G][C][C][C][C][A]GCTA[A][T]GCATTA" > > Thank you all in advance and have a great 2014! > > ----------------------------------- > Josh Banta, Ph.D > Assistant Professor > Department of Biology > The University of Texas at Tyler > Tyler, TX 75799 > Tel: (903) 565-5655 > http://plantevolutionaryecology.org > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Duncan Murdoch
2014-Jan-02 12:27 UTC
[R] Data parsing question: adding characters within a string of characters
On 14-01-01 10:55 PM, Joshua Banta wrote:> Dear Listserve, > > I have a data-parsing question for you. I recognize this is more in the domain of PERL/Python, but I don't know those languages! On the other hand, I am pretty good overall with R, so I'd rather get the job done within the R "ecosphere." > > Here is what I want to do. Consider the following data: > > string <- "ATCGCCCGTA[AGA]TAACCG" > > I want to alter string so that it looks like this: > > ATCGCCCGTA[A][G][A]TAACCG > > In other words, I want to design a piece of code that will scan a character string, find bracketed groups of characters, break up each character within the bracket into its own individual bracketed character, and then put the group of individually bracketed characters back into the character string. The lengths of the character strings enclosed by a bracket will vary, but in every case, I want to do the same thing: break up each character within the bracket into its own individual bracketed character, and then put the group of individually bracketed characters back into the character string. > > So, for example, another string may look like this: > > string2 <- "ATTATACGCA[AAATGCCCCA]GCTA[AT]GCATTA" > > I want to alter string so that it looks like this: > > "ATTATACGCA[A][A][A][T][G][C][C][C][C][A]GCTA[A][T]GCATTA"R is fine for that sort of operation, using regular expressions for matching and sub() or gsub() for substitution. For example, this code finds all the bracketed strings of 1 or more ATCG letters: matches <- gregexpr("[[][ATCG]+]", string) In the result, which looks like this for your example string, [[1]] [1] 11 attr(,"match.length") [1] 5 attr(,"useBytes") [1] TRUE the 11 is the start of the bracketed expression, the 5 is the length of the match. (There may be other starts and lengths if there are multiple bracketed expressions.) So use substr to extract the matches. You need to be a little careful putting the string back together after adding the extra brackets, because `substr<-` won't replace a string with one of a different length. I use this version instead: `mysubstr<-` <- function(x, start, stop, value) paste0(substr(x, 1, start-1), value, substr(x, stop+1, nchar(x)) I'll leave the details of the substitutions to you... Duncan Murdoch
Gabor Grothendieck
2014-Jan-02 12:55 UTC
[R] Data parsing question: adding characters within a string of characters
On Wed, Jan 1, 2014 at 10:55 PM, Joshua Banta <jbanta at uttyler.edu> wrote:> Dear Listserve, > > I have a data-parsing question for you. I recognize this is more in the domain of PERL/Python, but I don't know those languages! On the other hand, I am pretty good overall with R, so I'd rather get the job done within the R "ecosphere." > > Here is what I want to do. Consider the following data: > > string <- "ATCGCCCGTA[AGA]TAACCG" > > I want to alter string so that it looks like this: > > ATCGCCCGTA[A][G][A]TAACCG > > In other words, I want to design a piece of code that will scan a character string, find bracketed groups of characters, break up each character within the bracket into its own individual bracketed character, and then put the group of individually bracketed characters back into the character string. The lengths of the character strings enclosed by a bracket will vary, but in every case, I want to do the same thing: break up each character within the bracket into its own individual bracketed character, and then put the group of individually bracketed characters back into the character string. > > So, for example, another string may look like this: > > string2 <- "ATTATACGCA[AAATGCCCCA]GCTA[AT]GCATTA" > > I want to alter string so that it looks like this: > > "ATTATACGCA[A][A][A][T][G][C][C][C][C][A]GCTA[A][T]GCATTA" >Here is a one line solution: library(gsubfn)> gsubfn("\\[([^]]+)\\]", ~ paste(paste0("[", strsplit(x, "")[[1]], "]"), collapse = ""), string)[1] "ATCGCCCGTA[A][G][A]TAACCG"> > gsubfn("\\[([^]]+)\\]", ~ paste(paste0("[", strsplit(x, "")[[1]], "]"), collapse = ""), string2)[1] "ATTATACGCA[A][A][A][T][G][C][C][C][C][A]GCTA[A][T]GCATTA"