I have a two part question Part 1) I am trying to remove characters in a string based on the position of a key character in another string.? I have a solution that works but it requires a for-loop.? A vectorized way of doing this has alluded me.? CleanRead<-function(x,y) { ? if (!is.character(x)) ??? x <- as.character(x) ? if (!is.character(y)) ??? y <- as.character(y) ? idx<-grep("\\*", x, value=FALSE) ? starpos<-gregexpr("\\*", x[idx]) ? ? ysplit<-strsplit(y[idx], '') ? n<-length(idx) ? for(i in 1:n) { ??? ysplit[[i]][starpos[[i]]] = "" ? } ? y[idx]<-unlist(lapply(ysplit, paste, sep='', collapse='')) ? return(y) } x<-c("AA*.*A,,,", "**a.a*,,,A", "C*c..", "**aA") y<-c("abcdefghi", "abcdefghij", "abcde", "abcd") CleanRead(x,y) [1] "abdfghi" "cdeghij" "acde"??? "cd" Is there a better way to do this? Part 2) My next step in the string processing is to take the characters in the output of CleanRead and subtract 33 from the ascii value of the character to obtain an integer. Again I have a solution that works, involving splitting the string into characters then converting them to factors (starting at ascii 34) and using unclass to get the integer value. (kindof a atoi(x)-33 all in one step) I looked for the C equivalent of atoi, but the only help I could find (R-help 2003) suggested using as.numeric.? However, the help file (and testing) shows you get 'NA'.?? Am I missing an easier way to do this? Thanks in advance, Brian
On Wed, Jul 21, 2010 at 1:02 PM, Davis, Brian <Brian.Davis at uth.tmc.edu> wrote:> I have a two part question > > Part 1) > I am trying to remove characters in a string based on the position of a key character in another string.? I have a solution that works but it requires a for-loop.? A vectorized way of doing this has alluded me. > > CleanRead<-function(x,y) { > > ? if (!is.character(x)) > ??? x <- as.character(x) > ? if (!is.character(y)) > ??? y <- as.character(y) > > ? idx<-grep("\\*", x, value=FALSE) > ? starpos<-gregexpr("\\*", x[idx]) > > ? ysplit<-strsplit(y[idx], '') > ? n<-length(idx) > ? for(i in 1:n) { > ??? ysplit[[i]][starpos[[i]]] = "" > ? } > > ? y[idx]<-unlist(lapply(ysplit, paste, sep='', collapse='')) > ? return(y) > } > > x<-c("AA*.*A,,,", "**a.a*,,,A", "C*c..", "**aA") > y<-c("abcdefghi", "abcdefghij", "abcde", "abcd") > > CleanRead(x,y) > [1] "abdfghi" "cdeghij" "acde"??? "cd" > > > Is there a better way to do this? > > Part 2) > My next step in the string processing is to take the characters in the output of CleanRead and subtract 33 from the ascii value of the character to obtain an integer. Again I have a solution that works, involving splitting the string into characters then converting them to factors (starting at ascii 34) and using unclass to get the integer value. (kindof a atoi(x)-33 all in one step) > > I looked for the C equivalent of atoi, but the only help I could find (R-help 2003) suggested using as.numeric.? However, the help file (and testing) shows you get 'NA'. >This splits x and y into vectors of single characters, extracts those from y for which x is not * and then matches the result to letters to return a number. f <- function(x, y) match(y[x != "*"], letters) mapply(f, strsplit(x, ""), strsplit(y, ""))
On 07/21/2010 10:02 AM, Davis, Brian wrote:> I have a two part question > > Part 1) I am trying to remove characters in a string based on the > position ofa key character in another string. I have a solution that works but it requires a for-loop. A vectorized way of doing this has alluded me. Hi Brian -- This sounds like processing short reads from DNA sequencing experiments. The Bioconductor project has well-developed tools for doing these types of operations. See the Bioconductor mailing list, the Biostrings, ShortRead, IRanges, ... packages including their vignettes, and perhaps some of the recent course / training material accessible from the web site. http://bioconductor.org/ Also Thomas Girke's group has a straight-forward resource describing use of these tools at http://manuals.bioinformatics.ucr.edu/home/ht-seq If you explore this avenue, then please post messages to the Bioconductor mailing list, where a suitable audience of experienced users will give you prompt advice. Martin> > CleanRead<-function(x,y) { > > if (!is.character(x)) > x <- as.character(x) > if (!is.character(y)) > y <- as.character(y) > > idx<-grep("\\*", x, value=FALSE) > starpos<-gregexpr("\\*", x[idx]) > > ysplit<-strsplit(y[idx], '') > n<-length(idx) > for(i in 1:n) { > ysplit[[i]][starpos[[i]]] = "" > } > > y[idx]<-unlist(lapply(ysplit, paste, sep='', collapse='')) > return(y) > } > > x<-c("AA*.*A,,,", "**a.a*,,,A", "C*c..", "**aA") > y<-c("abcdefghi", "abcdefghij", "abcde", "abcd") > > CleanRead(x,y) > [1] "abdfghi" "cdeghij" "acde" "cd" > > > Is there a better way to do this? > > Part 2) > My next step in the string processing is to take the characters in the output of CleanRead and subtract 33 from the ascii value of the character to obtain an integer. Again I have a solution that works, involving splitting the string into characters then converting them to factors (starting at ascii 34) and using unclass to get the integer value. (kindof a atoi(x)-33 all in one step) > > I looked for the C equivalent of atoi, but the only help I could find (R-help 2003) suggested using as.numeric. However, the help file (and testing) shows you get 'NA'. > > Am I missing an easier way to do this? > > > > Thanks in advance, > > Brian > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
Hi Brian, On 07/21/2010 10:02 AM, Davis, Brian wrote: [...]> Part 2) > My next step in the string processing is to take the characters in the output of CleanRead and subtract 33 from the ascii value of the character to obtain an integer. Again I have a solution that works, involving splitting the string into characters then converting them to factors (starting at ascii 34) and using unclass to get the integer value. (kindof a atoi(x)-33 all in one step) > > I looked for the C equivalent of atoi, but the only help I could find (R-help 2003) suggested using as.numeric. However, the help file (and testing) shows you get 'NA'. > > Am I missing an easier way to do this?For this I found converting to raw with charToRaw(), then to integer with as.integer(), then substracting, then converting back to character to be pretty fast: > rawToChar(as.raw( as.integer(charToRaw("lajlkcjppstyyslkajalcmlkjla")) - 33)) [1] "K at IKJBIOORSXXRKJ@I at KBLKJIK@" Cheers, H.> > > > Thanks in advance, > > Brian > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M2-B876 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319