Hi, I am a newbie in R and was working on some DNA data represented as strings of A,C,T and G (also wild-character like M and X). I use the Bioconductor package in R. Currently I need to convert a string of the form "ACCTGMX" to "1223400" i.e. A is replaced by 1, C with 2, T with 3, G with 4 and any other character with a 0. I checked with 'replace' and also with a function called 'copySubstitute' found in the Biobase package but this is only for files. The data here is a string ("ACCTGMX" ) and we need to convert it to yet another string ("1223400"). Now I use the strsplit function to split "ACCTGM" into "A" "C" "C" "T" "G" "M" and then use 'which' to assign the corresponding numbers. Is there a faster way to do this or some function I can make use of? Please advice. Thank you. -- View this message in context: http://r.789695.n4.nabble.com/ACCTGMX-to-1223400-in-R-tp2294636p2294636.html Sent from the R help mailing list archive at Nabble.com.
On Jul 19, 2010, at 5:31 PM, John1983 wrote:> > Hi, > > I am a newbie in R and was working on some DNA data represented as > strings > of A,C,T and G (also wild-character like M and X). I use the > Bioconductor > package in R.Well, I guess it's sort of a "meta" package, but it is really more of a subculture. It also has its own mailing list.> Currently I need to convert a string of the form "ACCTGMX" to > "1223400" i.e. A is replaced by 1, C with 2, T with 3, G with 4 and > any > other character with a 0. I checked with 'replace' and also with a > function > called 'copySubstitute' found in the Biobase package but this is > only for > files. > The data here is a string ("ACCTGMX" ) and we need to convert it to > yet > another string ("1223400"). Now I use the strsplit function to split > "ACCTGM" into "A" "C" "C" "T" "G" "M" and then use 'which' to assign > the > corresponding numbers. > Is there a faster way to do this or some function I can make use of?> tst <- rep( "ACCTGMX", 5) > newtst <- gsub("A", "1", tst) > newtst <- gsub("C", "2", newtst) > newtst <- gsub("T", "3", newtst) > newtst <- gsub("G", "4", newtst) > newtst <- gsub("[[:alpha:]]", "0", newtst) > newtst [1] "1223400" "1223400" "1223400" "1223400" "1223400" There is also a rollaply function in teh zoo and an strapply function in the gsubfn package that might be even more powerful, but I am insufficiently talented to give you a one-liner using them.> > Please advise. > > Thank you. > ---- David Winsemius, MD West Hartford, CT
Here is another way of doing it with 'chartr'; I only assume that you have the upper characters, but you can add to the strings to cover any others:> tst <- rep( "ACCTGMX", 5) > chartr("ACTGBDEFHIJKLMNOPQRSUVWXYZ", "12340000000000000000000000", tst)[1] "1223400" "1223400" "1223400" "1223400" "1223400" On Mon, Jul 19, 2010 at 5:31 PM, John1983 <sandhya_prabhakaran at yahoo.com> wrote:> > Hi, > > I am a newbie in R and was working on some DNA data represented as strings > of A,C,T and G (also wild-character like M and X). I use the Bioconductor > package in R. Currently I need to convert a string of the form "ACCTGMX" to > "1223400" i.e. A is replaced by 1, C with 2, T with 3, G with 4 and any > other character with a 0. I checked with 'replace' and also with a function > called 'copySubstitute' found in the Biobase package but this is only for > files. > The data here is a string ("ACCTGMX" ) and we need to convert it to yet > another string ("1223400"). Now I use the strsplit function to split > "ACCTGM" into "A" "C" "C" "T" "G" "M" and then use 'which' to assign the > corresponding numbers. > Is there a faster way to do this or some function I can make use of? > > Please advice. > > Thank you. > -- > View this message in context: http://r.789695.n4.nabble.com/ACCTGMX-to-1223400-in-R-tp2294636p2294636.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve?
On Mon, Jul 19, 2010 at 5:31 PM, John1983 <sandhya_prabhakaran at yahoo.com> wrote:> > Hi, > > I am a newbie in R and was working on some DNA data represented as strings > of A,C,T and G (also wild-character like M and X). I use the Bioconductor > package in R. Currently I need to convert a string of the form "ACCTGMX" to > "1223400" i.e. A is replaced by 1, C with 2, T with 3, G with 4 and any > other character with a 0. I checked with 'replace' and also with a function > called 'copySubstitute' found in the Biobase package but this is only for > files. > The data here is a string ("ACCTGMX" ) and we need to convert it to yet > another string ("1223400"). Now I use the strsplit function to split > "ACCTGM" into "A" "C" "C" "T" "G" "M" and then use 'which' to assign the > corresponding numbers. > Is there a faster way to do this or some function I can make use of? >Here are a few alternatives. The first uses chartr which translates the ith character in the first string to the ith character in second string. If speed is a consideration then note that this alternative is the fastest by far. The second alternative translates just ACGT using chartr and then uses gsub to translate everything else to 0. This alternative like the prior only uses core R functionality. This solution is intermediate in speed and simplicity between the other two. The third uses gsubfn which is like gsub but allows the replacement to be a list. In that case if the match equals a name in the list it is replaced with that component and if no name is matched then the unnamed component at the end is used as the replacement. This one has the advantage that it is particularly simple to specify. #1 chartr("ABCDEFGHIJKLMNOPQRSTUVWXYZ", "10200040000000000003000000", "ACCTGMX") #2 gsub("[^1-4]", "0", chartr("ACGT", "1234", "ACCTGMX")) #3 library(gsubfn) gsubfn(".", list(A = 1, C = 2, T = 3, G = 4, 0), "ACCTGMX")