Joshua Banta
2014-Jan-04 00:41 UTC
[R] Parse character strings so they "align" (line up/match) properly
Dear listserv, I don't know if this question is more appropriate for the Bioconductor listserv or the general R listserv. I am asking it here because I believe this problem can be solved using regular R commands in the base package. I suspect you all will be very helpful. I have genetic sequence data in the following form. Each letter represents a nucleotide. ref.sequence <- "ATAGCCGCA" sequence1 <- "AT[G][C][C]AGCCG[T]CA" sequence2 <- "ATAGCCGC[C][A][C]A" sequence3 <- "AT[GCC]AGCCGCA" The brackets indicate nucleotide "insertions" relative to the reference sequence ("ref.sequence"). Some sequences may have some/all of the insertions, some may not. What I want is for all of the positions to "align" (line up) properly. Therefore, the sequences lacking a particular insertion should get scored with a dash (or dashes) at that position. I want to end up with this: ref.sequence should look like this: "AT---AGCCG-C---A" sequence1 should look like this: "AT[G][C][C]AGCCG[T]C---A" sequence2 should look like this: "AT---AGCCG-C[C][A][C]A" sequence3 should look like this: "AT[G][C][C]AGCCG-C---A" So how can I make this happen efficiently? Thanks very much in advance, ----------------------------------- Josh Banta, Ph.D Assistant Professor Department of Biology The University of Texas at Tyler Tyler, TX 75799 Tel: (903) 565-5655 http://plantevolutionaryecology.org [[alternative HTML version deleted]]