Joshua Banta
2014-Jan-04 00:41 UTC
[R] Parse character strings so they "align" (line up/match) properly
Dear listserv,
I don't know if this question is more appropriate for the Bioconductor
listserv or the general R listserv. I am asking it here because I believe this
problem can be solved using regular R commands in the base package. I suspect
you all will be very helpful.
I have genetic sequence data in the following form. Each letter represents a
nucleotide.
ref.sequence <- "ATAGCCGCA"
sequence1 <- "AT[G][C][C]AGCCG[T]CA"
sequence2 <- "ATAGCCGC[C][A][C]A"
sequence3 <- "AT[GCC]AGCCGCA"
The brackets indicate nucleotide "insertions" relative to the
reference sequence ("ref.sequence"). Some sequences may have some/all
of the insertions, some may not.
What I want is for all of the positions to "align" (line up) properly.
Therefore, the sequences lacking a particular insertion should get scored with a
dash (or dashes) at that position.
I want to end up with this:
ref.sequence should look like this: "AT---AGCCG-C---A"
sequence1 should look like this: "AT[G][C][C]AGCCG[T]C---A"
sequence2 should look like this: "AT---AGCCG-C[C][A][C]A"
sequence3 should look like this: "AT[G][C][C]AGCCG-C---A"
So how can I make this happen efficiently?
Thanks very much in advance,
-----------------------------------
Josh Banta, Ph.D
Assistant Professor
Department of Biology
The University of Texas at Tyler
Tyler, TX 75799
Tel: (903) 565-5655
http://plantevolutionaryecology.org
[[alternative HTML version deleted]]