thr3ads.net - R help - [R] Parse character strings so they "align" (line up/match) properly [Jan 2014]

If this information is useful, please help other people find it:
Share via:

Joshua Banta

2014-Jan-04 00:41 UTC

[R] Parse character strings so they "align" (line up/match) properly

Dear listserv,

I don't know if this question is more appropriate for the Bioconductor
listserv or the general R listserv. I am asking it here because I believe this
problem can be solved using regular R commands in the base package. I suspect
you all will be very helpful.

I have genetic sequence data in the following form. Each letter represents a
nucleotide.

ref.sequence <- "ATAGCCGCA"
sequence1 <- "AT[G][C][C]AGCCG[T]CA"
sequence2 <- "ATAGCCGC[C][A][C]A"
sequence3 <- "AT[GCC]AGCCGCA"

The brackets indicate nucleotide "insertions" relative to the
reference sequence ("ref.sequence"). Some sequences may have some/all
of the insertions, some may not.

What I want is for all of the positions to "align" (line up) properly.
Therefore, the sequences lacking a particular insertion should get scored with a
dash (or dashes) at that position.

I want to end up with this:

ref.sequence should look like this: "AT---AGCCG-C---A"
sequence1 should look like this: "AT[G][C][C]AGCCG[T]C---A"
sequence2 should look like this: "AT---AGCCG-C[C][A][C]A"
sequence3 should look like this: "AT[G][C][C]AGCCG-C---A"

So how can I make this happen efficiently?

Thanks very much in advance,
-----------------------------------
Josh Banta, Ph.D
Assistant Professor
Department of Biology
The University of Texas at Tyler
Tyler, TX 75799
Tel: (903) 565-5655
http://plantevolutionaryecology.org

	[[alternative HTML version deleted]]

R help - Jan 2014 - Parse character strings so they "align" (line up/match) properly

[R] Parse character strings so they "align" (line up/match) properly