thr3ads.net - R help - [R] Performing basic Multiple Sequence Alignment in R? [Dec 2010]

If this information is useful, please help other people find it:
Share via:

Tal Galili

2010-Dec-21 09:21 UTC

[R] Performing basic Multiple Sequence Alignment in R?

Hello everyone,

I am not sure if this should go on the general R mailing list (for example,
if there is a text mining solution that might work here) or the bioconductor
mailing list (since I wasn't able to find a solution to my question on
searching their lists) - so this time I tried both, and in the future I'll
know better (in case it should go to only one of the two).


The task I'm trying to achieve is to align several sequences together.
I don't have a basic pattern to match to.  All that I know is that the
"True" pattern should be of length "30" and that the
sequences I'm looking
at, have had missing values introduced to them at random points.
Here is an example of such sequences, were on the left we see what is the
real location of the missing values, and on the right we see the sequence
that we will be able to observe.  My goal is to reconstruct the left column
using only the sequences I've got on the right column (based on the fact
that many of the letters in each position are the same)

                     Real_sequence           The_sequence_we_see
1   CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2   CGCAATACTAGC-AGGTGACTTCC-CT-CG   CGCAATACTAGCAGGTGACTTCCCTCG
3   CGCAATGATCAC--GGTGGCTCCCGGTGCG  CGCAATGATCACGGTGGCTCCCGGTGCG
4   CGCAATACTAACCA-CTAACT--CGCTGCG   CGCAATACTAACCACTAACTCGCTGCG
5   CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6   CGCTATACTAACAA-GTG-CTTAGGC-CTG   CGCTATACTAACAAGTGCTTAGGCCTG
7   CCCA-C-CTAA-ACGGTGACTTACGCTCCG   CCCACCTAAACGGTGACTTACGCTCCG


Here is an example code to reproduce the above example:

ATCG <- c("A","T","C","G")
set.seed(40)

original.seq <- sample(ATCG, 30, T)

seqS <- matrix(original.seq,200,30, T)

change.letters <- function(x, number.of.changes = 15,
letters.to.change.with = ATCG)
{

    number.of.changes <- sample(seq_len(number.of.changes), 1)

    new.letters <- sample(letters.to.change.with , number.of.changes, T)

    where.to.change.the.letters <- sample(seq_along(x) , number.of.changes,
F)

    x[where.to.change.the.letters] <- new.letters

    return(x)
}

change.letters(original.seq)

insert.missing.values <- function(x) change.letters(x, 3, "-")

insert.missing.values(original.seq)

seqS2 <- t(apply(seqS, 1, change.letters))

seqS3 <- t(apply(seqS2, 1, insert.missing.values))

seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)

all.seqS <- str_replace(seqS4,"-" , "")

# how do we allign this?

data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)


I understand that if all I had was a string and a pattern I would be able to
use

library(Biostrings)

pairwiseAlignment(...)



But in the case I present we are dealing with many sequences to align to one
another (instead of aligning them to one pattern).

Is there a known method for doing this in R?


Thanks,

Tal



----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------

	[[alternative HTML version deleted]]

Mike Marchywka

2010-Dec-21 12:44 UTC

head link

[R] Performing basic Multiple Sequence Alignment in R?

I don't have an answer, trying to solicit more input with additional
questions.



> From: tal.galili at gmail.com
> Date: Tue, 21 Dec 2010 11:21:03 +0200
> To: r-help at r-project.org; bioconductor at r-project.org
> Subject: [R] Performing basic Multiple Sequence Alignment in R?
>
> Hello everyone,
>
> I am not sure if this should go on the general R mailing list (for example,
> if there is a text mining solution that might work here) or the
bioconductor
> mailing list (since I wasn't able to find a solution to my question on
> searching their lists) - so this time I tried both, and in the future
I'll
> know better (in case it should go to only one of the two).
>
I take it you don't want an R interface for clustal and I seem
to recall, from doing this a few years ago, that alignment by
exact string matching was a bit of a research area ( I think you
can find papers on citeseer for example). It does seem you are asking
about exact string matches for alignment markers- your left sequences
appear exactly someplace on the right- but your overall interests
are not real clear. I never got my code fully working but I was
happy that I could do different strains of e coli ( or something in 
the 5-10 Mbp genome range ) very quickly ( seconds as I recall ) and
you could also presumably find similar items that had
moved a long way. 

Earlier someone came
here with a task and was pointed to bio packages but I 
thought there may be something in computational linguistics or mining
better suited to needs but no one ever volunteered anything.

>
> The task I'm trying to achieve is to align several sequences together.
> I don't have a basic pattern to match to. All that I know is that the
> "True" pattern should be of length "30" and that the
sequences I'm looking
> at, have had missing values introduced to them at random points.

Alternatively I guess someone could make an R interface for various
BLAST's, sometimes the help desk at NCBI can get questions like this
to the right person internally.

> Here is an example of such sequences, were on the left we see what is the
> real location of the missing values, and on the right we see the sequence
> that we will be able to observe. My goal is to reconstruct the left column
> using only the sequences I've got on the right column (based on the
fact
> that many of the letters in each position are the same)
>
> Real_sequence The_sequence_we_see
> 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
> 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG
> 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG
> 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG
> 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
> 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG
> 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG
>

David Winsemius

2010-Dec-21 13:06 UTC

head link

[R] Performing basic Multiple Sequence Alignment in R?

Tal; I'm trimming the BioC posting. In the R lists it is considered  
spamming to cross post. (Please re-read the Posting Guide.)


On Dec 21, 2010, at 4:21 AM, Tal Galili wrote:
> Hello everyone,
>
> I am not sure if this should go on the general R mailing list (for  
> example,
> if there is a text mining solution that might work here) or the  
> bioconductor
> mailing list (since I wasn't able to find a solution to my question on
> searching their lists) - so this time I tried both, and in the  
> future I'll
> know better (in case it should go to only one of the two).
>
>
> The task I'm trying to achieve is to align several sequences together.
> I don't have a basic pattern to match to.  All that I know is that the
> "True" pattern should be of length "30" and that the
sequences I'm
> looking
> at, have had missing values introduced to them at random points.
> Here is an example of such sequences, were on the left we see what  
> is the
> real location of the missing values, and on the right we see the  
> sequence
> that we will be able to observe.  My goal is to reconstruct the left  
> column
> using only the sequences I've got on the right column (based on the  
> fact
> that many of the letters in each position are the same)
>
>                     Real_sequence           The_sequence_we_see
> 1   CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
> 2   CGCAATACTAGC-AGGTGACTTCC-CT-CG   CGCAATACTAGCAGGTGACTTCCCTCG
> 3   CGCAATGATCAC--GGTGGCTCCCGGTGCG  CGCAATGATCACGGTGGCTCCCGGTGCG
> 4   CGCAATACTAACCA-CTAACT--CGCTGCG   CGCAATACTAACCACTAACTCGCTGCG
> 5   CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
> 6   CGCTATACTAACAA-GTG-CTTAGGC-CTG   CGCTATACTAACAAGTGCTTAGGCCTG
> 7   CCCA-C-CTAA-ACGGTGACTTACGCTCCG   CCCACCTAAACGGTGACTTACGCTCCG
>
The agrep function allows one to specify which sort of differences to  
consider in calculating a Levenshtein edit distance. Insertions are  
one possible distance component. You could take a look at its code (in  
C in hte sources) and perhaps rejigger it to spit out the location of  
the deletions.

 > agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence,  
max.distance=list(deletions=0, substitutions=0, insertions=0))
integer(0)
 > agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence,  
max.distance=list(deletions=0, substitutions=0, insertions=1))
[1] 1

-- 
David.>
> Here is an example code to reproduce the above example:
>
> ATCG <- c("A","T","C","G")
> set.seed(40)
>
> original.seq <- sample(ATCG, 30, T)
>
> seqS <- matrix(original.seq,200,30, T)
>
> change.letters <- function(x, number.of.changes = 15,
> letters.to.change.with = ATCG)
> {
>
>    number.of.changes <- sample(seq_len(number.of.changes), 1)
>
>    new.letters <- sample(letters.to.change.with , number.of.changes,  
> T)
>
>    where.to.change.the.letters <- sample(seq_along(x) ,  
> number.of.changes, F)
>
>    x[where.to.change.the.letters] <- new.letters
>
>    return(x)
> }
>
> change.letters(original.seq)
>
> insert.missing.values <- function(x) change.letters(x, 3, "-")
>
> insert.missing.values(original.seq)
>
> seqS2 <- t(apply(seqS, 1, change.letters))
>
> seqS3 <- t(apply(seqS2, 1, insert.missing.values))
>
> seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
> require(stringr)
> # library(help=stringr)
>
> all.seqS <- str_replace(seqS4,"-" , "")
>
> # how do we allign this?
>
> data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)
>
>
> I understand that if all I had was a string and a pattern I would be  
> able to
> use
>
> library(Biostrings)
>
> pairwiseAlignment(...)
>
>
>
> But in the case I present we are dealing with many sequences to  
> align to one
> another (instead of aligning them to one pattern).
>
> Is there a known method for doing this in R?
>
>
> Thanks,
>
> Tal
>
>
>
> ----------------Contact
> Details:-------------------------------------------------------
> Contact me: Tal.Galili at gmail.com |  972-52-7275845
> Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il  
> (Hebrew) |
> www.r-statistics.com (English)
>
----------------------------------------------------------------------------------------------
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT

aishualex

2012-Apr-10 21:19 UTC

head link

[R] Performing basic Multiple Sequence Alignment in R?

hello everyone,
i wanted help to build a* user defined function in R  for multiple sequence
alignment*.
i tried using pairwisealignment () for the multiple sequences, but the
results seem to be wrong.
i have 10 sequences which i have to align ( by building profiles).
is there a function to build a *postion frequency matrix*?
 thanks,
alex

--
View this message in context:
http://r.789695.n4.nabble.com/Performing-basic-Multiple-Sequence-Alignment-in-R-tp3137526p4547081.html
Sent from the R help mailing list archive at Nabble.com.

R help - Dec 2010 - Performing basic Multiple Sequence Alignment in R?

[R] Performing basic Multiple Sequence Alignment in R?

[R] Performing basic Multiple Sequence Alignment in R?

[R] Performing basic Multiple Sequence Alignment in R?

[R] Performing basic Multiple Sequence Alignment in R?

Maybe Matching Threads