Hello everyone, I am not sure if this should go on the general R mailing list (for example, if there is a text mining solution that might work here) or the bioconductor mailing list (since I wasn't able to find a solution to my question on searching their lists) - so this time I tried both, and in the future I'll know better (in case it should go to only one of the two). The task I'm trying to achieve is to align several sequences together. I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I'm looking at, have had missing values introduced to them at random points. Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe. My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same) Real_sequence The_sequence_we_see 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG Here is an example code to reproduce the above example: ATCG <- c("A","T","C","G") set.seed(40) original.seq <- sample(ATCG, 30, T) seqS <- matrix(original.seq,200,30, T) change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG) { number.of.changes <- sample(seq_len(number.of.changes), 1) new.letters <- sample(letters.to.change.with , number.of.changes, T) where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F) x[where.to.change.the.letters] <- new.letters return(x) } change.letters(original.seq) insert.missing.values <- function(x) change.letters(x, 3, "-") insert.missing.values(original.seq) seqS2 <- t(apply(seqS, 1, change.letters)) seqS3 <- t(apply(seqS2, 1, insert.missing.values)) seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")}) require(stringr) # library(help=stringr) all.seqS <- str_replace(seqS4,"-" , "") # how do we allign this? data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS) I understand that if all I had was a string and a pattern I would be able to use library(Biostrings) pairwiseAlignment(...) But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern). Is there a known method for doing this in R? Thanks, Tal ----------------Contact Details:------------------------------------------------------- Contact me: Tal.Galili@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- [[alternative HTML version deleted]]
Mike Marchywka
2010-Dec-21 12:44 UTC
[R] Performing basic Multiple Sequence Alignment in R?
I don't have an answer, trying to solicit more input with additional questions.> From: tal.galili at gmail.com > Date: Tue, 21 Dec 2010 11:21:03 +0200 > To: r-help at r-project.org; bioconductor at r-project.org > Subject: [R] Performing basic Multiple Sequence Alignment in R? > > Hello everyone, > > I am not sure if this should go on the general R mailing list (for example, > if there is a text mining solution that might work here) or the bioconductor > mailing list (since I wasn't able to find a solution to my question on > searching their lists) - so this time I tried both, and in the future I'll > know better (in case it should go to only one of the two). >I take it you don't want an R interface for clustal and I seem to recall, from doing this a few years ago, that alignment by exact string matching was a bit of a research area ( I think you can find papers on citeseer for example). It does seem you are asking about exact string matches for alignment markers- your left sequences appear exactly someplace on the right- but your overall interests are not real clear. I never got my code fully working but I was happy that I could do different strains of e coli ( or something in the 5-10 Mbp genome range ) very quickly ( seconds as I recall ) and you could also presumably find similar items that had moved a long way. Earlier someone came here with a task and was pointed to bio packages but I thought there may be something in computational linguistics or mining better suited to needs but no one ever volunteered anything.> > The task I'm trying to achieve is to align several sequences together. > I don't have a basic pattern to match to. All that I know is that the > "True" pattern should be of length "30" and that the sequences I'm looking > at, have had missing values introduced to them at random points.Alternatively I guess someone could make an R interface for various BLAST's, sometimes the help desk at NCBI can get questions like this to the right person internally.> Here is an example of such sequences, were on the left we see what is the > real location of the missing values, and on the right we see the sequence > that we will be able to observe. My goal is to reconstruct the left column > using only the sequences I've got on the right column (based on the fact > that many of the letters in each position are the same) > > Real_sequence The_sequence_we_see > 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG > 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG > 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG > 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG > 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG > 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG > 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG >
David Winsemius
2010-Dec-21 13:06 UTC
[R] Performing basic Multiple Sequence Alignment in R?
Tal; I'm trimming the BioC posting. In the R lists it is considered spamming to cross post. (Please re-read the Posting Guide.) On Dec 21, 2010, at 4:21 AM, Tal Galili wrote:> Hello everyone, > > I am not sure if this should go on the general R mailing list (for > example, > if there is a text mining solution that might work here) or the > bioconductor > mailing list (since I wasn't able to find a solution to my question on > searching their lists) - so this time I tried both, and in the > future I'll > know better (in case it should go to only one of the two). > > > The task I'm trying to achieve is to align several sequences together. > I don't have a basic pattern to match to. All that I know is that the > "True" pattern should be of length "30" and that the sequences I'm > looking > at, have had missing values introduced to them at random points. > Here is an example of such sequences, were on the left we see what > is the > real location of the missing values, and on the right we see the > sequence > that we will be able to observe. My goal is to reconstruct the left > column > using only the sequences I've got on the right column (based on the > fact > that many of the letters in each position are the same) > > Real_sequence The_sequence_we_see > 1 CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG > 2 CGCAATACTAGC-AGGTGACTTCC-CT-CG CGCAATACTAGCAGGTGACTTCCCTCG > 3 CGCAATGATCAC--GGTGGCTCCCGGTGCG CGCAATGATCACGGTGGCTCCCGGTGCG > 4 CGCAATACTAACCA-CTAACT--CGCTGCG CGCAATACTAACCACTAACTCGCTGCG > 5 CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG > 6 CGCTATACTAACAA-GTG-CTTAGGC-CTG CGCTATACTAACAAGTGCTTAGGCCTG > 7 CCCA-C-CTAA-ACGGTGACTTACGCTCCG CCCACCTAAACGGTGACTTACGCTCCG >The agrep function allows one to specify which sort of differences to consider in calculating a Levenshtein edit distance. Insertions are one possible distance component. You could take a look at its code (in C in hte sources) and perhaps rejigger it to spit out the location of the deletions. > agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence, max.distance=list(deletions=0, substitutions=0, insertions=0)) integer(0) > agrep(seqdat$The_sequence_we_see[1], seqdat$Real_sequence, max.distance=list(deletions=0, substitutions=0, insertions=1)) [1] 1 -- David.> > Here is an example code to reproduce the above example: > > ATCG <- c("A","T","C","G") > set.seed(40) > > original.seq <- sample(ATCG, 30, T) > > seqS <- matrix(original.seq,200,30, T) > > change.letters <- function(x, number.of.changes = 15, > letters.to.change.with = ATCG) > { > > number.of.changes <- sample(seq_len(number.of.changes), 1) > > new.letters <- sample(letters.to.change.with , number.of.changes, > T) > > where.to.change.the.letters <- sample(seq_along(x) , > number.of.changes, F) > > x[where.to.change.the.letters] <- new.letters > > return(x) > } > > change.letters(original.seq) > > insert.missing.values <- function(x) change.letters(x, 3, "-") > > insert.missing.values(original.seq) > > seqS2 <- t(apply(seqS, 1, change.letters)) > > seqS3 <- t(apply(seqS2, 1, insert.missing.values)) > > seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")}) > require(stringr) > # library(help=stringr) > > all.seqS <- str_replace(seqS4,"-" , "") > > # how do we allign this? > > data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS) > > > I understand that if all I had was a string and a pattern I would be > able to > use > > library(Biostrings) > > pairwiseAlignment(...) > > > > But in the case I present we are dealing with many sequences to > align to one > another (instead of aligning them to one pattern). > > Is there a known method for doing this in R? > > > Thanks, > > Tal > > > > ----------------Contact > Details:------------------------------------------------------- > Contact me: Tal.Galili at gmail.com | 972-52-7275845 > Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il > (Hebrew) | > www.r-statistics.com (English) > ---------------------------------------------------------------------------------------------- > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.David Winsemius, MD West Hartford, CT
hello everyone, i wanted help to build a* user defined function in R for multiple sequence alignment*. i tried using pairwisealignment () for the multiple sequences, but the results seem to be wrong. i have 10 sequences which i have to align ( by building profiles). is there a function to build a *postion frequency matrix*? thanks, alex -- View this message in context: http://r.789695.n4.nabble.com/Performing-basic-Multiple-Sequence-Alignment-in-R-tp3137526p4547081.html Sent from the R help mailing list archive at Nabble.com.