G FANG
2010-Jun-24  01:55 UTC
[R] how to group a large list of strings into categories based on string similarity?
Hi,
I want to group a large list (20 million) of strings into categories
based on string similarity?
The specific problem is: given a list of DNA sequence as below
ACTCCCGCCGTTCGCGCGCAGCATGATCCTG
ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN
CAGGATCATGCTGCGCGCGAACGGCGGGAGT
CAGGATCATGCTGCGCGCGAANNNNNNNNNN
CAGGATCATGCTGCGCGCGNNNNNNNNNNNN
......
.....
NNNNNNNCCGTTCGCGCGCAGCATGATCCTG
NNNNNNNNNNNNCGCGCGCAGCATGATCCTG
NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT
NNNNNNNNNNNNNNCGCGCAGCATGATCCTG
NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT
NNNNNNNNNNTTCGCGCGCAGCATGATCCTG
'N' is the missing letter
It can be seen that some strings are the same except for those N's
(i.e. N can match with any base)
given this list of string, I want to have
1) a vector corresponding to each row (string), for each string assign
an id, such that similar strings (those only differ at N's) have the
same id
2) also get a mapping list from unique strings ('unique' in term of
the same similarity defined above) to the ids
I am a matlab user shifting to R. Please advice on efficient ways to do this.
Thanks!
Gang
Martin Morgan
2010-Jun-24  02:46 UTC
[R] how to group a large list of strings into categories based on string similarity?
On 06/23/2010 06:55 PM, G FANG wrote:> Hi, > > I want to group a large list (20 million) of strings into categories > based on string similarity? > > The specific problem is: given a list of DNA sequence as below > > ACTCCCGCCGTTCGCGCGCAGCATGATCCTG > ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN > CAGGATCATGCTGCGCGCGAACGGCGGGAGT > CAGGATCATGCTGCGCGCGAANNNNNNNNNN > CAGGATCATGCTGCGCGCGNNNNNNNNNNNN > ...... > ..... > NNNNNNNCCGTTCGCGCGCAGCATGATCCTG > NNNNNNNNNNNNCGCGCGCAGCATGATCCTG > NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT > NNNNNNNNNNNNNNCGCGCAGCATGATCCTG > NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT > NNNNNNNNNNTTCGCGCGCAGCATGATCCTG > > 'N' is the missing letter > > It can be seen that some strings are the same except for those N's > (i.e. N can match with any base) > > given this list of string, I want to have > > 1) a vector corresponding to each row (string), for each string assign > an id, such that similar strings (those only differ at N's) have the > same id > 2) also get a mapping list from unique strings ('unique' in term of > the same similarity defined above) to the ids > > I am a matlab user shifting to R. Please advice on efficient ways to do this.The Bioconductor Biostrings package has many tools for this sort of operation. See http://bioconductor.org/packages/release/Software.html Maybe a one-time install source('http://bioconductor.org/biocLite.R') biocLite('Biostrings') then library(Biostrings) x <- c("ACTCCCGCCGTTCGCGCGCAGCATGATCCTG", "ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN", "CAGGATCATGCTGCGCGCGAACGGCGGGAGT", "CAGGATCATGCTGCGCGCGAANNNNNNNNNN", "NCAGGATCATGCTGCGCGCGAANNNNNNNNN", "CAGGATCATGCTGCGCGCGNNNNNNNNNNNN", "NNNCAGGATCATGCTGCGCGCGAANNNNNNN") names(x) <- seq_along(x) dna <- DNAStringSet(x) while (!all(width(dna) = width(dna <- trimLRPatterns("N", "N", dna)))) {} names(dna)[rank(dna)] although there might be a faster way (e.g., match 8, 4, 2, 1 N's). Also, your sequences likely come from a fasta file (Biostrings::readFASTA) or a text file with a column of sequences (ShortRead::readXStringColumns) or from alignment software (ShortRead::readAligned / ShortRead::readFastq). If you go this route you'll want to address questions to the Bioconductor mailing list http://bioconductor.org/docs/mailList.html Martin> Thanks! > > Gang > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793