Displaying 1 result from an estimated 1 matches for "caggatcatgctgcgcgcgnnnnnnnnnnnn".
Did you mean:
caggatcatgctgcgcgcgaannnnnnnnnn
2010 Jun 24
1
how to group a large list of strings into categories based on string similarity?
Hi,
I want to group a large list (20 million) of strings into categories
based on string similarity?
The specific problem is: given a list of DNA sequence as below
ACTCCCGCCGTTCGCGCGCAGCATGATCCTG
ACTCCCGCCGTTCGCGCGCNNNNNNNNNNNN
CAGGATCATGCTGCGCGCGAACGGCGGGAGT
CAGGATCATGCTGCGCGCGAANNNNNNNNNN
CAGGATCATGCTGCGCGCGNNNNNNNNNNNN
......
.....
NNNNNNNCCGTTCGCGCGCAGCATGATCCTG
NNNNNNNNNNNNCGCGCGCAGCATGATCCTG
NNNNNNNNNNNNGCGCGCGAACGGCGGGAGT
NNNNNNNNNNNNNNCGCGCAGCATGATCCTG
NNNNNNNNNNNTGCGCGCGAACGGCGGGAGT
NNNNNNNNNNTTCGCGCGCAGCATGATCCTG
'N' is the missing letter
It can be seen that some strings are the same except for t...