Dr. Thomas Isenbarger
2004-Dec-08 21:12 UTC
[R] similarity matrix conversion to dissimilarity
I have a matrix of similarity scores that I want to convert into a matrix of dissimilarity scores so that I can apply some clustering methods to the data. That is, high values in my matrix signify similarity and low values (zero being the lowest) signify no similarity. What functions/options in R or its packages are available for making this kind of transformation of a matrix? Specifically, I am a molecular biologist. I have a set of 700+ nucleotide sequences i want to group into clusters based on sequence similarities. There is a wide range of sequences in the set, some of which are homologous to other sequences in the set. I want to use clustering to identify these groups. If the sequences were related and good be trimmed to the same length, I would do an alignment and then use phylip (or some other distance method) to create a distance matrix, but since my sequences are unrelated and cannot be trimmed to the same length, I am at a loss for what to do. For a set with so many unrelated sequences of different lengths, the only thing I have been able to is an all-against-all BLAST to create the matrix, but this gives high scores for similarities, not high scores for dissimilarities. The only thought I had was to use the reciprocal of the BLAST score as some perverse measure of distance. I am not subscribed to the list, so can I ask for responses directly to my email address? Thank-you, Tom Isenbarger -- isen at plantpath.wisc.edu thomas a isenbarger (608) 265-0850
Dear Sir: I posed a similar question a few months back and received many responses. Check the searchable archives at R Cran for those helpful email. I did a search for 'similarity matrix' and many results were returned. Harold -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dr. Thomas Isenbarger Sent: Wednesday, December 08, 2004 4:12 PM To: r-help at stat.math.ethz.ch Subject: [R] similarity matrix conversion to dissimilarity I have a matrix of similarity scores that I want to convert into a matrix of dissimilarity scores so that I can apply some clustering methods to the data. That is, high values in my matrix signify similarity and low values (zero being the lowest) signify no similarity. What functions/options in R or its packages are available for making this kind of transformation of a matrix? Specifically, I am a molecular biologist. I have a set of 700+ nucleotide sequences i want to group into clusters based on sequence similarities. There is a wide range of sequences in the set, some of which are homologous to other sequences in the set. I want to use clustering to identify these groups. If the sequences were related and good be trimmed to the same length, I would do an alignment and then use phylip (or some other distance method) to create a distance matrix, but since my sequences are unrelated and cannot be trimmed to the same length, I am at a loss for what to do. For a set with so many unrelated sequences of different lengths, the only thing I have been able to is an all-against-all BLAST to create the matrix, but this gives high scores for similarities, not high scores for dissimilarities. The only thought I had was to use the reciprocal of the BLAST score as some perverse measure of distance. I am not subscribed to the list, so can I ask for responses directly to my email address? Thank-you, Tom Isenbarger -- isen at plantpath.wisc.edu thomas a isenbarger (608) 265-0850 ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[replying to your personal address as well as the list; but I think you should subscribe to the list since this topic may well be pursued further] On 08-Dec-04 Dr. Thomas Isenbarger wrote:> I have a matrix of similarity scores that I want to convert into a > matrix of dissimilarity scores so that I can apply some clustering > methods to the data. That is, high values in my matrix signify > similarity and low values (zero being the lowest) signify no > similarity. What functions/options in R or its packages are available > for making this kind of transformation of a matrix? > > Specifically, I am a molecular biologist. I have a set of 700+ > nucleotide sequences i want to group into clusters based on sequence > similarities. There is a wide range of sequences in the set, some of > which are homologous to other sequences in the set. I want to use > clustering to identify these groups. > > If the sequences were related and good be trimmed to the same length, I > would do an alignment and then use phylip (or some other distance > method) to create a distance matrix, but since my sequences are > unrelated and cannot be trimmed to the same length, I am at a loss for > what to do. > > For a set with so many unrelated sequences of different lengths, the > only thing I have been able to is an all-against-all BLAST to create > the matrix, but this gives high scores for similarities, not high > scores for dissimilarities. The only thought I had was to use the > reciprocal of the BLAST score as some perverse measure of distance. > > I am not subscribed to the list, so can I ask for responses directly to > my email address?Clearly any function which "inverts" the measure of "similarity" (i.e. decreases as "similarity" increases) could be used as a measure of dissimilarity in general. Indeed you imply as much yourself. There is quite a wide choice ... "reciprocal" could be one. However, reading between your lines, it seems that you do not have a substantive interpretation for "dissimilarity". Yet apparently you have one for "similarity". Otherwise, on what basis do you claim that your similarity matrix expresses *substantive* similarity? But, if you can attach an interpretation (in some substantive terms) to your measure of similarity, can you not then negate the propositions that this expresses and obtain a measure of dissimilarity? In that case, the function could be programmed in R (though it may not be a function of your "similarity" and. you would need to derive it from the data). If not, why not? Or, if your measure of "similarity" in fact does not carry a substantive interpretation, then one could assert that any decreasing function of "similarity" could be used, and would be as meaningful as your measure of "similarity". Again, this can be programmed in R. Again reading between your lines, it could be inferred that in the situation you describe ("unrelated sequences" which "cannot be trimmed to the same length"), while you can derive a measure of similarity which matches established concepts for similarity in your field, you cannot match the concepts for dissimilarity. If that is the case, R cannot help you with the conceptual problem. This may appear not helpful, but it is a sincere attempt to clarify the issues. Best wishes, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 [NB: New number!] Date: 08-Dec-04 Time: 23:10:55 ------------------------------ XFMail ------------------------------