Hi useRs, Disclaimer: My question is more statistical than pertaining specifically to the R system) I am using the "tm" package in R to create a Document-Term Matrix, with Tf-Idf measures. A) Once done, I create a distance matrix using "euclidean" distance measure. B) After this, I use hierarchical clustering to find an "appropriate" separation in the data using "ward" measure For A above, what are generally the best practices for distance measures on TfIdf. I used the cosine similarity measure, but that creates NaN/Inf values which have to be converted to zero. For B above, I used "ward" since the Details alluded to it being the most used measure which provides better results. I understand that such a question requires extensive research since the underlying data (emails in my case) may have a great influence on the results. I have used a Part of Speech tagger to extract nouns as features to use as the dictionary in order to weed out trivial words. Any feedback/link to online knowledge resources/your experience would be greatly appreciated. Thank you for your time. Regards, Harsh Singhal