Jeffrey Salinger
2009-Oct-19 02:56 UTC
[R] Identifying similar but not identical rows in a dataframe
I would like to identify _almost_ duplicated rows in a data frame. For example, I might declare as duplicates pairs of rows that are alike at about 80% of their columns. When working with tens of thousands of rows and upwards of 20 columns an iterative approach, testing all permutations, can be time consuming. Duplicated() with incomparables sounds like the ticket. But previous discussion in this forum indicates that specifying an incomparable value when using duplicated() on a data frame is not yet implemented. Any suggestions about how to implement this efficiently would be appreciated. All data are numerical, and each datum could, for example, be reduced to a byte representation in a string. A fuzzy matching approach with agrep() might be possible. Thanks. __________________________________________________________________ Be smarter than spam. See how smart SpamGuard is at giving junk email the b [[alternative HTML version deleted]]