thr3ads.net - R help - [R] Identifying similar but not identical rows in a dataframe [Oct 2009]

If this information is useful, please help other people find it:
Share via:

Jeffrey Salinger

2009-Oct-19 02:56 UTC

[R] Identifying similar but not identical rows in a dataframe

I would like to identify _almost_ duplicated rows in a data frame.  For example,
I might declare as duplicates pairs of rows that are alike at about 80% of their
columns.  When working with tens of thousands of rows and upwards of 20 columns
an iterative approach, testing all permutations, can be time consuming.

 Duplicated() with incomparables sounds like the ticket.  But previous
discussion in this forum indicates that specifying an
incomparable value when using duplicated() on a data frame is not yet
implemented. 

Any suggestions about how to implement this efficiently would be appreciated.  

All data are numerical, and each datum could, for example, be reduced to a byte
representation in a string.  A fuzzy matching approach with agrep() might be
possible.

Thanks.



      __________________________________________________________________
Be smarter than spam. See how smart SpamGuard is at giving junk email the b

	[[alternative HTML version deleted]]

Apparently Analagous Threads

Search for more reasonably related threads

R help - Oct 2009 - Identifying similar but not identical rows in a dataframe

[R] Identifying similar but not identical rows in a dataframe

Apparently Analagous Threads

Wisdom of the Ancients