Dear list, I have a data frame of survey respondents, a little like this: set.seed(20081215) n <- 100 dat <- data.frame(id=1:100, addr1=sample(LETTERS, n, replace=TRUE), addr2=sample(LETTERS, n, replace=TRUE), addr3=sample(LETTERS, n, replace=TRUE)) head(dat) id addr1 addr2 addr3 1 1 R H Q 2 2 H C K 3 3 I P S 4 4 A H L 5 5 P Q P I wish to detect potential duplicates in the data frame. In my example, people can have up to three addresses. If two people have the same address, then there is a chance that the two entries are duplicates (for instance, persons 1, 2, and 4 in the sample data have the same entry "H" so I want to be sure they aren't duplicates). Person 5 has the same address "P" for addr1 and addr3 but this is not a duplicate, however, since that person may have the same address in several bits of information. I'm only concerned about multiple people sharing the same address. It's easy to find duplicates within individual columns, but I'm not sure how to do so across columns. Any advice you had would be more than welcome. Thanks! Regards, Andrew C. Ward CAPE Centre Department of Chemical Engineering The University of Queensland Brisbane Qld 4072 Australia
I think you mean duplicated *rows*, not columns, despite your subject line. See ?dublicated, which has a data.frame method. On Mon, 15 Dec 2008, Andrew C. Ward wrote:> Dear list, > > I have a data frame of survey respondents, a little like this: > > set.seed(20081215) > n <- 100 > dat <- data.frame(id=1:100, > addr1=sample(LETTERS, n, replace=TRUE), > addr2=sample(LETTERS, n, replace=TRUE), > addr3=sample(LETTERS, n, replace=TRUE)) > head(dat) > > id addr1 addr2 addr3 > 1 1 R H Q > 2 2 H C K > 3 3 I P S > 4 4 A H L > 5 5 P Q P > > > > I wish to detect potential duplicates in the data frame. > In my example, people can have up to three addresses. > If two people have the same address, then there is a > chance that the two entries are duplicates (for instance, > persons 1, 2, and 4 in the sample data have the same > entry "H" so I want to be sure they aren't duplicates). > Person 5 has the same address "P" for addr1 and addr3 > but this is not a duplicate, however, since that person > may have the same address in several bits of information. > I'm only concerned about multiple people sharing the > same address. > > It's easy to find duplicates within individual columns, but > I'm not sure how to do so across columns. Any advice you > had would be more than welcome. Thanks! > > Regards, > > Andrew C. Ward > > CAPE Centre > Department of Chemical Engineering > The University of Queensland > Brisbane Qld 4072 Australia-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Andrew, Is this what you seek? all.addresses <- Reduce( union, dat[-1] ) who.is.here <- sapply( all.addresses, function(x) dat$id[ rowSums(dat[ -1 ] == x ) != 0 ], simplify=FALSE) If not, try to give us more detail. HTH, Chuck On Mon, 15 Dec 2008, Andrew C. Ward wrote:> Dear list, > > I have a data frame of survey respondents, a little like this: > > set.seed(20081215) > n <- 100 > dat <- data.frame(id=1:100, > addr1=sample(LETTERS, n, replace=TRUE), > addr2=sample(LETTERS, n, replace=TRUE), > addr3=sample(LETTERS, n, replace=TRUE)) > head(dat) > > id addr1 addr2 addr3 > 1 1 R H Q > 2 2 H C K > 3 3 I P S > 4 4 A H L > 5 5 P Q P > > > > I wish to detect potential duplicates in the data frame. > In my example, people can have up to three addresses. > If two people have the same address, then there is a > chance that the two entries are duplicates (for instance, > persons 1, 2, and 4 in the sample data have the same > entry "H" so I want to be sure they aren't duplicates). > Person 5 has the same address "P" for addr1 and addr3 > but this is not a duplicate, however, since that person > may have the same address in several bits of information. > I'm only concerned about multiple people sharing the > same address. > > It's easy to find duplicates within individual columns, but > I'm not sure how to do so across columns. Any advice you > had would be more than welcome. Thanks! > > > Regards, > > Andrew C. Ward > > CAPE Centre > Department of Chemical Engineering > The University of Queensland > Brisbane Qld 4072 Australia > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901