david hilton shanabrook
2010-Jul-13 22:42 UTC
[R] Checking for duplicate rows in data frame efficiently
I wrote something to check for duplicate rows in a data frame, but it is too inefficient. Is there a way to do this without the nested loops? This code correctly indicates rows 1-7, 1-8, 2-9 and 7-8 are duplicates.> m <- matrix(c(1,1,1,1,1, 2,2,2,2,2, 6,6,6,6,6, 3,3,3,3,3, 4,4,4,4,4, 5,5,5,5,5, 1,1,1,1,1, 1,1,1,1,1, 2,2,2,2,2, 7,7,7,7,7), ncol=5, byrow=TRUE) > df <- data.frame(m) > dfX1 X2 X3 X4 X5 1 1 1 1 1 1 2 2 2 2 2 2 3 6 6 6 6 6 4 3 3 3 3 3 5 4 4 4 4 4 6 5 5 5 5 5 7 1 1 1 1 1 8 1 1 1 1 1 9 2 2 2 2 2 10 7 7 7 7 7> > compareTwoRows <- function(row1, row2){+ numCol <- 5 + logicalRow <- row1==row2 + duplicate <- sum(logicalRow)==numCol + return(as.numeric(duplicate))}> > same <- matrix(0, byrow=TRUE, ncol=10,nrow=10) > > for (j in 1:9)+ for (k in (j+1):10) + same[j,k] <- compareTwoRows(df[j,],df[k,])> > same[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0 0 0 0 0 0 1 1 0 0 [2,] 0 0 0 0 0 0 0 0 1 0 [3,] 0 0 0 0 0 0 0 0 0 0 [4,] 0 0 0 0 0 0 0 0 0 0 [5,] 0 0 0 0 0 0 0 0 0 0 [6,] 0 0 0 0 0 0 0 0 0 0 [7,] 0 0 0 0 0 0 0 1 0 0 [8,] 0 0 0 0 0 0 0 0 0 0 [9,] 0 0 0 0 0 0 0 0 0 0 [10,] 0 0 0 0 0 0 0 0 0 0 [[alternative HTML version deleted]]
Henrique Dallazuanna
2010-Jul-14 00:18 UTC
[R] Checking for duplicate rows in data frame efficiently
See ?duplicated On Tue, Jul 13, 2010 at 7:42 PM, david hilton shanabrook < davidshanabrook@me.com> wrote:> I wrote something to check for duplicate rows in a data frame, but it is > too inefficient. Is there a way to do this without the nested loops? > > This code correctly indicates rows 1-7, 1-8, 2-9 and 7-8 are duplicates. > > > m <- matrix(c(1,1,1,1,1, 2,2,2,2,2, 6,6,6,6,6, 3,3,3,3,3, 4,4,4,4,4, > 5,5,5,5,5, 1,1,1,1,1, 1,1,1,1,1, 2,2,2,2,2, 7,7,7,7,7), ncol=5, byrow=TRUE) > > df <- data.frame(m) > > df > X1 X2 X3 X4 X5 > 1 1 1 1 1 1 > 2 2 2 2 2 2 > 3 6 6 6 6 6 > 4 3 3 3 3 3 > 5 4 4 4 4 4 > 6 5 5 5 5 5 > 7 1 1 1 1 1 > 8 1 1 1 1 1 > 9 2 2 2 2 2 > 10 7 7 7 7 7 > > > > compareTwoRows <- function(row1, row2){ > + numCol <- 5 > + logicalRow <- row1==row2 > + duplicate <- sum(logicalRow)==numCol > + return(as.numeric(duplicate))} > > > > same <- matrix(0, byrow=TRUE, ncol=10,nrow=10) > > > > for (j in 1:9) > + for (k in (j+1):10) > + same[j,k] <- compareTwoRows(df[j,],df[k,]) > > > > same > [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > [1,] 0 0 0 0 0 0 1 1 0 0 > [2,] 0 0 0 0 0 0 0 0 1 0 > [3,] 0 0 0 0 0 0 0 0 0 0 > [4,] 0 0 0 0 0 0 0 0 0 0 > [5,] 0 0 0 0 0 0 0 0 0 0 > [6,] 0 0 0 0 0 0 0 0 0 0 > [7,] 0 0 0 0 0 0 0 1 0 0 > [8,] 0 0 0 0 0 0 0 0 0 0 > [9,] 0 0 0 0 0 0 0 0 0 0 > [10,] 0 0 0 0 0 0 0 0 0 0 > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40" S 49° 16' 22" O [[alternative HTML version deleted]]
T.D. Rudolph
2010-Jul-14 00:44 UTC
[R] Checking for duplicate rows in data frame efficiently
Henrique is correct; entering duplicated(df) will return an index of TRUE or FALSE for every row. TRUE indicates a duplicated row. df[duplicated(df),] # shows which rows are repeated df[-duplicated(df),] # shows which rows are unique -- View this message in context: http://r.789695.n4.nabble.com/Checking-for-duplicate-rows-in-data-frame-efficiently-tp2288120p2288166.html Sent from the R help mailing list archive at Nabble.com.