thr3ads.net - R help - [R] Checking for duplicate rows in data frame efficiently [Jul 2010]

If this information is useful, please help other people find it:
Share via:

david hilton shanabrook

2010-Jul-13 22:42 UTC

[R] Checking for duplicate rows in data frame efficiently

I wrote something to check for duplicate rows in a data frame, but it is too
inefficient.  Is there a way to do this without the nested loops?

This code correctly indicates rows 1-7, 1-8, 2-9 and 7-8 are duplicates.
> m <- matrix(c(1,1,1,1,1, 2,2,2,2,2, 6,6,6,6,6, 3,3,3,3,3, 4,4,4,4,4,
5,5,5,5,5, 1,1,1,1,1, 1,1,1,1,1, 2,2,2,2,2, 7,7,7,7,7), ncol=5, byrow=TRUE)
> df <- data.frame(m)
> df   X1 X2 X3 X4 X5
1   1  1  1  1  1
2   2  2  2  2  2
3   6  6  6  6  6
4   3  3  3  3  3
5   4  4  4  4  4
6   5  5  5  5  5
7   1  1  1  1  1
8   1  1  1  1  1
9   2  2  2  2  2
10  7  7  7  7  7> 
> compareTwoRows <- function(row1, row2){+ 	numCol <- 5	
+ 	logicalRow <- row1==row2
+ 	duplicate <- sum(logicalRow)==numCol
+ 	return(as.numeric(duplicate))}> 	
> same <- matrix(0, byrow=TRUE, ncol=10,nrow=10)
> 
> for (j in 1:9)+ 	for (k in (j+1):10)
+ 		same[j,k] <- compareTwoRows(df[j,],df[k,])> 		
> same      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    0    0    0    0    0    0    1    1    0     0
 [2,]    0    0    0    0    0    0    0    0    1     0
 [3,]    0    0    0    0    0    0    0    0    0     0
 [4,]    0    0    0    0    0    0    0    0    0     0
 [5,]    0    0    0    0    0    0    0    0    0     0
 [6,]    0    0    0    0    0    0    0    0    0     0
 [7,]    0    0    0    0    0    0    0    1    0     0
 [8,]    0    0    0    0    0    0    0    0    0     0
 [9,]    0    0    0    0    0    0    0    0    0     0
[10,]    0    0    0    0    0    0    0    0    0     0
	[[alternative HTML version deleted]]

Henrique Dallazuanna

2010-Jul-14 00:18 UTC

head link

[R] Checking for duplicate rows in data frame efficiently

See ?duplicated

On Tue, Jul 13, 2010 at 7:42 PM, david hilton shanabrook <
davidshanabrook@me.com> wrote:
> I wrote something to check for duplicate rows in a data frame, but it is
> too inefficient.  Is there a way to do this without the nested loops?
>
> This code correctly indicates rows 1-7, 1-8, 2-9 and 7-8 are duplicates.
>
> > m <- matrix(c(1,1,1,1,1, 2,2,2,2,2, 6,6,6,6,6, 3,3,3,3,3,
4,4,4,4,4,
> 5,5,5,5,5, 1,1,1,1,1, 1,1,1,1,1, 2,2,2,2,2, 7,7,7,7,7), ncol=5, byrow=TRUE)
> > df <- data.frame(m)
> > df
>   X1 X2 X3 X4 X5
> 1   1  1  1  1  1
> 2   2  2  2  2  2
> 3   6  6  6  6  6
> 4   3  3  3  3  3
> 5   4  4  4  4  4
> 6   5  5  5  5  5
> 7   1  1  1  1  1
> 8   1  1  1  1  1
> 9   2  2  2  2  2
> 10  7  7  7  7  7
> >
> > compareTwoRows <- function(row1, row2){
> +       numCol <- 5
> +       logicalRow <- row1==row2
> +       duplicate <- sum(logicalRow)==numCol
> +       return(as.numeric(duplicate))}
> >
> > same <- matrix(0, byrow=TRUE, ncol=10,nrow=10)
> >
> > for (j in 1:9)
> +       for (k in (j+1):10)
> +               same[j,k] <- compareTwoRows(df[j,],df[k,])
> >
> > same
>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
>  [1,]    0    0    0    0    0    0    1    1    0     0
>  [2,]    0    0    0    0    0    0    0    0    1     0
>  [3,]    0    0    0    0    0    0    0    0    0     0
>  [4,]    0    0    0    0    0    0    0    0    0     0
>  [5,]    0    0    0    0    0    0    0    0    0     0
>  [6,]    0    0    0    0    0    0    0    0    0     0
>  [7,]    0    0    0    0    0    0    0    1    0     0
>  [8,]    0    0    0    0    0    0    0    0    0     0
>  [9,]    0    0    0    0    0    0    0    0    0     0
> [10,]    0    0    0    0    0    0    0    0    0     0
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Henrique Dallazuanna
Curitiba-Paraná-Brasil
25° 25' 40" S 49° 16' 22" O

	[[alternative HTML version deleted]]

T.D. Rudolph

2010-Jul-14 00:44 UTC

head link

[R] Checking for duplicate rows in data frame efficiently

Henrique is correct; entering duplicated(df) will return an index of TRUE or
FALSE for every row.  TRUE indicates a duplicated row.

df[duplicated(df),]  # shows which rows are repeated
df[-duplicated(df),] # shows which rows are unique
-- 
View this message in context:
http://r.789695.n4.nabble.com/Checking-for-duplicate-rows-in-data-frame-efficiently-tp2288120p2288166.html
Sent from the R help mailing list archive at Nabble.com.

Apparently Analagous Threads

Search for more seemingly similar threads

R help - Jul 2010 - Checking for duplicate rows in data frame efficiently

[R] Checking for duplicate rows in data frame efficiently

[R] Checking for duplicate rows in data frame efficiently

[R] Checking for duplicate rows in data frame efficiently

Apparently Analagous Threads