I need to identify repeated items in p$a with
different s and d entries on the same row, given that
the "0" items should not be considered in the
comparison. Here is an example:
1. Items 3 and 5 in p$a are repeated with different
entries of s and d, should be removed.
2. Item 2 was repeated twice but with a 0 once for s
on row 2 and a second time for d on row 6, hence 2
should be excluded from the comparison. All items are
factor levels and not necessarily numbers.
> p <- data.frame(a=c(1,2,3,4,5,2,3,5,3,5,3),
s=c(0,0,0,2,4,3,2,4,0,0,4),
d=c(0,1,1,1,3,0,5,11,0,0,0)
)
for(i in 1:3) p[,i] <- factor(p[,i])
> p
a s d
1 1 0 0
2 2 0 1
3 3 0 1
4 4 2 1
5 5 4 3
6 2 3 0
7 3 2 5
8 5 4 11
9 3 0 0
10 5 0 0
11 3 4 0
Here is my best effort, I don't like the efficiency
with large data frames! Actually,
efficiency is ridiculous with 800,000 rows!
is.unk <- function(x) {x == "0"}
p.tmp <- unique(p[,1:2])
p.tmp <- p.tmp[!is.unk(p.tmp[,1]) &
!is.unk(p.tmp[,2]),]
dup.s <- p.tmp[duplicated(p.tmp[,1]), 1][,drop=T]
p.tmp <- unique(p[,c(1,3)])
p.tmp <- p.tmp[!is.unk(p.tmp[,1]) &
!is.unk(p.tmp[,2]),]
dup.d <- p.tmp[duplicated(p.tmp[,1]), 1][,drop=T]
dup.sd <- union(as.character(dup.d),
as.character(dup.s))
> row.names(p[is.element(p[,1],dup.sd),])
[1] "3" "5" "7" "8" "9"
"10" "11"
There must be more efficient ways, help please!!
Thanks