thr3ads.net - R help - [R] Find "undirected" duplicates in a tibble [Aug 2021]

If this information is useful, please help other people find it:
Share via:

Greg Minshall

2021-Aug-20 19:39 UTC

[R] Find "undirected" duplicates in a tibble

Bert,
> The efficiency gains are due to vectorization and the use of more
> efficient primitives. None of this may matter of course, but it seemed
> worth mentioning.
thanks very much!  the varieties of code, and disparities of
performance, are truly wonderful.

Rui's point that what works better for small n is not necessarily what
will work better for large n is important to keep in [my] mind.

as a "so-far" summary, here are some timings.  the relevant code is
below.
----
my apply
   user  system elapsed 
  8.397   0.124   8.531 
Bert's !duplicated
   user  system elapsed 
  2.367   0.000   2.370 
Bert's x[,2]>x[,1]
   user  system elapsed 
  1.052   0.000   1.054 
my a.d.f(unique(cbind(do.call)))
   user  system elapsed 
  3.909   0.000   3.914 
Eric Berger's unique(...pmin...pmax)
   user  system elapsed 
  0.848   0.000   0.850 
Eric Berger's transmuting tibble...
   user  system elapsed 
  0.986   0.000   0.988 
Kimmo Elo's [OP] mutating paste
   user  system elapsed 
 52.079   0.000  52.136 
Rui Barradas' sort-based
   user  system elapsed 
 42.327   0.080  42.450 
----

cheers, Greg

----
n <- 1000
x <- expand.grid(Source = 1:n, Target = 1:n)

cat("my apply\n")
system.time({
 y <- apply(x, 1, function(y) return (c(A=min(y), B=max(y))))
 unique(t(y))})
#   user  system elapsed
#  5.075   0.034   5.109

cat("Bert's !duplicated\n")
system.time({
 x[!duplicated(cbind(do.call(pmin, x), do.call(pmax, x))), ]
 })
#   user  system elapsed
#  1.340   0.013   1.353

# Still more efficient and still returning a data frame is:
cat("Bert's x[,2]>x[,1]\n")
system.time({
 w <- x[, 2] > x[,1]
 x[w, ] <- x[w, 2:1]
 unique(x)})
#   user  system elapsed
#  0.693   0.011   0.703

cat("my a.d.f(unique(cbind(do.call)))\n")
system.time({
  as.data.frame(unique(cbind(A=do.call(pmin,x), B=do.call(pmax,x))))
})

cat("Eric Berger's unique(...pmin...pmax)\n")
system.time({
  unique(data.frame(V1=pmin(x$Source,x$Target), V2=pmax(x$Source,x$Target)))
})

cat("Eric Berger's transmuting tibble...\n")
require(dplyr)
xt<-tibble(x)
system.time({
  xt %>% transmute( a=pmin(Source,Target), b=pmax(Source,Target)) %>%
    unique() %>% rename(Source=a, Target=b)
})

cat("Kimmo Elo's [OP] mutating paste\n")
system.time({
  xt %>%
    mutate(pair=mapply(function(x,y)
      paste0(sort(c(x,y)),collapse="-"), Source, Target)) %>%
    distinct(pair,
             .keep_all = T) %>%
    mutate(Source=sapply(pair, function(x)
      unlist(strsplit(x, split="-"))[1]), Target=sapply(pair,
function(x)
        unlist(strsplit(x, split="-"))[2])) %>%
    select(-pair)
})

cat("Rui Barradas' sort-based\n")
system.time({
  apply(x, 1, sort) |> t() |> unique()
})

Bert Gunter

2021-Aug-20 22:42 UTC

head link

[R] Find "undirected" duplicates in a tibble

Thanks, Greg.

Turns out that there's an even faster alternative. Note that the OP
asked whether one could include in the result the counts of each
unordered pair, which I assume could be either 2 or 1. This can be
done easily using table(), and it's quite a bit faster for my 1
million pair example. Herewith the details, which I'll define as
functions for convenience.

## my earlier attempt using unique():
 g <- function(x) {
   w <- x[,2] > x[, 1]
   x[w,] <- x[w, 2:1]
   unique(x)
}

## present version using table():
f <- function(x){
   w <- x[,2] > x[,1]
   x[w, ] <- x[w, 2:1]
   x$counts <- as.vector(table(x)) ## drop the dim
   x[x$counts>0, ]
}
>  y <- expand.grid(source =1:4, target = 1:3)
> g(y)   source target
1       1      1
2       2      1
3       3      1
4       4      1
6       2      2
7       3      2
8       4      2
11      3      3
12      4      3> f(y)   source target counts
1       1      1      1
2       2      1      2
3       3      1      2
4       4      1      1
6       2      2      1
7       3      2      2
8       4      2      1
11      3      3      1
12      4      3      1

## Timing:> y <- expand.grid(sample.int(1000), sample.int(1000))
##> system.time(g(y))   user  system elapsed
  0.896   0.027   0.924
##> system.time(f(y))   user  system elapsed
  0.142   0.009   0.151

And, yes, I was surprised by this, too.

Again, it may not matter, but it is interesting.
Your mileage may vary, of course.

Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Fri, Aug 20, 2021 at 12:39 PM Greg Minshall <minshall at umich.edu>
wrote:>
> Bert,
>
> > The efficiency gains are due to vectorization and the use of more
> > efficient primitives. None of this may matter of course, but it seemed
> > worth mentioning.
>
> thanks very much!  the varieties of code, and disparities of
> performance, are truly wonderful.
>
> Rui's point that what works better for small n is not necessarily what
> will work better for large n is important to keep in [my] mind.
>
> as a "so-far" summary, here are some timings.  the relevant code
is below.
> ----
> my apply
>    user  system elapsed
>   8.397   0.124   8.531
> Bert's !duplicated
>    user  system elapsed
>   2.367   0.000   2.370
> Bert's x[,2]>x[,1]
>    user  system elapsed
>   1.052   0.000   1.054
> my a.d.f(unique(cbind(do.call)))
>    user  system elapsed
>   3.909   0.000   3.914
> Eric Berger's unique(...pmin...pmax)
>    user  system elapsed
>   0.848   0.000   0.850
> Eric Berger's transmuting tibble...
>    user  system elapsed
>   0.986   0.000   0.988
> Kimmo Elo's [OP] mutating paste
>    user  system elapsed
>  52.079   0.000  52.136
> Rui Barradas' sort-based
>    user  system elapsed
>  42.327   0.080  42.450
> ----
>
> cheers, Greg
>
> ----
> n <- 1000
> x <- expand.grid(Source = 1:n, Target = 1:n)
>
> cat("my apply\n")
> system.time({
>  y <- apply(x, 1, function(y) return (c(A=min(y), B=max(y))))
>  unique(t(y))})
> #   user  system elapsed
> #  5.075   0.034   5.109
>
> cat("Bert's !duplicated\n")
> system.time({
>  x[!duplicated(cbind(do.call(pmin, x), do.call(pmax, x))), ]
>  })
> #   user  system elapsed
> #  1.340   0.013   1.353
>
> # Still more efficient and still returning a data frame is:
> cat("Bert's x[,2]>x[,1]\n")
> system.time({
>  w <- x[, 2] > x[,1]
>  x[w, ] <- x[w, 2:1]
>  unique(x)})
> #   user  system elapsed
> #  0.693   0.011   0.703
>
> cat("my a.d.f(unique(cbind(do.call)))\n")
> system.time({
>   as.data.frame(unique(cbind(A=do.call(pmin,x), B=do.call(pmax,x))))
> })
>
> cat("Eric Berger's unique(...pmin...pmax)\n")
> system.time({
>   unique(data.frame(V1=pmin(x$Source,x$Target),
V2=pmax(x$Source,x$Target)))
> })
>
> cat("Eric Berger's transmuting tibble...\n")
> require(dplyr)
> xt<-tibble(x)
> system.time({
>   xt %>% transmute( a=pmin(Source,Target), b=pmax(Source,Target)) %>%
>     unique() %>% rename(Source=a, Target=b)
> })
>
> cat("Kimmo Elo's [OP] mutating paste\n")
> system.time({
>   xt %>%
>     mutate(pair=mapply(function(x,y)
>       paste0(sort(c(x,y)),collapse="-"), Source, Target)) %>%
>     distinct(pair,
>              .keep_all = T) %>%
>     mutate(Source=sapply(pair, function(x)
>       unlist(strsplit(x, split="-"))[1]), Target=sapply(pair,
function(x)
>         unlist(strsplit(x, split="-"))[2])) %>%
>     select(-pair)
> })
>
> cat("Rui Barradas' sort-based\n")
> system.time({
>   apply(x, 1, sort) |> t() |> unique()
> })

R help - Aug 2021 - Find "undirected" duplicates in a tibble

[R] Find "undirected" duplicates in a tibble

[R] Find "undirected" duplicates in a tibble