thr3ads.net - R devel - [Rd] Faster sorting algorithm... [Mar 2021]

If this information is useful, please help other people find it:
Share via:

Radford Neal

2021-Mar-17 00:36 UTC

[Rd] Faster sorting algorithm...

Those interested in faster sorting may want to look at the merge sort
implemented in pqR (see pqR-project.org).  It's often used as the
default, because it is stable, and does different collations, while
being faster than shell sort (except for small vectors).

Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2, 
compiled identically:

-----------------------------
pqR-2020-07-23 in C locale:
> set.seed(1)
> N <- 1000000
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))   user  system elapsed 
  1.332   0.000   1.334 > print(system.time (or <- order(x,method="radix")))   user  system elapsed 
  0.092   0.004   0.096 > print(system.time (om <- order(x,method="merge")))   user  system elapsed 
  0.363   0.000   0.363 > print(identical(os,or))
[1] TRUE> print(identical(os,om))
[1] TRUE> 
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 1 2> print(order(x,method="radix"))
[1] 1 2> print(order(x,method="merge"))[1] 1 2

-----------------------------
R-4.0.2 in C locale:
> set.seed(1)
> N <- 1000000
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))   user  system elapsed 
  2.381   0.004   2.387 > print(system.time (or <- order(x,method="radix")))   user  system elapsed 
  0.138   0.000   0.137 > #print(system.time (om <- order(x,method="merge")))
> print(identical(os,or))
[1] TRUE> #print(identical(os,om))
> 
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 1 2> print(order(x,method="radix"))
[1] 1 2> #print(order(x,method="merge"))
------------------------------------
pqR-2020-07-23 in fr_CA.utf8 locale:
> set.seed(1)
> N <- 1000000
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))utilisateur     syst?me      ?coul? 
      2.960       0.000       2.962 > print(system.time (or <- order(x,method="radix")))utilisateur     syst?me      ?coul? 
      0.083       0.008       0.092 > print(system.time (om <- order(x,method="merge")))utilisateur     syst?me      ?coul? 
      1.143       0.000       1.142 > print(identical(os,or))
[1] TRUE> print(identical(os,om))
[1] TRUE> 
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 2 1> print(order(x,method="radix"))
[1] 1 2> print(order(x,method="merge"))[1] 2 1

------------------------------------
R-4.0.2 in fr_CA.utf8 locale:
> set.seed(1)
> N <- 1000000
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))utilisateur     syst?me      ?coul? 
      4.222       0.016       4.239 > print(system.time (or <- order(x,method="radix")))utilisateur     syst?me      ?coul? 
      0.136       0.000       0.137 > #print(system.time (om <- order(x,method="merge")))
> print(identical(os,or))
[1] TRUE> #print(identical(os,om))
> 
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 2 1> print(order(x,method="radix"))
[1] 1 2> #print(order(x,method="merge"))

pqR is faster in all the tests, but more relevant to this discussion
is that the "merge" method is substantially faster than
"shell" for
these character vectors with a million elements, while retaining the
stability and collation properties of "shell" (whereas
"radix" only
does C collation).

It would probably not be too hard to take the merge sort code from pqR
and use it in R core's implementation.

The merge sort in pqR doesn't exploit parallelism at the moment, but
merge sort is potentially quite parallelizable (though I think the
storage allocation strategy I use would have to be modified).

Regards,

   Radford Neal

Morgan Morgan

2021-Mar-17 09:23 UTC

head link

[Rd] Faster sorting algorithm...

Thank you Neal. This is interesting. I will have a look at pqR.
Indeed radix only does C collation, I believe that is why it is not the
default choice for character ordering and sorting.
Not sure but I believe it can help address the following bugzilla item:
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17400

On the same topic of collation, there is an experimental sorting function
"psort" in package kit that might help address this issue.
> library(kit)
Attaching kit 0.0.7 (OPENMP enabled using 1 thread)> x <-
c("b","A","B","a","\xe4")
> Encoding(x) <- "latin1"
> identical(psort(x, c.locale=FALSE), sort(x))
[1] TRUE> identical(psort(x, c.locale=TRUE), sort(x, method="radix"))[1] TRUE

Coming back to the topic of fsort, I have just finished the implementation
for double, integer, factor and logical.
The implementation takes into account NA, Inf.. values. Values can be
sorted in a decreasing order or increasing order.
Comparing benchmark with the current implementation in data.table, it is
currently over 30% faster.
There might bugs but I am sure performance can be further improved as I did
not really try hard.
If there is interest in both the implementation and cross community
sharing, please let know....

Best regards,
Morgan

On Wed, 17 Mar 2021, 00:37 Radford Neal, <radford at cs.toronto.edu>
wrote:
> Those interested in faster sorting may want to look at the merge sort
> implemented in pqR (see pqR-project.org).  It's often used as the
> default, because it is stable, and does different collations, while
> being faster than shell sort (except for small vectors).
>
> Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2,
> compiled identically:
>
> -----------------------------
> pqR-2020-07-23 in C locale:
>
> > set.seed(1)
> > N <- 1000000
> > x <- as.character (sample(N,N,replace=TRUE))
> > print(system.time (os <- order(x,method="shell")))
>    user  system elapsed
>   1.332   0.000   1.334
> > print(system.time (or <- order(x,method="radix")))
>    user  system elapsed
>   0.092   0.004   0.096
> > print(system.time (om <- order(x,method="merge")))
>    user  system elapsed
>   0.363   0.000   0.363
> > print(identical(os,or))
> [1] TRUE
> > print(identical(os,om))
> [1] TRUE
> >
> > x <- c("a","~")
> > print(order(x,method="shell"))
> [1] 1 2
> > print(order(x,method="radix"))
> [1] 1 2
> > print(order(x,method="merge"))
> [1] 1 2
>
> -----------------------------
> R-4.0.2 in C locale:
>
> > set.seed(1)
> > N <- 1000000
> > x <- as.character (sample(N,N,replace=TRUE))
> > print(system.time (os <- order(x,method="shell")))
>    user  system elapsed
>   2.381   0.004   2.387
> > print(system.time (or <- order(x,method="radix")))
>    user  system elapsed
>   0.138   0.000   0.137
> > #print(system.time (om <- order(x,method="merge")))
> > print(identical(os,or))
> [1] TRUE
> > #print(identical(os,om))
> >
> > x <- c("a","~")
> > print(order(x,method="shell"))
> [1] 1 2
> > print(order(x,method="radix"))
> [1] 1 2
> > #print(order(x,method="merge"))
>
> ------------------------------------
> pqR-2020-07-23 in fr_CA.utf8 locale:
>
> > set.seed(1)
> > N <- 1000000
> > x <- as.character (sample(N,N,replace=TRUE))
> > print(system.time (os <- order(x,method="shell")))
> utilisateur     syst?me      ?coul?
>       2.960       0.000       2.962
> > print(system.time (or <- order(x,method="radix")))
> utilisateur     syst?me      ?coul?
>       0.083       0.008       0.092
> > print(system.time (om <- order(x,method="merge")))
> utilisateur     syst?me      ?coul?
>       1.143       0.000       1.142
> > print(identical(os,or))
> [1] TRUE
> > print(identical(os,om))
> [1] TRUE
> >
> > x <- c("a","~")
> > print(order(x,method="shell"))
> [1] 2 1
> > print(order(x,method="radix"))
> [1] 1 2
> > print(order(x,method="merge"))
> [1] 2 1
>
> ------------------------------------
> R-4.0.2 in fr_CA.utf8 locale:
>
> > set.seed(1)
> > N <- 1000000
> > x <- as.character (sample(N,N,replace=TRUE))
> > print(system.time (os <- order(x,method="shell")))
> utilisateur     syst?me      ?coul?
>       4.222       0.016       4.239
> > print(system.time (or <- order(x,method="radix")))
> utilisateur     syst?me      ?coul?
>       0.136       0.000       0.137
> > #print(system.time (om <- order(x,method="merge")))
> > print(identical(os,or))
> [1] TRUE
> > #print(identical(os,om))
> >
> > x <- c("a","~")
> > print(order(x,method="shell"))
> [1] 2 1
> > print(order(x,method="radix"))
> [1] 1 2
> > #print(order(x,method="merge"))
>
>
> pqR is faster in all the tests, but more relevant to this discussion
> is that the "merge" method is substantially faster than
"shell" for
> these character vectors with a million elements, while retaining the
> stability and collation properties of "shell" (whereas
"radix" only
> does C collation).
>
> It would probably not be too hard to take the merge sort code from pqR
> and use it in R core's implementation.
>
> The merge sort in pqR doesn't exploit parallelism at the moment, but
> merge sort is potentially quite parallelizable (though I think the
> storage allocation strategy I use would have to be modified).
>
> Regards,
>
>    Radford Neal
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
	[[alternative HTML version deleted]]

R devel - Mar 2021 - Faster sorting algorithm...

[Rd] Faster sorting algorithm...

[Rd] Faster sorting algorithm...