As part of checking Martin Maechler's observations of sorting efficiency I have been putting the code under the microscope. Part of this involved comparing the performance with Splus. I turns out that the R code is much slower. A good deal of this is due to the interpreted front end to the R verion of sort. It is worth taking a look at the sort function and seeing if you can see what the problem is. (hint: turn on GC reporting with "gcinfo"). A good deal of the rest of the inefficiency is because the internal R code attempts to deal with missing observations. This is no longer needed because this is dealt with in the interpreted front end. An overall fix of the code should see us nearly match the performance of Splus. I say nearly because I think I will use heapsort instead of quicksort because I want to avoid worst-case performance rather than seek good average performance. I don't yet have an explanation of the behavior which Martin uncovered. Ross =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Martin Maechler
1997-Sep-05 08:32 UTC
R-alpha: Sorting Efficiency -- thoughts.. : if(!is.sorted(x)) x <- sort(x)
Since you are delving into the code anyway.... Here is a thought that I have been having for a long time: Wouldn't it make sense / be useful to have a function is.sorted which returns TRUE if all(diff(x) >= 0). --------- I actually have been using the function is.sorted <- function(x) (length(x) <= 1) || all(diff(x) >= 0) in my collection of S-plus enhancements for quite while now. I can do things as if(!is.sorted(x)) { o <- order(x) x <- x[o] y <- y[o] } which potentially saves computing time (over unconditional 'order'), since 1. is.sorted is O(n) whereas sort is O(n*log(n)) 2. for sorted x, the two assignments are superfluous even when order(x) was very fast (for SORTED x). TO be CPU-efficient, is.sorted probably should be written .Internal(.). On the other hand, sort(.) and order(.) could be designed such that they use a sorting algorithm which guarantees to be fast when things are already sorted. [maybe this is already the case, and one of your reasons against quicksort? ] This would make the 'if(..)' in if(!is.sorted(x)) x <- sort(x) superfluous, but the situation with order(.) above (which is NOT infrequent in statistics !!) would still profit from not doing 'x <- x[1:length(x)]' when x is sorted. -- Opinions? =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-