Hi, I've got two suggestions how to speed up median() about 50%. For all iterative methods calling median() in the loops this has a major impact. The second suggestion will apply to other methods too. This is what the functions look like today:> medianfunction (x, na.rm = FALSE) { if (is.factor(x) || mode(x) != "numeric") stop("need numeric data") if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(NA) n <- length(x) if (n == 0) return(NA) half <- (n + 1)/2 if (n%%2 == 1) { sort(x, partial = half)[half] } else { sum(sort(x, partial = c(half, half + 1))[c(half, half + 1)])/2 } } <environment: namespace:stats> Suggestion 1: Replace the sort() calls with the .Internal(psort(x, partial)). This will avoid unnecessary overhead, especially an expensive second check for NAs using any(is.na(x)). Simple benchmarking with x <- rnorm(10e6) system.time(median(x))/system.time(median2(x)) where median2() is the function with the above replacements, gives about 20-25% speed up. Suggestion 2: Create a has.na(x) function to replace any(is.na(x)) that returns TRUE as soon as a NA value is detected. In the best case it returns after the first index with TRUE, in the worst case it returns after the last index N with FALSE. The cost for is.na(x) is always O(N), and any() in the best case O(1) and in the worst case O(N) (if any() is implemented as I hope). An has.na() function would be very useful elsewhere too. An poor mans alternative to (2), is to have a third alternative to 'na.rm', say, NA, which indicates that we know that there are no NAs in 'x'. The original median() is approx 50% slower (naive benchmarking) than a version with the above two improvements, if passing a large 'x' with no NAs; median2 <- function (x, na.rm = FALSE) { if (is.factor(x) || mode(x) != "numeric") stop("need numeric data") if (is.na(na.rm)) { } else if (na.rm) x <- x[!is.na(x)] else if (any(is.na(x))) return(NA) n <- length(x) if (n == 0) return(NA) half <- (n + 1)/2 if (n%%2 == 1) { .Internal(psort(x, half))[half] } else { sum(.Internal(psort(x, c(half, half + 1)))[c(half, half + 1)])/2 } } x <- rnorm(10e5) K <- 10 t0 <- system.time({ for (kk in 1:K) y <- median(x); }) print(t0) # [1] 1.82 0.14 1.98 NA NA t1 <- system.time({ for (kk in 1:K) y <- median2(x, na.rm=NA); }) print(t1) # [1] 1.25 0.06 1.34 NA NA print(t0/t1) # [1] 1.456000 2.333333 1.477612 NA NA BTW, without having checked the source code, it looks like is.na() is unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on a vector without NAs. On the other hand, is.na(sum(x)) becomes awfully slow if 'x' contains NAs. /Henrik
On Mon, 10 Apr 2006, Henrik Bengtsson wrote:> Hi, > > I've got two suggestions how to speed up median() about 50%. For all > iterative methods calling median() in the loops this has a major > impact. The second suggestion will apply to other methods too.I'm surprised this has a major impact -- in your example it takes much longer to generate the ten million numbers than to find the median.> Suggestion 1: > Replace the sort() calls with the .Internal(psort(x, partial)). This > will avoid unnecessary overhead, especially an expensive second check > for NAs using any(is.na(x)). Simple benchmarking with > > x <- rnorm(10e6) > system.time(median(x))/system.time(median2(x)) > > where median2() is the function with the above replacements, gives > about 20-25% speed up.There's something that seems a bit undesirable about having median() call the .Internal function for sort().> Suggestion 2: > Create a has.na(x) function to replace any(is.na(x)) that returns TRUE > as soon as a NA value is detected. In the best case it returns after > the first index with TRUE, in the worst case it returns after the last > index N with FALSE. The cost for is.na(x) is always O(N), and any() > in the best case O(1) and in the worst case O(N) (if any() is > implemented as I hope). An has.na() function would be very useful > elsewhere too.This sounds useful (though it has missed the deadline for 2.3.0). It won't help if the typical case is no missing values, as you suggest, but it will be faster when there are missing values.> BTW, without having checked the source code, it looks like is.na() is > unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on > a vector without NAs. On the other hand, is.na(sum(x)) becomes > awfully slow if 'x' contains NAs. >I don't think it is unnecessarily slow. It has to dispatch methods and it has to make sure that matrix structure is preserved. After that the code is just case REALSXP: for (i = 0; i < n; i++) LOGICAL(ans)[i] = ISNAN(REAL(x)[i]); break; and it's hard to see how that can be improved. It does suggest that a faster anyNA() function would have to not be generic. -thomas Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
I've recently folded a new version of the Floyd-Rivest quantile algorithm for quantiles into my quantreg package. So it is easily available for comparative testing. On my G5 running last friday's R-devel, I get: Median Only 5 Quantiles n quantile kuantile quantile kuantile qsort 100 0.003 0.003 0.006 0.004 0.002 1000 0.002 0.002 0.002 0.002 0.002 10000 0.005 0.003 0.008 0.003 0.001 1e+05 0.022 0.010 0.035 0.012 0.017 1e+06 0.181 0.117 0.308 0.138 0.200 1e+07 1.853 0.762 3.180 1.003 2.287 # Small timing experiment to compare kuantile and quantile require(quantreg) set.seed(1446) ns <- 10^(2:7) R <- 10 T <- array(NA,c(R,length(ns),5)) eps <- 20 * .Machine$double.eps for(j in 1:length(ns)){ for(i in 1:R){ y <- rnorm(ns[j]) T[i,j,1] <- system.time(qy <- quantile(y,.5))[1] T[i,j,2] <- system.time(ky <- kuantile(y,.5))[1] stopifnot(abs(qy - ky) < eps) T[i,j,3] <- system.time(qy <- quantile(y))[1] T[i,j,4] <- system.time(ky <- kuantile(y))[1] stopifnot(abs(qy - ky) < eps) T[i,j,5] <- system.time(sort(y,method="quick"))[1] } } Tab <- apply(T,2:3,mean) dimnames(Tab) <- list(paste(ns),c(rep(c("quantile","kuantile"), 2),"qsort")) url: www.econ.uiuc.edu/~roger Roger Koenker email rkoenker at uiuc.edu Department of Economics vox: 217-333-4558 University of Illinois fax: 217-244-6678 Champaign, IL 61820