thr3ads.net - R devel - [Rd] Suggestions to speed up median() and has.na() [Apr 2006]

If this information is useful, please help other people find it:
Share via:

Henrik Bengtsson

2006-Apr-10 17:37 UTC

[Rd] Suggestions to speed up median() and has.na()

Hi,

I've got two suggestions how to speed up median() about 50%.  For all
iterative methods calling median() in the loops this has a major
impact.  The second suggestion will apply to other methods too.

This is what the functions look like today:
> medianfunction (x, na.rm = FALSE)
{
    if (is.factor(x) || mode(x) != "numeric")
        stop("need numeric data")
    if (na.rm)
        x <- x[!is.na(x)]
    else if (any(is.na(x)))
        return(NA)
    n <- length(x)
    if (n == 0)
        return(NA)
    half <- (n + 1)/2
    if (n%%2 == 1) {
        sort(x, partial = half)[half]
    }
    else {
        sum(sort(x, partial = c(half, half + 1))[c(half, half +
            1)])/2
    }
}
<environment: namespace:stats>

Suggestion 1:
Replace the sort() calls with the .Internal(psort(x, partial)).   This
will avoid unnecessary overhead, especially an expensive second check
for NAs using any(is.na(x)).  Simple benchmarking with

x <- rnorm(10e6)
system.time(median(x))/system.time(median2(x))

where median2() is the function with the above replacements, gives
about 20-25% speed up.

Suggestion 2:
Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
as soon as a NA value is detected.  In the best case it returns after
the first index with TRUE, in the worst case it returns after the last
index N with FALSE.  The cost for is.na(x) is always O(N), and any()
in the best case O(1) and in the worst case O(N) (if any() is
implemented as I hope).  An has.na() function would be very useful
elsewhere too.

An poor mans alternative to (2), is to have a third alternative to
'na.rm', say, NA, which indicates that we know that there are no NAs
in 'x'.

The original median() is approx 50% slower (naive benchmarking) than a
version with the above two improvements, if passing a large 'x' with
no NAs;

median2 <- function (x, na.rm = FALSE) {
    if (is.factor(x) || mode(x) != "numeric")
        stop("need numeric data")

    if (is.na(na.rm)) {
    } else if (na.rm)
        x <- x[!is.na(x)]
    else if (any(is.na(x)))
        return(NA)

    n <- length(x)
    if (n == 0)
        return(NA)
    half <- (n + 1)/2
    if (n%%2 == 1) {
        .Internal(psort(x, half))[half]
    }
    else {
        sum(.Internal(psort(x, c(half, half + 1)))[c(half, half + 1)])/2
    }
}

x <- rnorm(10e5)
K <- 10
t0 <- system.time({
  for (kk in 1:K)
    y <- median(x);
})
print(t0)  # [1] 1.82 0.14 1.98   NA   NA
t1 <- system.time({
  for (kk in 1:K)
    y <- median2(x, na.rm=NA);
})
print(t1)  # [1] 1.25 0.06 1.34   NA   NA
print(t0/t1)  # [1] 1.456000 2.333333 1.477612       NA       NA

BTW, without having checked the source code, it looks like is.na() is
unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
a vector without NAs.  On the other hand, is.na(sum(x)) becomes
awfully slow if 'x' contains NAs.

/Henrik

Thomas Lumley

2006-Apr-10 23:22 UTC

head link

[Rd] Suggestions to speed up median() and has.na()

On Mon, 10 Apr 2006, Henrik Bengtsson wrote:
> Hi,
>
> I've got two suggestions how to speed up median() about 50%.  For all
> iterative methods calling median() in the loops this has a major
> impact.  The second suggestion will apply to other methods too.
I'm surprised this has a major impact -- in your example it takes much 
longer to generate the ten million numbers than to find the median.
> Suggestion 1:
> Replace the sort() calls with the .Internal(psort(x, partial)).   This
> will avoid unnecessary overhead, especially an expensive second check
> for NAs using any(is.na(x)).  Simple benchmarking with
>
> x <- rnorm(10e6)
> system.time(median(x))/system.time(median2(x))
>
> where median2() is the function with the above replacements, gives
> about 20-25% speed up.
There's something that seems a bit undesirable about having median() call 
the .Internal function for sort().
> Suggestion 2:
> Create a has.na(x) function to replace any(is.na(x)) that returns TRUE
> as soon as a NA value is detected.  In the best case it returns after
> the first index with TRUE, in the worst case it returns after the last
> index N with FALSE.  The cost for is.na(x) is always O(N), and any()
> in the best case O(1) and in the worst case O(N) (if any() is
> implemented as I hope).  An has.na() function would be very useful
> elsewhere too.
This sounds useful (though it has missed the deadline for 2.3.0).

It won't help if the typical case is no missing values, as you suggest, 
but it will be faster when there are missing values.
> BTW, without having checked the source code, it looks like is.na() is
> unnecessarily slow; is.na(sum(x)) is much faster than any(is.na(x)) on
> a vector without NAs.  On the other hand, is.na(sum(x)) becomes
> awfully slow if 'x' contains NAs.
>
I don't think  it is unnecessarily slow.  It has to dispatch methods and 
it has to make sure that matrix structure is preserved.  After that the 
code is just

     case REALSXP:
         for (i = 0; i < n; i++)
             LOGICAL(ans)[i] = ISNAN(REAL(x)[i]);
         break;

and it's hard to see how that can be improved. It does suggest that a 
faster anyNA() function would have to not be generic.


 	-thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

roger koenker

2006-Apr-11 12:51 UTC

head link

[Rd] Suggestions to speed up median() and has.na()

I've recently folded a new version of the Floyd-Rivest quantile  
algorithm
for quantiles into my quantreg package.  So it is easily available for
comparative testing.  On my G5 running last friday's  R-devel, I get:


              Median Only          5 Quantiles
  n          quantile kuantile quantile kuantile qsort
100         0.003    0.003     0.006    0.004 0.002
1000       0.002    0.002     0.002    0.002 0.002
10000     0.005    0.003     0.008    0.003 0.001
1e+05     0.022    0.010     0.035    0.012 0.017
1e+06     0.181    0.117     0.308    0.138 0.200
1e+07     1.853    0.762     3.180    1.003 2.287


# Small timing experiment to compare kuantile and quantile

require(quantreg)
set.seed(1446)

ns <- 10^(2:7)
R <- 10
T <- array(NA,c(R,length(ns),5))
eps <- 20 * .Machine$double.eps
for(j in 1:length(ns)){
     for(i in 1:R){
         y <- rnorm(ns[j])
         T[i,j,1] <- system.time(qy <- quantile(y,.5))[1]
         T[i,j,2] <- system.time(ky <- kuantile(y,.5))[1]
         stopifnot(abs(qy - ky) < eps)
         T[i,j,3] <- system.time(qy <- quantile(y))[1]
         T[i,j,4] <- system.time(ky <- kuantile(y))[1]
         stopifnot(abs(qy - ky) < eps)
         T[i,j,5] <- system.time(sort(y,method="quick"))[1]
         }
     }

Tab <- apply(T,2:3,mean)
dimnames(Tab) <-
list(paste(ns),c(rep(c("quantile","kuantile"),
2),"qsort"))

url:    www.econ.uiuc.edu/~roger            Roger Koenker
email    rkoenker at uiuc.edu            Department of Economics
vox:     217-333-4558                University of Illinois
fax:       217-244-6678                Champaign, IL 61820

Apparently Analagous Threads

Search for more maybe matching threads

R devel - Apr 2006 - Suggestions to speed up median() and has.na()

[Rd] Suggestions to speed up median() and has.na()

[Rd] Suggestions to speed up median() and has.na()

[Rd] Suggestions to speed up median() and has.na()

Apparently Analagous Threads