Hello, I am working on a simple non-parametric (Theil) regression function and and am following Hollander and Wolfe 1999 text. I would like some help making my function faster. I have compared with pre-packaged version from "MBLM", which isnt very fast either, but it appears mine is faster with N = 1000 (see results below). I plan on running this function repeatedly, and I generally have data lengths of ~ N = 6000 or more. # My function following Hollander and Wolfe text, Chapter 9 np.lm <-function(dat, X, Y, ...){ # Ch 9.2: Slope est. (X) for Thiel statistic combos <- combn(nrow(dat), 2) i.s <- combos[1,] j.s <- combos[2,] num <- vector("list", length=length(i.s)) dom <- vector("list", length=length(i.s)) for(i in 1:length(i.s)){ num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y] dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X] } X <- median( sort( do.call(c, num) / do.call(c, dom) ) ) # Ch 9.4: Intercept est. for Thiel statistic Intercept <- median(dat[,"Y"] - X*dat[,"X"]) out <- data.frame(Intercept, X) return(out) } # usage: np.lm(dat, X=1, Y=2) ################################################################ library("mblm") # I will compare to mblm() function X <- rnorm(1000) Y <- rnorm(1000) dat <- data.frame(X, Y) system.time(np.lm(dat, X=1, Y=2) ) user system elapsed 118.610 0.130 119.144 109.000 0.040 109.416 # ran it twice 86.190 0.100 86.589 # 3rd time system.time( mblm(Y~X, dat, repeated=F) ) user system elapsed 1509.200 87.670 1602.987 # not waiting on that to run again OK, mine appears to be way faster... oddly, mblm() seemed faster with smaller (n=100) datasets Can someone please tell me how they would improve the np.lm() function to make it quicker, or faster with some other tricks (parallel?.. Ive never done that.). Thank you ahead of time for any help. Brad Ubuntu 10.04 Intel i5 CPU 4*(650 @ 3.20GHz) 11.6 GiB memory R version 2.15.0 (2012-03-30) -- View this message in context: http://r.789695.n4.nabble.com/help-speeding-up-simple-Theil-regression-function-tp4646923.html Sent from the R help mailing list archive at Nabble.com.
Obviously sort() is not needed in the following line: X <- median( sort( do.call(c, num) / do.call(c, dom) ) ) Brad Schneid wrote> Hello, > > I am working on a simple non-parametric (Theil) regression function and > and am following Hollander and Wolfe 1999 text. I would like some help > making my function faster. I have compared with pre-packaged version from > "MBLM", which isnt very fast either, but it appears mine is faster with N > = 1000 (see results below). I plan on running this function repeatedly, > and I generally have data lengths of ~ N = 6000 or more. > > # My function following Hollander and Wolfe text, Chapter 9 > np.lm <-function(dat, X, Y, ...){ > # Ch 9.2: Slope est. (X) for Thiel statistic > combos <- combn(nrow(dat), 2) > i.s <- combos[1,] > j.s <- combos[2,] > num <- vector("list", length=length(i.s)) > dom <- vector("list", length=length(i.s)) > > for(i in 1:length(i.s)){ > num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y] > dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X] > } > > X <- median( sort( do.call(c, num) / do.call(c, dom) ) ) > # Ch 9.4: Intercept est. for Thiel statistic > Intercept <- median(dat[,"Y"] - X*dat[,"X"]) > out <- data.frame(Intercept, X) > return(out) > } # usage: np.lm(dat, X=1, Y=2) > ################################################################ > > library("mblm") # I will compare to mblm() function > > X <- rnorm(1000) > Y <- rnorm(1000) > dat <- data.frame(X, Y) > > system.time(np.lm(dat, X=1, Y=2) ) > user system elapsed > 118.610 0.130 119.144 > 109.000 0.040 109.416 # ran it twice > 86.190 0.100 86.589 # 3rd time > > system.time( mblm(Y~X, dat, repeated=F) ) > user system elapsed > 1509.200 87.670 1602.987 > # not waiting on that to run again > > OK, mine appears to be way faster... oddly, mblm() seemed faster with > smaller (n=100) datasets > > Can someone please tell me how they would improve the np.lm() function to > make it quicker, or faster with some other tricks (parallel?.. Ive never > done that.). > > Thank you ahead of time for any help. > Brad > > Ubuntu 10.04 > Intel i5 CPU 4*(650 @ 3.20GHz) > 11.6 GiB memory > R version 2.15.0 (2012-03-30)-- View this message in context: http://r.789695.n4.nabble.com/help-speeding-up-simple-Theil-regression-function-tp4646923p4646926.html Sent from the R help mailing list archive at Nabble.com.
Berend Hasselman
2012-Oct-21 18:59 UTC
[R] help speeding up simple Theil regression function
On 21-10-2012, at 20:06, Brad Schneid wrote:> Hello, > > I am working on a simple non-parametric (Theil) regression function and and > am following Hollander and Wolfe 1999 text. I would like some help making > my function faster. I have compared with pre-packaged version from "MBLM", > which isnt very fast either, but it appears mine is faster with N = 1000 > (see results below). I plan on running this function repeatedly, and I > generally have data lengths of ~ N = 6000 or more. > > # My function following Hollander and Wolfe text, Chapter 9 > np.lm <-function(dat, X, Y, ...){ > # Ch 9.2: Slope est. (X) for Thiel statistic > combos <- combn(nrow(dat), 2) > i.s <- combos[1,] > j.s <- combos[2,] > num <- vector("list", length=length(i.s)) > dom <- vector("list", length=length(i.s)) > > for(i in 1:length(i.s)){ > num[[i]] <- dat[j.s[i],Y] - dat[i.s[i],Y] > dom[[i]] <- dat[j.s[i],X] - dat[i.s[i],X] > } > > X <- median( sort( do.call(c, num) / do.call(c, dom) ) ) > # Ch 9.4: Intercept est. for Thiel statistic > Intercept <- median(dat[,"Y"] - X*dat[,"X"]) > out <- data.frame(Intercept, X) > return(out) > } # usage: np.lm(dat, X=1, Y=2) > ################################################################ > > library("mblm") # I will compare to mblm() function > > X <- rnorm(1000) > Y <- rnorm(1000) > dat <- data.frame(X, Y) > > system.time(np.lm(dat, X=1, Y=2) ) > user system elapsed > 118.610 0.130 119.144 > 109.000 0.040 109.416 # ran it twice > 86.190 0.100 86.589 # 3rd timeAlternative function without your i loop (it isn't needed and can be vectorized): np.lm.alt <-function(dat, X, Y, ...){ # Ch 9.2: Slope est. (X) for Thiel statistic ==> (Pedantic comment: it is Theil (swap the i and e) combos <- combn(nrow(dat), 2) i.s <- combos[1,] j.s <- combos[2,] Y.num <- dat[j.s,Y] - dat[i.s,Y] X.dom <- dat[j.s,X] - dat[i.s,X] X <- median( Y.num / X.dom) # Ch 9.4: Intercept est. for Thiel statistic ==> (Pedantic comment: it is Theil (swap the i and e) Intercept <- median(dat[,"Y"] - X*dat[,"X"]) out <- data.frame(Intercept, X) return(out) } # usage: np.lm(dat, X=1, Y=2) Try the compiler package on you original function: library(compiler) np.lm.c <- cmpfun(np.lm) Test speed and correct results: X <- rnorm(500) Y <- rnorm(500) dat <- data.frame(X, Y) system.time(npout.c <- np.lm.c(dat, X=1, Y=2) ) system.time(npout.1 <- np.lm(dat, X=1, Y=2) ) system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) ) identical(npout.1,npout.c) identical(npout.1,npout.a) Results:> system.time(npout.c <- np.lm.c(dat, X=1, Y=2) )user system elapsed 21.442 0.066 21.517> system.time(npout.1 <- np.lm(dat, X=1, Y=2) )user system elapsed 21.068 0.073 21.161> system.time(npout.a <- np.lm.alt(dat, X=1, Y=2) )user system elapsed 0.303 0.010 0.313> identical(npout.1,npout.c)[1] TRUE> identical(npout.1,npout.a)[1] TRUE You may try and test this with larger data lengths. Berend