I need to order a long vector of integers with rather few unique values. This is very slow:> x <- sample(rep(c(1:10), 50000)) > system.time(ord <- order(x))[1] 189.18 0.09 190.48 0.00 0.00 But with no ties> y <- sample(500000) > system.time(ord1 <- order(y))[1] 1.18 0.00 1.18 0.00 0.00 it is very fast! This gave me the following idea: Since I don't care about keeping the order within tied values, why not add some small disturbance to x, and indeed,> unix.time(ord2 <- order(x + runif(length(x), -0.1, 0.1)))[1] 1.66 0.00 1.66 0.00 0.00> identical(x[ord], x[ord2])[1] TRUE it works! Is there an obvious (=better) solution to this problem that I have overlooked? In any case, I think that the problem with order and many ties is worth mentioning in the help page. For the record: R-1.7.0, RH9 G?ran --- G?ran Brostr?m tel: +46 90 786 5223 Department of Statistics fax: +46 90 786 6614 Ume? University http://www.stat.umu.se/egna/gb/ SE-90187 Ume?, Sweden e-mail: gb at stat.umu.se
On Sat, 7 Jun 2003, G?ran Brostr?m wrote:> > I need to order a long vector of integers with rather few unique values. > This is very slow: > > > x <- sample(rep(c(1:10), 50000)) > > system.time(ord <- order(x)) > [1] 189.18 0.09 190.48 0.00 0.00 > > But with no ties > > > y <- sample(500000) > > system.time(ord1 <- order(y)) > [1] 1.18 0.00 1.18 0.00 0.00 > > it is very fast! > This gave me the following idea: Since I don't care about keeping the > order within tied values, why not add some small disturbance to x, > and indeed, > > > unix.time(ord2 <- order(x + runif(length(x), -0.1, 0.1))) > [1] 1.66 0.00 1.66 0.00 0.00An even better way is> system.time(ord3 <- order(x + seq(0, 0.9, length = length(x))))[1] 1.32 0.05 1.37 0.00 0.00 Faster, but more important; it keeps the original ordering for tied values. Thanks to James Holtman. G?ran [...]
On Sat, 7 Jun 2003, [ISO-8859-1] Göran Broström wrote:> > I need to order a long vector of integers with rather few unique values. > This is very slow:I think the culprit is src/main/sort.c: orderVector1 /* Shell sort isn't stable, but it proves to be somewhat faster to run a final insertion sort to re-order runs of ties when comparison is cheap. */ This also explains:> aa<-sample(rep(1:10,50000)) > system.time( order(aa, 1:length(aa)))[1] 3.67 0.01 3.68 0.00 0.00> system.time( order(aa))^C Timing stopped at: 49.33 0.01 49.34 0 0 which is perhaps the simplest work-around :). -thomas