thr3ads.net - R help - [R] Simple Lookup... why so slow [Aug 2004]

If this information is useful, please help other people find it:
Share via:

Dieter Menne

2004-Aug-06 12:42 UTC

[R] Simple Lookup... why so slow

Dear List,

At 32 degrees Celsius in the office, I was too lazy to figure out
the correct xapplytion for a simple lookup problem
and regressed to well-known c-style. Only to see my
computer hang forever doing 10000 indexed offset calculation.
Boiled down, the problem is shown below; needs a few milliseconds
in c. Looking at the timing results of n=2000 and n=4000,
this is not linear in time, so something I don't understand
must go on.

And, just as an aside: why is $-indexing so much faster (!)
than numeric indexing?

Dieter

(all on Windows, latest R-Version)
----

# Generate Data set
StartDay = matrix(as.integer(runif(80)*20),nrow=4)
n=4000
PatDay = data.frame(Day = as.integer(runif(n)*20)+50,
                       Pat= as.integer(runif(n)*20)+1,
                       Treat = as.integer(runif(n)*4)+1,
                       DayOff=NA) # reserve output space
# Correct for days offset
ti= system.time(
  for (i in 1:n)
    PatDay$DayOff[i] = PatDay$Day[i]-StartDay[PatDay$Treat[i],PatDay$Pat[i]]
  )
cat("$Style index",n,ti[3],"\n");
# n= 2000 3 seconds
# n= 4000 15 seconds

# I first believed using numeric indexes could be faster...
ti= system.time(
  for (i in 1:n)
    PatDay[i,4] = PatDay[i,1]-StartDay[PatDay[i,3],PatDay[i,2]]
  )
cat("Numeric index", n,ti[3],"\n");
# n=2000 12 seconds
# n=4000 53 seconds

Dieter Menne

2004-Aug-06 14:06 UTC

head link

[R] Re: Simple Lookup... why so slow

Ok, found it out. Things are really speedy when you first store result in a
vector, and cbind the vector to the data frame later.

Assuming that copying is involved, this would explain to me that my first
approach was so much slower, but I don't understand why time goes up more
than linearily with n.

Dieter

--

# Generate Data set
StartDay = matrix(as.integer(runif(80)*20),nrow=4)
n=8000
PatDay = data.frame(Day = as.integer(runif(n)*20)+50,
                       Pat= as.integer(runif(n)*20)+1,
                       Treat = as.integer(runif(n)*4)+1
                       )
DayOff = rep(NA,n)
# Correct for days offset
ti= system.time(
  for (i in 1:n)
# bad
#    PatDay$DayOff[i] PatDay$Day[i]-StartDay[PatDay$Treat[i],PatDay$Pat[i]]
# good
    DayOff[i] = PatDay$Day[i]-StartDay[PatDay$Treat[i],PatDay$Pat[i]]
  )
PatDay$DayOff = DayOff
cat("Separate Vector first",n,ti[3],"\n");
# n= 4000 0.43 seconds

Adaikalavan Ramasamy

2004-Aug-06 14:45 UTC

head link

[R] Simple Lookup... why so slow

The first 2 solutions are vastly slower than the last 3 simply because
they use the for() loop. The vectorised versions are definitely faster.

# Solution 1 : list extraction operator
aa <- rep(NA, n); bb <- rep(NA, n)

system.time( for (i in 1:n) {
  aa[i] <- PatDay$Day[i] - StartDay[PatDay$Treat[i], PatDay$Pat[i]] } )
[1] 0.33 0.00 0.33 0.00 0.00

# Solution 2 : numeric index with for loop
system.time( for (i in 1:n){ 
   bb[i] <-  PatDay[i,1]-StartDay[PatDay[i,3],PatDay[i,2]] } )
[1] 15.43  0.12 17.76  0.00  0.00


# Solution 3 : Vectorised operation with numeric index
system.time( cc <- PatDay[ , 1] - StartDay[ as.matrix(PatDay[, 3:2]) ] )
[1] 0.01 0.00 0.01 0.00 0.00

# Solution 4 : Vectorised operation with named index> system.time( dd <- PatDay[ , "Day"] - StartDay[
as.matrix(PatDay[,c("Treat", "Pat")]) ] )
[1] 0.01 0.00 0.01 0.00 0.00

# Solution 5 : Vectorised operation with list extractor
system.time( ee <- PatDay$Day - StartDay[ cbind(PatDay$Treat,PatDay$Pat)
] )
[1] 0 0 0 0 0


There is insufficient precision to say which of the parameterised
operation is faster. So I tried the same thing with n=400,000 and the
last 3 gave the following timing

Solution 3 : [1] 1.67 0.21 1.89 0.00 0.00
Solution 4 : [1] 2.55 0.21 2.77 0.00 0.00
Solution 5 : [1] 0.25 0.03 0.28 0.00 0.00

However, when I redefined PatDay as matrix, for n=400,000

Solution 3 : [1] 0.48 0.04 0.51 0.00 0.00
Solution 4 : [1] 0.26 0.04 0.31 0.00 0.00


Just to make sure all the answer are the same, try this

cor( cbind(aa, bb, cc, dd) )
   aa bb cc dd
aa  1  1  1  1
bb  1  1  1  1
cc  1  1  1  1
dd  1  1  1  1

or the slow way : all.equal(aa, bb); all.equal(aa, cc); ...

Regards, Adai


On Fri, 2004-08-06 at 13:42, Dieter Menne wrote:> Dear List,
> 
> At 32 degrees Celsius in the office, I was too lazy to figure out
> the correct xapplytion for a simple lookup problem
> and regressed to well-known c-style. Only to see my
> computer hang forever doing 10000 indexed offset calculation.
> Boiled down, the problem is shown below; needs a few milliseconds
> in c. Looking at the timing results of n=2000 and n=4000,
> this is not linear in time, so something I don't understand
> must go on.
> 
> And, just as an aside: why is $-indexing so much faster (!)
> than numeric indexing?
> 
> Dieter
> 
> (all on Windows, latest R-Version)
> ----
> 
> # Generate Data set
> StartDay = matrix(as.integer(runif(80)*20),nrow=4)
> n=4000
> PatDay = data.frame(Day = as.integer(runif(n)*20)+50,
>                        Pat= as.integer(runif(n)*20)+1,
>                        Treat = as.integer(runif(n)*4)+1,
>                        DayOff=NA) # reserve output space
> # Correct for days offset
> ti= system.time(
>   for (i in 1:n)
>     PatDay$DayOff[i] =
PatDay$Day[i]-StartDay[PatDay$Treat[i],PatDay$Pat[i]]
>   )
> cat("$Style index",n,ti[3],"\n");
> # n= 2000 3 seconds
> # n= 4000 15 seconds
> 
> # I first believed using numeric indexes could be faster...
> ti= system.time(
>   for (i in 1:n)
>     PatDay[i,4] = PatDay[i,1]-StartDay[PatDay[i,3],PatDay[i,2]]
>   )
> cat("Numeric index", n,ti[3],"\n");
> # n=2000 12 seconds
> # n=4000 53 seconds
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
>

Dieter Menne

2004-Aug-06 15:10 UTC

head link

[R] Simple Lookup... why so slow

Adaikalavan,

thanks for your phantastic summary. Solution 5 is the version I was looking
for, but I left out the cbind in good old c-fashion.
> # Solution 5 : Vectorised operation with list extractor
> system.time( ee <- PatDay$Day - StartDay[ cbind(PatDay$Treat,PatDay$Pat)

Dieter

Apparently Analagous Threads

Search for more reasonably related threads

R help - Aug 2004 - Simple Lookup... why so slow

[R] Simple Lookup... why so slow

[R] Re: Simple Lookup... why so slow

[R] Simple Lookup... why so slow

[R] Simple Lookup... why so slow

Apparently Analagous Threads