Thomas Terhoeven-Urselmans
2009-Jan-14 07:32 UTC
[R] Vectorization of three embedded loops
Dear R-programmer, I wrote an adapted implementation of the Kennard-Stone algorithm for sample selection of multivariate data (R 2.7.1 under MacBook Pro, Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM). I used for the heart of the script three embedded loops. This makes it especially for huge datasets very slow. For a datamatrix of 1853*1853 and the selection of 556 samples needed computation time of more than 24 hours. I did some research on vecotrization, but I could not figure out how to do it better/faster. Which ways are there to replace the time consuming loops? Here are some information: # val.n<-24; # start.b<-matrix(nrow=1812, ncol=20); # val is a vector of the rownames of 22 in an earlier step chosen extrem samples; # euc<-<-matrix(nrow=1853, ncol=1853); [contains the Euclidean distance calculations] The following calculation of the system.time was for the selection of two samples: system.time(KEN.STO(val.n,start.b,val.start,euc)) user system elapsed 25.294 13.262 38.927 The function: KEN.STO<-function(val.n,start.b,val,euc){ for(k in 1:val.n){ sum.dist<-c(); for(i in 1:length(start.b[,1])){ sum<-c(); for(j in 1:length(val)){ sum[j]<-euc[rownames(start.b)[i],val[j]] } sum.dist[i]<-min(sum); } bla<-rownames(start.b)[which(sum.dist==max(sum.dist))] val<-c(val,bla[1]); start.b<-start.b[-(which(match(rownames(start.b),val[length(val)])! ="NA")),]; if(length(val)>=val.n)break; } return(val); } Regards, Thomas Dr. Thomas Terhoeven-Urselmans Post-Doc Fellow Soil infrared spectroscopy World Agroforestry Center (ICRAF) [[alternative HTML version deleted]]
You are definitely in Circle 2 of the R Inferno. Growing objects is suboptimal, although your objects are small so this probably isn't taking too much time. There is no need for the inner-most loop: sum.dist[i] <- min(euc[rownames(start.b)[i],val] ) Maybe I'm blind, but I don't see where 'k' comes in from the outer-most loop. Patrick Burns patrick at burns-stat.com +44 (0)20 8525 0696 http://www.burns-stat.com (home of "The R Inferno" and "A Guide for the Unwilling S User") Thomas Terhoeven-Urselmans wrote:> Dear R-programmer, > > I wrote an adapted implementation of the Kennard-Stone algorithm for > sample selection of multivariate data (R 2.7.1 under MacBook Pro, > Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM). > I used for the heart of the script three embedded loops. This makes it > especially for huge datasets very slow. For a datamatrix of 1853*1853 > and the selection of 556 samples needed computation time of more than > 24 hours. > I did some research on vecotrization, but I could not figure out how > to do it better/faster. Which ways are there to replace the time > consuming loops? > > Here are some information: > > # val.n<-24; > # start.b<-matrix(nrow=1812, ncol=20); > # val is a vector of the rownames of 22 in an earlier step chosen > extrem samples; > # euc<-<-matrix(nrow=1853, ncol=1853); [contains the Euclidean > distance calculations] > > The following calculation of the system.time was for the selection of > two samples: > system.time(KEN.STO(val.n,start.b,val.start,euc)) > user system elapsed > 25.294 13.262 38.927 > > The function: > > KEN.STO<-function(val.n,start.b,val,euc){ > > for(k in 1:val.n){ > sum.dist<-c(); > for(i in 1:length(start.b[,1])){ > sum<-c(); > for(j in 1:length(val)){ > sum[j]<-euc[rownames(start.b)[i],val[j]] > } > sum.dist[i]<-min(sum); > } > bla<-rownames(start.b)[which(sum.dist==max(sum.dist))] > val<-c(val,bla[1]); > start.b<-start.b[-(which(match(rownames(start.b),val[length(val)])! > ="NA")),]; > if(length(val)>=val.n)break; > } > return(val); > } > > Regards, > > Thomas > > Dr. Thomas Terhoeven-Urselmans > Post-Doc Fellow > Soil infrared spectroscopy > World Agroforestry Center (ICRAF) > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >
Hello, I believe that your bottleneck lies at this piece of code: sum<-c(); for(j in 1:length(val)){ sum[j]<-euc[rownames(start.b)[i],val[j]] } In order to speed up your code, there are two alternatives: 1) Try to reorder the euc matrix so that the sum vector corresponds to (part of) a row or column of euc. 2) For each i value, create a matrix with the coordinates corresponding to ( rownames(start.b)[i], val[j] ) and index the matrix by this matrix in order to create sum. This will be easiest if you can reorder euc in a way that accessing its elements will be easy (and then you would be back into (1)). Creating a variable sum as c() and increasing its size in a loop is one of the easiest ways to uselessly burn your CPU. Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com On Wed, 2009-01-14 at 10:32 +0300, Thomas Terhoeven-Urselmans wrote:> Dear R-programmer, > > I wrote an adapted implementation of the Kennard-Stone algorithm for > sample selection of multivariate data (R 2.7.1 under MacBook Pro, > Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM). > I used for the heart of the script three embedded loops. This makes it > especially for huge datasets very slow. For a datamatrix of 1853*1853 > and the selection of 556 samples needed computation time of more than > 24 hours. > I did some research on vecotrization, but I could not figure out how > to do it better/faster. Which ways are there to replace the time > consuming loops? > > Here are some information: > > # val.n<-24; > # start.b<-matrix(nrow=1812, ncol=20); > # val is a vector of the rownames of 22 in an earlier step chosen > extrem samples; > # euc<-<-matrix(nrow=1853, ncol=1853); [contains the Euclidean > distance calculations] > > The following calculation of the system.time was for the selection of > two samples: > system.time(KEN.STO(val.n,start.b,val.start,euc)) > user system elapsed > 25.294 13.262 38.927 > > The function: > > KEN.STO<-function(val.n,start.b,val,euc){ > > for(k in 1:val.n){ > sum.dist<-c(); > for(i in 1:length(start.b[,1])){ > sum<-c(); > for(j in 1:length(val)){ > sum[j]<-euc[rownames(start.b)[i],val[j]] > } > sum.dist[i]<-min(sum); > } > bla<-rownames(start.b)[which(sum.dist==max(sum.dist))] > val<-c(val,bla[1]); > start.b<-start.b[-(which(match(rownames(start.b),val[length(val)])! > ="NA")),]; > if(length(val)>=val.n)break; > } > return(val); > } > > Regards, > > Thomas > > Dr. Thomas Terhoeven-Urselmans > Post-Doc Fellow > Soil infrared spectroscopy > World Agroforestry Center (ICRAF) > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Possibly Parallel Threads
- Problems loading RWeka and rJava under R 2.10.1
- problems with method ken.sto in package soil.spec: subscript out of bounds
- Least-square support vector machines regression!
- another missing link in febootstrap; failing tests for libguestfs
- times family unavailable in postscript device (Ubuntu Linux)