thr3ads.net - R help - [R] Vectorization of three embedded loops [Jan 2009]

If this information is useful, please help other people find it:
Share via:

Thomas Terhoeven-Urselmans

2009-Jan-14 07:32 UTC

[R] Vectorization of three embedded loops

Dear R-programmer,

I wrote an adapted implementation of the Kennard-Stone algorithm for  
sample selection of multivariate data (R 2.7.1 under MacBook Pro,  
Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM).
I used for the heart of the script three embedded loops. This makes it  
especially for huge datasets very slow. For a datamatrix of 1853*1853  
and the selection of 556 samples needed computation time of more than  
24 hours.
I did some research on vecotrization, but I could not figure out how  
to do it better/faster. Which ways are there to replace the time  
consuming loops?

Here are some information:

# val.n<-24;
# start.b<-matrix(nrow=1812, ncol=20);
# val is a vector of the rownames of 22 in an earlier step chosen  
extrem samples;
# euc<-<-matrix(nrow=1853, ncol=1853); [contains the Euclidean  
distance calculations]

The following calculation of the system.time was for the selection of  
two samples:
system.time(KEN.STO(val.n,start.b,val.start,euc))
    user  system elapsed
  25.294  13.262  38.927

The function:

KEN.STO<-function(val.n,start.b,val,euc){

for(k in 1:val.n){
sum.dist<-c();
for(i in 1:length(start.b[,1])){
	sum<-c();
	for(j in 1:length(val)){
		sum[j]<-euc[rownames(start.b)[i],val[j]]
		}
		sum.dist[i]<-min(sum);
	}
bla<-rownames(start.b)[which(sum.dist==max(sum.dist))]
val<-c(val,bla[1]);
start.b<-start.b[-(which(match(rownames(start.b),val[length(val)])! 
="NA")),];
if(length(val)>=val.n)break;
}
return(val);
}

Regards,

Thomas

Dr. Thomas Terhoeven-Urselmans
Post-Doc Fellow
Soil infrared spectroscopy
World Agroforestry Center (ICRAF) 
	[[alternative HTML version deleted]]

Patrick Burns

2009-Jan-14 09:52 UTC

head link

[R] Vectorization of three embedded loops

You are definitely in Circle 2 of the R Inferno.
Growing objects is suboptimal, although your
objects are small so this probably isn't taking
too much time.

There is no need for the inner-most loop:

  sum.dist[i] <- min(euc[rownames(start.b)[i],val] )

Maybe I'm blind, but I don't see where 'k' comes
in from the outer-most loop.


Patrick Burns
patrick at burns-stat.com
+44 (0)20 8525 0696
http://www.burns-stat.com
(home of "The R Inferno" and "A Guide for the Unwilling S
User")


Thomas Terhoeven-Urselmans wrote:> Dear R-programmer,
>
> I wrote an adapted implementation of the Kennard-Stone algorithm for  
> sample selection of multivariate data (R 2.7.1 under MacBook Pro,  
> Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM).
> I used for the heart of the script three embedded loops. This makes it  
> especially for huge datasets very slow. For a datamatrix of 1853*1853  
> and the selection of 556 samples needed computation time of more than  
> 24 hours.
> I did some research on vecotrization, but I could not figure out how  
> to do it better/faster. Which ways are there to replace the time  
> consuming loops?
>
> Here are some information:
>
> # val.n<-24;
> # start.b<-matrix(nrow=1812, ncol=20);
> # val is a vector of the rownames of 22 in an earlier step chosen  
> extrem samples;
> # euc<-<-matrix(nrow=1853, ncol=1853); [contains the Euclidean  
> distance calculations]
>
> The following calculation of the system.time was for the selection of  
> two samples:
> system.time(KEN.STO(val.n,start.b,val.start,euc))
>     user  system elapsed
>   25.294  13.262  38.927
>
> The function:
>
> KEN.STO<-function(val.n,start.b,val,euc){
>
> for(k in 1:val.n){
> sum.dist<-c();
> for(i in 1:length(start.b[,1])){
> 	sum<-c();
> 	for(j in 1:length(val)){
> 		sum[j]<-euc[rownames(start.b)[i],val[j]]
> 		}
> 		sum.dist[i]<-min(sum);
> 	}
> bla<-rownames(start.b)[which(sum.dist==max(sum.dist))]
> val<-c(val,bla[1]);
> start.b<-start.b[-(which(match(rownames(start.b),val[length(val)])! 
> ="NA")),];
> if(length(val)>=val.n)break;
> }
> return(val);
> }
>
> Regards,
>
> Thomas
>
> Dr. Thomas Terhoeven-Urselmans
> Post-Doc Fellow
> Soil infrared spectroscopy
> World Agroforestry Center (ICRAF) 
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>

Carlos J. Gil Bellosta

2009-Jan-14 09:58 UTC

head link

[R] Vectorization of three embedded loops

Hello,

I believe that your bottleneck lies at this piece of code:

sum<-c();
for(j in 1:length(val)){
	sum[j]<-euc[rownames(start.b)[i],val[j]]
}

In order to speed up your code, there are two alternatives:

1) Try to reorder the euc matrix so that the sum vector corresponds to
(part of) a row or column of euc.

2) For each i value, create a matrix with the coordinates corresponding
to ( rownames(start.b)[i], val[j] ) and index the matrix by this matrix
in order to create sum. This will be easiest if you can reorder euc in a
way that accessing its elements will be easy (and then you would be back
into (1)).

Creating a variable sum as c() and increasing its size in a loop is one
of the easiest ways to uselessly burn your CPU.

Best regards,

Carlos J. Gil Bellosta
http://www.datanalytics.com


On Wed, 2009-01-14 at 10:32 +0300, Thomas Terhoeven-Urselmans
wrote:> Dear R-programmer,
> 
> I wrote an adapted implementation of the Kennard-Stone algorithm for  
> sample selection of multivariate data (R 2.7.1 under MacBook Pro,  
> Processor 2.2 GHz Intel Core 2 Duo, Memory 2 GB 667 MHZ DDR2 SDRAM).
> I used for the heart of the script three embedded loops. This makes it  
> especially for huge datasets very slow. For a datamatrix of 1853*1853  
> and the selection of 556 samples needed computation time of more than  
> 24 hours.
> I did some research on vecotrization, but I could not figure out how  
> to do it better/faster. Which ways are there to replace the time  
> consuming loops?
> 
> Here are some information:
> 
> # val.n<-24;
> # start.b<-matrix(nrow=1812, ncol=20);
> # val is a vector of the rownames of 22 in an earlier step chosen  
> extrem samples;
> # euc<-<-matrix(nrow=1853, ncol=1853); [contains the Euclidean  
> distance calculations]
> 
> The following calculation of the system.time was for the selection of  
> two samples:
> system.time(KEN.STO(val.n,start.b,val.start,euc))
>     user  system elapsed
>   25.294  13.262  38.927
> 
> The function:
> 
> KEN.STO<-function(val.n,start.b,val,euc){
> 
> for(k in 1:val.n){
> sum.dist<-c();
> for(i in 1:length(start.b[,1])){
> 	sum<-c();
> 	for(j in 1:length(val)){
> 		sum[j]<-euc[rownames(start.b)[i],val[j]]
> 		}
> 		sum.dist[i]<-min(sum);
> 	}
> bla<-rownames(start.b)[which(sum.dist==max(sum.dist))]
> val<-c(val,bla[1]);
> start.b<-start.b[-(which(match(rownames(start.b),val[length(val)])! 
> ="NA")),];
> if(length(val)>=val.n)break;
> }
> return(val);
> }
> 
> Regards,
> 
> Thomas
> 
> Dr. Thomas Terhoeven-Urselmans
> Post-Doc Fellow
> Soil infrared spectroscopy
> World Agroforestry Center (ICRAF) 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Maybe Matching Threads

Search for more apparently analagous threads

R help - Jan 2009 - Vectorization of three embedded loops

[R] Vectorization of three embedded loops

[R] Vectorization of three embedded loops

[R] Vectorization of three embedded loops

Maybe Matching Threads