Purna chander
2012-Oct-08 07:25 UTC
[R] Any better way of optimizing time for calculating distances in the mentioned scenario??
Dear All, I'm dealing with a case, where 'manhattan' distance of each of 100 vectors is calculated from 10000 other vectors. For achieving this, following 4 scenarios are tested: 1) scenario 1:> x<-read.table("query.vec") > v<-read.table("query.vec2")> d<-matrix(nrow=nrow(v),ncol=nrow(x)) > for (i in 1:nrow(v)){+ d[i,]<- sapply(1:nrow(x),function(z){dist(rbind(v[i,],x[z,]),method="manhattan")}) + }> print(d[1,1:10])time taken for running the code is : real 1m33.088s user 1m32.287s sys 0m0.036s 2) scenario2:> x<-read.table("query.vec") > v<-read.table("query.vec2") > v<-as.matrix(v) > d<-matrix(nrow=nrow(v),ncol=nrow(x)) > for (i in 1:nrow(v)){+ tmp_m<-matrix(rep(v[i,],nrow(x)),nrow=nrow(x),byrow=T) + d[i,]<- rowSums(abs(tmp_m - x)) + }> print(d[1,1:10])time taken for running the code is: real 0m0.882s user 0m0.854s sys 0m0.025s 3) scenario3:> x<-read.table("query.vec") > v<-read.table("query.vec2") > v<-as.matrix(v) > d<-matrix(nrow=nrow(v),ncol=nrow(x)) > for (i in 1:nrow(v)){+ d[i,]<-sapply(1:nrow(x),function(z){dist(rbind(v[i,],x[z,]),method="manhattan")}) + }> print(d[1,1:10])time taken for running the code is: real 1m3.817s user 1m3.543s sys 0m0.031s 4) scenario4:> x<-read.table("query.vec") > v<-read.table("query.vec2") > v<-as.matrix(v) > d<-dist(rbind(v,x),method="manhattan") > m<-as.matrix(d) > m2<-m[1:nrow(v),(nrow(v)+1):nrow(x)] > print(m2[1,1:10])time taken for running the code: real 0m0.445s user 0m0.401s sys 0m0.041s Queries: 1) Though scenario 4 is optimum, this scenario failed when matrix 'v' having more no. of rows. An error occurred while converting distance object 'd' to a matrix 'm'. For E.g: > m<-as.matrix(d) the above command resulted in error: "Error: cannot allocate vector of size 922.7 MB". So, what can be done to convert a larger dist object into a matrix or how allocation size can be increased? 2) Here I observed that 'dist()' function calculates the distances across all vectors present in a given matrix or dataframe. Is it not possible to calculate distances of specific vectors from other vectors present in a matrix using 'dist()' function? Which means, suppose if a matrix 'x' having 20 rows, is it not possible using 'dist()' to calculate only distance of 1st row vector from other 19 vectors. 3) Any other ideas to optimize the problem i'm facing with. Regards, Purnachander
Purna chander
2012-Oct-12 07:46 UTC
[R] Any better way of optimizing time for calculating distances in the mentioned scenario??
Dear All, I'm dealing with a case, where 'manhattan' distance of each of 100 vectors is calculated from 10000 other vectors. For achieving this, following 4 scenarios are tested: 1) scenario 1:> x<-read.table("query.vec") > v<-read.table("query.vec2")> d<-matrix(nrow=nrow(v),ncol=nrow(x)) > for (i in 1:nrow(v)){+ d[i,]<- sapply(1:nrow(x),function(z){dist(rbind(v[i,],x[z,]),method="manhattan")}) + }> print(d[1,1:10])time taken for running the code is : real 1m33.088s user 1m32.287s sys 0m0.036s 2) scenario2:> x<-read.table("query.vec") > v<-read.table("query.vec2") > v<-as.matrix(v) > d<-matrix(nrow=nrow(v),ncol=nrow(x)) > for (i in 1:nrow(v)){+ tmp_m<-matrix(rep(v[i,],nrow(x)),nrow=nrow(x),byrow=T) + d[i,]<- rowSums(abs(tmp_m - x)) + }> print(d[1,1:10])time taken for running the code is: real 0m0.882s user 0m0.854s sys 0m0.025s 3) scenario3:> x<-read.table("query.vec") > v<-read.table("query.vec2") > v<-as.matrix(v) > d<-matrix(nrow=nrow(v),ncol=nrow(x)) > for (i in 1:nrow(v)){+ d[i,]<-sapply(1:nrow(x),function(z){dist(rbind(v[i,],x[z,]),method="manhattan")}) + }> print(d[1,1:10])time taken for running the code is: real 1m3.817s user 1m3.543s sys 0m0.031s 4) scenario4:> x<-read.table("query.vec") > v<-read.table("query.vec2") > v<-as.matrix(v) > d<-dist(rbind(v,x),method="manhattan") > m<-as.matrix(d) > m2<-m[1:nrow(v),(nrow(v)+1):nrow(x)] > print(m2[1,1:10])time taken for running the code: real 0m0.445s user 0m0.401s sys 0m0.041s Queries: 1) Though scenario 4 is optimum, this scenario failed when matrix 'v' having more no. of rows. An error occurred while converting distance object 'd' to a matrix 'm'. For E.g: > m<-as.matrix(d) the above command resulted in error: "Error: cannot allocate vector of size 922.7 MB". So, what can be done to convert a larger dist object into a matrix or how allocation size can be increased? 2) Here I observed that 'dist()' function calculates the distances across all vectors present in a given matrix or dataframe. Is it not possible to calculate distances of specific vectors from other vectors present in a matrix using 'dist()' function? Which means, suppose if a matrix 'x' having 20 rows, is it not possible using 'dist()' to calculate only distance of 1st row vector from other 19 vectors. 3) Any other ideas to optimize the problem i'm facing with. Regards, Purnachander