Hello, I am a relatively new user of R. I have written a basic function to calculate the Gower similarity function. I was motivated to do so partly as an excercise in learning R, and partly because the existing option (vegdist in the vegan package) does not accept missing values. I think I have succeeded - my function gives me the correct values. However, now that I'm starting to use it with real data, I realise it's very slow. It takes more than 45 minutes on my Windows 98 machine (R 2.0.1 Patched (2005-03-29)) with a 185x32 matrix with ca 100 missing values. If anyone can suggest ways to speed up my function I would appreciate it. I suspect having a pair of nested for loops is the problem, but I couldn't figure out how to get rid of them. The function is: ### Gower Similarity Matrix### sGow <- function (mat){ OBJ <- nrow(mat) #number of objects MATDESC <- ncol (mat) #number of descriptors MRANGE <- apply (mat,2,max, na.rm=T)-apply (mat,2,min,na.rm=T) #descr ranges DESCRIPT <- 1:MATDESC #descriptor index vector smat <- matrix(1, nrow = OBJ, ncol = OBJ) #'empty' similarity matrix for (i in 1:OBJ){ for (j in i:OBJ){ ##calculate index vector of non-NA descriptors between objects i and j descvect <- intersect (setdiff (DESCRIPT, DESCRIPT[is.na(mat[i,DESCRIPT])]), setdiff (DESCRIPT, DESCRIPT[is.na (mat[j,DESCRIPT])])) descnum <- length(descvect) # number of valid descr for i~j comparison partialsim <- (1- abs(mat[i,descvect]-mat[j,descvect])/MRANGE[descvect]) smat[i,j] <- smat[j,i] <- sum (partialsim) / descnum } } smat } Thank-you for your time, Tyler -- Tyler Smith PhD Candidate Plant Science Department McGill University tyler.smith at mail.mcgill.ca
On 18 Apr 2005, at 19:10, Tyler Smith wrote:> Hello, > > I am a relatively new user of R. I have written a basic function to > calculate > the Gower similarity function. I was motivated to do so partly as an > excercise > in learning R, and partly because the existing option (vegdist in the > vegan > package) does not accept missing values. >Speed is the reason to use C instead of R. It should be easy, almost trivial, to modify the vegdist.c so that it handles missing values. I guess this handling means ignoring the value pair if one of the values is missing -- which is not so gentle to the metric properties so dear to Gower. Package vegan is designed for ecological community data which generally do not have missing values (except in environmental data), but contributions are welcome.> I think I have succeeded - my function gives me the correct values. > However, now > that I'm starting to use it with real data, I realise it's very slow. > It takes > more than 45 minutes on my Windows 98 machine (R 2.0.1 Patched > (2005-03-29)) > with a 185x32 matrix with ca 100 missing values. If anyone can suggest > ways to > speed up my function I would appreciate it. I suspect having a pair of > nested > for loops is the problem, but I couldn't figure out how to get rid of > them.cheers, jari oksanen -- Jari Oksanen, Oulu, Finland
>>>>> "Tyler" == Tyler Smith <tyler.smith at mail.mcgill.ca> >>>>> on Mon, 18 Apr 2005 12:10:34 -0400 writes:Tyler> Hello, I am a relatively new user of R. I have Tyler> written a basic function to calculate the Gower Tyler> similarity function. I was motivated to do so partly Tyler> as an excercise in learning R, and partly because the Tyler> existing option (vegdist in the vegan package) does Tyler> not accept missing values. I don't know what exactly you want. The function daisy() in the recommended package "cluster" has always worked with missing values and IIRC, the book "Kaufman & Rousseeuw" {which I have not at hand here at home}, clearly mentions Gower's origin of their distance measure definition. Martin Maechler, maintainer of cluster package, ETH Zurich Tyler> I think I have succeeded - my function gives me the Tyler> correct values. However, now that I'm starting to use Tyler> it with real data, I realise it's very slow. It takes Tyler> more than 45 minutes on my Windows 98 machine (R Tyler> 2.0.1 Patched (2005-03-29)) with a 185x32 matrix with Tyler> ca 100 missing values. If anyone can suggest ways to Tyler> speed up my function I would appreciate it. I suspect Tyler> having a pair of nested for loops is the problem, but Tyler> I couldn't figure out how to get rid of them. Tyler> The function is: Tyler> ### Gower Similarity Matrix### Tyler> sGow <- function (mat){ Tyler> OBJ <- nrow(mat) #number of objects MATDESC <- ncol Tyler> (mat) #number of descriptors MRANGE <- apply Tyler> (mat,2,max, na.rm=T)-apply (mat,2,min,na.rm=T) #descr Tyler> ranges DESCRIPT <- 1:MATDESC #descriptor index vector Tyler> smat <- matrix(1, nrow = OBJ, ncol = OBJ) #'empty' Tyler> similarity matrix Tyler> for (i in 1:OBJ){ for (j in i:OBJ){ Tyler> ##calculate index vector of non-NA descriptors Tyler> between objects i and j descvect <- intersect Tyler> (setdiff (DESCRIPT, Tyler> DESCRIPT[is.na(mat[i,DESCRIPT])]), setdiff (DESCRIPT, Tyler> DESCRIPT[is.na (mat[j,DESCRIPT])])) Tyler> descnum <- length(descvect) # number of valid Tyler> descr for i~j comparison Tyler> partialsim <- (1- Tyler> abs(mat[i,descvect]-mat[j,descvect])/MRANGE[descvect]) Tyler> smat[i,j] <- smat[j,i] <- sum (partialsim) / Tyler> descnum } } smat } Tyler> Thank-you for your time, Tyler> Tyler Tyler> -- Tyler Smith Tyler> PhD Candidate Plant Science Department McGill Tyler> University Tyler> tyler.smith at mail.mcgill.ca Tyler> ______________________________________________ Tyler> R-help at stat.math.ethz.ch mailing list Tyler> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE Tyler> do read the posting guide! Tyler> http://www.R-project.org/posting-guide.html