Hi, I have a data.frame with 294 columns and 211 rows. I am calculating correlations between all pairs of columns (excluding column 1) and based on these correlation values I delete one column from any pair that shows a R^2 greater than a cuttoff value. (Rather than directly delete the column all I do is store the column number, and do the deletion later) The code I am using is: ndesc <- length(names(data)); for (i in 2:(ndesc-1)) { for (j in (i+1):ndesc) { if (i %in% drop || j %in% drop) next; r2 <- cor(data[,i],data[,j]); r2 <- r2*r2; if (r2 >= r2cut) { rnd <- abs(rnorm(1)); if (rnd < 0.5) { drop <- c(drop,i); } else { drop <- c(drop,j); } } } } drop is a vector that contains columns numbers that can be skipped data is the data.frame For the data.frame mentioned above (279 columns, 211 rows) the calculation takes more than 7 minutes (after which I Ctrl-C'ed the calculation). The machine is a 1GHz Duron with 1GB RAM The output of version is: platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 1 minor 7.1 year 2003 month 06 day 16 language R I'm not too sure why it takes *so* long (I had done a similar calculation in Python using list operations and it took forever), but is there any trick that could be used to make this run faster or is this type of runtime to be expected? Thanks, ------------------------------------------------------------------- Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- A red sign on the door of a physics professor: 'If this sign is blue, you're going too fast.'
Have you tried computing the correlation matrix using "cor" and then selecting variables to retain or drop from the resulting correlation matrix? R uses vectorized arithmetic for operations like "cor". By comparison, "for" loops are quite inefficient, requiring extra overhead for memory management and validity checking. Alternatively, have you considered vectorizing the inner loop, something like the following: ndesc <- dim(data)[2] Keep <- rep(TRUE, ndesc) for(i in 2:(ndesc-1)){ if(any(K.i <- Keep[(i+1):ndesc])){ cor.i <- cor(data[,i], data[,((i+1):ndesc)[K.i]]) ...<your selection criteria applied to Keep> } Obviously, I haven't tested this specific code, but I hope it is adequate to illustrate the technique. It might even be faster than either of the other options discussed. hope this helps. spencer graves Rajarshi Guha wrote:>Hi, > I have a data.frame with 294 columns and 211 rows. I am calculating >correlations between all pairs of columns (excluding column 1) and based >on these correlation values I delete one column from any pair that shows >a R^2 greater than a cuttoff value. (Rather than directly delete the >column all I do is store the column number, and do the deletion later) > >The code I am using is: > > ndesc <- length(names(data)); > for (i in 2:(ndesc-1)) { > for (j in (i+1):ndesc) { > > if (i %in% drop || j %in% drop) next; > > r2 <- cor(data[,i],data[,j]); > r2 <- r2*r2; > > if (r2 >= r2cut) { > rnd <- abs(rnorm(1)); > if (rnd < 0.5) { drop <- c(drop,i); } > else { drop <- c(drop,j); } > } > } > } > >drop is a vector that contains columns numbers that can be skipped >data is the data.frame > >For the data.frame mentioned above (279 columns, 211 rows) the >calculation takes more than 7 minutes (after which I Ctrl-C'ed the >calculation). The machine is a 1GHz Duron with 1GB RAM > >The output of version is: > >platform i686-pc-linux-gnu >arch i686 >os linux-gnu >system i686, linux-gnu >status >major 1 >minor 7.1 >year 2003 >month 06 >day 16 >language R > >I'm not too sure why it takes *so* long (I had done a similar >calculation in Python using list operations and it took forever), but is >there any trick that could be used to make this run faster or is this >type of runtime to be expected? > >Thanks, >------------------------------------------------------------------- >Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net> >GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE >------------------------------------------------------------------- >A red sign on the door of a physics professor: >'If this sign is blue, you're going too fast.' > >______________________________________________ >R-help at stat.math.ethz.ch mailing list >https://www.stat.math.ethz.ch/mailman/listinfo/r-help > >
Adaikalavan RAMASAMY
2003-Nov-21 05:54 UTC
[R] speeding up a pairwise correlation calculation
You probably want to use runif() instead of rnorm() for equal probability of selecting between i,j Your algorithm is of order n^2 [ 294 choose 2, 293 choose 2, ... ], so it should not be too slow. But two for() loops are inefficient in R. Something like this should be fairly fast in C. What is you aim in trying to do this ? Your algorithm is similar to hclust() - which has nice graphical support - but it merges two nearest neighbour to find another centroid instead of removing one of the neigbours. By removing columns early in stage you are losing information. The alternative would be to use hclust(), select a similarity/dissimilarity cutoff to create groups. Then from each group you can either choose the average profile or randomly select one column to represent the group. -- Adaikalavan Ramasamy -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Rajarshi Guha Sent: Friday, November 21, 2003 11:23 AM To: R Subject: [R] speeding up a pairwise correlation calculation Hi, I have a data.frame with 294 columns and 211 rows. I am calculating correlations between all pairs of columns (excluding column 1) and based on these correlation values I delete one column from any pair that shows a R^2 greater than a cuttoff value. (Rather than directly delete the column all I do is store the column number, and do the deletion later) The code I am using is: ndesc <- length(names(data)); for (i in 2:(ndesc-1)) { for (j in (i+1):ndesc) { if (i %in% drop || j %in% drop) next; r2 <- cor(data[,i],data[,j]); r2 <- r2*r2; if (r2 >= r2cut) { rnd <- abs(rnorm(1)); if (rnd < 0.5) { drop <- c(drop,i); } else { drop <- c(drop,j); } } } } drop is a vector that contains columns numbers that can be skipped data is the data.frame For the data.frame mentioned above (279 columns, 211 rows) the calculation takes more than 7 minutes (after which I Ctrl-C'ed the calculation). The machine is a 1GHz Duron with 1GB RAM The output of version is: platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 1 minor 7.1 year 2003 month 06 day 16 language R I'm not too sure why it takes *so* long (I had done a similar calculation in Python using list operations and it took forever), but is there any trick that could be used to make this run faster or is this type of runtime to be expected? Thanks, ------------------------------------------------------------------- Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE ------------------------------------------------------------------- A red sign on the door of a physics professor: 'If this sign is blue, you're going too fast.' ______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
My guess is that the objective is to delete correlated variables before doing some sort of modeling... This is what I would do (untested): rcut <- sqrt(r2cut) cormat <- cor(data[, 2:ncol(data)]) ## get the position of entries larger than the cutoff bad.idx <- which(abs(cormat) > rcut, arr.ind=TRUE) ## get the indices of the lower triangular part. bad.idx <- bad.idx[bad.idx[,1] < bad.idx[,2]] ## randomly pick one or the other: drop.idx <- ifelse(runif(nrow(bad.idx)) > .5, bad.idx[,1], bad.idx[,2]) HTH, Andy> From: Adaikalavan RAMASAMY [mailto:ramasamya at gis.a-star.edu.sg] > > You probably want to use runif() instead of rnorm() for equal > probability of selecting between i,j > > Your algorithm is of order n^2 [ 294 choose 2, 293 choose 2, > ... ], so it should not be too slow. But two for() loops are > inefficient in R. Something like this should be fairly fast in C. > > What is you aim in trying to do this ? Your algorithm is similar to > hclust() - which has nice graphical support - but it merges > two nearest neighbour to find another centroid instead of > removing one of the neigbours. By removing columns early in > stage you are losing information. > > The alternative would be to use hclust(), select a > similarity/dissimilarity cutoff to create groups. Then from > each group you can either choose the average profile or > randomly select one column to represent the group. > > -- > Adaikalavan Ramasamy > > > -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Rajarshi Guha > Sent: Friday, November 21, 2003 11:23 AM > To: R > Subject: [R] speeding up a pairwise correlation calculation > > > Hi, > I have a data.frame with 294 columns and 211 rows. I am > calculating correlations between all pairs of columns > (excluding column 1) and based on these correlation values I > delete one column from any pair that shows a R^2 greater than > a cuttoff value. (Rather than directly delete the column all > I do is store the column number, and do the deletion later) > > The code I am using is: > > ndesc <- length(names(data)); > for (i in 2:(ndesc-1)) { > for (j in (i+1):ndesc) { > > if (i %in% drop || j %in% drop) next; > > r2 <- cor(data[,i],data[,j]); > r2 <- r2*r2; > > if (r2 >= r2cut) { > rnd <- abs(rnorm(1)); > if (rnd < 0.5) { drop <- c(drop,i); } > else { drop <- c(drop,j); } > } > } > } > > drop is a vector that contains columns numbers that can be > skipped data is the data.frame > > For the data.frame mentioned above (279 columns, 211 rows) > the calculation takes more than 7 minutes (after which I > Ctrl-C'ed the calculation). The machine is a 1GHz Duron with 1GB RAM > > The output of version is: > > platform i686-pc-linux-gnu > arch i686 > os linux-gnu > system i686, linux-gnu > status > major 1 > minor 7.1 > year 2003 > month 06 > day 16 > language R > > I'm not too sure why it takes *so* long (I had done a similar > calculation in Python using list operations and it took > forever), but is there any trick that could be used to make > this run faster or is this type of runtime to be expected? > > Thanks, > ------------------------------------------------------------------- > Rajarshi Guha <rxg218 at psu.edu> <http://jijo.cjb.net> > GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE > ------------------------------------------------------------------- > A red sign on the door of a physics professor: > 'If this sign is blue, you're going too fast.' > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo> /r-help > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > https://www.stat.math.ethz.ch/mailman/listinfo> /r-help >