Juan Pablo Lewinger
2007-May-25 06:04 UTC
[R] Speeding up resampling of rows from a large matrix
I'm trying to: Resample with replacement pairs of distinct rows from a 120 x 65,000 matrix H of 0's and 1's. For each resampled pair sum the resulting 2 x 65,000 matrix by column: 0 1 0 1 ... + 0 0 1 1 ... _______ = 0 1 1 2 ... For each column accumulate the number of 0's, 1's and 2's over the resamples to obtain a 3 x 65,000 matrix G. For those interested in the background, H is a matrix of haplotypes, each pair of haplotypes forms a genotype, and each column corresponds to a SNP. I'm using resampling to compute the null distribution of the maximum over correlated SNPs of a simple statistic. The code: #------------------------------------------------------------------------------- nSNPs <- 1000 H <- matrix(sample(0:1, 120*nSNPs , replace=T), nrow=120) G <- matrix(0, nrow=3, ncol=nSNPs) # Keep in mind that the real H is 120 x 65000 nResamples <- 3000 pair <- replicate(nResamples, sample(1:120, 2)) gen <- function(x){g <- sum(x); c(g==0, g==1, g==2)} for (i in 1:nResamples){ G <- G + apply(H[pair[,i],], 2, gen) } #------------------------------------------------------------------------------- The problem is that the loop takes about 80 mins to complete and I need to repeat the whole thing 10,000 times, which would then take over a year and a half! Is there a way to speed this up so that the full 10,000 iterations take a reasonable amount of time (say a week)? My machine has an Intel Xeon 3.40GHz CPU with 1GB of RAM > sessionInfo() R version 2.5.0 (2007-04-23) i386-pc-mingw32 I would greatly appreciate any help. Juan Pablo Lewinger Department of Preventive Medicine Keck School of Medicine University of Southern California
Bill.Venables at csiro.au
2007-May-25 08:11 UTC
[R] Speeding up resampling of rows from a large matrix
Here is a possibility. The only catch is that if a pair of rows is selected twice you will get the results in a block, not scattered at random throughout the columns of G. I can't see that as a problem. ### --- start code excerpt --- nSNPs <- 1000 H <- matrix(sample(0:1, 120*nSNPs , replace=T), nrow=120) # G <- matrix(0, nrow=3, ncol=nSNPs) # Keep in mind that the real H is 120 x 65000 ij <- as.matrix(subset(expand.grid(i = 1:120, j = 1:120), i < j)) nResamples <- 3000 sel <- sample(1:nrow(ij), nResamples, rep = TRUE) repf <- table(sel) # replication factors ij <- ij[as.numeric(names(repf)), ] # distinct choice made G <- matrix(0, nrow = 3, ncol = nrow(ij)) # for now for(j in 1:ncol(G)) G[,j] <- rowSums(outer(0:2, colSums(H[ij[j, ], ]), "==")) G <- G[, rep(1:ncol(G), repf)] # bulk up the result # _ # _ # _ # _pair <- replicate(nResamples, sample(1:120, 2)) # _ # _gen <- function(x){g <- sum(x); c(g==0, g==1, g==2)} # _ # _for (i in 1:nResamples){ # _ G <- G + apply(H[pair[,i],], 2, gen) # _} ### --- end of code excerpt --- I did a timing on my machine which is a middle-of-the range windows monstrosity...> system.time({+ + nSNPs <- 1000 + H <- matrix(sample(0:1, 120*nSNPs , replace=T), nrow=120) + + # G <- matrix(0, nrow=3, ncol=nSNPs) + + # Keep in mind that the real H is 120 x 65000 + + ij <- as.matrix(subset(expand.grid(i = 1:120, j = 1:120), i < j)) + + nResamples <- 3000 + sel <- sample(1:nrow(ij), nResamples, rep = TRUE) + repf <- table(sel) # replication factors + ij <- ij[as.numeric(names(repf)), ] # distinct choice made + + G <- matrix(0, nrow = 3, ncol = nrow(ij)) # for now + + for(j in 1:ncol(G)) + G[,j] <- rowSums(outer(0:2, colSums(H[ij[j, ], ]), "==")) + + G <- G[, rep(1:ncol(G), repf)] # bulk up the result + + # _ + # _ + # _ + # _pair <- replicate(nResamples, sample(1:120, 2)) + # _ + # _gen <- function(x){g <- sum(x); c(g==0, g==1, g==2)} + # _ + # _for (i in 1:nResamples){ + # _ G <- G + apply(H[pair[,i],], 2, gen) + # _} + # _#---------------------------------------------------------------------- --------- + # _ + }) user system elapsed 0.97 0.00 0.99 Less than a second. Somewhat of an improvement on the 80 minutes, I reckon. This will increase, of course when you step the size of the H matrix up from 1000 to 65000 columns Bill Venables CSIRO Laboratories PO Box 120, Cleveland, 4163 AUSTRALIA Office Phone (email preferred): +61 7 3826 7251 Fax (if absolutely necessary): +61 7 3826 7304 Mobile: (I don't have one!) Home Phone: +61 7 3286 7700 mailto:Bill.Venables at csiro.au http://www.cmis.csiro.au/bill.venables/ -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Juan Pablo Lewinger Sent: Friday, 25 May 2007 4:04 PM To: r-help at stat.math.ethz.ch Subject: [R] Speeding up resampling of rows from a large matrix I'm trying to: Resample with replacement pairs of distinct rows from a 120 x 65,000 matrix H of 0's and 1's. For each resampled pair sum the resulting 2 x 65,000 matrix by column: 0 1 0 1 ... + 0 0 1 1 ... _______ = 0 1 1 2 ... For each column accumulate the number of 0's, 1's and 2's over the resamples to obtain a 3 x 65,000 matrix G. For those interested in the background, H is a matrix of haplotypes, each pair of haplotypes forms a genotype, and each column corresponds to a SNP. I'm using resampling to compute the null distribution of the maximum over correlated SNPs of a simple statistic. The code: #----------------------------------------------------------------------- -------- nSNPs <- 1000 H <- matrix(sample(0:1, 120*nSNPs , replace=T), nrow=120) G <- matrix(0, nrow=3, ncol=nSNPs) # Keep in mind that the real H is 120 x 65000 nResamples <- 3000 pair <- replicate(nResamples, sample(1:120, 2)) gen <- function(x){g <- sum(x); c(g==0, g==1, g==2)} for (i in 1:nResamples){ G <- G + apply(H[pair[,i],], 2, gen) } #----------------------------------------------------------------------- -------- The problem is that the loop takes about 80 mins to complete and I need to repeat the whole thing 10,000 times, which would then take over a year and a half! Is there a way to speed this up so that the full 10,000 iterations take a reasonable amount of time (say a week)? My machine has an Intel Xeon 3.40GHz CPU with 1GB of RAM > sessionInfo() R version 2.5.0 (2007-04-23) i386-pc-mingw32 I would greatly appreciate any help. Juan Pablo Lewinger Department of Preventive Medicine Keck School of Medicine University of Southern California ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.