Hi, I have a 900,000,000*9,000 matrix where I need to calculate the correlation between all entries along the smaller dimension, thus creating a 9k*9k correlation matrix. This matrix is too big to be uploaded in R, and is saved as a binary file. To access the data in the file I use mmap and some api-functions (to get all values in one row, one column, or one particular value). I'm looking for some advice in how to calculate the correlation matrix. Right now my approach is to do something similar to this (toy code): corr.matrix<-matrix('numeric',ncol=9000,nrow=9000) for (i in 1:9000) { for (j in (i+1):9000) { # i1=... getting the index of item (i) in a second file # i2=....getting the index of item (j) g1=api$getCol(i1) g2=api$getCol(i2) cor.matrix[i,j]=cor(g1,g2) }} This will work, but will take forever. Any advice for how this can be done more efficiently? I'm running on a 2.6.18 linux system, with R version R-2.11.1. Thanks! -- View this message in context: http://r.789695.n4.nabble.com/Correlation-of-huge-matrix-saved-as-binary-file-tp4440119p4440119.html Sent from the R help mailing list archive at Nabble.com.
Peter Langfelder
2012-Mar-03 01:36 UTC
[R] Correlation of huge matrix saved as binary file
I don't think you can speed it up by a whole lot... but you can try a few things, especially if you don't have missing data in the matrix (which you probably don't). The main question is what takes most of the time- the api calls or the cor() call? If it's cor, here's what you can try: 1. Pre-standardize the entire matrix input matrix, i.e. scale each column to mean=0 and sum of squares=1. Save the standardized matrix (or make sure it's available to api). Since your matrix only has 9000 columns, this should not take extremely long. 2. Instead of calculating correlations, calculate simply sum(g1*g2) - if g1 and g2 are standardized as above, correlation equals sum(g1*g2). 3. Instead of calculating the correlations one-by-one, calculate them in small blocks (if you have enough memory and you run a 64-bit R). With 900M rows, you will only be able to put a 900Mx2 into an R object, but if you have two such standardized matrices loaded in g1, g2, you can get their (2x2) correlation matrix by t(g1) %*% g2. This 2x2 matrix you can use to fill the appropriate components of the result matrix. 4. Use one of the multi-threading packages (multicore comes to mind but there are others) to parallelize your code. If you have 8 available cores, you can expect a nearly 8x speedup. All in all, this will probably still take forever, but should be one or two orders of magnitude faster than your current code :) HTH, Peter On Fri, Mar 2, 2012 at 2:50 PM, Bryo <brynedal at gmail.com> wrote:> Hi, > > I have a 900,000,000*9,000 matrix where I need to calculate the correlation > between all entries along the smaller dimension, thus creating a 9k*9k > correlation matrix. This matrix is too big to be uploaded in R, and is saved > as a binary file. To access the data in the file I use mmap and some > api-functions (to get all values in one row, one column, or one particular > value). I'm looking for some advice in how to calculate the correlation > matrix. Right now my approach is to do something similar to this (toy code): > > corr.matrix<-matrix('numeric',ncol=9000,nrow=9000) > > for (i in 1:9000) { > for (j in (i+1):9000) { > # i1=... getting the index of ?item (i) in a second file > # i2=....getting the index of item (j) > g1=api$getCol(i1) > g2=api$getCol(i2) > cor.matrix[i,j]=cor(g1,g2) > }} > > This will work, but will take forever. Any advice for how this can be done > more efficiently? I'm running on a 2.6.18 linux system, with R version > R-2.11.1. > > Thanks!
On Sat, Mar 3, 2012 at 2:36 PM, Peter Langfelder <peter.langfelder at gmail.com> wrote:> 3. Instead of calculating the correlations one-by-one, calculate them > in small blocks (if you have enough memory and you run a 64-bit R). > With 900M rows, you will only be able to put a 900Mx2 into an R > object, but if you have two such standardized matrices loaded in g1, > g2, you can get their (2x2) correlation matrix by t(g1) %*% g2. This > 2x2 matrix you can use to fill the appropriate components of the > result matrix.Or split it the other way. Compute the covariance of all 9000 variables on, say, 50k observations and store it. Repeat 180 times, then add up the covariances and scale to a correlation. -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland
Seemingly Similar Threads
- [LLVMdev] Adding a custom GC safe point creation phase
- [LLVMdev] Adding a custom GC safe point creation phase
- [LLVMdev] Start column from DebugLoc of MachineInstr
- [LLVMdev] Get precise line/column debug info from LLVM IR
- [LLVMdev] How to read v3.3 dbg metadata using v3.4 LLVM