thr3ads.net - R help - [R] Correlation of huge matrix saved as binary file [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Bryo

2012-Mar-02 22:50 UTC

[R] Correlation of huge matrix saved as binary file

Hi,

I have a 900,000,000*9,000 matrix where I need to calculate the correlation
between all entries along the smaller dimension, thus creating a 9k*9k
correlation matrix. This matrix is too big to be uploaded in R, and is saved
as a binary file. To access the data in the file I use mmap and some
api-functions (to get all values in one row, one column, or one particular
value). I'm looking for some advice in how to calculate the correlation
matrix. Right now my approach is to do something similar to this (toy code):

corr.matrix<-matrix('numeric',ncol=9000,nrow=9000)

for (i in 1:9000) {
for (j in (i+1):9000) {
# i1=... getting the index of  item (i) in a second file
# i2=....getting the index of item (j)
g1=api$getCol(i1)
g2=api$getCol(i2)
cor.matrix[i,j]=cor(g1,g2)
}}

This will work, but will take forever. Any advice for how this can be done
more efficiently? I'm running on a 2.6.18 linux system, with R version
R-2.11.1.

Thanks!


--
View this message in context:
http://r.789695.n4.nabble.com/Correlation-of-huge-matrix-saved-as-binary-file-tp4440119p4440119.html
Sent from the R help mailing list archive at Nabble.com.

Peter Langfelder

2012-Mar-03 01:36 UTC

head link

[R] Correlation of huge matrix saved as binary file

I don't think you can speed it up by a whole lot... but you can try a
few things, especially if you don't have missing data in the matrix
(which you probably don't). The main question is what takes most of
the time- the api calls or the cor() call? If it's cor, here's what
you can try:

1. Pre-standardize the entire matrix input matrix, i.e. scale each
column to mean=0 and sum of squares=1. Save the standardized matrix
(or make sure it's available to api). Since your matrix only has 9000
columns, this should not take extremely long.

2. Instead of calculating correlations, calculate simply sum(g1*g2) -
if g1 and g2 are standardized as above, correlation equals sum(g1*g2).

3. Instead of calculating the correlations one-by-one, calculate them
in small blocks (if you have enough memory and you run a 64-bit R).
With 900M rows, you will only be able to put a 900Mx2 into an R
object, but if you have two such standardized matrices loaded in g1,
g2, you can get their (2x2) correlation matrix by t(g1) %*% g2. This
2x2 matrix you can use to fill the appropriate components of the
result matrix.

4. Use one of the multi-threading packages (multicore comes to mind
but there are others) to parallelize your code. If you have 8
available cores, you can expect a nearly 8x speedup.

All in all, this will probably still take forever, but should be one
or two orders of magnitude faster than your current code :)

HTH,

Peter

On Fri, Mar 2, 2012 at 2:50 PM, Bryo <brynedal at gmail.com>
wrote:> Hi,
>
> I have a 900,000,000*9,000 matrix where I need to calculate the correlation
> between all entries along the smaller dimension, thus creating a 9k*9k
> correlation matrix. This matrix is too big to be uploaded in R, and is
saved
> as a binary file. To access the data in the file I use mmap and some
> api-functions (to get all values in one row, one column, or one particular
> value). I'm looking for some advice in how to calculate the correlation
> matrix. Right now my approach is to do something similar to this (toy
code):
>
> corr.matrix<-matrix('numeric',ncol=9000,nrow=9000)
>
> for (i in 1:9000) {
> for (j in (i+1):9000) {
> # i1=... getting the index of ?item (i) in a second file
> # i2=....getting the index of item (j)
> g1=api$getCol(i1)
> g2=api$getCol(i2)
> cor.matrix[i,j]=cor(g1,g2)
> }}
>
> This will work, but will take forever. Any advice for how this can be done
> more efficiently? I'm running on a 2.6.18 linux system, with R version
> R-2.11.1.
>
> Thanks!

Thomas Lumley

2012-Mar-03 22:12 UTC

head link

[R] Correlation of huge matrix saved as binary file

On Sat, Mar 3, 2012 at 2:36 PM, Peter Langfelder
<peter.langfelder at gmail.com> wrote:
> 3. Instead of calculating the correlations one-by-one, calculate them
> in small blocks (if you have enough memory and you run a 64-bit R).
> With 900M rows, you will only be able to put a 900Mx2 into an R
> object, but if you have two such standardized matrices loaded in g1,
> g2, you can get their (2x2) correlation matrix by t(g1) %*% g2. This
> 2x2 matrix you can use to fill the appropriate components of the
> result matrix.
Or split it the other way.   Compute the covariance of all 9000
variables on, say, 50k observations and store it. Repeat 180 times,
then add up the covariances and scale to a correlation.

    -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

Seemingly Similar Threads

Search for more reasonably related threads

R help - Mar 2012 - Correlation of huge matrix saved as binary file

[R] Correlation of huge matrix saved as binary file

[R] Correlation of huge matrix saved as binary file

[R] Correlation of huge matrix saved as binary file

Seemingly Similar Threads