thr3ads.net - R devel - [R] Memory usage in prcomp [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Roy Mendelssohn - NOAA Federal

2016-Mar-22 14:42 UTC

[R] Memory usage in prcomp

Hi All:

I am running prcomp on a very large array, roughly [500000, 3650].  The array
itself is 16GB.  I am running on a Unix machine and am running ?top? at the same
time and am quite surprised to see that the application memory usage is 76GB.  I
have the ?tol? set very high  (.8) so that it should only pull out a few
components.  I am surprised at this memory usage because prcomp uses the SVD if
I am not mistaken, and when I take guesses at the size of the SVD matrices they
shouldn?t be that large.   While I can fit this  in, for a variety of reasons I
would like to reduce the memory footprint.  She questions:

1.  I am running with ?center=FALSE? and ?scale=TRUE?.  Would I save memory if I
scaled the data first myself, saved the result, cleared out the workspace, read
the scaled data back in and did the prcomp call?  Basically are the intermediate
calculations for scaling kept in memory after use.

2. I don?t know how prcomp memory usage compares to a direct call to ?svn? which
allows me to explicitly set how many  singular vectors to compute (I only need
like the first five at most).  prcomp is convenient because it does a lot of the
other work for me


**********************
"The contents of this message do not reflect any position of the U.S.
Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
***Note new address and phone***
110 Shaffer Road
Santa Cruz, CA 95060
Phone: (831)-420-3666
Fax: (831) 420-3980
e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected" 
"the arc of the moral universe is long, but it bends toward justice"
-MLK Jr.

Martin Maechler

2016-Mar-22 17:00 UTC

head link

[R] Memory usage in prcomp

>>>>> Roy Mendelssohn <- NOAA Federal <roy.mendelssohn at
noaa.gov>>
>>>>>     on Tue, 22 Mar 2016 07:42:10 -0700 writes:
    > Hi All:
    > I am running prcomp on a very large array, roughly [500000, 3650].  The
array itself is 16GB.  I am running on a Unix machine and am running ?top? at
the same time and am quite surprised to see that the application memory usage is
76GB.  I have the ?tol? set very high  (.8) so that it should only pull out a
few components.  I am surprised at this memory usage because prcomp uses the SVD
if I am not mistaken, and when I take guesses at the size of the SVD matrices
they shouldn?t be that large.   While I can fit this  in, for a variety of
reasons I would like to reduce the memory footprint.  She questions:

    > 1.  I am running with ?center=FALSE? and ?scale=TRUE?.  Would I save
memory if I scaled the data first myself, saved the result, cleared out the
workspace, read the scaled data back in and did the prcomp call?  Basically are
the intermediate calculations for scaling kept in memory after use.

    > 2. I don?t know how prcomp memory usage compares to a direct call to
?svn? which allows me to explicitly set how many  singular vectors to compute (I
only need like the first five at most).  prcomp is convenient because it does a
lot of the other work for me

For your example, where p := ncol(x)  is 3650  but you only want
the first 5 PCs, it would be *considerably* more efficient to
use svd(..., nv = 5) directly.

So I would take  stats:::prcomp.default  and modify it
correspondingly.

This seems such a useful idea in general that I consider
updating the function in R with a new optional 'rank.'  argument which
you'd set to 5 in your case.

Scrutinizing R's underlying svd() code however, I know see that
there are typicall still two other [n x p] matrices created (on
in R's La.svd(), one in C code) ... which I think should be
unnecessary in this case... but that would really be another
topic (for R-devel , not R-help).

Martin

Roy Mendelssohn - NOAA Federal

2016-Mar-22 17:36 UTC

head link

[R] Memory usage in prcomp

> On Mar 22, 2016, at 10:00 AM, Martin Maechler <maechler at
stat.math.ethz.ch> wrote:
> 
>>>>>> Roy Mendelssohn <- NOAA Federal <roy.mendelssohn
at noaa.gov>>
>>>>>>    on Tue, 22 Mar 2016 07:42:10 -0700 writes:
> 
>> Hi All:
>> I am running prcomp on a very large array, roughly [500000, 3650].  The
array itself is 16GB.  I am running on a Unix machine and am running ?top? at
the same time and am quite surprised to see that the application memory usage is
76GB.  I have the ?tol? set very high  (.8) so that it should only pull out a
few components.  I am surprised at this memory usage because prcomp uses the SVD
if I am not mistaken, and when I take guesses at the size of the SVD matrices
they shouldn?t be that large.   While I can fit this  in, for a variety of
reasons I would like to reduce the memory footprint.  She questions:
> 
>> 1.  I am running with ?center=FALSE? and ?scale=TRUE?.  Would I save
memory if I scaled the data first myself, saved the result, cleared out the
workspace, read the scaled data back in and did the prcomp call?  Basically are
the intermediate calculations for scaling kept in memory after use.
> 
>> 2. I don?t know how prcomp memory usage compares to a direct call to
?svn? which allows me to explicitly set how many  singular vectors to compute (I
only need like the first five at most).  prcomp is convenient because it does a
lot of the other work for me
> 
> For your example, where p := ncol(x)  is 3650  but you only want
> the first 5 PCs, it would be *considerably* more efficient to
> use svd(..., nv = 5) directly.
> 
> So I would take  stats:::prcomp.default  and modify it
> correspondingly.
> 
> This seems such a useful idea in general that I consider
> updating the function in R with a new optional 'rank.'  argument
which
> you'd set to 5 in your case.
> 
> Scrutinizing R's underlying svd() code however, I know see that
> there are typicall still two other [n x p] matrices created (on
> in R's La.svd(), one in C code) ... which I think should be
> unnecessary in this case... but that would really be another
> topic (for R-devel , not R-help).
> 
> Martin
> 

Thanks.  It is easy enough to recode using SVN, and I think I will.    It gives
me a ;title more control on what the algorithm does.

-Roy



**********************
"The contents of this message do not reflect any position of the U.S.
Government or NOAA."
**********************
Roy Mendelssohn
Supervisory Operations Research Analyst
NOAA/NMFS
Environmental Research Division
Southwest Fisheries Science Center
***Note new address and phone***
110 Shaffer Road
Santa Cruz, CA 95060
Phone: (831)-420-3666
Fax: (831) 420-3980
e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/

"Old age and treachery will overcome youth and skill."
"From those who have been given much, much will be expected" 
"the arc of the moral universe is long, but it bends toward justice"
-MLK Jr.

Martin Maechler

2016-Mar-24 17:09 UTC

head link

[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

Following from the R-help thread of March 22 on "Memory usage in
prcomp",

I've started looking into adding an optional   'rank.'  argument
to prcomp  allowing to more efficiently get only a few PCs
instead of the full p PCs, say when p = 1000 and you know you
only want 5 PCs.

 (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html

As it was mentioned, we already have an optional 'tol' argument
which allows *not* to choose all PCs.

When I do that,
say

     C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root
     all.equal(S, crossprod(C))
     set.seed(17)
     X <- matrix(rnorm(32000), 1000, 32)
     Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
     all.equal(cov(Z), S, tol = 0.08)
     pZ <- prcomp(Z, tol = 0.1)
     summary(pZ) # only ~14 PCs (out of 32)
     
I get for the last line, the   summary.prcomp(.) call :
> summary(pZ) # only ~14 PCs (out of 32)Importance of components:
                          PC1    PC2    PC3    PC4     PC5     PC6     PC7    
PC8
Standard deviation     3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 0.76951
0.67490
Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 0.01943
0.01495
Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 0.93944
0.95439
                           PC9    PC10    PC11    PC12    PC13   PC14
Standard deviation     0.60833 0.51638 0.49048 0.44452 0.40326 0.3904
Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050
Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500
1.0000>
which computes the *proportions* as if there were only 14 PCs in
total (but there were 32 originally).

I would think that the summary should  or could in addition show
the usual  "proportion of variance explained"  like result which
does involve all 32  variances or std.dev.s ... which are
returned from the svd() anyway, even in the case when I use my
new 'rank.' argument which only returns a "few" PCs instead of
all.

Would you think the current  summary() output is good enough or
rather misleading?

I think I would want to see (possibly in addition) proportions
with respect to the full variance and not just to the variance
of those few components selected.

Opinions?

Martin Maechler
ETH Zurich

Apparently Analagous Threads

Search for more reasonably related threads

R devel - Mar 2016 - Memory usage in prcomp

[R] Memory usage in prcomp

[R] Memory usage in prcomp

[R] Memory usage in prcomp

[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

Apparently Analagous Threads