Hi All: I am running prcomp on a very large array, roughly [500000, 3650]. The array itself is 16GB. I am running on a Unix machine and am running ?top? at the same time and am quite surprised to see that the application memory usage is 76GB. I have the ?tol? set very high (.8) so that it should only pull out a few components. I am surprised at this memory usage because prcomp uses the SVD if I am not mistaken, and when I take guesses at the size of the SVD matrices they shouldn?t be that large. While I can fit this in, for a variety of reasons I would like to reduce the memory footprint. She questions: 1. I am running with ?center=FALSE? and ?scale=TRUE?. Would I save memory if I scaled the data first myself, saved the result, cleared out the workspace, read the scaled data back in and did the prcomp call? Basically are the intermediate calculations for scaling kept in memory after use. 2. I don?t know how prcomp memory usage compares to a direct call to ?svn? which allows me to explicitly set how many singular vectors to compute (I only need like the first five at most). prcomp is convenient because it does a lot of the other work for me ********************** "The contents of this message do not reflect any position of the U.S. Government or NOAA." ********************** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center ***Note new address and phone*** 110 Shaffer Road Santa Cruz, CA 95060 Phone: (831)-420-3666 Fax: (831) 420-3980 e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/ "Old age and treachery will overcome youth and skill." "From those who have been given much, much will be expected" "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
>>>>> Roy Mendelssohn <- NOAA Federal <roy.mendelssohn at noaa.gov>> >>>>> on Tue, 22 Mar 2016 07:42:10 -0700 writes:> Hi All: > I am running prcomp on a very large array, roughly [500000, 3650]. The array itself is 16GB. I am running on a Unix machine and am running ?top? at the same time and am quite surprised to see that the application memory usage is 76GB. I have the ?tol? set very high (.8) so that it should only pull out a few components. I am surprised at this memory usage because prcomp uses the SVD if I am not mistaken, and when I take guesses at the size of the SVD matrices they shouldn?t be that large. While I can fit this in, for a variety of reasons I would like to reduce the memory footprint. She questions: > 1. I am running with ?center=FALSE? and ?scale=TRUE?. Would I save memory if I scaled the data first myself, saved the result, cleared out the workspace, read the scaled data back in and did the prcomp call? Basically are the intermediate calculations for scaling kept in memory after use. > 2. I don?t know how prcomp memory usage compares to a direct call to ?svn? which allows me to explicitly set how many singular vectors to compute (I only need like the first five at most). prcomp is convenient because it does a lot of the other work for me For your example, where p := ncol(x) is 3650 but you only want the first 5 PCs, it would be *considerably* more efficient to use svd(..., nv = 5) directly. So I would take stats:::prcomp.default and modify it correspondingly. This seems such a useful idea in general that I consider updating the function in R with a new optional 'rank.' argument which you'd set to 5 in your case. Scrutinizing R's underlying svd() code however, I know see that there are typicall still two other [n x p] matrices created (on in R's La.svd(), one in C code) ... which I think should be unnecessary in this case... but that would really be another topic (for R-devel , not R-help). Martin
> On Mar 22, 2016, at 10:00 AM, Martin Maechler <maechler at stat.math.ethz.ch> wrote: > >>>>>> Roy Mendelssohn <- NOAA Federal <roy.mendelssohn at noaa.gov>> >>>>>> on Tue, 22 Mar 2016 07:42:10 -0700 writes: > >> Hi All: >> I am running prcomp on a very large array, roughly [500000, 3650]. The array itself is 16GB. I am running on a Unix machine and am running ?top? at the same time and am quite surprised to see that the application memory usage is 76GB. I have the ?tol? set very high (.8) so that it should only pull out a few components. I am surprised at this memory usage because prcomp uses the SVD if I am not mistaken, and when I take guesses at the size of the SVD matrices they shouldn?t be that large. While I can fit this in, for a variety of reasons I would like to reduce the memory footprint. She questions: > >> 1. I am running with ?center=FALSE? and ?scale=TRUE?. Would I save memory if I scaled the data first myself, saved the result, cleared out the workspace, read the scaled data back in and did the prcomp call? Basically are the intermediate calculations for scaling kept in memory after use. > >> 2. I don?t know how prcomp memory usage compares to a direct call to ?svn? which allows me to explicitly set how many singular vectors to compute (I only need like the first five at most). prcomp is convenient because it does a lot of the other work for me > > For your example, where p := ncol(x) is 3650 but you only want > the first 5 PCs, it would be *considerably* more efficient to > use svd(..., nv = 5) directly. > > So I would take stats:::prcomp.default and modify it > correspondingly. > > This seems such a useful idea in general that I consider > updating the function in R with a new optional 'rank.' argument which > you'd set to 5 in your case. > > Scrutinizing R's underlying svd() code however, I know see that > there are typicall still two other [n x p] matrices created (on > in R's La.svd(), one in C code) ... which I think should be > unnecessary in this case... but that would really be another > topic (for R-devel , not R-help). > > Martin >Thanks. It is easy enough to recode using SVN, and I think I will. It gives me a ;title more control on what the algorithm does. -Roy ********************** "The contents of this message do not reflect any position of the U.S. Government or NOAA." ********************** Roy Mendelssohn Supervisory Operations Research Analyst NOAA/NMFS Environmental Research Division Southwest Fisheries Science Center ***Note new address and phone*** 110 Shaffer Road Santa Cruz, CA 95060 Phone: (831)-420-3666 Fax: (831) 420-3980 e-mail: Roy.Mendelssohn at noaa.gov www: http://www.pfeg.noaa.gov/ "Old age and treachery will overcome youth and skill." "From those who have been given much, much will be expected" "the arc of the moral universe is long, but it bends toward justice" -MLK Jr.
Following from the R-help thread of March 22 on "Memory usage in prcomp", I've started looking into adding an optional 'rank.' argument to prcomp allowing to more efficiently get only a few PCs instead of the full p PCs, say when p = 1000 and you know you only want 5 PCs. (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html As it was mentioned, we already have an optional 'tol' argument which allows *not* to choose all PCs. When I do that, say C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and its root all.equal(S, crossprod(C)) set.seed(17) X <- matrix(rnorm(32000), 1000, 32) Z <- X %*% C ## ==> cov(Z) ~= C'C = S all.equal(cov(Z), S, tol = 0.08) pZ <- prcomp(Z, tol = 0.1) summary(pZ) # only ~14 PCs (out of 32) I get for the last line, the summary.prcomp(.) call :> summary(pZ) # only ~14 PCs (out of 32)Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 Standard deviation 3.6415 2.7178 1.8447 1.3943 1.10207 0.90922 0.76951 0.67490 Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986 0.02713 0.01943 0.01495 Cumulative Proportion 0.4352 0.6775 0.7892 0.8530 0.89288 0.92001 0.93944 0.95439 PC9 PC10 PC11 PC12 PC13 PC14 Standard deviation 0.60833 0.51638 0.49048 0.44452 0.40326 0.3904 Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534 0.0050 Cumulative Proportion 0.96653 0.97528 0.98318 0.98966 0.99500 1.0000>which computes the *proportions* as if there were only 14 PCs in total (but there were 32 originally). I would think that the summary should or could in addition show the usual "proportion of variance explained" like result which does involve all 32 variances or std.dev.s ... which are returned from the svd() anyway, even in the case when I use my new 'rank.' argument which only returns a "few" PCs instead of all. Would you think the current summary() output is good enough or rather misleading? I think I would want to see (possibly in addition) proportions with respect to the full variance and not just to the variance of those few components selected. Opinions? Martin Maechler ETH Zurich