Dear Sirs: Please pardon me I am very new to R. I have been using MATLAB. I was wondering if R would allow me to do principal components analysis on a very large dataset. Specifically, our dataset has 68800 variables and around 6000 observations. Matlab gives "out of memory" errors. I have tried also doing princomp in pieces, but this does not seem to quite work for our approach. Anything that might help much appreciated. If anyone has had experience doing this in R much appreciated. Thank you Misha -- View this message in context: http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25071267p25071267.html Sent from the R help mailing list archive at Nabble.com.
Dear Sirs: Please pardon me I am very new to R. I have been using MATLAB. I was wondering if R would allow me to do principal components analysis on a very large dataset. Specifically, our dataset has 68800 variables and around 6000 observations. Matlab gives "out of memory" errors. I have tried also doing princomp in pieces, but this does not seem to quite work for our approach. Anything that might help much appreciated. If anyone has had experience doing this in R much appreciated. Thank you Misha -- View this message in context: http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html Sent from the R help mailing list archive at Nabble.com.
Moshe Olshansky
2009-Aug-21 01:13 UTC
[R] Principle components analysis on a large dataset
Hi Misha, Since PCA is a linear procedure and you have only 6000 observations, you do not need 68000 variables. Using any 6000 of your variables so that the resulting 6000x6000 matrix is non-singular will do. You can choose these 6000 variables (columns) randomly, hoping that the resulting matrix is non-singular (and checking for this). Alternatively, you can try something like choosing one "nice" column, then choosing the second one which is the mostly orthogonal to the first one (kind of Gram-Schmidt), then choose the third one which is mostly orthogonal to the first two, etc. (I am not sure how much rounoff may be a problem- try doing this using higher precision if you can). Note that you do not need to load the entire 6000x68000 matrix into memory (you can load several thousands of columns, process them and discard them). Anyway, you will end up with a 6000x6000 matrix, i.e. 36,000,000 entries, which can fit into a memory and you can perform the usual PCA on this matrix. Good luck! Moshe. P.S. I am curious to see what other people think. --- On Fri, 21/8/09, misha680 <mk144210 at bcm.edu> wrote:> From: misha680 <mk144210 at bcm.edu> > Subject: [R] Principle components analysis on a large dataset > To: r-help at r-project.org > Received: Friday, 21 August, 2009, 10:45 AM > > Dear Sirs: > > Please pardon me I am very new to R. I have been using > MATLAB. > > I was wondering if R would allow me to do principal > components analysis on a > very large > dataset. > > Specifically, our dataset has 68800 variables and around > 6000 observations. > Matlab gives "out of memory" errors. I have tried also > doing princomp in > pieces, but this does not seem to quite work for our > approach. > > Anything that might help much appreciated. If anyone has > had experience > doing this in R much appreciated. > > Thank you > Misha > -- > View this message in context: http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org > mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, > reproducible code. >
Prof. John C Nash
2009-Aug-21 14:44 UTC
[R] Principle components analysis on a large dataset
The essential issue is that the matrix you need to manipulate is very large. This is not a new problem, and about a year ago I exchanged ideas with the Rff package developers (things have been on the back burner since due to recession woes and illness issues). These ideas were based on some very small codes from my 1979 book "Compact numerical methods for computers". This contains a code that takes a matrix row-wise from a file and builds a triangular decomposition as well as a list of orthogonal transformations, then does an svd of the result. Your problem would work on the transpose. This is a whole lot different from how R users generally work, so there are lots of interfacing and similar issues. Also there are likely more efficient computational methods than the one I used -- but I was working in 1974 on an HP9830 desk calculator with the matrix on punched cards to develop this. And it has a short code that can be written in a fairly vectorized way in R only, which may make the human/computer trade-off favourable, depending on how many times you need to run such problems. However, the main point is that you need to use some sort of "out of core" (how dated that sounds!) method, which is and will remain an issue for systems like R that work on objects in memory. I'm willing to kibbitz on such work, but it would go best if there are 3-4 folk involved to bring different skills to the table. John Nash Message: 128 Date: Thu, 20 Aug 2009 17:45:00 -0700 (PDT) From: misha680 <mk144210 at bcm.edu> Subject: [R] Principle components analysis on a large dataset To: r-help at r-project.org Message-ID: <25072510.post at talk.nabble.com> Content-Type: text/plain; charset=us-ascii Dear Sirs: Please pardon me I am very new to R. I have been using MATLAB. I was wondering if R would allow me to do principal components analysis on a very large dataset. Specifically, our dataset has 68800 variables and around 6000 observations. Matlab gives "out of memory" errors. I have tried also doing princomp in pieces, but this does not seem to quite work for our approach. Anything that might help much appreciated. If anyone has had experience doing this in R much appreciated. Thank you Misha -- View this message in context: http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html Sent from the R help mailing list archive at Nabble.com.
Hi Moshe, Your idea sounds reasonable to me. It seems analogous to have a system of linear equations with more unknowns that equations - there should be several solutions so there is no "exact" PCA solution. My plan (* = dot product) 1. Pick first "nice" vector to be longest - that is x1 * x1 is maximal. 2. For all second vectors x2 ~= x1, compute (x2 * x1)^2 / (x1 * x1) and pick minimum as my second vector. 3. For all third vectors x3 ~= x2 ~= x1, compute (x3 * x1)^2 / (x1 * x2) + (x3 * x2)^2/(x2 * x2) and pick minimum as my third vector. 4. So on until we have 6000 vectors. 5. Perform PCA on this 6000x6000 resulting matrix. What do you think? Moshe Olshansky-2 wrote:> > Hi Misha, > > Since PCA is a linear procedure and you have only 6000 observations, you > do not need 68000 variables. Using any 6000 of your variables so that the > resulting 6000x6000 matrix is non-singular will do. You can choose these > 6000 variables (columns) randomly, hoping that the resulting matrix is > non-singular (and checking for this). Alternatively, you can try something > like choosing one "nice" column, then choosing the second one which is the > mostly orthogonal to the first one (kind of Gram-Schmidt), then choose the > third one which is mostly orthogonal to the first two, etc. (I am not sure > how much rounoff may be a problem- try doing this using higher precision > if you can). Note that you do not need to load the entire 6000x68000 > matrix into memory (you can load several thousands of columns, process > them and discard them). > Anyway, you will end up with a 6000x6000 matrix, i.e. 36,000,000 entries, > which can fit into a memory and you can perform the usual PCA on this > matrix. > > Good luck! > > Moshe. > > P.S. I am curious to see what other people think. > > --- On Fri, 21/8/09, misha680 <mk144210 at bcm.edu> wrote: > >> From: misha680 <mk144210 at bcm.edu> >> Subject: [R] Principle components analysis on a large dataset >> To: r-help at r-project.org >> Received: Friday, 21 August, 2009, 10:45 AM >> >> Dear Sirs: >> >> Please pardon me I am very new to R. I have been using >> MATLAB. >> >> I was wondering if R would allow me to do principal >> components analysis on a >> very large >> dataset. >> >> Specifically, our dataset has 68800 variables and around >> 6000 observations. >> Matlab gives "out of memory" errors. I have tried also >> doing princomp in >> pieces, but this does not seem to quite work for our >> approach. >> >> Anything that might help much appreciated. If anyone has >> had experience >> doing this in R much appreciated. >> >> Thank you >> Misha >> -- >> View this message in context: >> http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html >> Sent from the R help mailing list archive at Nabble.com. >> >> ______________________________________________ >> R-help at r-project.org >> mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, >> reproducible code. >> > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- View this message in context: http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25085859.html Sent from the R help mailing list archive at Nabble.com.