thr3ads.net - R help - [R] Principle components analysis on a large dataset [Aug 2009]

If this information is useful, please help other people find it:
Share via:

misha680

2009-Aug-20 22:31 UTC

[R] Principle components analysis on a large dataset

Dear Sirs:

Please pardon me I am very new to R. I have been using MATLAB.

I was wondering if R would allow me to do principal components analysis on a
very large
dataset.

Specifically, our dataset has 68800 variables and around 6000 observations.
Matlab gives "out of memory" errors. I have tried also doing princomp
in
pieces, but this does not seem to quite work for our approach.

Anything that might help much appreciated. If anyone has had experience
doing this in R much appreciated.

Thank you
Misha
-- 
View this message in context:
http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25071267p25071267.html
Sent from the R help mailing list archive at Nabble.com.

misha680

2009-Aug-21 00:45 UTC

head link

[R] Principle components analysis on a large dataset

Dear Sirs:

Please pardon me I am very new to R. I have been using MATLAB.

I was wondering if R would allow me to do principal components analysis on a
very large
dataset.

Specifically, our dataset has 68800 variables and around 6000 observations.
Matlab gives "out of memory" errors. I have tried also doing princomp
in
pieces, but this does not seem to quite work for our approach.

Anything that might help much appreciated. If anyone has had experience
doing this in R much appreciated.

Thank you
Misha
-- 
View this message in context:
http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html
Sent from the R help mailing list archive at Nabble.com.

Moshe Olshansky

2009-Aug-21 01:13 UTC

head link

[R] Principle components analysis on a large dataset

Hi Misha,

Since PCA is a linear procedure and you have only 6000 observations, you do not
need 68000 variables. Using any 6000 of your variables so that the resulting
6000x6000 matrix is non-singular will do. You can choose these 6000 variables
(columns) randomly, hoping that the resulting matrix is non-singular (and
checking for this). Alternatively, you can try something like choosing one
"nice" column, then choosing the second one which is the mostly
orthogonal to the first one (kind of Gram-Schmidt), then choose the third one
which is mostly orthogonal to the first two, etc. (I am not sure how much
rounoff may be a problem- try doing this using higher precision if you can).
Note that you do not need to load the entire 6000x68000 matrix into memory (you
can load several thousands of columns, process them and discard them).
Anyway, you will end up with a 6000x6000 matrix, i.e. 36,000,000 entries, which
can fit into a memory and you can perform the usual PCA on this matrix.

Good luck!

Moshe.

P.S. I am curious to see what other people think.

--- On Fri, 21/8/09, misha680 <mk144210 at bcm.edu> wrote:
> From: misha680 <mk144210 at bcm.edu>
> Subject: [R]  Principle components analysis on a large dataset
> To: r-help at r-project.org
> Received: Friday, 21 August, 2009, 10:45 AM
> 
> Dear Sirs:
> 
> Please pardon me I am very new to R. I have been using
> MATLAB.
> 
> I was wondering if R would allow me to do principal
> components analysis on a
> very large
> dataset.
> 
> Specifically, our dataset has 68800 variables and around
> 6000 observations.
> Matlab gives "out of memory" errors. I have tried also
> doing princomp in
> pieces, but this does not seem to quite work for our
> approach.
> 
> Anything that might help much appreciated. If anyone has
> had experience
> doing this in R much appreciated.
> 
> Thank you
> Misha
> -- 
> View this message in context:
http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
> reproducible code.
>

Prof. John C Nash

2009-Aug-21 14:44 UTC

head link

[R] Principle components analysis on a large dataset

The essential issue is that the matrix you need to manipulate is very 
large. This is not a new problem, and about a year ago I exchanged ideas 
with the Rff package developers (things have been on the back burner 
since due to recession woes and illness issues). These ideas were based 
on some very small codes from my 1979 book "Compact numerical methods 
for computers". This contains a code that takes a matrix row-wise from a 
file and builds a triangular decomposition as well as a list of 
orthogonal transformations, then does an svd of the result. Your problem 
would work on the transpose. This is a whole lot different from how R 
users generally  work, so there are lots of interfacing and similar 
issues. Also there are likely more efficient computational methods than 
the one I used -- but I was working in 1974 on an HP9830 desk calculator 
with the matrix on punched cards to develop this. And it has a short 
code that can be written in a fairly vectorized way in R only, which may 
make the human/computer trade-off favourable, depending on how many 
times you need to run such problems.

However, the main point is that you need to use some sort of "out of 
core" (how dated that sounds!) method, which is and will remain an issue 
for systems like R that work on objects in memory.

I'm willing to kibbitz on such work, but it would go best if there are 
3-4 folk involved to bring different skills to the table.

John Nash




Message: 128
Date: Thu, 20 Aug 2009 17:45:00 -0700 (PDT)
From: misha680 <mk144210 at bcm.edu>
Subject: [R]  Principle components analysis on a large dataset
To: r-help at r-project.org
Message-ID: <25072510.post at talk.nabble.com>
Content-Type: text/plain; charset=us-ascii


Dear Sirs:

Please pardon me I am very new to R. I have been using MATLAB.

I was wondering if R would allow me to do principal components analysis on a
very large
dataset.

Specifically, our dataset has 68800 variables and around 6000 observations.
Matlab gives "out of memory" errors. I have tried also doing princomp
in
pieces, but this does not seem to quite work for our approach.

Anything that might help much appreciated. If anyone has had experience
doing this in R much appreciated.

Thank you
Misha
-- View this message in context: 
http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html
Sent from the R help mailing list archive at Nabble.com.

misha680

2009-Aug-21 19:20 UTC

head link

[R] Principle components analysis on a large dataset

Hi Moshe,

Your idea sounds reasonable to me. It seems analogous to have a system of
linear equations with
more unknowns that equations - there should be several solutions so there is
no "exact" PCA solution.

My plan (* = dot product)
1. Pick first "nice" vector to be longest - that is x1 * x1 is
maximal.
2. For all second vectors x2 ~= x1, compute 
(x2 * x1)^2 / (x1 * x1)
and pick minimum as my second vector.
3. For all third vectors x3 ~= x2 ~= x1, compute
(x3 * x1)^2 / (x1 * x2) + (x3 * x2)^2/(x2 * x2)
and pick minimum as my third vector.
4. So on until we have 6000 vectors.
5. Perform PCA on this 6000x6000 resulting matrix.

What do you think?


Moshe Olshansky-2 wrote:> 
> Hi Misha,
> 
> Since PCA is a linear procedure and you have only 6000 observations, you
> do not need 68000 variables. Using any 6000 of your variables so that the
> resulting 6000x6000 matrix is non-singular will do. You can choose these
> 6000 variables (columns) randomly, hoping that the resulting matrix is
> non-singular (and checking for this). Alternatively, you can try something
> like choosing one "nice" column, then choosing the second one
which is the
> mostly orthogonal to the first one (kind of Gram-Schmidt), then choose the
> third one which is mostly orthogonal to the first two, etc. (I am not sure
> how much rounoff may be a problem- try doing this using higher precision
> if you can). Note that you do not need to load the entire 6000x68000
> matrix into memory (you can load several thousands of columns, process
> them and discard them).
> Anyway, you will end up with a 6000x6000 matrix, i.e. 36,000,000 entries,
> which can fit into a memory and you can perform the usual PCA on this
> matrix.
> 
> Good luck!
> 
> Moshe.
> 
> P.S. I am curious to see what other people think.
> 
> --- On Fri, 21/8/09, misha680 <mk144210 at bcm.edu> wrote:
> 
>> From: misha680 <mk144210 at bcm.edu>
>> Subject: [R]  Principle components analysis on a large dataset
>> To: r-help at r-project.org
>> Received: Friday, 21 August, 2009, 10:45 AM
>> 
>> Dear Sirs:
>> 
>> Please pardon me I am very new to R. I have been using
>> MATLAB.
>> 
>> I was wondering if R would allow me to do principal
>> components analysis on a
>> very large
>> dataset.
>> 
>> Specifically, our dataset has 68800 variables and around
>> 6000 observations.
>> Matlab gives "out of memory" errors. I have tried also
>> doing princomp in
>> pieces, but this does not seem to quite work for our
>> approach.
>> 
>> Anything that might help much appreciated. If anyone has
>> had experience
>> doing this in R much appreciated.
>> 
>> Thank you
>> Misha
>> -- 
>> View this message in context:
>>
http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25072510.html
>> Sent from the R help mailing list archive at Nabble.com.
>> 
>> ______________________________________________
>> R-help at r-project.org
>> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained,
>> reproducible code.
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
-- 
View this message in context:
http://www.nabble.com/Principle-components-analysis-on-a-large-dataset-tp25072510p25085859.html
Sent from the R help mailing list archive at Nabble.com.

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Aug 2009 - Principle components analysis on a large dataset

[R] Principle components analysis on a large dataset

[R] Principle components analysis on a large dataset

[R] Principle components analysis on a large dataset

[R] Principle components analysis on a large dataset

[R] Principle components analysis on a large dataset

Possibly Parallel Threads