thr3ads.net - R devel - [Rd] summary( prcomp(*, tol = .) ) -- and 'rank.' [Mar 2016]

If this information is useful, please help other people find it:
Share via:

Jari Oksanen

2016-Mar-25 09:08 UTC

[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

> On 25 Mar 2016, at 10:41 am, peter dalgaard <pdalgd at gmail.com>
wrote:
> 
> As I see it, the display showing the first p << n PCs adding up to
100% of the variance is plainly wrong.
> 
> I suspect it comes about via a mental short-circuit: If we try to control p
using a tolerance, then that amounts to saying that the remaining PCs are
effectively zero-variance, but that is (usually) not the intention at all.
> 
> The common case is that the remainder terms have a roughly _constant_,
small-ish variance and are interpreted as noise. Of course the magnitude of the
noise is important information.
> But then you should use Factor Analysis which has that concept of ?noise?
(unlike PCA).

Cheers, Jari Oksanen
>> On 25 Mar 2016, at 00:02 , Steve Bronder <sbronder at
stevebronder.com> wrote:
>> 
>> I agree with Kasper, this is a 'big' issue. Does your method of
taking only
>> n PCs reduce the load on memory?
>> 
>> The new addition to the summary looks like a good idea, but Proportion
of
>> Variance as you describe it may be confusing to new users. Am I correct
in
>> saying Proportion of variance describes the amount of variance with
respect
>> to the number of components the user chooses to show? So if I only
choose
>> one I will explain 100% of the variance? I think showing 'Total
Proportion
>> of Variance' is important if that is the case.
>> 
>> 
>> Regards,
>> 
>> Steve Bronder
>> Website: stevebronder.com
>> Phone: 412-719-1282
>> Email: sbronder at stevebronder.com
>> 
>> 
>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
>> kasperdanielhansen at gmail.com> wrote:
>> 
>>> Martin, I fully agree.  This becomes an issue when you have big
matrices.
>>> 
>>> (Note that there are awesome methods for actually only computing a
small
>>> number of PCs (unlike your code which uses svn which gets all of
them);
>>> these are available in various CRAN packages).
>>> 
>>> Best,
>>> Kasper
>>> 
>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>>> maechler at stat.math.ethz.ch
>>>> wrote:
>>> 
>>>> Following from the R-help thread of March 22 on "Memory
usage in prcomp",
>>>> 
>>>> I've started looking into adding an optional  
'rank.'  argument
>>>> to prcomp  allowing to more efficiently get only a few PCs
>>>> instead of the full p PCs, say when p = 1000 and you know you
>>>> only want 5 PCs.
>>>> 
>>>> (https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
>>>> 
>>>> As it was mentioned, we already have an optional 'tol'
argument
>>>> which allows *not* to choose all PCs.
>>>> 
>>>> When I do that,
>>>> say
>>>> 
>>>>    C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix and
its root
>>>>    all.equal(S, crossprod(C))
>>>>    set.seed(17)
>>>>    X <- matrix(rnorm(32000), 1000, 32)
>>>>    Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
>>>>    all.equal(cov(Z), S, tol = 0.08)
>>>>    pZ <- prcomp(Z, tol = 0.1)
>>>>    summary(pZ) # only ~14 PCs (out of 32)
>>>> 
>>>> I get for the last line, the   summary.prcomp(.) call :
>>>> 
>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>> Importance of components:
>>>>                         PC1    PC2    PC3    PC4     PC5    
PC6
>>>> PC7     PC8
>>>> Standard deviation     3.6415 2.7178 1.8447 1.3943 1.10207
0.90922
>>> 0.76951
>>>> 0.67490
>>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986
0.02713
>>> 0.01943
>>>> 0.01495
>>>> Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288
0.92001
>>> 0.93944
>>>> 0.95439
>>>>                          PC9    PC10    PC11    PC12    PC13  
PC14
>>>> Standard deviation     0.60833 0.51638 0.49048 0.44452 0.40326
0.3904
>>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648 0.00534
0.0050
>>>> Cumulative Proportion  0.96653 0.97528 0.98318 0.98966 0.99500
1.0000
>>>>> 
>>>> 
>>>> which computes the *proportions* as if there were only 14 PCs
in
>>>> total (but there were 32 originally).
>>>> 
>>>> I would think that the summary should  or could in addition
show
>>>> the usual  "proportion of variance explained"  like
result which
>>>> does involve all 32  variances or std.dev.s ... which are
>>>> returned from the svd() anyway, even in the case when I use my
>>>> new 'rank.' argument which only returns a
"few" PCs instead of
>>>> all.
>>>> 
>>>> Would you think the current  summary() output is good enough or
>>>> rather misleading?
>>>> 
>>>> I think I would want to see (possibly in addition) proportions
>>>> with respect to the full variance and not just to the variance
>>>> of those few components selected.
>>>> 
>>>> Opinions?
>>>> 
>>>> Martin Maechler
>>>> ETH Zurich
>>>> 
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>> 
>>> 
>>>       [[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

peter dalgaard

2016-Mar-25 09:45 UTC

head link

[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

> On 25 Mar 2016, at 10:08 , Jari Oksanen <jari.oksanen at oulu.fi>
wrote:
> 
>> 
>> On 25 Mar 2016, at 10:41 am, peter dalgaard <pdalgd at gmail.com>
wrote:
>> 
>> As I see it, the display showing the first p << n PCs adding up
to 100% of the variance is plainly wrong.
>> 
>> I suspect it comes about via a mental short-circuit: If we try to
control p using a tolerance, then that amounts to saying that the remaining PCs
are effectively zero-variance, but that is (usually) not the intention at all.
>> 
>> The common case is that the remainder terms have a roughly _constant_,
small-ish variance and are interpreted as noise. Of course the magnitude of the
noise is important information.
>> 
> But then you should use Factor Analysis which has that concept of ?noise?
(unlike PCA).
Actually, FA has a slightly different concept of noise. PCA can be interpreted
as a purely technical operation, but also as an FA variant with same variance
for all components.

Specifically, FA is 

Sigma = LL' + Psi

with Psi a diagonal matrix. If Psi = sigma^2 I , then L can be determined (up to
rotation) as the first p components of PCA. (This is used in ML algorithms for
FA since it allows you to concentrate the likelihood to be a function of Psi.)

Methods like PC regression are not being very specific about the model, but the
underlying line of thought is that PCs with small variances are
"uninformative", so that you can make do with only the first handful
regressors. I tend to interpret "uninformative" as
"noise-like" in these contexts.

-pd
> 
> Cheers, Jari Oksanen
> 
>>> On 25 Mar 2016, at 00:02 , Steve Bronder <sbronder at
stevebronder.com> wrote:
>>> 
>>> I agree with Kasper, this is a 'big' issue. Does your
method of taking only
>>> n PCs reduce the load on memory?
>>> 
>>> The new addition to the summary looks like a good idea, but
Proportion of
>>> Variance as you describe it may be confusing to new users. Am I
correct in
>>> saying Proportion of variance describes the amount of variance with
respect
>>> to the number of components the user chooses to show? So if I only
choose
>>> one I will explain 100% of the variance? I think showing 'Total
Proportion
>>> of Variance' is important if that is the case.
>>> 
>>> 
>>> Regards,
>>> 
>>> Steve Bronder
>>> Website: stevebronder.com
>>> Phone: 412-719-1282
>>> Email: sbronder at stevebronder.com
>>> 
>>> 
>>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
>>> kasperdanielhansen at gmail.com> wrote:
>>> 
>>>> Martin, I fully agree.  This becomes an issue when you have big
matrices.
>>>> 
>>>> (Note that there are awesome methods for actually only
computing a small
>>>> number of PCs (unlike your code which uses svn which gets all
of them);
>>>> these are available in various CRAN packages).
>>>> 
>>>> Best,
>>>> Kasper
>>>> 
>>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>>>> maechler at stat.math.ethz.ch
>>>>> wrote:
>>>> 
>>>>> Following from the R-help thread of March 22 on
"Memory usage in prcomp",
>>>>> 
>>>>> I've started looking into adding an optional  
'rank.'  argument
>>>>> to prcomp  allowing to more efficiently get only a few PCs
>>>>> instead of the full p PCs, say when p = 1000 and you know
you
>>>>> only want 5 PCs.
>>>>> 
>>>>>
(https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
>>>>> 
>>>>> As it was mentioned, we already have an optional
'tol' argument
>>>>> which allows *not* to choose all PCs.
>>>>> 
>>>>> When I do that,
>>>>> say
>>>>> 
>>>>>   C <- chol(S <- toeplitz(.9 ^ (0:31))) # Cov.matrix
and its root
>>>>>   all.equal(S, crossprod(C))
>>>>>   set.seed(17)
>>>>>   X <- matrix(rnorm(32000), 1000, 32)
>>>>>   Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
>>>>>   all.equal(cov(Z), S, tol = 0.08)
>>>>>   pZ <- prcomp(Z, tol = 0.1)
>>>>>   summary(pZ) # only ~14 PCs (out of 32)
>>>>> 
>>>>> I get for the last line, the   summary.prcomp(.) call :
>>>>> 
>>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>>> Importance of components:
>>>>>                        PC1    PC2    PC3    PC4     PC5    
PC6
>>>>> PC7     PC8
>>>>> Standard deviation     3.6415 2.7178 1.8447 1.3943 1.10207
0.90922
>>>> 0.76951
>>>>> 0.67490
>>>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638 0.03986
0.02713
>>>> 0.01943
>>>>> 0.01495
>>>>> Cumulative Proportion  0.4352 0.6775 0.7892 0.8530 0.89288
0.92001
>>>> 0.93944
>>>>> 0.95439
>>>>>                         PC9    PC10    PC11    PC12    PC13
PC14
>>>>> Standard deviation     0.60833 0.51638 0.49048 0.44452
0.40326 0.3904
>>>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648
0.00534 0.0050
>>>>> Cumulative Proportion  0.96653 0.97528 0.98318 0.98966
0.99500 1.0000
>>>>>> 
>>>>> 
>>>>> which computes the *proportions* as if there were only 14
PCs in
>>>>> total (but there were 32 originally).
>>>>> 
>>>>> I would think that the summary should  or could in addition
show
>>>>> the usual  "proportion of variance explained" 
like result which
>>>>> does involve all 32  variances or std.dev.s ... which are
>>>>> returned from the svd() anyway, even in the case when I use
my
>>>>> new 'rank.' argument which only returns a
"few" PCs instead of
>>>>> all.
>>>>> 
>>>>> Would you think the current  summary() output is good
enough or
>>>>> rather misleading?
>>>>> 
>>>>> I think I would want to see (possibly in addition)
proportions
>>>>> with respect to the full variance and not just to the
variance
>>>>> of those few components selected.
>>>>> 
>>>>> Opinions?
>>>>> 
>>>>> Martin Maechler
>>>>> ETH Zurich
>>>>> 
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>> 
>>>> 
>>>>      [[alternative HTML version deleted]]
>>>> 
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>> 
>>> 
>>> 	[[alternative HTML version deleted]]
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 
>> -- 
>> Peter Dalgaard, Professor,
>> Center for Statistics, Copenhagen Business School
>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>> Phone: (+45)38153501
>> Office: A 4.23
>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>> 
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Jari Oksanen

2016-Mar-25 12:53 UTC

head link

[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

> On 25 Mar 2016, at 11:45 am, peter dalgaard <pdalgd at gmail.com>
wrote:
> 
>> 
>> On 25 Mar 2016, at 10:08 , Jari Oksanen <jari.oksanen at oulu.fi>
wrote:
>> 
>>> 
>>> On 25 Mar 2016, at 10:41 am, peter dalgaard <pdalgd at
gmail.com> wrote:
>>> 
>>> As I see it, the display showing the first p << n PCs adding
up to 100% of the variance is plainly wrong.
>>> 
>>> I suspect it comes about via a mental short-circuit: If we try to
control p using a tolerance, then that amounts to saying that the remaining PCs
are effectively zero-variance, but that is (usually) not the intention at all.
>>> 
>>> The common case is that the remainder terms have a roughly
_constant_, small-ish variance and are interpreted as noise. Of course the
magnitude of the noise is important information.
>>> 
>> But then you should use Factor Analysis which has that concept of
?noise? (unlike PCA).
> 
> Actually, FA has a slightly different concept of noise. PCA can be
interpreted as a purely technical operation, but also as an FA variant with same
variance for all components.
> 
> Specifically, FA is 
> 
> Sigma = LL' + Psi
> 
> with Psi a diagonal matrix. If Psi = sigma^2 I , then L can be determined
(up to rotation) as the first p components of PCA. (This is used in ML
algorithms for FA since it allows you to concentrate the likelihood to be a
function of Psi.)
> If I remember correctly, we took a correlation matrix and replaced the diagonal
elements with variable ?communalities? < 1 estimated by some trick, and then
chunked that matrix into PCA and called the result FA. A more advanced way was
to do this iteratively: take some first axes of PCA/FA, calculate diagonal
elements from them & re-feed them into PCA. It was done like that because
algorithms & computers were not strong enough for real FA. Now they are, and
I think it would be better to treat PCA like PCA, at least in the default output
of standard stats::summary function. So summary should show proportion of total
variance (for people who think this is a cool thing to know) instead of showing
a proportion of an unspecified part of the variance.

Cheers, Jari Oksanen (who now switches to listening to today?s Passion instead
of continuing with PCA)

> Methods like PC regression are not being very specific about the model, but
the underlying line of thought is that PCs with small variances are
"uninformative", so that you can make do with only the first handful
regressors. I tend to interpret "uninformative" as
"noise-like" in these contexts.
> 
> -pd
> 
>> 
>> Cheers, Jari Oksanen
>> 
>>>> On 25 Mar 2016, at 00:02 , Steve Bronder <sbronder at
stevebronder.com> wrote:
>>>> 
>>>> I agree with Kasper, this is a 'big' issue. Does your
method of taking only
>>>> n PCs reduce the load on memory?
>>>> 
>>>> The new addition to the summary looks like a good idea, but
Proportion of
>>>> Variance as you describe it may be confusing to new users. Am I
correct in
>>>> saying Proportion of variance describes the amount of variance
with respect
>>>> to the number of components the user chooses to show? So if I
only choose
>>>> one I will explain 100% of the variance? I think showing
'Total Proportion
>>>> of Variance' is important if that is the case.
>>>> 
>>>> 
>>>> Regards,
>>>> 
>>>> Steve Bronder
>>>> Website: stevebronder.com
>>>> Phone: 412-719-1282
>>>> Email: sbronder at stevebronder.com
>>>> 
>>>> 
>>>> On Thu, Mar 24, 2016 at 2:58 PM, Kasper Daniel Hansen <
>>>> kasperdanielhansen at gmail.com> wrote:
>>>> 
>>>>> Martin, I fully agree.  This becomes an issue when you have
big matrices.
>>>>> 
>>>>> (Note that there are awesome methods for actually only
computing a small
>>>>> number of PCs (unlike your code which uses svn which gets
all of them);
>>>>> these are available in various CRAN packages).
>>>>> 
>>>>> Best,
>>>>> Kasper
>>>>> 
>>>>> On Thu, Mar 24, 2016 at 1:09 PM, Martin Maechler <
>>>>> maechler at stat.math.ethz.ch
>>>>>> wrote:
>>>>> 
>>>>>> Following from the R-help thread of March 22 on
"Memory usage in prcomp",
>>>>>> 
>>>>>> I've started looking into adding an optional  
'rank.'  argument
>>>>>> to prcomp  allowing to more efficiently get only a few
PCs
>>>>>> instead of the full p PCs, say when p = 1000 and you
know you
>>>>>> only want 5 PCs.
>>>>>> 
>>>>>>
(https://stat.ethz.ch/pipermail/r-help/2016-March/437228.html
>>>>>> 
>>>>>> As it was mentioned, we already have an optional
'tol' argument
>>>>>> which allows *not* to choose all PCs.
>>>>>> 
>>>>>> When I do that,
>>>>>> say
>>>>>> 
>>>>>>  C <- chol(S <- toeplitz(.9 ^ (0:31))) #
Cov.matrix and its root
>>>>>>  all.equal(S, crossprod(C))
>>>>>>  set.seed(17)
>>>>>>  X <- matrix(rnorm(32000), 1000, 32)
>>>>>>  Z <- X %*% C  ## ==>  cov(Z) ~=  C'C = S
>>>>>>  all.equal(cov(Z), S, tol = 0.08)
>>>>>>  pZ <- prcomp(Z, tol = 0.1)
>>>>>>  summary(pZ) # only ~14 PCs (out of 32)
>>>>>> 
>>>>>> I get for the last line, the   summary.prcomp(.) call :
>>>>>> 
>>>>>>> summary(pZ) # only ~14 PCs (out of 32)
>>>>>> Importance of components:
>>>>>>                       PC1    PC2    PC3    PC4     PC5 
PC6
>>>>>> PC7     PC8
>>>>>> Standard deviation     3.6415 2.7178 1.8447 1.3943
1.10207 0.90922
>>>>> 0.76951
>>>>>> 0.67490
>>>>>> Proportion of Variance 0.4352 0.2424 0.1117 0.0638
0.03986 0.02713
>>>>> 0.01943
>>>>>> 0.01495
>>>>>> Cumulative Proportion  0.4352 0.6775 0.7892 0.8530
0.89288 0.92001
>>>>> 0.93944
>>>>>> 0.95439
>>>>>>                        PC9    PC10    PC11    PC12   
PC13   PC14
>>>>>> Standard deviation     0.60833 0.51638 0.49048 0.44452
0.40326 0.3904
>>>>>> Proportion of Variance 0.01214 0.00875 0.00789 0.00648
0.00534 0.0050
>>>>>> Cumulative Proportion  0.96653 0.97528 0.98318 0.98966
0.99500 1.0000
>>>>>>> 
>>>>>> 
>>>>>> which computes the *proportions* as if there were only
14 PCs in
>>>>>> total (but there were 32 originally).
>>>>>> 
>>>>>> I would think that the summary should  or could in
addition show
>>>>>> the usual  "proportion of variance explained"
like result which
>>>>>> does involve all 32  variances or std.dev.s ... which
are
>>>>>> returned from the svd() anyway, even in the case when I
use my
>>>>>> new 'rank.' argument which only returns a
"few" PCs instead of
>>>>>> all.
>>>>>> 
>>>>>> Would you think the current  summary() output is good
enough or
>>>>>> rather misleading?
>>>>>> 
>>>>>> I think I would want to see (possibly in addition)
proportions
>>>>>> with respect to the full variance and not just to the
variance
>>>>>> of those few components selected.
>>>>>> 
>>>>>> Opinions?
>>>>>> 
>>>>>> Martin Maechler
>>>>>> ETH Zurich
>>>>>> 
>>>>>> ______________________________________________
>>>>>> R-devel at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>>> 
>>>>> 
>>>>>     [[alternative HTML version deleted]]
>>>>> 
>>>>> ______________________________________________
>>>>> R-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>>> 
>>>> 
>>>> 	[[alternative HTML version deleted]]
>>>> 
>>>> ______________________________________________
>>>> R-devel at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>> 
>>> -- 
>>> Peter Dalgaard, Professor,
>>> Center for Statistics, Copenhagen Business School
>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
>>> Phone: (+45)38153501
>>> Office: A 4.23
>>> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
>>> 
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

Reasonably Related Threads

Search for more apparently analagous threads

R devel - Mar 2016 - summary( prcomp(*, tol = .) ) -- and 'rank.'

[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

[Rd] summary( prcomp(*, tol = .) ) -- and 'rank.'

Reasonably Related Threads