thr3ads.net - R help - [R] highly biased PCA data? [Nov 2004]

If this information is useful, please help other people find it:
Share via:

Dan Bolser

2004-Nov-04 17:40 UTC

[R] highly biased PCA data?

Hello, supposing that I have two or three clear categories for my data,
lets say pet preferece across fish, cat, dog. Lets say most people rate
their preference as being mostly one of the categories.

I want to do pca on the data to see three 'groups' of people, one group
for fish, one for cat and one for dog. I would like to see the odd person
who likes both or all three in the (appropriate) middle of the other main
groups.

Will my data be affected by the fact that I have interviewed 1000 dog
owners, 100 cat owners and 10 fish owners? (assuming that each scale of
preference has an equal range). 

Cheers,
dan.

Berton Gunter

2004-Nov-04 18:08 UTC

head link

[R] highly biased PCA data?

Dan:


1) There is no guarantee that PCA will show separate groups, of course, as
that is not its purpose, although it is frequently a side effect.

2) If you were to use a classification method of some sort (discriminant
analysis, neural nets, SVM's, model=based classification,  ...), my
understanding is that yes, indeed, severely unbalanced group membership
would, indeed, affect results. A guess is that Bayesian or other methods
that could explicitly model the prior membership probabilities would do
better. To make it clear why, suppose that there was a 99.9% preference of
"dog" and .05% each of the others. Than your datasets would have
almost no
information on how covariates could distinguish the classes and the best
classifier would be to call everything a "dog" no matter what values
the
covariates had.

I presume experts will have more and better to say about this.

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
 
 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser
> Sent: Thursday, November 04, 2004 9:41 AM
> To: R mailing list
> Subject: [R] highly biased PCA data?
> 
> 
> Hello, supposing that I have two or three clear categories 
> for my data,
> lets say pet preferece across fish, cat, dog. Lets say most 
> people rate
> their preference as being mostly one of the categories.
> 
> I want to do pca on the data to see three 'groups' of people, 
> one group
> for fish, one for cat and one for dog. I would like to see 
> the odd person
> who likes both or all three in the (appropriate) middle of 
> the other main
> groups.
> 
> Will my data be affected by the fact that I have interviewed 1000 dog
> owners, 100 cat owners and 10 fish owners? (assuming that 
> each scale of
> preference has an equal range). 
> 
> Cheers,
> dan.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>

Gabor Grothendieck

2004-Nov-04 18:33 UTC

head link

[R] highly biased PCA data?

Dan Bolser <dmb <at> mrc-dunn.cam.ac.uk> writes:

: 
: Hello, supposing that I have two or three clear categories for my data,
: lets say pet preferece across fish, cat, dog. Lets say most people rate
: their preference as being mostly one of the categories.
: 
: I want to do pca on the data to see three 'groups' of people, one
group
: for fish, one for cat and one for dog. I would like to see the odd person
: who likes both or all three in the (appropriate) middle of the other main
: groups.
: 
: Will my data be affected by the fact that I have interviewed 1000 dog
: owners, 100 cat owners and 10 fish owners? (assuming that each scale of
: preference has an equal range). 

This is not PCA but randomForest has facilities for handling
classifications where the number of points vary widely.  See the 
help for randomForest and the sampsize= argument, in particular.  
Also see R News 2/3 and http://www.stat.berkeley.edu/users/chenchao/666.pdf

Liaw, Andy

2004-Nov-05 00:53 UTC

head link

[R] highly biased PCA data?

I am no expert on this sort of matters, but that has never stopped me from
tossing in my $0.02...

As Gabor and Bert hinted, this is what I would try:

Run randomForest on the data, using sampsize=c(10, 10, 10) and
importance=TRUE, for example.  Then take the few most important variables
with respect to each class and maybe do PCA on those to see if you can see
separation.

HTH,
Andy
> From: Dan Bolser
> 
> On Thu, 4 Nov 2004, Berton Gunter wrote:
> 
> >
> >Dan:
> >
> >
> >1) There is no guarantee that PCA will show separate groups, 
> of course, as
> >that is not its purpose, although it is frequently a side effect.
> >
> >2) If you were to use a classification method of some sort 
> (discriminant
> >analysis, neural nets, SVM's, model=based classification,  ...), my
> >understanding is that yes, indeed, severely unbalanced group 
> membership
> >would, indeed, affect results. A guess is that Bayesian or 
> other methods
> >that could explicitly model the prior membership 
> probabilities would do
> >better. To make it clear why, suppose that there was a 99.9% 
> preference of
> >"dog" and .05% each of the others. Than your datasets would 
> have almost no
> >information on how covariates could distinguish the classes 
> and the best
> >classifier would be to call everything a "dog" no matter 
> what values the
> >covariates had.
> >
> >I presume experts will have more and better to say about this.
> 
> Sounds interesting. Thanks very much for the input. Just out 
> of curiosity,
> given that I can make my data more uniform (less biased), how 
> could I best
> generate a 2d plot to encapsulate the clusters (and inter cluster
> relationships)?
> 
> Actually I am thinking of a 2d density.
> 
> 
> >
> >-- Bert Gunter
> >Genentech Non-Clinical Statistics
> >South San Francisco, CA
> > 
> >"The business of the statistician is to catalyze the 
> scientific learning
> >process."  - George E. P. Box
> > 
> > 
> >
> >> -----Original Message-----
> >> From: r-help-bounces at stat.math.ethz.ch 
> >> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan
Bolser
> >> Sent: Thursday, November 04, 2004 9:41 AM
> >> To: R mailing list
> >> Subject: [R] highly biased PCA data?
> >> 
> >> 
> >> Hello, supposing that I have two or three clear categories 
> >> for my data,
> >> lets say pet preferece across fish, cat, dog. Lets say most 
> >> people rate
> >> their preference as being mostly one of the categories.
> >> 
> >> I want to do pca on the data to see three 'groups' of
people,
> >> one group
> >> for fish, one for cat and one for dog. I would like to see 
> >> the odd person
> >> who likes both or all three in the (appropriate) middle of 
> >> the other main
> >> groups.
> >> 
> >> Will my data be affected by the fact that I have 
> interviewed 1000 dog
> >> owners, 100 cat owners and 10 fish owners? (assuming that 
> >> each scale of
> >> preference has an equal range). 
> >> 
> >> Cheers,
> >> dan.
> >> 
> >> ______________________________________________
> >> R-help at stat.math.ethz.ch mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide! 
> >> http://www.R-project.org/posting-guide.html
> >> 
> >
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>

John Maindonald

2004-Nov-05 23:36 UTC

head link

[R] highly biased PCA data?

I'd suggest you start by using lda() or qda() from MASS,
benefits being that

(a) if the frequencies in the sample do not reflect the frequencies
in the target population, you can set 'prior' to mirror the target
frequencies.  The issue is, perhaps, is your odd person odd in
a 1000 dog : 100 cat owners : 10 fish population, or odd, e.g., in
a 1000:1000:50 population?  You can also vary the prior to see
what the effect is.  If however you set a large prior probability for
a group that is poorly represented, results will be 'noisy'.  Note
the use of 'classwt' for the prior probablities for randomForest().

(b) You can plot second versus first discriminant function scores,
to get a direct graphical representation of results.
Other discrimination techniques may have to use an ordination
technique or even lds() or qds() on a >2 dimensional representation
of results, in order to get a scatterplot.
[cf MDSplot() for randomForest()]

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

On 5 Nov 2004, at 10:18 PM, r-help-request at stat.math.ethz.ch wrote:
> From: Berton Gunter <gunter.berton at gene.com>
> Date: 5 November 2004 5:08:38 AM
> To: "'Dan Bolser'" <dmb at mrc-dunn.cam.ac.uk>,
"'R-help'"
> <r-help at stat.math.ethz.ch>
> Cc: Subject: RE: [R] highly biased PCA data?
>
> Dan:
>
> 1) There is no guarantee that PCA will show separate groups, of 
> course, as
> that is not its purpose, although it is frequently a side effect.
>
> 2) If you were to use a classification method of some sort 
> (discriminant
> analysis, neural nets, SVM's, model=based classification,  ...), my
> understanding is that yes, indeed, severely unbalanced group membership
> would, indeed, affect results. A guess is that Bayesian or other 
> methods
> that could explicitly model the prior membership probabilities would do
> better. To make it clear why, suppose that there was a 99.9% 
> preference of
> "dog" and .05% each of the others. Than your datasets would have 
> almost no
> information on how covariates could distinguish the classes and the 
> best
> classifier would be to call everything a "dog" no matter what
values
> the
> covariates had.
>
> I presume experts will have more and better to say about this.
>
> -- Bert Gunter
>
>
>> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser
>> Sent: Thursday, November 04, 2004 9:41 AM
>> To: R mailing list
>> Subject: [R] highly biased PCA data?
>>
>> Hello, supposing that I have two or three clear categories
>> for my data, lets say pet preferece across fish, cat, dog. Lets say 
>> most
>> people rate their preference as being mostly one of the categories.
>>
>> I want to do pca on the data to see three 'groups' of people,
>> one group for fish, one for cat and one for dog. I would like to see
>> the odd person who likes both or all three in the (appropriate) 
>> middle of
>> the other main groups.
>>
>> Will my data be affected by the fact that I have interviewed 1000 dog
>> owners, 100 cat owners and 10 fish owners? (assuming that
>> each scale of preference has an equal range).
>>
>> Cheers,
>> dan.

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Nov 2004 - highly biased PCA data?

[R] highly biased PCA data?

[R] highly biased PCA data?

[R] highly biased PCA data?

[R] highly biased PCA data?

[R] highly biased PCA data?

Apparently Analagous Threads