Hello, supposing that I have two or three clear categories for my data, lets say pet preferece across fish, cat, dog. Lets say most people rate their preference as being mostly one of the categories. I want to do pca on the data to see three 'groups' of people, one group for fish, one for cat and one for dog. I would like to see the odd person who likes both or all three in the (appropriate) middle of the other main groups. Will my data be affected by the fact that I have interviewed 1000 dog owners, 100 cat owners and 10 fish owners? (assuming that each scale of preference has an equal range). Cheers, dan.
Dan: 1) There is no guarantee that PCA will show separate groups, of course, as that is not its purpose, although it is frequently a side effect. 2) If you were to use a classification method of some sort (discriminant analysis, neural nets, SVM's, model=based classification, ...), my understanding is that yes, indeed, severely unbalanced group membership would, indeed, affect results. A guess is that Bayesian or other methods that could explicitly model the prior membership probabilities would do better. To make it clear why, suppose that there was a 99.9% preference of "dog" and .05% each of the others. Than your datasets would have almost no information on how covariates could distinguish the classes and the best classifier would be to call everything a "dog" no matter what values the covariates had. I presume experts will have more and better to say about this. -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser > Sent: Thursday, November 04, 2004 9:41 AM > To: R mailing list > Subject: [R] highly biased PCA data? > > > Hello, supposing that I have two or three clear categories > for my data, > lets say pet preferece across fish, cat, dog. Lets say most > people rate > their preference as being mostly one of the categories. > > I want to do pca on the data to see three 'groups' of people, > one group > for fish, one for cat and one for dog. I would like to see > the odd person > who likes both or all three in the (appropriate) middle of > the other main > groups. > > Will my data be affected by the fact that I have interviewed 1000 dog > owners, 100 cat owners and 10 fish owners? (assuming that > each scale of > preference has an equal range). > > Cheers, > dan. > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >
Dan Bolser <dmb <at> mrc-dunn.cam.ac.uk> writes: : : Hello, supposing that I have two or three clear categories for my data, : lets say pet preferece across fish, cat, dog. Lets say most people rate : their preference as being mostly one of the categories. : : I want to do pca on the data to see three 'groups' of people, one group : for fish, one for cat and one for dog. I would like to see the odd person : who likes both or all three in the (appropriate) middle of the other main : groups. : : Will my data be affected by the fact that I have interviewed 1000 dog : owners, 100 cat owners and 10 fish owners? (assuming that each scale of : preference has an equal range). This is not PCA but randomForest has facilities for handling classifications where the number of points vary widely. See the help for randomForest and the sampsize= argument, in particular. Also see R News 2/3 and http://www.stat.berkeley.edu/users/chenchao/666.pdf
I am no expert on this sort of matters, but that has never stopped me from tossing in my $0.02... As Gabor and Bert hinted, this is what I would try: Run randomForest on the data, using sampsize=c(10, 10, 10) and importance=TRUE, for example. Then take the few most important variables with respect to each class and maybe do PCA on those to see if you can see separation. HTH, Andy> From: Dan Bolser > > On Thu, 4 Nov 2004, Berton Gunter wrote: > > > > >Dan: > > > > > >1) There is no guarantee that PCA will show separate groups, > of course, as > >that is not its purpose, although it is frequently a side effect. > > > >2) If you were to use a classification method of some sort > (discriminant > >analysis, neural nets, SVM's, model=based classification, ...), my > >understanding is that yes, indeed, severely unbalanced group > membership > >would, indeed, affect results. A guess is that Bayesian or > other methods > >that could explicitly model the prior membership > probabilities would do > >better. To make it clear why, suppose that there was a 99.9% > preference of > >"dog" and .05% each of the others. Than your datasets would > have almost no > >information on how covariates could distinguish the classes > and the best > >classifier would be to call everything a "dog" no matter > what values the > >covariates had. > > > >I presume experts will have more and better to say about this. > > Sounds interesting. Thanks very much for the input. Just out > of curiosity, > given that I can make my data more uniform (less biased), how > could I best > generate a 2d plot to encapsulate the clusters (and inter cluster > relationships)? > > Actually I am thinking of a 2d density. > > > > > >-- Bert Gunter > >Genentech Non-Clinical Statistics > >South San Francisco, CA > > > >"The business of the statistician is to catalyze the > scientific learning > >process." - George E. P. Box > > > > > > > >> -----Original Message----- > >> From: r-help-bounces at stat.math.ethz.ch > >> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser > >> Sent: Thursday, November 04, 2004 9:41 AM > >> To: R mailing list > >> Subject: [R] highly biased PCA data? > >> > >> > >> Hello, supposing that I have two or three clear categories > >> for my data, > >> lets say pet preferece across fish, cat, dog. Lets say most > >> people rate > >> their preference as being mostly one of the categories. > >> > >> I want to do pca on the data to see three 'groups' of people, > >> one group > >> for fish, one for cat and one for dog. I would like to see > >> the odd person > >> who likes both or all three in the (appropriate) middle of > >> the other main > >> groups. > >> > >> Will my data be affected by the fact that I have > interviewed 1000 dog > >> owners, 100 cat owners and 10 fish owners? (assuming that > >> each scale of > >> preference has an equal range). > >> > >> Cheers, > >> dan. > >> > >> ______________________________________________ > >> R-help at stat.math.ethz.ch mailing list > >> https://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guide! > >> http://www.R-project.org/posting-guide.html > >> > > > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > >
I'd suggest you start by using lda() or qda() from MASS, benefits being that (a) if the frequencies in the sample do not reflect the frequencies in the target population, you can set 'prior' to mirror the target frequencies. The issue is, perhaps, is your odd person odd in a 1000 dog : 100 cat owners : 10 fish population, or odd, e.g., in a 1000:1000:50 population? You can also vary the prior to see what the effect is. If however you set a large prior probability for a group that is poorly represented, results will be 'noisy'. Note the use of 'classwt' for the prior probablities for randomForest(). (b) You can plot second versus first discriminant function scores, to get a direct graphical representation of results. Other discrimination techniques may have to use an ordination technique or even lds() or qds() on a >2 dimensional representation of results, in order to get a scatterplot. [cf MDSplot() for randomForest()] John Maindonald email: john.maindonald at anu.edu.au phone : +61 2 (6125)3473 fax : +61 2(6125)5549 Centre for Bioinformation Science, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200. On 5 Nov 2004, at 10:18 PM, r-help-request at stat.math.ethz.ch wrote:> From: Berton Gunter <gunter.berton at gene.com> > Date: 5 November 2004 5:08:38 AM > To: "'Dan Bolser'" <dmb at mrc-dunn.cam.ac.uk>, "'R-help'" > <r-help at stat.math.ethz.ch> > Cc: Subject: RE: [R] highly biased PCA data? > > Dan: > > 1) There is no guarantee that PCA will show separate groups, of > course, as > that is not its purpose, although it is frequently a side effect. > > 2) If you were to use a classification method of some sort > (discriminant > analysis, neural nets, SVM's, model=based classification, ...), my > understanding is that yes, indeed, severely unbalanced group membership > would, indeed, affect results. A guess is that Bayesian or other > methods > that could explicitly model the prior membership probabilities would do > better. To make it clear why, suppose that there was a 99.9% > preference of > "dog" and .05% each of the others. Than your datasets would have > almost no > information on how covariates could distinguish the classes and the > best > classifier would be to call everything a "dog" no matter what values > the > covariates had. > > I presume experts will have more and better to say about this. > > -- Bert Gunter > > >> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dan Bolser >> Sent: Thursday, November 04, 2004 9:41 AM >> To: R mailing list >> Subject: [R] highly biased PCA data? >> >> Hello, supposing that I have two or three clear categories >> for my data, lets say pet preferece across fish, cat, dog. Lets say >> most >> people rate their preference as being mostly one of the categories. >> >> I want to do pca on the data to see three 'groups' of people, >> one group for fish, one for cat and one for dog. I would like to see >> the odd person who likes both or all three in the (appropriate) >> middle of >> the other main groups. >> >> Will my data be affected by the fact that I have interviewed 1000 dog >> owners, 100 cat owners and 10 fish owners? (assuming that >> each scale of preference has an equal range). >> >> Cheers, >> dan.