abanero
2010-May-26 13:45 UTC
[R] cluster analysis and supervised classification: an alternative to knn1?
Hi, I have a 1.000 observations with 10 attributes (of different types: numeric, dicotomic, categorical ecc..) and a measure M. I need to cluster these observations in order to assign a new observation (with the same 10 attributes but not the measure) to a cluster. I want to calculate for the new observation a measure as the average of the meausures M of the observations in the cluster assigned. I would use cluster analysis ( ?Clara? algorithm?) and then ?knn1? (in package class) to assign the new observation to a cluster. The problem is: I?m not able to use ?knn1? because some of attributes are categorical. Do you know something like ?knn1? that works with categorical variables too? Do you have any suggestion? -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html Sent from the R help mailing list archive at Nabble.com.
Joris Meys
2010-May-26 16:07 UTC
[R] cluster analysis and supervised classification: an alternative to knn1?
Not a direct answer, but from your description it looks like you are better of with supervised classification algorithms instead of unsupervised clustering. see the library randomForest for example. Alternatively, you can try a logistic regression or a multinomial regression approach, but these are parametric methods and put requirements on the data. randomForest is completely non-parametric. Cheers Joris On Wed, May 26, 2010 at 3:45 PM, abanero <gdevitis@xtel.it> wrote:> > Hi, > I have a 1.000 observations with 10 attributes (of different types: > numeric, > dicotomic, categorical ecc..) and a measure M. > > I need to cluster these observations in order to assign a new observation > (with the same 10 attributes but not the measure) to a cluster. > > I want to calculate for the new observation a measure as the average of the > meausures M of the observations in the cluster assigned. > > I would use cluster analysis ( “Clara” algorithm?) and then “knn1” (in > package class) to assign the new observation to a cluster. > > The problem is: I’m not able to use “knn1” because some of attributes are > categorical. > > Do you know something like “knn1” that works with categorical variables > too? Do you have any suggestion? > > -- > View this message in context: > http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 Joris.Meys@Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Ulrich Bodenhofer
2010-May-27 10:46 UTC
[R] cluster analysis and supervised classification: an alternative to knn1?
abanero wrote:> > Do you know something like ?knn1? that works with categorical variables > too? > Do you have any suggestion? >There are surely plenty of clustering algorithms around that do not require a vector space structure on the inputs (like KNN does). I think agglomerative clustering would solve the problem as well as a kernel-based clustering (assuming that you have a way to positive semi-definite measure of the similarity of two samples). Probably the simplest way is Affinity Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation; see CRAN package "apcluster" I have co-developed). All you need is a way of measuring the similarity of samples which is straightforward both for numerical and categorical variables - as well as for mixtures of both (the choice of the similarity measures and how to aggregate the different variables is left to you, of course). Your final "classification" task can be accomplished simply by assigning the new sample to the cluster whose exemplar is most similar. Joris Meys wrote:> > Not a direct answer, but from your description it looks like you are > better > of with supervised classification algorithms instead of unsupervised > clustering. >If you say that this is a purely supervised task that can be solved without clustering, I disagree. abanero does not mention any class labels. So it seems to me that it is indeed necessary to do unsupervised clustering first. However, I agree that the second task of assigning new samples to clusters/classes/whatever can also be solved by almost any supervised technique if samples are labeled according to their cluster membership first. Cheers, Ulrich -- View this message in context: http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html Sent from the R help mailing list archive at Nabble.com.
Joris Meys
2010-May-27 12:21 UTC
[R] cluster analysis and supervised classification: an alternative to knn1?
I'm confusing myself :-) randomForest cannot handle character vectors as predictors. (Which is why I, to my surprise, found out that a categorical variable could not be used in the function). It can handle categorical variables as predictors IF they are put in as a factor. Obviously they handle categorical variables as a response variable. I hope I'm not going to add up more mistakes, it's been enough for the day... Cheers Joris On Thu, May 27, 2010 at 2:08 PM, <Steve_Friedman@nps.gov> wrote:> Joris, > > I've been following this thread for a few days as I am beginning to use > randomForest in my work. I am confused by your last email. > > What do you mean that randomForest does not handle categorical variables ? > > It can be used in either regression or classification analysis. Do you > mean that categorical predictors are not suitable? Certainly they are as > the response. > Would you be so kind, and clarify what you were suggesting. > > Thanks, > > Steve Friedman Ph. D. > Spatial Statistical Analyst > Everglades and Dry Tortugas National Park > 950 N Krome Ave (3rd Floor) > Homestead, Florida 33034 > > Steve_Friedman@nps.gov > Office (305) 224 - 4282 > Fax (305) 224 - 4147 > > > > Joris Meys > <jorismeys@gmail. > com> To > Sent by: abanero <gdevitis@xtel.it> > r-help-bounces@r- cc > project.org r-help@r-project.org > Subject > Re: [R] cluster analysis and > 05/27/2010 07:56 supervised classification: an > AM alternative to knn1? > > > > > > > > > > > Hi Abanero, > > first, I have to correct myself. Knn1 is a supervised learning algorithm, > so > my comment wasn't completely correct. In any case, if you want to do a > clustering prior to a supervised classification, the function daisy() can > handle any kind of variable. The resulting distance matrix can be used with > a number of different methods. > > And you're right, randomForest doesn't handle categorical variables either. > So I haven't been of great help here... > Cheers > Joris > > On Thu, May 27, 2010 at 1:25 PM, abanero <gdevitis@xtel.it> wrote: > > > > > Hi, > > > > thank you Joris and Ulrich for you answers. > > > > Joris Meys wrote: > > > > >see the library randomForest for example > > > > > > I'm trying to find some example in randomForest with categorical > variables > > but I haven't found anything. Do you know any example with both > categorical > > and numerical variables? Anyway I don't have any class labels yet. How > > could > > I find clusters with randomForest? > > > > > > Ulrich wrote: > > > > >Probably the simplest way is Affinity Propagation[...] All you need is a > > way of measuring the similarity of >samples which is straightforward both > > for numerical and categorical variables. > > > > I had a look at the documentation of the package apcluster. That's > > interesting but do you have any example using it with both categorical > and > > numerical variables? I'd like to test it with a large dataset.. > > > > Thanks a lot! > > Cheers > > > > Giuseppe > > > > -- > > View this message in context: > > > > http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html > > > Sent from the R help mailing list archive at Nabble.com. > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > -- > Joris Meys > Statistical Consultant > > Ghent University > Faculty of Bioscience Engineering > Department of Applied mathematics, biometrics and process control > > Coupure Links 653 > B-9000 Gent > > tel : +32 9 264 59 87 > Joris.Meys@Ugent.be > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > >-- Joris Meys Statistical Consultant Ghent University Faculty of Bioscience Engineering Department of Applied mathematics, biometrics and process control Coupure Links 653 B-9000 Gent tel : +32 9 264 59 87 Joris.Meys@Ugent.be ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]