thr3ads.net - R help - [R] cluster analysis and supervised classification: an alternative to knn1? [May 2010]

If this information is useful, please help other people find it:
Share via:

abanero

2010-May-26 13:45 UTC

[R] cluster analysis and supervised classification: an alternative to knn1?

Hi,
I have a 1.000 observations with 10 attributes (of different types: numeric,
dicotomic, categorical  ecc..) and a measure M. 

I need to cluster these observations in order to assign a new observation
(with the same 10 attributes but not the measure) to a cluster. 

I want to calculate for the new observation a measure as the average of the
meausures M of the observations in the cluster assigned.

I would use cluster analysis ( ?Clara? algorithm?) and then ?knn1? (in 
package class) to assign the new observation to a cluster.

The problem is: I?m not able to use ?knn1? because some of attributes are
categorical. 

Do you know  something like ?knn1? that works with categorical variables
too? Do you have any suggestion?

-- 
View this message in context:
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html
Sent from the R help mailing list archive at Nabble.com.

Joris Meys

2010-May-26 16:07 UTC

head link

[R] cluster analysis and supervised classification: an alternative to knn1?

Not a direct answer, but from your description it looks like you are better
of with supervised classification algorithms instead of unsupervised
clustering. see the library randomForest for example. Alternatively, you can
try a logistic regression or a multinomial regression approach, but these
are parametric methods and put requirements on the data. randomForest is
completely non-parametric.

Cheers
Joris

On Wed, May 26, 2010 at 3:45 PM, abanero <gdevitis@xtel.it> wrote:
>
> Hi,
> I have a 1.000 observations with 10 attributes (of different types:
> numeric,
> dicotomic, categorical  ecc..) and a measure M.
>
> I need to cluster these observations in order to assign a new observation
> (with the same 10 attributes but not the measure) to a cluster.
>
> I want to calculate for the new observation a measure as the average of the
> meausures M of the observations in the cluster assigned.
>
> I would use cluster analysis ( “Clara” algorithm?) and then “knn1” (in
> package class) to assign the new observation to a cluster.
>
> The problem is: I’m not able to use “knn1” because some of attributes are
> categorical.
>
> Do you know  something like “knn1” that works with categorical variables
> too? Do you have any suggestion?
>
> --
> View this message in context:
>
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2231656.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
Joris.Meys@Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

Ulrich Bodenhofer

2010-May-27 10:46 UTC

head link

[R] cluster analysis and supervised classification: an alternative to knn1?

abanero wrote:>
> Do you know  something like ?knn1? that works with categorical variables
> too?
> Do you have any suggestion? 
>There are surely plenty of clustering algorithms around that do not require
a vector space structure on the inputs (like KNN does). I think
agglomerative clustering would solve the problem as well as a kernel-based
clustering (assuming that you have a way to positive semi-definite measure
of the similarity of two samples). Probably the simplest way is Affinity
Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation;
see CRAN package "apcluster" I have co-developed). All you need is a
way of
measuring the similarity of samples which is straightforward both for
numerical and categorical variables - as well as for mixtures of both (the
choice of the similarity measures and how to aggregate the different
variables is left to you, of course). Your final "classification" task
can
be accomplished simply by assigning the new sample to the cluster whose
exemplar is most similar.

Joris Meys wrote:>
> Not a direct answer, but from your description it looks like you are
> better
> of with supervised classification algorithms instead of unsupervised
> clustering. 
>If you say that this is a purely supervised task that can be solved without
clustering, I disagree. abanero does not mention any class labels. So it
seems to me that it is indeed necessary to do unsupervised clustering first.
However, I agree that the second task of assigning new samples to
clusters/classes/whatever can also be solved by almost any supervised
technique if samples are labeled according to their cluster membership
first.

Cheers, Ulrich
-- 
View this message in context:
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232902.html
Sent from the R help mailing list archive at Nabble.com.

Joris Meys

2010-May-27 12:21 UTC

head link

[R] cluster analysis and supervised classification: an alternative to knn1?

I'm confusing myself :-)

randomForest cannot handle character vectors as predictors. (Which is why I,
to my surprise, found out that a categorical variable could not be used in
the function). It can handle categorical variables as predictors IF they are
put in as a factor.

Obviously they handle categorical variables as a response variable.

 I hope I'm not going to add up more mistakes, it's been enough for the
day...
Cheers
Joris

On Thu, May 27, 2010 at 2:08 PM, <Steve_Friedman@nps.gov> wrote:
> Joris,
>
> I've been following this thread for a few days as I am beginning to use
> randomForest in my work.  I am confused by your last email.
>
> What do you mean that randomForest does not handle categorical variables ?
>
> It can be used in either regression or classification analysis.  Do you
> mean that categorical predictors are not suitable? Certainly they are as
> the response.
> Would you be so kind, and clarify what you were suggesting.
>
> Thanks,
>
> Steve Friedman Ph. D.
> Spatial Statistical Analyst
> Everglades and Dry Tortugas National Park
> 950 N Krome Ave (3rd Floor)
> Homestead, Florida 33034
>
> Steve_Friedman@nps.gov
> Office (305) 224 - 4282
> Fax     (305) 224 - 4147
>
>
>
>             Joris Meys
>             <jorismeys@gmail.
>             com>                                                      
To
>             Sent by:                  abanero <gdevitis@xtel.it>
>             r-help-bounces@r-                                          cc
>             project.org               r-help@r-project.org
>                                                                   Subject
>                                       Re: [R] cluster analysis and
>             05/27/2010 07:56          supervised classification: an
>             AM                        alternative to knn1?
>
>
>
>
>
>
>
>
>
>
> Hi Abanero,
>
> first, I have to correct myself. Knn1 is a supervised learning algorithm,
> so
> my comment wasn't completely correct. In any case, if you want to do a
> clustering prior to a supervised classification, the function daisy() can
> handle any kind of variable. The resulting distance matrix can be used with
> a number of different methods.
>
> And you're right, randomForest doesn't handle categorical variables
either.
> So I haven't been of great help here...
> Cheers
> Joris
>
> On Thu, May 27, 2010 at 1:25 PM, abanero <gdevitis@xtel.it> wrote:
>
> >
> > Hi,
> >
> > thank you Joris and Ulrich for you answers.
> >
> > Joris Meys wrote:
> >
> > >see the library randomForest for example
> >
> >
> > I'm trying to find some example in randomForest with categorical
> variables
> > but I haven't found anything. Do you know any example with both
> categorical
> > and numerical variables? Anyway I don't have any class labels yet.
How
> > could
> > I  find clusters with randomForest?
> >
> >
> > Ulrich wrote:
> >
> > >Probably the simplest way is Affinity Propagation[...] All you
need is a
> > way of measuring the similarity of >samples which is
straightforward both
> > for numerical and categorical variables.
> >
> > I had a look at the documentation of the package apcluster. That's
> > interesting but do you have any example using it with both categorical
> and
> > numerical variables? I'd like to test it with a large dataset..
> >
> > Thanks a lot!
> > Cheers
> >
> > Giuseppe
> >
> > --
> > View this message in context:
> >
>
>
http://r.789695.n4.nabble.com/cluster-analysis-and-supervised-classification-an-alternative-to-knn1-tp2231656p2232950.html
>
> > Sent from the R help mailing list archive at Nabble.com.
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Joris Meys
> Statistical Consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Applied mathematics, biometrics and process control
>
> Coupure Links 653
> B-9000 Gent
>
> tel : +32 9 264 59 87
> Joris.Meys@Ugent.be
> -------------------------------
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>              [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>

-- 
Joris Meys
Statistical Consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

Coupure Links 653
B-9000 Gent

tel : +32 9 264 59 87
Joris.Meys@Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

	[[alternative HTML version deleted]]

Reasonably Related Threads

Search for more possibly parallel threads

R help - May 2010 - cluster analysis and supervised classification: an alternative to knn1?

[R] cluster analysis and supervised classification: an alternative to knn1?

[R] cluster analysis and supervised classification: an alternative to knn1?

[R] cluster analysis and supervised classification: an alternative to knn1?

[R] cluster analysis and supervised classification: an alternative to knn1?

Reasonably Related Threads