thr3ads.net - R help - [R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject [Oct 2012]

If this information is useful, please help other people find it:
Share via:

John Sorkin

2012-Oct-02 15:35 UTC

[R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject

Window XP
R 2.15
 
I am running a cluster analysis in which I ask for three clusters (see code
below). The analysis nicely tells me what cluster each of the subjects in my
input dataset belongs to. I would like two pieces of information
(1) for every subject in my input data set, what is the probability of the
subject belonging to each of the three cluster
(2) given a new subject, someone who was not in my original dataset, how can I
determine their cluster assignment?
Thanks,
John
 
# K-Means Cluster Analysis
jclusters <- 3
fit       <- kmeans(datascaled, jclusters) # 3 cluster solution
 
and fit$cluster tells me what cluster each observation in my input dataset
belongs to (output truncated for brevity):
 > fit$cluster   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16
17 . . . .  1   1   1   1   3   1   1   1   1   2   1   2   1   1   1   1   1 . . . . How
do I get probability of being in cluster 1, cluster 2, and cluster 3 for a given
subject, e.g datascaled[1,]?How do I get the cluster assigment for a new
subject?Thanks,John
John David Sorkin M.D., Ph.D.
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to faxing)
Confidentiality Statement:
This email message, including any attachments, is for the sole use of the
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized use, disclosure or distribution is prohibited.  If you are not
the intended recipient, please contact the sender by reply email and destroy all
copies of the original message.

Ranjan Maitra

2012-Oct-02 17:59 UTC

head link

[R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject

John,

On Tue, 2 Oct 2012 11:35:12 -0400 John Sorkin
<jsorkin at grecc.umaryland.edu> wrote:
> Window XP
> R 2.15
>  
> I am running a cluster analysis in which I ask for three clusters (see code
below). The analysis nicely tells me what cluster each of the subjects in my
input dataset belongs to. I would like two pieces of information
> (1) for every subject in my input data set, what is the probability of the
subject belonging to each of the three cluster
K-means provides hard clustering, whatever cluster has closest mean
gets the assignment.
> (2) given a new subject, someone who was not in my original dataset, how
can I determine their cluster assignment?
Look at the distance between the subject the cluster means: the one
that is closest gets assigned the cluster.

If you are looking for probabilistic clustering (under Gaussian
mixture model assumptions), you could use model-based clustering: one R
package is mclust.

Btw, note that kmeans is very sensitive to initialization (as is
mclust): you may want to try several random starts (for kmeans),
at the very least. Use the argument "nstart" with a huge number.

HTH,
Ranjan

> Thanks,
> John
>  
> # K-Means Cluster Analysis
> jclusters <- 3
> fit       <- kmeans(datascaled, jclusters) # 3 cluster solution
>  
> and fit$cluster tells me what cluster each observation in my input dataset
belongs to (output truncated for brevity):
>  
> > fit$cluster   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
15  16  17 . . . .
>   1   1   1   1   3   1   1   1   1   2   1   2   1   1   1   1   1 . . . .
How do I get probability of being in cluster 1, cluster 2, and cluster 3 for a
given subject, e.g datascaled[1,]?How do I get the cluster assigment for a new
subject?Thanks,John
> John David Sorkin M.D., Ph.D.
> Chief, Biostatistics and Informatics
> University of Maryland School of Medicine Division of Gerontology
> Baltimore VA Medical Center
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> (Phone) 410-605-7119
> (Fax) 410-605-7913 (Please call phone number above prior to faxing)
> Confidentiality Statement:
> This email message, including any attachments, is for ...{{dropped:16}}

John Sorkin

2012-Oct-02 18:32 UTC

head link

[R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject

Ranjan,
Thank you for your help. What eludes me is how one computes the distance from
each cluster for each subject. For my first subject, datascaled[1,], I have
tried to use the following:
v1 <- sum(fit$centers[1,]*datascaled[1,])
v2 <- sum(fit$centers[2,]*datascaled[1,])
v3 <- sum(fit$centers[2,]*datascaled[1,])
hoping the max(v1,v2,v3) would reproduce the group assignment, i.e. simply
assign the subject to the group that gives the largest value, but it does not.
How is the distance to the three clusters computed for each subject?
Thanks,
John 

John David Sorkin M.D., Ph.D.
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to faxing)>>>
Ranjan Maitra <maitra.mbox.ignored at inbox.com> 10/2/2012 1:59 PM
>>>
John,

On Tue, 2 Oct 2012 11:35:12 -0400 John Sorkin
<jsorkin at grecc.umaryland.edu> wrote:
> Window XP
> R 2.15
>  
> I am running a cluster analysis in which I ask for three clusters (see code
below). The analysis nicely tells me what cluster each of the subjects in my
input dataset belongs to. I would like two pieces of information
> (1) for every subject in my input data set, what is the probability of the
subject belonging to each of the three cluster
K-means provides hard clustering, whatever cluster has closest mean
gets the assignment.
> (2) given a new subject, someone who was not in my original dataset, how
can I determine their cluster assignment?
Look at the distance between the subject the cluster means: the one
that is closest gets assigned the cluster.

If you are looking for probabilistic clustering (under Gaussian
mixture model assumptions), you could use model-based clustering: one R
package is mclust.

Btw, note that kmeans is very sensitive to initialization (as is
mclust): you may want to try several random starts (for kmeans),
at the very least. Use the argument "nstart" with a huge number.

HTH,
Ranjan

> Thanks,
> John
>  
> # K-Means Cluster Analysis
> jclusters <- 3
> fit       <- kmeans(datascaled, jclusters) # 3 cluster solution
>  
> and fit$cluster tells me what cluster each observation in my input dataset
belongs to (output truncated for brevity):
>  
> > fit$cluster   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
15  16  17 . . . .
>   1   1   1   1   3   1   1   1   1   2   1   2   1   1   1   1   1 . . . .
How do I get probability of being in cluster 1, cluster 2, and cluster 3 for a
given subject, e.g datascaled[1,]?How do I get the cluster assigment for a new
subject?Thanks,John
> John David Sorkin M.D., Ph.D.
> Chief, Biostatistics and Informatics
> University of Maryland School of Medicine Division of Gerontology
> Baltimore VA Medical Center
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> (Phone) 410-605-7119
> (Fax) 410-605-7913 (Please call phone number above prior to faxing)
> Confidentiality Statement:
> This email message, including any attachments, is for the sole use of the
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized use, disclosure or distribution is prohibited.  If you are not
the intended recipient, please contact the sender by reply email and destroy all
copies of the original message.-- 
Important Notice: This mailbox is ignored: e-mails are set to be
deleted on receipt. For those needing to send personal or professional
e-mail, please use appropriate addresses.

____________________________________________________________
FREE ONLINE PHOTOSHARING - Share your photos online with your friends and
family!
Visit http://www.inbox.com/photosharing to find out more!

Confidentiality Statement:
This email message, including any attachments, is for the sole use of the
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized use, disclosure or distribution is prohibited.  If you are not
the intended recipient, please contact the sender by reply email and destroy all
copies of the original message.

Ranjan Maitra

2012-Oct-02 18:52 UTC

head link

[R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject

On Tue, 2 Oct 2012 14:32:12 -0400 John Sorkin
<jsorkin at grecc.umaryland.edu> wrote:
> Ranjan,
> Thank you for your help. What eludes me is how one computes the distance
from each cluster for each subject. For my first subject, datascaled[1,], I have
tried to use the following:
> v1 <- sum(fit$centers[1,]*datascaled[1,])
> v2 <- sum(fit$centers[2,]*datascaled[1,])
> v3 <- sum(fit$centers[2,]*datascaled[1,])
> hoping the max(v1,v2,v3) would reproduce the group assignment, i.e. simply
assign the subject to the group that gives the largest value, but it does not.
How is the distance to the three clusters computed for each subject?
> Thanks,
> John 
Well, it should be:

v <- vector(length = 3)
for (i in 1:3) 
   v[i] <- sum((fit$centers[i, ] - datascaled[1, ])^2)

whichmin(v)

should provide the cluster assignment.

Btw, there is a better, more efficient and automated way to do this,
i.e. avoid the loop using matrices and arrays and apply, but I have not
bothered with that here. 

Ranjan

-- 
Important Notice: This mailbox is ignored: e-mails are set to be
deleted on receipt. For those needing to send personal or professional
e-mail, please use appropriate addresses.

____________________________________________________________
GET FREE SMILEYS FOR YOUR IM & EMAIL - Learn more at
http://www.inbox.com/smileys
Works with AIM?, MSN? Messenger, Yahoo!? Messenger, ICQ?, Google Talk? and most
webmails

John Sorkin

2012-Oct-02 18:56 UTC

head link

[R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject

Thank you!
I just wanted to know how one goes from the values returned by kmeans to a
distance metric. You have shown me that is simply the squared distance from the
centers! Thanks again.
John

John David Sorkin M.D., Ph.D.
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to faxing)>>>
Ranjan Maitra <maitra.mbox.ignored at inbox.com> 10/2/2012 2:52 PM
>>>
On Tue, 2 Oct 2012 14:32:12 -0400 John Sorkin
<jsorkin at grecc.umaryland.edu> wrote:
> Ranjan,
> Thank you for your help. What eludes me is how one computes the distance
from each cluster for each subject. For my first subject, datascaled[1,], I have
tried to use the following:
> v1 <- sum(fit$centers[1,]*datascaled[1,])
> v2 <- sum(fit$centers[2,]*datascaled[1,])
> v3 <- sum(fit$centers[2,]*datascaled[1,])
> hoping the max(v1,v2,v3) would reproduce the group assignment, i.e. simply
assign the subject to the group that gives the largest value, but it does not.
How is the distance to the three clusters computed for each subject?
> Thanks,
> John 
Well, it should be:

v <- vector(length = 3)
for (i in 1:3) 
   v[i] <- sum((fit$centers[i, ] - datascaled[1, ])^2)

whichmin(v)

should provide the cluster assignment.

Btw, there is a better, more efficient and automated way to do this,
i.e. avoid the loop using matrices and arrays and apply, but I have not
bothered with that here. 

Ranjan

-- 
Important Notice: This mailbox is ignored: e-mails are set to be
deleted on receipt. For those needing to send personal or professional
e-mail, please use appropriate addresses.

____________________________________________________________
GET FREE SMILEYS FOR YOUR IM & EMAIL - Learn more at
http://www.inbox.com/smileys
Works with AIM?, MSN? Messenger, Yahoo!? Messenger, ICQ?, Google Talk? and most
webmails

Confidentiality Statement:
This email message, including any attachments, is for the sole use of the
intended recipient(s) and may contain confidential and privileged information. 
Any unauthorized use, disclosure or distribution is prohibited.  If you are not
the intended recipient, please contact the sender by reply email and destroy all
copies of the original message.

R help - Oct 2012 - kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject

[R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject

[R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject

[R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject

[R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject

[R] kmeans cluster analysis. How do I (1) determine probability of cluster membership (2) determine cluster membership for a new subject