thr3ads.net - R help - [R] kmeans and incom,plete distance matrix concern [Aug 2006]

If this information is useful, please help other people find it:
Share via:

Ffenics

2006-Aug-07 14:38 UTC

[R] kmeans and incom,plete distance matrix concern

Hi there
I have been using R to perform kmeans on a dataset. The data is fed in using
read.table and then a matrix (x) is created

i.e:

[
mat <- matrix(0, nlevels(DF$V1), nlevels(DF$V2),
 dimnames = list(levels(DF$V1), levels(DF$V2)))
mat[cbind(DF$V1, DF$V2)] <- DF$V3
This matrix is then taken and a distance matrix (y) created using dist() before
performing the kmeans clustering.

My query is this: not all the data for the initial matrix (x) exists and
therefore the matrix is not fully populated - empty cells are populated with
'0's.

Could someone please tell me how this may affect the result from the dist()
command - because a '0' in a distance matrix means that the two
variables are identical doesnt it(?) - but I dont want tthings clustered
together simply because there was no information.

Is this a problem and are there ways to circumnavigate them? Thanks

	[[alternative HTML version deleted]]

Christian Hennig

2006-Aug-07 14:46 UTC

head link

[R] kmeans and incom,plete distance matrix concern

First of all, kmeans doesn't work on distance matrices.

On Mon, 7 Aug 2006, Ffenics wrote:
> Hi there
> I have been using R to perform kmeans on a dataset. The data is fed in
using read.table and then a matrix (x) is created
>
> i.e:
>
> [
> mat <- matrix(0, nlevels(DF$V1), nlevels(DF$V2),
> dimnames = list(levels(DF$V1), levels(DF$V2)))
> mat[cbind(DF$V1, DF$V2)] <- DF$V3
> This matrix is then taken and a distance matrix (y) created using dist()
before performing the kmeans clustering.
>
> My query is this: not all the data for the initial matrix (x) exists and
therefore the matrix is not fully populated - empty cells are populated with
'0's.
>
> Could someone please tell me how this may affect the result from the dist()
command - because a '0' in a distance matrix means that the two
variables are identical doesnt it(?) - but I dont want tthings clustered
together simply because there was no information.
>
> Is this a problem and are there ways to circumnavigate them? Thanks
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche

Ffenics

2006-Aug-07 15:43 UTC

head link

[R] kmeans and incom,plete distance matrix concern

I still don't quite understand. I thought kmeans algorithm went something
like this:

Iterate until stable :
Determine the centroid coordinate

Determine the distance of each object to the centroids

Group the object based on minimum distance 
         So, why do I not want a distance matrix?



Christian Hennig <chrish@stats.ucl.ac.uk> wrote: On Mon, 7 Aug 2006,
Ffenics wrote:
> well then i dont understand because everything i have read so far suggests
that you use the dist() function to create a matrix based on the euclideam
distance and then the kmeans() function.
kmeans requires a data matrix where cases are rows and variables are 
columns. (If  you understand what kmeans does, you should know why - means 
can't be computed from distances.)

I'm not sure about the NA behaviour. I guess NAs produce an error? (Try it 
ou!)
Anyway, I'd think about casewise deletion or imputation if I had to run 
kmeans on data with missing values.
	[[alternative HTML version deleted]]

Ffenics

2006-Aug-07 15:55 UTC

head link

[R] kmeans and incom,plete distance matrix concern

Thanks. I had a look at that and it says:


Partitioning Clustering:                  

                Function                            kmeans()                    
from package stats provides   several algorithms   for computing partitions with
respect to   Euclidean distance.
Hence why I am using a euclidean distance matrix. Why is this incorrect?

Gabor Grothendieck <ggrothendieck@gmail.com> wrote:
There are many clustering functions in R and R packages and some
take distance objects whereas others do not.  You likely read about
hclust or some different clustering function.  See ?kmeans for the
kmeans function and also look at the CRAN Task View on clustering for
other clustering functions:

  http://cran.r-project.org/src/contrib/Views/

 
	[[alternative HTML version deleted]]

Gabor Grothendieck

2006-Aug-07 15:58 UTC

head link

[R] kmeans and incom,plete distance matrix concern

?kmeans says the following.  Note that x is a matrix of ***data***.
Also look at the examples at the end of the help page if its still
not clear.

Usage:

     kmeans(x, centers, iter.max = 10, nstart = 1,
            algorithm = c("Hartigan-Wong", "Lloyd",
"Forgy", "MacQueen"))

Arguments:

       x: A numeric matrix of data, or an object that can be coerced to
          such a matrix (such as a numeric vector or a data frame with
          all numeric columns).



On 8/7/06, Ffenics <ffenics2002 at yahoo.co.uk>
wrote:>         Thanks. I had a look at that and it says:
>
>
> Partitioning Clustering:
>
>                Function                            kmeans()                
from package stats provides   several algorithms   for computing partitions with
respect to   Euclidean distance.
> Hence why I am using a euclidean distance matrix. Why is this incorrect?
>
> Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> There are many clustering functions in R and R packages and some
> take distance objects whereas others do not.  You likely read about
> hclust or some different clustering function.  See ?kmeans for the
> kmeans function and also look at the CRAN Task View on clustering for
> other clustering functions:
>
>  http://cran.r-project.org/src/contrib/Views/
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Ffenics

2006-Aug-07 16:29 UTC

head link

[R] kmeans and incom,plete distance matrix concern

Thanks everyone for their help so far. I'm very appreciative of the fact
that
people have pointed out that I was heading in the wrong direction.
I would be most grateful if someone could look over the following simple
example for me and tell me if this is how to do it.
I'm assuming by data matrix you mean the 'raw data' organised as a
matrix
Data (not euclidean distance) matrix> DF  V1 V2 V3 V4
1 78 45 34 45
2 97 23 67 12
3  9 56 12 67
4 19 67 23 90
5 34 12 78 56

and then>  clusters.kmeans <-kmeans(DF, 2)
if I want 2 clusters for example.

Am I also right in thinking that I can say which 'centriods' I want the
clustering to be done?


	[[alternative HTML version deleted]]

Possibly Parallel Threads

Search for more apparently analagous threads

R help - Aug 2006 - kmeans and incom,plete distance matrix concern

[R] kmeans and incom,plete distance matrix concern

[R] kmeans and incom,plete distance matrix concern

[R] kmeans and incom,plete distance matrix concern

[R] kmeans and incom,plete distance matrix concern

[R] kmeans and incom,plete distance matrix concern

[R] kmeans and incom,plete distance matrix concern

Possibly Parallel Threads