Dear All, I hope I am not asking a FAQ. I am dealing with a problem of graph theory [connected components in a non-directed graph] and I do not want to rediscover the wheel. I saw a large number of R packages dealing for instance with the k-means method or hierarchical clustering for spatially distributed data and I am basically facing a similar problem. I am given a set of data which are the positions of particles in 3 dimensions; I define two particles A and B to be directly connected if their Euclidean distance is below a certain threshold d. If A and B are directly connected and B and C are directly connected, then A,B and C are connected components (physically it means that they are members of the same cluster). All my N particles then split into k disjointed clusters, each with a certain number of connected components, and this is what I want to investigate. I do not know a priori how many clusters I have (this is my problem with e.g. k-means since k is an output for me); the only input is the set of 3-dimensional particle positions and a threshold distance. The algorithm/package I am looking should return the number of clusters and the composition of each cluster, e.g. the fact that the second cluster is made up of particles {R,T,L}. Consider for instance: # a 2-dimensional example x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") How can I then find out how many connected components I have when my threshold distance is d=0.5? Many thanks Lorenzo
Lorenzo, why can't you actually generate the graph to find the connection components? With the 'igraph' package this is something like: g <- graph.adjacency( DIST < 0.5, mode="undirected" ) g <- simplify(g) no.clusters(g) assuming you have your distance matrix in 'DIST'. If N is too big then you don't really want to create the distance matrix, but use a more sophisticated approach to find the points that are close to each other than your threshold; then create the graph. So why using clustering algorithms? Btw. from your vector of points you can create the distance matrix by using the 'outer' function. Btw2. your algorithm for graph generation is similar to geometric random graphs, see function 'grg.game' in 'igraph'. Gabor On Mon, Jan 07, 2008 at 03:26:57PM +0100, Lorenzo Isella wrote:> Dear All, > I hope I am not asking a FAQ. I am dealing with a problem of graph > theory [connected components in a non-directed graph] and I do not > want to rediscover the wheel. > I saw a large number of R packages dealing for instance with the > k-means method or hierarchical clustering for spatially distributed > data and I am basically facing a similar problem. > I am given a set of data which are the positions of particles in 3 > dimensions; I define two particles A and B to be directly connected if > their Euclidean distance is below a certain threshold d. If A and B > are directly connected and B and C are directly connected, then A,B > and C are connected components (physically it means that they are > members of the same cluster). > All my N particles then split into k disjointed clusters, each with a > certain number of connected components, and this is what I want to > investigate. > I do not know a priori how many clusters I have (this is my problem > with e.g. k-means since k is an output for me); the only input is the > set of 3-dimensional particle positions and a threshold distance. > The algorithm/package I am looking should return the number of > clusters and the composition of each cluster, e.g. the fact that the > second cluster is made up of particles {R,T,L}. > Consider for instance: > > # a 2-dimensional example > x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), > matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) > colnames(x) <- c("x", "y") > > How can I then find out how many connected components I have when my > threshold distance is d=0.5? > > Many thanks > > Lorenzo > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.-- Csardi Gabor <csardi at rmki.kfki.hu> MTA RMKI, ELTE TTK
Dear Lorenzo, if I understand your posting correctly, this is exactly what Single Linkage clustering does if you cut the dendrogram tree at your threshold distance. Therefore you can use hclust with method="single" (which produces the full dendrogram; you have to generate the Euclidean distances first) and then you feed the output into cutree with h=0.5 (your threshold), which produces a partition by cutting the dendrogram at the height 0.5. Hope this helps, Christian On Mon, 7 Jan 2008, Lorenzo Isella wrote:> Dear All, > I hope I am not asking a FAQ. I am dealing with a problem of graph > theory [connected components in a non-directed graph] and I do not > want to rediscover the wheel. > I saw a large number of R packages dealing for instance with the > k-means method or hierarchical clustering for spatially distributed > data and I am basically facing a similar problem. > I am given a set of data which are the positions of particles in 3 > dimensions; I define two particles A and B to be directly connected if > their Euclidean distance is below a certain threshold d. If A and B > are directly connected and B and C are directly connected, then A,B > and C are connected components (physically it means that they are > members of the same cluster). > All my N particles then split into k disjointed clusters, each with a > certain number of connected components, and this is what I want to > investigate. > I do not know a priori how many clusters I have (this is my problem > with e.g. k-means since k is an output for me); the only input is the > set of 3-dimensional particle positions and a threshold distance. > The algorithm/package I am looking should return the number of > clusters and the composition of each cluster, e.g. the fact that the > second cluster is made up of particles {R,T,L}. > Consider for instance: > > # a 2-dimensional example > x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), > matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) > colnames(x) <- c("x", "y") > > How can I then find out how many connected components I have when my > threshold distance is d=0.5? > > Many thanks > > Lorenzo > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >*** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche