t.peter.Mueller at gmx.net
2010-Jan-11 12:19 UTC
[R] K-means recluster data with given cluster centers
K-means recluster data with given cluster centers Dear R user, I have several large data sets. Over time additional new data sets will be created. I want to cluster all the data in a similar/ identical way with the k-means algorithm. With the first data set I will find my cluster centers and save the cluster centers to a file [1]. This first data set is huge, it is guarantied that cluster centers will converge. Afterwards I load my cluster centers and cluster via k-means all other datasets with the same cluster centers [2]. I tried this but now I'm getting in the reclustering step following error message: "Error: empty cluster: try a better set of initial centers" That one of the clusters is empty (has no datapoint) should not be a problem. This can happen because the new data sets can be smaller. What am I doing wrong? Is there a other way to cluster new data in the same way like the old datasets? Thanks Peter 1: R code to find cluster center and save them to file #---INITIAL CLUSTERING TO FIND CLUSTER CENTERS # LOAD LIB library(cluster) # LOAD DATA data_unclean <- read.table("dataset1.dat") data.matrix<-as.matrix(data_unclean,"any") # CLUSTER Nclust <- 100 # amount cluster centers Imax <- 200 # amount of iteration for convergence of clustering set.seed(100) # set seed of random nr generator init <- sample(dim(data.matrix)[1], Nclust) # this is the initial Nclust prototypes km <- kmeans(data.matrix, centers=data.matrix[init,], iter.max=Imax) # WRITE OUT CLUSTER CENTERS km$centers # print cluster center (columns: dim component; rows: clusters) km$size # print amount of data in each cluster clusterCenters=km$centers save(file="clusterCenters.RData", list='clusterCenters') # Beispiel write.table(km$centers, file = "clusterCenters.dat", sep = ",", col.names= FALSE, row.names= FALSE) 2: R code to recluster new data #---RECLUSTER NEW DATA WITH GIVEN CLUSTER CENTERS # LOAD LIB, SET PARAMETER library(cluster) loopStart="0" loopEnd="10" # LOAD CLUSTER CENTER load("clusterCenters.RData") # load cluster centers # LOOP OVER TRAJ AND RECLUSTER THEM for(ii in loopStart:loopEnd){ # DEFINE FILENAME #print(paste("test",ii,sep="")) filenameInput=paste("dataset",ii,"dat",sep="") filenameOutput=paste("dataset",ii,"datClusters",sep="") print(filenameInput) print(filenameOutput) # LOAD DATA data_unclean <- read.table(filenameInput) data.matrix<-as.matrix(data_unclean,"any") # RECLUSTER DATA kmRecluster <- kmeans(data.matrix, centers=clusterCenters, iter.max=1) kmRecluster$size # WRITE OUT CLUSTERS FOR EACH DATA write.table(kmRecluster$cluster, file = filenameOutput, sep = ",", col.names= FALSE, row.names= FALSE) } -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
Christian Hennig
2010-Jan-11 12:46 UTC
[R] K-means recluster data with given cluster centers
That kmeans returns an error if there is an empty cluster is a bit of a nuisance. It should not be too difficult to get rid off the kmeans function for what you call "reclustering". You could write your own function that assigns every point of the new data to the closest initial center. That should be relatively easy and does the same thing, if I understand correctly what you want. I don't comment on whether it makes sense what you attempt to do, which entirely depends on the aim of your analysis (and on what you mean by "cluster in the same way"), but an alternative could be to cluster the initial data by mclustBIC in library mclust and to use the resulting clusters as training data in mclustDA. Cheers, Christian On Mon, 11 Jan 2010, t.peter.Mueller at gmx.net wrote:> K-means recluster data with given cluster centers > > Dear R user, > > I have several large data sets. Over time additional new data sets will be created. > I want to cluster all the data in a similar/ identical way with the k-means algorithm. > > With the first data set I will find my cluster centers and save the cluster centers to a file [1]. > This first data set is huge, it is guarantied that cluster centers will converge. > > Afterwards I load my cluster centers and cluster via k-means all other datasets with the same cluster centers [2]. > > I tried this but now I'm getting in the reclustering step following error message: > "Error: empty cluster: try a better set of initial centers" > > That one of the clusters is empty (has no datapoint) should not be a > problem. This can happen because the new data sets can be smaller. What > am I doing wrong? Is there a other way to cluster new data in the same > way like the old datasets? > > Thanks > Peter > > > 1: R code to find cluster center and save them to file > #---INITIAL CLUSTERING TO FIND CLUSTER CENTERS > # LOAD LIB > library(cluster) > > # LOAD DATA > data_unclean <- read.table("dataset1.dat") > data.matrix<-as.matrix(data_unclean,"any") > > # CLUSTER > Nclust <- 100 # amount cluster centers > Imax <- 200 # amount of iteration for convergence of clustering > set.seed(100) # set seed of random nr generator > init <- sample(dim(data.matrix)[1], Nclust) # this is the initial Nclust prototypes > km <- kmeans(data.matrix, centers=data.matrix[init,], iter.max=Imax) > > # WRITE OUT CLUSTER CENTERS > km$centers # print cluster center (columns: dim component; rows: clusters) > km$size # print amount of data in each cluster > clusterCenters=km$centers > save(file="clusterCenters.RData", list='clusterCenters') # Beispiel > write.table(km$centers, file = "clusterCenters.dat", sep = ",", col.names= FALSE, row.names= FALSE) > > > 2: R code to recluster new data > #---RECLUSTER NEW DATA WITH GIVEN CLUSTER CENTERS > # LOAD LIB, SET PARAMETER > library(cluster) > loopStart="0" > loopEnd="10" > > # LOAD CLUSTER CENTER > load("clusterCenters.RData") # load cluster centers > > # LOOP OVER TRAJ AND RECLUSTER THEM > for(ii in loopStart:loopEnd){ > # DEFINE FILENAME > #print(paste("test",ii,sep="")) > filenameInput=paste("dataset",ii,"dat",sep="") > filenameOutput=paste("dataset",ii,"datClusters",sep="") > print(filenameInput) > print(filenameOutput) > > # LOAD DATA > data_unclean <- read.table(filenameInput) > data.matrix<-as.matrix(data_unclean,"any") > > # RECLUSTER DATA > kmRecluster <- kmeans(data.matrix, centers=clusterCenters, iter.max=1) > kmRecluster$size > > # WRITE OUT CLUSTERS FOR EACH DATA > write.table(kmRecluster$cluster, file = filenameOutput, sep = ",", col.names= FALSE, row.names= FALSE) > } > > -- > Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - > sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >*** --- *** Christian Hennig University College London, Department of Statistical Science Gower St., London WC1E 6BT, phone +44 207 679 1698 chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche