Danny Heuman
2004-Jul-13 19:58 UTC
[R] Calculating sum of squares deviation between 2 similar matrices
Hi all, I've got clusters and would like to match individual records to each cluster based on a sum of squares deviation. For each cluster and individual, I've got 50 variables to use (measured in the same way). Matrix 1 is individuals and is 25000x50. Matrix 2 is the cluster centroids and is 100x50. The same variables are found in each matrix in the same order. I'd like to calculate the 'distance' of matrix 1 to matrix 2 and get a ranking of matrix 2's distances (and row IDs 1 to 100) sorted by distance. I tried using the RDIST and DIST functions but they have true (Euclidean) distances and all I want is the sum of squares deviation across the 50 variables. I don't know how to program the sum of squares deviation across the 50 variables and do it efficiently. Because of the size of the data I'm not sure that apply would work well here, that is why I was using a for loop. The (highly inefficient) code I was using is below if that helps at all. I give you permission to laugh if you want. I'm not remotely close to a programmer. Are there any suggestions from the general readership? I'm using the 1.9.0 on Windows XP with 1GB of RAM. Thanks for your attention, Danny ------------------------------------------- #Calculate Euclidean distances between two sets of matrices. library(foreign) library(fields) #centroid is small file with 100x50 centroid <- as.data.frame(read.spss("C:\\centroid.sav")) #in_data is 25000x50 in_data <- as.data.frame(read.spss("C:\\in_vars.sav")) #loop through the in_data records, calculate distances to the 100 centroids #sort the distances in ascending order and write out the centroid # and distance for all 100. for(i in 1:nrow(in_data)) { #first column is the centroid #. columns 2 through 51 have data. aa <- as.matrix(centroid[,2:51]) #first column is a unique identifier. columns 2 through 51 have data. bb <- as.matrix(in_data[i,2:51]) #merge the in_data row to the 100 centroids and calculate Euclidean distance. cc <- rdist(rbind(bb,aa)) #take first column of distance matrix - this column is the distance of in_data row to all 100 centroids. dd <- as.matrix(cc[1,2:151]) #sort dd on distance and attach the centroid number. ee <-c(t(cbind(sort.list(dd), sort(dd)))) #write sorted distance to file write(ee, file="C:\\cluster_distances.txt",ncol=300, append=TRUE) } [[alternative HTML version deleted]]