Dear forumities,
It's seem that there is no straight forward way to calculate R2 of a cluster
solution in R. So, I would like to know if I'm right when calculating a
R2-like statistic for a given clustering solution. In fact, I have different
cluster solution for a given set of data. I would like to know which cluster
solution gives the highest R2. My data (5 variables) are scaled to a 0 mean and
1 std. This is the command lines I used to calculate R2 for 1 cluster solution:
SSTot <- (nrow(grid40km.datascale)-1)*sum(apply(grid40km.datascale,2,var)) #
total sum of square
SStot_grid40km <- NULL
for (i in 1:22) # there is 22 clusters
{
data_group <- subset(grid40km.data,grid40km.cluster==i, select=c(X1, X2, X3,
X4, X5))
SSgroup <- (nrow(data_group-1)*sum(apply(data_group,2,var))) # SS for all
variables for a given cluster
SStot_grid40km=append(SStot_grid40km, SSgroup,after=length(SStot_grid40km))
}
ssw_grid40km = sum(SStot_grid40km) #withinSS (??) as the sum of SS for all
clusters
ssbetween_grid40km = SSTot-ssw_grid40km
RSQ_grid40km2 = ssbetween_grid40km/SSTot # R-square
Am I right? Does this correspond to SAS's R2?
Many thanks,
Yan
Ressources Naturelles Canada
Service Canadien des ForĂȘts - Centre de Foresterie des Laurentides
1055, rue du PEPS
CP 10380, Succ. Ste-Foy
Québec, QC, G1V 4C7
Tel. : +001 418 649-6859
Fax : +001 418 648-5849
email : Yan.Boulanger@nrcan.gc.ca
[[alternative HTML version deleted]]