Hi all, I am interested in performing a cluster analysis on ecological data from forests in Pennsylvania. I would like to develop definitions for forest types (red maple forests, upland oak forests, etc.(AH AR in attached table)) based on measured attributes in each forest type. To do this, I would like to 'draw clusters' around forest types based on information from various tree species (red maple, red oak, etc.(837, 832 in attached table)) occurring in those forests. Each row of data includes mean values on a particular species occurring within a forest type at a particular site. In other words, if we monitored 10 sites in red maple forests, we would only have 10 rows of data for the tree species 'red maple', even though we measured 100 trees. I have used classification trees to examine this data, which I like because of it's predictive abilities for later 'unknown' datasets. However, my concern is that the mean species attributes (columns Diameter:Avgnumtrees in attached table) are associated with the tree species (nested?)(column Treespecies in attached table) and are not independent attributes, but are directly associated with the species listed in that row. My question is, what is the best way to conduct a clustering (I have also tried hclust, cclust and flexclust) or CART model with this sort of nested data? Also, what is the preferrable method for predicting a new dataset once these clusters or CART models have been developed? Any help would be greatly appreciated. Kind regards, Scott ---------------------------------------------------------------------------- ---- Scott L. Bearer, Ph.D. Forest Ecologist sbearer at tnc.org (570) 321-9092 (Office) (570) 321-9096 (Fax) (570) 460-0778 (Mobile) The Nature Conservancy in Pennsylvania Community Arts Center 220 West Fourth Street, 3rd Floor Williamsport, PA 17701 nature.org
Hi all, I am interested in performing a cluster analysis on ecological data from forests in Pennsylvania. I would like to develop definitions for forest types (red maple forests, upland oak forests, etc.(AH AR in attached table)) based on measured attributes in each forest type. To do this, I would like to 'draw clusters' around forest types based on information from various tree species (red maple, red oak, etc.(837, 832 in attached table)) occurring in those forests. Each row of data includes mean values on a particular species occurring within a forest type at a particular site. In other words, if we monitored 10 sites in red maple forests, we would only have 10 rows of data for the tree species 'red maple', even though we measured 100 trees. I have used classification trees to examine this data, which I like because of it's predictive abilities for later 'unknown' datasets. However, my concern is that the mean species attributes (columns Diameter:Avgnumtrees in attached table) are associated with the tree species (nested?)(column Treespecies in attached table) and are not independent attributes, but are directly associated with the species listed in that row. My question is, what is the best way to conduct a clustering (I have also tried hclust, cclust and flexclust) or CART model with this sort of nested data? Also, what is the preferrable method for predicting a new dataset once these clusters or CART models have been developed? Any help would be greatly appreciated. Kind regards, Scott PS-Due to r-help email size restrictions, I cannot post the table. Please let me know if you would like me to forward an example to you. ---------------------------------------------------------------------------- ---- Scott L. Bearer, Ph.D. Forest Ecologist sbearer@tnc.org (570) 321-9092 (Office) (570) 321-9096 (Fax) (570) 460-0778 (Mobile) The Nature Conservancy in Pennsylvania Community Arts Center 220 West Fourth Street, 3rd Floor Williamsport, PA 17701 nature.org [[alternative HTML version deleted]]
My apologies for cross-postings Hi all, I am interested in performing a cluster analysis on ecological data from forests in Pennsylvania. I would like to develop definitions for forest types (red maple forests, upland oak forests, etc.(AH AR in attached table)) based on measured attributes in each forest type. To do this, I would like to 'draw clusters' around forest types based on information from various tree species (red maple, red oak, etc.(837, 832 in attached table)) occurring in those forests. Each row of data includes mean values on a particular species occurring within a forest type at a particular site. In other words, if we monitored 10 sites in red maple forests, we would only have 10 rows of data for the tree species 'red maple', even though we measured 100 trees. I have used classification trees to examine this data, which I like because of it's predictive abilities for later 'unknown' datasets. However, my concern is that the mean species attributes (columns Diameter:Avgnumtrees in attached table) are associated with the tree species (nested?)(column Treespecies in attached table) and are not independent attributes, but are directly associated with the species listed in that row. My question is, what is the best way to conduct a clustering (I have also tried hclust, cclust and flexclust) or CART model with this sort of nested data? Also, what is the preferrable method for predicting a new dataset once these clusters or CART models have been developed? Any help would be greatly appreciated. Kind regards, Scott> head(data_hal_dom, 15)ForestType COMMON_NAME BasalArea TreesperAcre DeadperAcre VolumeperAcre BiomassperAcre AverageDiameter STDERRDIAM AVGHT STDERRHT AVGNUMTREES AH blackoak 50 31.5 25.1 NA 950.9 47955 15.1 1.1 86.8 15.2 4 AH chestnutoak 50 11.2 12 NA 231.9 16713.8 13.1 0.3 55 4.2 2 AH northern oak 50 45.3 37.6 NA 1319.7 82508.2 14.7 0.9 81.5 7 6 AH redmaple 50 51.9 66.2 NA 1564.4 60960.9 12 0.2 70.3 2.5 3 AH redpine 50 8.8 9.3 NA 189.4 8106.9 13.2 0 42 0 1 AH scarletoak 50 41.2 27.9 NA 1211 67645.6 16.3 1.5 80.3 12.4 3 AH whiteoak 50 10.4 9.2 NA 264.1 15738.6 14.4 0.3 73.3 0 1.3 AR northern oak 50 47.2 30.1 12 1506.4 93490 16.9 0.9 84.2 10.7 5 AR paperbirch 50 7.5 6 NA 243.7 9637 15.1 0 77 0 1 AR redmaple 50 7.1 6 6 226.7 9102.2 14.6 0 75 0 1 AR sweetbirch 50 4.7 6 NA 146.3 6676.2 12 0 75.5 0 1 AR whiteash 50 6.8 6 NA 261.5 9474.5 14.4 0 106 0 1 AR yellow-poplar 50 23.8 18.1 NA 962.1 28302.8 15.3 2.1 99.3 6.8 3 AR easternhemlock 70 16.6 6 NA 512.6 17125.8 22.5 0 94 0 1 AR northern oak 70 16.2 6 12 583.4 38060.4 22.2 0 110 0 1 Scott Bearer Forest Ecologist The Nature Conservancy in Pennsylvania Community Arts Center 220 West Fourth Street, 3rd Floor Williamsport, PA 17701
Hello, I believe this is an easy scripting problem, but one I am stumbling on. I have a "known" vector of 3 colors with nrow=10: known<-c("red", "blue", "red", "red", "yellow", "blue", "yellow", "blue", "blue", "yellow") and a model output vector: modelout<-c("red", "red", "red", "blue", "yellow", "blue", "blue", "red", "blue", "yellow") I would like to determine the proportion (in)correctly identified for each color. In other words: % correct "red"% correct "blue"% correct "yellow" How would I code this (assuming the actual dataset is more complex)? Any help would be much appreciated. Thank you, Scott
Try something like:> mytable <- table(known, modelout) > prop.table( mytable, 1 )Also look at ?addmargins and the CrossTable function in the gmodels package. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at intermountainmail.org (801) 408-8111> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Scott Bearer > Sent: Thursday, July 12, 2007 11:32 AM > To: r-help at stat.math.ethz.ch > Subject: [R] calculating percent error from 2 vectors > > Hello, > > I believe this is an easy scripting problem, but one I am > stumbling on. > > I have a "known" vector of 3 colors with nrow=10: > known<-c("red", "blue", "red", "red", "yellow", "blue", > "yellow", "blue", "blue", "yellow") > > and a model output vector: > modelout<-c("red", "red", "red", "blue", "yellow", "blue", > "blue", "red", "blue", "yellow") > > I would like to determine the proportion (in)correctly > identified for each color. In other words: > % correct "red"> % correct "blue"> % correct "yellow"> > How would I code this (assuming the actual dataset is more complex)? > > Any help would be much appreciated. > > Thank you, > Scott > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >
On 12-Jul-07 17:32:03, Scott Bearer wrote:> Hello, > > I believe this is an easy scripting problem, but one I am stumbling on. > > I have a "known" vector of 3 colors with nrow=10: > known<-c("red", "blue", "red", "red", "yellow", "blue", "yellow", > "blue", > "blue", "yellow") > > and a model output vector: > modelout<-c("red", "red", "red", "blue", "yellow", "blue", "blue", > "red", > "blue", "yellow") > > I would like to determine the proportion (in)correctly identified for > each > color. In other words: > % correct "red"> % correct "blue"> % correct "yellow"> > How would I code this (assuming the actual dataset is more complex)?For your example:> tbl<-table(known,modelout)> tblmodelout known blue red yellow blue 2 2 0 red 1 2 0 yellow 1 0 2> dim(tbl)[1] 3 3> for(i in (1:dim(tbl)[1])){print(sum(tbl[i,-i])/sum(tbl[i,]))}[1] 0.5 [1] 0.3333333 [1] 0.3333333 and you can modify the "print" command produce a desired format, e.g. using rownames(tbl)[i] for the successive colour names. Hoping this helps (as a start), Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk> Fax-to-email: +44 (0)870 094 0861 Date: 12-Jul-07 Time: 20:15:34 ------------------------------ XFMail ------------------------------