thr3ads.net - R help - [R] Clustering nested data [Jul 2007]

If this information is useful, please help other people find it:
Share via:

Scott Bearer

2007-Jul-06 16:43 UTC

[R] Clustering nested data

Hi all,

I am interested in performing a cluster analysis on ecological data from
forests in Pennsylvania.  I would like to develop definitions for forest
types (red maple forests, upland oak forests, etc.(AH AR in attached table))
based on measured attributes in each forest type.  To do this, I would like
to 'draw clusters' around forest types based on information from various
tree species (red maple, red oak, etc.(837, 832 in attached table))
occurring in those forests.  Each row of data includes mean values on a
particular species occurring within a forest type at a particular site.  In
other words, if we monitored 10 sites in red maple forests, we would only
have 10 rows of data for the tree species 'red maple', even though we
measured 100 trees.

I have used classification trees to examine this data, which I like because
of it's predictive abilities for later 'unknown' datasets.  However,
my
concern is that the mean species attributes (columns Diameter:Avgnumtrees in
attached table) are associated with the tree species (nested?)(column
Treespecies in attached table) and are not independent attributes, but are
directly associated with the species listed in that row.

My question is, what is the best way to conduct a clustering (I have also
tried hclust, cclust and flexclust) or CART model with this sort of nested
data?
Also, what is the preferrable method for predicting a new dataset once these
clusters or CART models have been developed?

Any help would be greatly appreciated.

Kind regards,
Scott


----------------------------------------------------------------------------
----

      Scott L. Bearer, Ph.D.
      Forest Ecologist

      sbearer at tnc.org
      (570) 321-9092 (Office)
      (570) 321-9096 (Fax)
      (570) 460-0778 (Mobile)       The Nature Conservancy
        in Pennsylvania


      Community Arts Center
      220 West Fourth Street, 3rd Floor
      Williamsport, PA  17701


      nature.org

Scott Bearer

2007-Jul-09 15:38 UTC

head link

[R] Clustering nested data

Hi all,

I am interested in performing a cluster analysis on ecological data from
forests in Pennsylvania.  I would like to develop definitions for forest
types (red maple forests, upland oak forests, etc.(AH AR in attached table))
based on measured attributes in each forest type.  To do this, I would like
to 'draw clusters' around forest types based on information from various
tree species (red maple, red oak, etc.(837, 832 in attached table))
occurring in those forests.  Each row of data includes mean values on a
particular species occurring within a forest type at a particular site.  In
other words, if we monitored 10 sites in red maple forests, we would only
have 10 rows of data for the tree species 'red maple', even though we
measured 100 trees.

I have used classification trees to examine this data, which I like because
of it's predictive abilities for later 'unknown' datasets.  However,
my
concern is that the mean species attributes (columns Diameter:Avgnumtrees in
attached table) are associated with the tree species (nested?)(column
Treespecies in attached table) and are not independent attributes, but are
directly associated with the species listed in that row.

My question is, what is the best way to conduct a clustering (I have also
tried hclust, cclust and flexclust) or CART model with this sort of nested
data?
Also, what is the preferrable method for predicting a new dataset once these
clusters or CART models have been developed?

Any help would be greatly appreciated.

Kind regards,
Scott

PS-Due to r-help email size restrictions, I cannot post the table.  Please
let me know if you would like me to forward an example to you.


----------------------------------------------------------------------------
----

      Scott L. Bearer, Ph.D.
      Forest Ecologist

      sbearer@tnc.org
      (570) 321-9092 (Office)
      (570) 321-9096 (Fax)
      (570) 460-0778 (Mobile)       The Nature Conservancy
        in Pennsylvania


      Community Arts Center
      220 West Fourth Street, 3rd Floor
      Williamsport, PA  17701


      nature.org


	[[alternative HTML version deleted]]

Scott Bearer

2007-Jul-09 18:06 UTC

head link

[R] Clustering nested data

My apologies for cross-postings

Hi all,

I am interested in performing a cluster analysis on ecological data from
forests in Pennsylvania.  I would like to develop definitions for forest
types (red maple forests, upland oak forests, etc.(AH AR in attached table))
based on measured attributes in each forest type.  To do this, I would like
to 'draw clusters' around forest types based on information from various
tree species (red maple, red oak, etc.(837, 832 in attached table))
occurring in those forests.  Each row of data includes mean values on a
particular species occurring within a forest type at a particular site.  In
other words, if we monitored 10 sites in red maple forests, we would only
have 10 rows of data for the tree species 'red maple', even though we
measured 100 trees.

I have used classification trees to examine this data, which I like because
of it's predictive abilities for later 'unknown' datasets.  However,
my
concern is that the mean species attributes (columns Diameter:Avgnumtrees in
attached table) are associated with the tree species (nested?)(column
Treespecies in attached table) and are not independent attributes, but are
directly associated with the species listed in that row.

My question is, what is the best way to conduct a clustering (I have also
tried hclust, cclust and flexclust) or CART model with this sort of nested
data?
Also, what is the preferrable method for predicting a new dataset once these
clusters or CART models have been developed?

Any help would be greatly appreciated.

Kind regards,
Scott
> head(data_hal_dom, 15)ForestType	COMMON_NAME	BasalArea	TreesperAcre	DeadperAcre	VolumeperAcre
BiomassperAcre	AverageDiameter		STDERRDIAM	AVGHT	STDERRHT	AVGNUMTREES
AH	blackoak	50	31.5	25.1	NA	950.9	47955	15.1	1.1	86.8	15.2	4
AH	chestnutoak	50	11.2	12	NA	231.9	16713.8	13.1	0.3	55	4.2	2
AH	northern	oak	50	45.3	37.6	NA	1319.7	82508.2	14.7	0.9	81.5	7	6
AH	redmaple	50	51.9	66.2	NA	1564.4	60960.9	12	0.2	70.3	2.5	3
AH	redpine	50	8.8	9.3	NA	189.4	8106.9	13.2	0	42	0	1
AH	scarletoak	50	41.2	27.9	NA	1211	67645.6	16.3	1.5	80.3	12.4	3
AH	whiteoak	50	10.4	9.2	NA	264.1	15738.6	14.4	0.3	73.3	0	1.3
AR	northern	oak	50	47.2	30.1	12	1506.4	93490	16.9	0.9	84.2	10.7	5
AR	paperbirch	50	7.5	6	NA	243.7	9637	15.1	0	77	0	1
AR	redmaple	50	7.1	6	6	226.7	9102.2	14.6	0	75	0	1
AR	sweetbirch	50	4.7	6	NA	146.3	6676.2	12	0	75.5	0	1
AR	whiteash	50	6.8	6	NA	261.5	9474.5	14.4	0	106	0	1
AR	yellow-poplar	50	23.8	18.1	NA	962.1	28302.8	15.3	2.1	99.3	6.8	3
AR	easternhemlock	70	16.6	6	NA	512.6	17125.8	22.5	0	94	0	1
AR	northern	oak	70	16.2	6	12	583.4	38060.4	22.2	0	110	0	1

Scott Bearer
Forest Ecologist
The Nature Conservancy
 in Pennsylvania
Community Arts Center
220 West Fourth Street, 3rd Floor
Williamsport, PA  17701

Scott Bearer

2007-Jul-12 17:32 UTC

head link

[R] calculating percent error from 2 vectors

Hello,

I believe this is an easy scripting problem, but one I am stumbling on.

I have a "known" vector of 3 colors with nrow=10:
known<-c("red", "blue", "red", "red",
"yellow", "blue", "yellow", "blue",
"blue", "yellow")

and a model output vector:
modelout<-c("red", "red", "red",
"blue", "yellow", "blue", "blue",
"red",
"blue", "yellow")

I would like to determine the proportion (in)correctly identified for each
color.  In other words:
% correct "red"% correct "blue"% correct "yellow"
How would I code this (assuming the actual dataset is more complex)?

Any help would be much appreciated.

Thank you,
Scott

Greg Snow

2007-Jul-12 18:53 UTC

head link

[R] calculating percent error from 2 vectors

Try something like:
> mytable <- table(known, modelout)
> prop.table( mytable, 1 )
Also look at ?addmargins and the CrossTable function in the gmodels
package.

Hope this helps,

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at intermountainmail.org
(801) 408-8111
 
 
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Scott Bearer
> Sent: Thursday, July 12, 2007 11:32 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] calculating percent error from 2 vectors
> 
> Hello,
> 
> I believe this is an easy scripting problem, but one I am 
> stumbling on.
> 
> I have a "known" vector of 3 colors with nrow=10:
> known<-c("red", "blue", "red",
"red", "yellow", "blue",
> "yellow", "blue", "blue", "yellow")
> 
> and a model output vector:
> modelout<-c("red", "red", "red",
"blue", "yellow", "blue",
> "blue", "red", "blue", "yellow")
> 
> I would like to determine the proportion (in)correctly 
> identified for each color.  In other words:
> % correct "red"> % correct "blue"> % correct
"yellow">
> How would I code this (assuming the actual dataset is more complex)?
> 
> Any help would be much appreciated.
> 
> Thank you,
> Scott
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

(Ted Harding)

2007-Jul-12 19:15 UTC

head link

[R] calculating percent error from 2 vectors

On 12-Jul-07 17:32:03, Scott Bearer wrote:> Hello,
> 
> I believe this is an easy scripting problem, but one I am stumbling on.
> 
> I have a "known" vector of 3 colors with nrow=10:
> known<-c("red", "blue", "red",
"red", "yellow", "blue", "yellow",
> "blue",
> "blue", "yellow")
> 
> and a model output vector:
> modelout<-c("red", "red", "red",
"blue", "yellow", "blue", "blue",
> "red",
> "blue", "yellow")
> 
> I would like to determine the proportion (in)correctly identified for
> each
> color.  In other words:
> % correct "red"> % correct "blue"> % correct
"yellow">
> How would I code this (assuming the actual dataset is more complex)?
For your example:
> tbl<-table(known,modelout)
> tbl        modelout
known    blue red yellow
  blue   2    2   0     
  red    1    2   0     
  yellow 1    0   2     
> dim(tbl)[1] 3 3
> for(i in (1:dim(tbl)[1])){print(sum(tbl[i,-i])/sum(tbl[i,]))}[1] 0.5
[1] 0.3333333
[1] 0.3333333

and you can modify the "print" command produce a desired format,
e.g. using rownames(tbl)[i] for the successive colour names.

Hoping this helps (as a start),
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 12-Jul-07                                       Time: 20:15:34
------------------------------ XFMail ------------------------------

Maybe Matching Threads

Search for more seemingly similar threads

R help - Jul 2007 - Clustering nested data

[R] Clustering nested data

[R] Clustering nested data

[R] Clustering nested data

[R] calculating percent error from 2 vectors

[R] calculating percent error from 2 vectors

[R] calculating percent error from 2 vectors

Maybe Matching Threads