thr3ads.net - R help - [R] rpart minimum sample size [Feb 2007]

If this information is useful, please help other people find it:
Share via:

Amy Uhrin

2007-Feb-27 15:13 UTC

[R] rpart minimum sample size

Is there an optimal / minimum sample size for attempting to construct a 
classification tree using /rpart/?

I have 27 seagrass disturbance sites (boat groundings) that have been 
monitored for a number of years.  The monitoring protocol for each site 
is identical.  From the monitoring data, I am able to determine the 
level of recovery that each site has experienced.  Recovery is our 
categorical dependent variable with values of none, low, medium, high 
which are based upon percent seagrass regrowth into the injury over 
time.  I wish to be able to predict the level of recovery of future 
vessel grounding sites based upon a number of categorical / continuous 
predictor variables used here including (but not limited to) such 
parameters as:  sediment grain size, wave exposure, original size 
(volume) of the injury, injury age, injury location.

When I run /rpart/, the data is split into only two terminal nodes based 
solely upon values of the original volume of each injury.  No other 
predictor variables are considered, even though I have included about 
six of them in the model.  When I remove volume from the model the same 
thing happens but with injury area - two terminal nodes are formed based 
upon area values and no other variables appear.  I was hoping that this 
was a programming issue, me being a newbie and all, but I really think 
I've got the code right.  Now I am beginning to wonder if my N is too 
small for this method?

-- 
Amy V. Uhrin, Research Ecologist

NOAA, National Ocean Service
Center for Coastal Fisheries and Habitat Research
101 Pivers Island Road
Beaufort, NC 28516
(252) 728-8778
(252) 728-8784 (fax)
Amy.Uhrin@noaa.gov

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 \!/ \!/   <:}}}}}><   \!/ \!/  >^<**>^<  \!/ \!/ 


	[[alternative HTML version deleted]]

Wensui Liu

2007-Feb-27 15:44 UTC

head link

[R] rpart minimum sample size

amy,
without looking at your actual code, i would suggest you to take a
look at rpart.control()

On 2/27/07, Amy Uhrin <amy.uhrin at noaa.gov>
wrote:> Is there an optimal / minimum sample size for attempting to construct a
> classification tree using /rpart/?
>
> I have 27 seagrass disturbance sites (boat groundings) that have been
> monitored for a number of years.  The monitoring protocol for each site
> is identical.  From the monitoring data, I am able to determine the
> level of recovery that each site has experienced.  Recovery is our
> categorical dependent variable with values of none, low, medium, high
> which are based upon percent seagrass regrowth into the injury over
> time.  I wish to be able to predict the level of recovery of future
> vessel grounding sites based upon a number of categorical / continuous
> predictor variables used here including (but not limited to) such
> parameters as:  sediment grain size, wave exposure, original size
> (volume) of the injury, injury age, injury location.
>
> When I run /rpart/, the data is split into only two terminal nodes based
> solely upon values of the original volume of each injury.  No other
> predictor variables are considered, even though I have included about
> six of them in the model.  When I remove volume from the model the same
> thing happens but with injury area - two terminal nodes are formed based
> upon area values and no other variables appear.  I was hoping that this
> was a programming issue, me being a newbie and all, but I really think
> I've got the code right.  Now I am beginning to wonder if my N is too
> small for this method?
>
> --
> Amy V. Uhrin, Research Ecologist
>
> NOAA, National Ocean Service
> Center for Coastal Fisheries and Habitat Research
> 101 Pivers Island Road
> Beaufort, NC 28516
> (252) 728-8778
> (252) 728-8784 (fax)
> Amy.Uhrin at noaa.gov
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>  \!/ \!/   <:}}}}}><   \!/ \!/  >^<**>^<  \!/ \!/
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
WenSui Liu
A lousy statistician who happens to know a little programming
(http://spaces.msn.com/statcompute/blog)

Frank E Harrell Jr

2007-Feb-27 16:08 UTC

head link

[R] rpart minimum sample size

Amy Uhrin wrote:> Is there an optimal / minimum sample size for attempting to construct a 
> classification tree using /rpart/?
> 
> I have 27 seagrass disturbance sites (boat groundings) that have been 
> monitored for a number of years.  The monitoring protocol for each site 
> is identical.  From the monitoring data, I am able to determine the 
> level of recovery that each site has experienced.  Recovery is our 
> categorical dependent variable with values of none, low, medium, high 
> which are based upon percent seagrass regrowth into the injury over 
> time.  I wish to be able to predict the level of recovery of future 
> vessel grounding sites based upon a number of categorical / continuous 
> predictor variables used here including (but not limited to) such 
> parameters as:  sediment grain size, wave exposure, original size 
> (volume) of the injury, injury age, injury location.
> 
> When I run /rpart/, the data is split into only two terminal nodes based 
> solely upon values of the original volume of each injury.  No other 
> predictor variables are considered, even though I have included about 
> six of them in the model.  When I remove volume from the model the same 
> thing happens but with injury area - two terminal nodes are formed based 
> upon area values and no other variables appear.  I was hoping that this 
> was a programming issue, me being a newbie and all, but I really think 
> I've got the code right.  Now I am beginning to wonder if my N is too 
> small for this method?
> 
In my experience N needs to be around 20,000 to get both good accuracy 
and replicability of patterns if the number of potential predictors is 
not tiny.  In general, the R^2 from rpart is not competitive with that 
from an intelligently fitted regression model.  It's just a difficult 
problem, when relying on a single tree (hence the popularity of random 
forests, bagging, boosting).

Frank
-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University

Terry Therneau

2007-Feb-28 14:59 UTC

head link

[R] rpart minimum sample size

Look at rpart.control.  Rpart has two "advisory" parameters that
control
the tree size at the smallest nodes:
	minsplit (default 20): a node with less than this many subjects will
	not be worth splitting
	
	minbucket (default 7) : don't create any final nodes with <7 
	observations
	
As I said, these are advisory, and reflect that these final splits are usually
not worthwhile.  They lead to a little faster run time, but mostly to a less
complex plotted model.

  I am not nearly as pessimistic as Frank Harrell ("need 20,000
observations").
Rpart often gives a good model -- one that predicts the outcome, and I find
the intermediate steps that it takes informative.  However, there are often many
trees with similar predictive ability, but a very different "look" in
terms
of splitpoints and variables.  Saying that any given rpart model is THE best
is perilous.
	Terry T.

Possibly Parallel Threads

Search for more possibly parallel threads

R help - Feb 2007 - rpart minimum sample size

[R] rpart minimum sample size

[R] rpart minimum sample size

[R] rpart minimum sample size

[R] rpart minimum sample size

Possibly Parallel Threads