Is there an optimal / minimum sample size for attempting to construct a classification tree using /rpart/? I have 27 seagrass disturbance sites (boat groundings) that have been monitored for a number of years. The monitoring protocol for each site is identical. From the monitoring data, I am able to determine the level of recovery that each site has experienced. Recovery is our categorical dependent variable with values of none, low, medium, high which are based upon percent seagrass regrowth into the injury over time. I wish to be able to predict the level of recovery of future vessel grounding sites based upon a number of categorical / continuous predictor variables used here including (but not limited to) such parameters as: sediment grain size, wave exposure, original size (volume) of the injury, injury age, injury location. When I run /rpart/, the data is split into only two terminal nodes based solely upon values of the original volume of each injury. No other predictor variables are considered, even though I have included about six of them in the model. When I remove volume from the model the same thing happens but with injury area - two terminal nodes are formed based upon area values and no other variables appear. I was hoping that this was a programming issue, me being a newbie and all, but I really think I've got the code right. Now I am beginning to wonder if my N is too small for this method? -- Amy V. Uhrin, Research Ecologist NOAA, National Ocean Service Center for Coastal Fisheries and Habitat Research 101 Pivers Island Road Beaufort, NC 28516 (252) 728-8778 (252) 728-8784 (fax) Amy.Uhrin@noaa.gov ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \!/ \!/ <:}}}}}>< \!/ \!/ >^<**>^< \!/ \!/ [[alternative HTML version deleted]]
amy, without looking at your actual code, i would suggest you to take a look at rpart.control() On 2/27/07, Amy Uhrin <amy.uhrin at noaa.gov> wrote:> Is there an optimal / minimum sample size for attempting to construct a > classification tree using /rpart/? > > I have 27 seagrass disturbance sites (boat groundings) that have been > monitored for a number of years. The monitoring protocol for each site > is identical. From the monitoring data, I am able to determine the > level of recovery that each site has experienced. Recovery is our > categorical dependent variable with values of none, low, medium, high > which are based upon percent seagrass regrowth into the injury over > time. I wish to be able to predict the level of recovery of future > vessel grounding sites based upon a number of categorical / continuous > predictor variables used here including (but not limited to) such > parameters as: sediment grain size, wave exposure, original size > (volume) of the injury, injury age, injury location. > > When I run /rpart/, the data is split into only two terminal nodes based > solely upon values of the original volume of each injury. No other > predictor variables are considered, even though I have included about > six of them in the model. When I remove volume from the model the same > thing happens but with injury area - two terminal nodes are formed based > upon area values and no other variables appear. I was hoping that this > was a programming issue, me being a newbie and all, but I really think > I've got the code right. Now I am beginning to wonder if my N is too > small for this method? > > -- > Amy V. Uhrin, Research Ecologist > > NOAA, National Ocean Service > Center for Coastal Fisheries and Habitat Research > 101 Pivers Island Road > Beaufort, NC 28516 > (252) 728-8778 > (252) 728-8784 (fax) > Amy.Uhrin at noaa.gov > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > \!/ \!/ <:}}}}}>< \!/ \!/ >^<**>^< \!/ \!/ > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog)
Amy Uhrin wrote:> Is there an optimal / minimum sample size for attempting to construct a > classification tree using /rpart/? > > I have 27 seagrass disturbance sites (boat groundings) that have been > monitored for a number of years. The monitoring protocol for each site > is identical. From the monitoring data, I am able to determine the > level of recovery that each site has experienced. Recovery is our > categorical dependent variable with values of none, low, medium, high > which are based upon percent seagrass regrowth into the injury over > time. I wish to be able to predict the level of recovery of future > vessel grounding sites based upon a number of categorical / continuous > predictor variables used here including (but not limited to) such > parameters as: sediment grain size, wave exposure, original size > (volume) of the injury, injury age, injury location. > > When I run /rpart/, the data is split into only two terminal nodes based > solely upon values of the original volume of each injury. No other > predictor variables are considered, even though I have included about > six of them in the model. When I remove volume from the model the same > thing happens but with injury area - two terminal nodes are formed based > upon area values and no other variables appear. I was hoping that this > was a programming issue, me being a newbie and all, but I really think > I've got the code right. Now I am beginning to wonder if my N is too > small for this method? >In my experience N needs to be around 20,000 to get both good accuracy and replicability of patterns if the number of potential predictors is not tiny. In general, the R^2 from rpart is not competitive with that from an intelligently fitted regression model. It's just a difficult problem, when relying on a single tree (hence the popularity of random forests, bagging, boosting). Frank -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
Look at rpart.control. Rpart has two "advisory" parameters that control the tree size at the smallest nodes: minsplit (default 20): a node with less than this many subjects will not be worth splitting minbucket (default 7) : don't create any final nodes with <7 observations As I said, these are advisory, and reflect that these final splits are usually not worthwhile. They lead to a little faster run time, but mostly to a less complex plotted model. I am not nearly as pessimistic as Frank Harrell ("need 20,000 observations"). Rpart often gives a good model -- one that predicts the outcome, and I find the intermediate steps that it takes informative. However, there are often many trees with similar predictive ability, but a very different "look" in terms of splitpoints and variables. Saying that any given rpart model is THE best is perilous. Terry T.