On Mon, 10 Mar 2003, Ko-Kang Kevin Wang wrote:
> Hi,
>
> This may actually be a theoretical question.
>
> When I tried to do the following:
>
> ##########################################################
> > colnames(rating.adclms)
> [1] "usage" "mileage" "sex"
"excess" "ncd"
> [6] "primage" "minage" "drivers"
"district" "cargroup"
> [11] "car.age" "adclms" "days"
> > rating.r1 <- rpart(adclms ~ ., data = rating.adclms,
> + method = "class")
> > rating.r1
> n= 140602
>
> node), split, n, loss, yval, (yprob)
> * denotes terminal node
>
> 1) root 140602 3792 0 (9.730303e-01 2.506365e-02 1.834967e-03
> 7.112274e-05) *
> ##########################################################
>
> Should I set the costs in rpart()? I'm kind of surprised to see it
only
> return 1 node for the tree.
Why are you surprised? One class has 97% of the examples, and it may be
impossible to get a single split that makes a worthwhile improvement (1%)
in classification. You probably want to set cp (an argument to
rpart.control).
You could use losses, but I would use weighted sub-sampling of the
training set. See my 1996 book on Pattern Recognition and Neural Networks
for the theory and the practical details.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595