Don Stierman asks:
> -----Original Message-----
> From: Don Stierman [mailto:dstierman at cableone.net]
> Sent: Sunday, April 28, 2002 4:41 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] rpart problem
>
> I am having problems with rpart and a particular dataset. I try the
> following code and get no significant results back from the script.
However,> if I delete the value of 79 in the last row and column of the dataset, I
do> get results.
Actually you got results in the first case as well, you just did not get the
results you were expecting. You have at least three outliers in your data
(both the values I = 79 and I = 81). Regression trees are fitted by least
squares and it is pretty well known that outlying observations can seriously
influence any least squares procedure. The tree model, in effect, shrugs
its shoulders and says the best tree has a single root node.
> Is rpart really that dependent on a single value,
The problem, such as it is, is not with rpart but with your model. You
might try a transformation of y, for example, and see if you get anything
more useful to you. With a transformed response the least squares fitting
procedure might be more reasonable and productive.
For example
fit <- rpart(sqrt(I) ~ ., data)
Alterntively you might use log(I+1) as the response. Some judgment based on
the context is necessary and you do not give the context.
> or is there
> something else wrong? How do I fix this so that I will get results without
> having to change the dataset? Also, what is a good way to determine if a
> single value or a single column is causing problems in the rpart package?
Tree models are not a magic bullet. In particular they do not remove the
necessity to do some diagnostic checks afterwards. You might just try
looking at qqnorm plots of the residuals or residuals vs fitted values plots
just to see if the model achieves reasonable symmetry in the residuals, for
example. (For your example these all looked pretty dodgy even with a
transformation, I have to say!)
> Thanks, Don
>
> R script:
>
> library (rpart)
> data<-read.csv("C:\\temp.txt")
> fit <- rpart(data$I ~ ., data,method="anova")
> fit
>
> Results with value of 79:
>
> n= 703
>
> node), split, n, deviance, yval
> * denotes terminal node
>
> 1) root 703 30738.37 12.02987 *
>
> Results without value of 79:
>
> n=702 (1 observations deleted due to missing)
>
> node), split, n, deviance, yval
> * denotes terminal node
>
> 1) root 702 26246.990 11.93447
> 2) C>=0.325 615 19389.630 11.71382
> 4) D< 11.635 522 12578.270 11.47893 *
> 5) D>=11.635 93 6620.903 13.03226
> 10) D>=11.655 82 1849.561 12.07317 *
> 11) D< 11.655 11 4133.636 20.18182 *
> 3) C< 0.325 87 6615.747 13.49425
> 6) A< 28.68 69 1872.638 12.59420 *
> 7) A>=28.68 18 4472.944 16.94444 * << File: temp.txt
>>
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at
stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._