thr3ads.net - R help - [R] rpart v. lda classification. [Feb 2003]

If this information is useful, please help other people find it:
Share via:

Rolf Turner

2003-Feb-12 00:30 UTC

[R] rpart v. lda classification.

I've been groping my way through a classification/discrimination
problem, from a consulting client.  There are 26 observations, with 4
possible categories and 24 (!!!) potential predictor variables.

I tried using lda() on the first 7 predictor variables and got 24 of
the 26 observations correctly classified.  (Training and testing both
on the complete data set --- just to get started.)

I then tried rpart() for comparison and was somewhat surprised when
rpart() only managed to classify 14 of the 26 observations correctly.
(I got the same classification using just the first 7 predictors as I
did using all of the predictors.)

I would have thought that rpart(), being unconstrained by a parametric
model, would have a tendency to over-fit and therefore to appear to
do better than lda() when the test data and training data are the
same.

Am I being silly, or is there something weird going on?  I can
give more detail on what I actually did, if anyone is interested.

The data are pretty obviously nothing like Gaussian, so my
gut feeling is that rpart() should be much more appropriate than
lda().  And it does not seem surprizing that with so few
observations to train with, the success rate should be low, even
when testing and training on the same data set.  What does
surprise me is that lda() gets such a high success rate.

Should I just put this down as a random occurrence of a low
prob. event?

				cheers,

					Rolf Turner
					rolf at math.unb.ca

P.S.  Using CV=TRUE in lda() I got only 16 of the 26 observations
correctly classified.

ripley@stats.ox.ac.uk

2003-Feb-12 09:10 UTC

head link

[R] rpart v. lda classification.

On Tue, 11 Feb 2003, Rolf Turner wrote:
> 
> I've been groping my way through a classification/discrimination
> problem, from a consulting client.  There are 26 observations, with 4
> possible categories and 24 (!!!) potential predictor variables.
> 
> I tried using lda() on the first 7 predictor variables and got 24 of
> the 26 observations correctly classified.  (Training and testing both
> on the complete data set --- just to get started.)
> 
> I then tried rpart() for comparison and was somewhat surprised when
> rpart() only managed to classify 14 of the 26 observations correctly.
> (I got the same classification using just the first 7 predictors as I
> did using all of the predictors.)
> 
> I would have thought that rpart(), being unconstrained by a parametric
> model, would have a tendency to over-fit and therefore to appear to
> do better than lda() when the test data and training data are the
> same.
> 
> Am I being silly, or is there something weird going on?  I can
> give more detail on what I actually did, if anyone is interested.
The first.  rpart is seriously constrained by having so few observations,
and its model is much more restricted than lda: axis-parallel splits only.
There is a similar example, with pictures, in MASS (on Cushings).

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  stats.ox.ac.uk/~ripley
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Reasonably Related Threads

Search for more possibly parallel threads

R help - Feb 2003 - rpart v. lda classification.

[R] rpart v. lda classification.

[R] rpart v. lda classification.

Reasonably Related Threads