Dear R users, I'm trying to use rpart() to build a classification tree on a big dataset. The number of samples is n=100 and the number of variables is p=10000. At first I stored all the data in a data.frame and got a "stack overflow" error; then I changed the data into a matrix and the problem disappeared. Now the trouble is when I try to use the predict() function, since each newdata is a long list with p=10000 elements, the predict() function doesn't recognize it and simply returns the fitted values at the training data (rather than the newdata). Could anyone give me some suggestion on how to proceed? Thank you. Regards, Ji ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Ji Zhu 439 West Hall Assistant Professor 550 East University Department of Statistics Ann Arbor, MI 48109 University of Michigan (734) 936-2577 (O) jizhu at umich.edu (734) 763-4676 (F) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Try something like this (suppose x is the matrix of predictors in the training set, and xtest is the same for the test set): my.rp <- rpart(y ~ x, ...) test.pred <- predict(my.rp, newdata=data.frame(x=I(xtest))) Make sure the name of the variable in the data frame given to newdata matches the name of the variable in the original formula, in this case `x', a matrix. HTH, Andy> From: Ji Zhu [mailto:jizhu at umich.edu] > > Dear R users, > > I'm trying to use rpart() to build a classification tree on a > big dataset. The number of samples is n=100 and the number of > variables is p=10000. > > At first I stored all the data in a data.frame and got a > "stack overflow" error; then I changed the data into a matrix > and the problem disappeared. Now the trouble is when I try to > use the predict() function, since each newdata is a long list > with p=10000 elements, the predict() function doesn't > recognize it and simply returns the fitted values at the > training data (rather than the newdata). > > Could anyone give me some suggestion on how to proceed? Thank you. > > Regards, > > Ji > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > Ji Zhu 439 West Hall > Assistant Professor 550 East University > Department of Statistics Ann Arbor, MI 48109 > University of Michigan (734) 936-2577 (O) > jizhu at umich.edu (734) 763-4676 (F) > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
That's not a sensible thing to do. Supply predict.rpart with a data frame that contains just the variables rpart selected. R does have limits, and attempting to use 10,000 variables is hitting them, But surely any statistician is aware of the dangers of selecting from 10000 variables on just 100 observations? On Fri, 7 Nov 2003, Ji Zhu wrote:> > Dear R users, > > I'm trying to use rpart() to build a classification tree on a big dataset. > The number of samples is n=100 and the number of variables is p=10000. > > At first I stored all the data in a data.frame and got a "stack overflow" > error; then I changed the data into a matrix and the problem disappeared. > Now the trouble is when I try to use the predict() function, since each > newdata is a long list with p=10000 elements, the predict() function > doesn't recognize it and simply returns the fitted values at the training > data (rather than the newdata). > > Could anyone give me some suggestion on how to proceed? Thank you.-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595