thr3ads.net - R help - [R] Half Million features Selection (Random Forest) [Jul 2004]

If this information is useful, please help other people find it:
Share via:

daisy

2004-Jul-02 19:31 UTC

[R] Half Million features Selection (Random Forest)

Hi,

I have about half million binary features, and would like to find a model to
estimate the continous response. According to the inference, I can express
predictors and response by linear model. (ie. Design matrix: large sparse matrix
with 0/1. Response: Continous number) Since it is not a classification problem,
someone suggested me to try random forest in R. However, in the randomForest
help page, it points out "For large data sets, especially those with large
number of variables, calling 'randomForest' via the formula interface is
not advised: There may be too much overhead in handling the formula." and I
also gave a try on 300 variables and R either gave me error message or no
response. (OS: Windows XP; R:1.9.0 ; RAM:512MB) Is there any way to implement
random forest on this big dataset? Any suggestion is welcome! Many thanks!

Chihying


	[[alternative HTML version deleted]]

Prof Brian Ripley

2004-Jul-03 05:58 UTC

head link

[R] Half Million features Selection (Random Forest)

How many cases do you have?  Since you apparently expect the dataset to be 
usable in R, you only have room to store a dataset with 200 cases or so 
(let alone space to analyse it).

Even selecting *one* variable is statistically nonsensical with less than
millions of cases (as otherwise the possibility of chance agreement of
predictors is too high -- and I don't known enough about your problem to 
do even a rough calculation with any confidence).

On Fri, 2 Jul 2004, daisy wrote:
> I have about half million binary features, and would like to find a
> model to estimate the continous response. According to the inference, I
> can express predictors and response by linear model. (ie. Design matrix:
> large sparse matrix with 0/1. Response: Continous number) Since it is
> not a classification problem, someone suggested me to try random forest
> in R. However, in the randomForest help page, it points out "For large
> data sets, especially those with large number of variables, calling
> 'randomForest' via the formula interface is not advised: There may
be
> too much overhead in handling the formula." and I also gave a try on
300
> variables and R either gave me error message or no response. (OS:
> Windows XP; R:1.9.0 ; RAM:512MB) Is there any way to implement random
> forest on this big dataset? Any suggestion is welcome! Many thanks!
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Possibly Parallel Threads

Search for more reasonably related threads

R help - Jul 2004 - Half Million features Selection (Random Forest)

[R] Half Million features Selection (Random Forest)

[R] Half Million features Selection (Random Forest)

Possibly Parallel Threads