Hi, I have about half million binary features, and would like to find a model to estimate the continous response. According to the inference, I can express predictors and response by linear model. (ie. Design matrix: large sparse matrix with 0/1. Response: Continous number) Since it is not a classification problem, someone suggested me to try random forest in R. However, in the randomForest help page, it points out "For large data sets, especially those with large number of variables, calling 'randomForest' via the formula interface is not advised: There may be too much overhead in handling the formula." and I also gave a try on 300 variables and R either gave me error message or no response. (OS: Windows XP; R:1.9.0 ; RAM:512MB) Is there any way to implement random forest on this big dataset? Any suggestion is welcome! Many thanks! Chihying [[alternative HTML version deleted]]
Prof Brian Ripley
2004-Jul-03 05:58 UTC
[R] Half Million features Selection (Random Forest)
How many cases do you have? Since you apparently expect the dataset to be usable in R, you only have room to store a dataset with 200 cases or so (let alone space to analyse it). Even selecting *one* variable is statistically nonsensical with less than millions of cases (as otherwise the possibility of chance agreement of predictors is too high -- and I don't known enough about your problem to do even a rough calculation with any confidence). On Fri, 2 Jul 2004, daisy wrote:> I have about half million binary features, and would like to find a > model to estimate the continous response. According to the inference, I > can express predictors and response by linear model. (ie. Design matrix: > large sparse matrix with 0/1. Response: Continous number) Since it is > not a classification problem, someone suggested me to try random forest > in R. However, in the randomForest help page, it points out "For large > data sets, especially those with large number of variables, calling > 'randomForest' via the formula interface is not advised: There may be > too much overhead in handling the formula." and I also gave a try on 300 > variables and R either gave me error message or no response. (OS: > Windows XP; R:1.9.0 ; RAM:512MB) Is there any way to implement random > forest on this big dataset? Any suggestion is welcome! Many thanks!-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595