Mary Kindall
2013-Sep-13 18:41 UTC
[R] Which regression tree algorithm to use for large data?
I have a dataframe with 2 million rows and approximately 200 columns / features. Approximately 30-40% of the entries are blank. I am trying to find important features for a binary response variable. The predictors may be categorical or continuous. I started with applying logistic regression, but having so much missing entries I feel that this is not a good approach as glm discard all records which have any item blank. So I am now looking to apply tree based algorithms (rpart or gbm) which are capable to handle missing data in a better way. Since my data is too big for rpart or gbm, I decided to randomly fetch 10,000 records from original data, apply rpart on that, and keep building a pool of important variables. However, even this 10,000 records seem to be too much for the rpart algorithm. What can I do in this situation? Is there any switch that I can use to make it fast? Or it is impossible to apply rpart on my data. I am using the following rpart command: varimp = rpart(fmla, dat=tmpData, method = "class")$variable.importance Thanks -- ------------- Mary Kindall Yorktown Heights, NY USA [[alternative HTML version deleted]]