thr3ads.net - R help - [R] Which regression tree algorithm to use for large data? [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Mary Kindall

2013-Sep-13 18:41 UTC

[R] Which regression tree algorithm to use for large data?

I have a dataframe with 2 million rows and approximately 200 columns /
features. Approximately 30-40% of the entries are blank. I am trying to
find important features for a binary response variable. The predictors may
be categorical or continuous.

I started with applying logistic regression, but having so much missing
entries I feel that this is not a good approach as glm discard all records
which have any item blank. So I am now looking to apply tree based
algorithms (rpart or gbm) which are capable to handle missing data in a
better way.

Since my data is too big for rpart or gbm, I decided to randomly fetch
10,000 records from original data, apply rpart on that, and keep building a
pool of important variables. However, even this 10,000 records seem to be
too much for the rpart algorithm.

What can I do in this situation? Is there any switch that I can use to make
it fast? Or it is impossible to apply rpart on my data.

I am using the following rpart command:

varimp = rpart(fmla,  dat=tmpData, method =
"class")$variable.importance

Thanks

-- 
-------------
Mary Kindall
Yorktown Heights, NY
USA

	[[alternative HTML version deleted]]

R help - Sep 2013 - Which regression tree algorithm to use for large data?

[R] Which regression tree algorithm to use for large data?