> I am always experiencing the scalability of some R packages. This
> time, I am trying gbm to do adaboosting on my project. Initially I
> tried to grow trees by using rpart on a dataset with 200 variables and
> 30,000 observations. Now, I am thinking if I can apply adaboosting on
> it.
R seems to be particularly slow in general when having a wide dataset (like your
200 variables) and using formula interfaces. I seem to remember that calls to
model.frame() are particularly slow. The gbm package also offers gbm.fit()
(which gbm() itself uses) that avoids model.frame(). It does not have a formula
interface and takes a little more work up front to organize the data. With
gbm.fit() and a bit of patience you should be able to fit your model.
> I am wondering if here is anyone who did a similar thing before and
> can provide some sample codes. Also any comments on the scalability
> and feasiblity is welcome. Also, is there any limitation on the
> requirement of data, like any categorical variable cannot have more
> than 32 levels. Is there anything like that?
gbm() currently allows for categorical predictors with up to 256 levels.
There's no particular reason it is set to that and could be increased if
needed. Just increase k_cMaxClasses in line 11 of node_search.cpp, recompile,
and install. If I get sufficient complaints about the 256 barrier I'll
change it.
Greg Ridgeway
RAND Statistics Group
--------------------
This email message is for the sole use of the intended recipient(s) and
may contain privileged information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.