The limitation comes from the way categorical splits are represented in the
code: For a categorical variable with k categories, the split is
represented by k binary digits: 0=right, 1=left. So it takes k bits to
store each split on k categories. To save storage, this is `packed' into a
4-byte integer (32-bit), thus the limit of 32 categories.
The current Fortran code (version 5.x) by Breiman and Cutler gets around
this limitation by storing the split in an integer array. While this lifts
the 32-category limit, it takes much more memory to store the splits. I'm
still trying to figure out a more memory efficient way of storing the splits
without imposing the 32-category limit. If anyone has suggestions, I'm all
ears.
Best,
Andy
> From: Arne.Muller at sanofi-aventis.com
>
> Hello,
>
> I'm using the random forest package. One of my factors in the
> data set contains 41 levels (I can't code this as a numeric
> value - in terms of linear models this would be a random
> factor). The randomForest call comes back with an error
> telling me that the limit is 32 categories.
>
> Is there any reason for this particular limit? Maybe it's
> possible to recompile the module with a different cutoff?
>
> thanks a lot for your help,
> kind regards,
>
>
> Arne
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
>
>
>