thr3ads.net - R help - [R] Error: Can not handle categorical predictors with more than 32 categories. [Mar 2005]

If this information is useful, please help other people find it:
Share via:

Melanie Vida

2005-Mar-22 23:14 UTC

[R] Error: Can not handle categorical predictors with more than 32 categories.

Hi All,

My question is in regards to an error generated when using randomForest 
in R. Is there a special way to format the data in order to avoid this 
error, or am I completely confused on what the error implies?

"Error in randomForest.default(m, y, ...) :
        Can not handle categorical predictors with more than 32
categories."

This is generated from the command line:
 > credit.rf <- randomForest(V16 ~ ., data=credit, mtry=2, importance = 
TRUE, do.trace=100)

The data set is the credit-screening data from the UCI respository, 
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/credit-screening/crx.data. 
This data consists of  690 samples and 16 attributes.
The attribute information includes:

A1:	b, a.
    A2:	continuous.
    A3:	continuous.
    A4:	u, y, l, t.
    A5:	g, p, gg.
    A6:	c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
    A7:	v, h, bb, j, n, z, dd, ff, o.
    A8:	continuous.
    A9:	t, f.
    A10:	t, f.
    A11:	continuous.
    A12:	t, f.
    A13:	g, p, s.
    A14:	continuous.
    A15:	continuous.
    A16: +,-         (class attribute)

Has anyone tried randomForests in R on the credit-screening data set 
from the UCI repository?

Thanks in advance for any useful hints and tips,

Melanie

Uwe Ligges

2005-Mar-23 07:32 UTC

head link

[R] Error: Can not handle categorical predictors with more than 32 categories.

Melanie Vida wrote:
> Hi All,
> 
> My question is in regards to an error generated when using randomForest 
> in R. Is there a special way to format the data in order to avoid this 
> error, or am I completely confused on what the error implies?
> 
> "Error in randomForest.default(m, y, ...) :
>        Can not handle categorical predictors with more than 32
categories."
> 
> This is generated from the command line:
>  > credit.rf <- randomForest(V16 ~ ., data=credit, mtry=2, importance
=
> TRUE, do.trace=100)
> 
> The data set is the credit-screening data from the UCI respository, 
>
ftp://ftp.ics.uci.edu/pub/machine-learning-databases/credit-screening/crx.data.
> This data consists of  690 samples and 16 attributes.
> The attribute information includes:
> 
> A1:    b, a.
>    A2:    continuous.
>    A3:    continuous.
>    A4:    u, y, l, t.
>    A5:    g, p, gg.
>    A6:    c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
>    A7:    v, h, bb, j, n, z, dd, ff, o.
>    A8:    continuous.
>    A9:    t, f.
>    A10:    t, f.
>    A11:    continuous.
>    A12:    t, f.
>    A13:    g, p, s.
>    A14:    continuous.
>    A15:    continuous.
>    A16: +,-         (class attribute)
>
> Has anyone tried randomForests in R on the credit-screening data set 
> from the UCI repository?

For sure you forgot to set  na.strings = "?"  in read.table()....
Look at str(credit) to see that some numerics had been converted to 
factors for that reason.

Uwe Ligges


> Thanks in advance for any useful hints and tips,
> 
> Melanie
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html

Melanie Vida

2005-Mar-23 15:05 UTC

head link

[R] Question on class 1, 2 output for RandomForest

Hi All,

I read the R-newsletter Volum 2/3, December 2002 on page 18. I tried the 
example there, too. Then, I used a different data set with random Forest 
from the UCI respository. The results for the "credit" data generated
2
additional columns, column "1" and a column "2" that the
example given
in the newsletter did not generate from the  fgl data set.

For the "credit" data, what does the output with the heading
"1", " 2"
imply for ntree=100...500 (below)? Does the "1" imply the actual data,
"class 1" and a group of synthetic data "2" ->
"class 2"? Did my random
forest automatically default to unsupervised learning  and automatically 
create the class 2, synthetic data, then classify the combined data with 
the random Forest? If so, which method did R used to generate the 
synthetic data? The newsletter states that there are 2 ways to generate 
synthetic data.

Further, the  parameters to tune these randomForest would ideally 
optimize the OOB error rate and whatever column 1 and 2 error rates 
mean? I tried mtry=2, 3 and 10, but that didn't change the errors much. 
Are these results reasonable, or should I tried to tune different 
parameters for this special case?

ntree      OOB      1      2
  100:  20.72% 14.10% 28.99%
  200:  18.99% 13.58% 25.73%
  300:  19.71% 15.14% 25.41%
  400:  20.00% 14.10% 27.36%
  500:  19.13% 13.58% 26.06%

Call:
 randomForest(x = V16 ~ ., data = credit, mtry = 3, importance = 
TRUE,      do.trace = 100)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 19.86%
Confusion matrix:
    -   + class.error
- 326  57   0.1488251
+  80 227   0.2605863


Thanks in advance,

-Melanie
-------
# Read in the credit table
credit = 
read.table(url('ftp://ftp.ics.uci.edu/pub/machine-learning-databases/credit-screening/crx.data'),sep=",")
str(credit)
credit$V2 = as.numeric(credit$V2)
credit$V14 = as.numeric(credit$V14)
str(credit)

credit.rf <- randomForest(V16 ~ ., data=credit, mtry=3, importance = 
TRUE, do.trace=100)
print(credit.rf)


-Melanie

Apparently Analagous Threads

Search for more apparently analagous threads

R help - Mar 2005 - Error: Can not handle categorical predictors with more than 32 categories.

[R] Error: Can not handle categorical predictors with more than 32 categories.

[R] Error: Can not handle categorical predictors with more than 32 categories.

[R] Question on class 1, 2 output for RandomForest

Apparently Analagous Threads