On Thu, 28 Jul 2011, seanstclair at verizon.net wrote:
>
> I am running the ctree function in R.
>
>
>
> My data has about 10 variables, many of which are categorical. 2 of the
> categorical variables have many levels (one has 900 levels, another has
> 1,000 levels). As an example, 1 of these variables is disease code and
is
> structured as A, B, C, ...., AA, AB, AC....
>
>
>
> Each time i've tried to run the ctree function, including these 2
variables
> in the data, the function never stops running. When i remove these 2
> variables from the data and run without them, the function returns in
about
> 3 seconds.
>
>
>
> Q: Is there a limit to the amount of levels that a categorical variable
can
> contain? Is there something else that i may be overlooking?
ctree() tries to split such a variable into two groups: left and right
daughter node. And there are 2^(k-1) - 1 possible groupings for a
categorical variable with k levels. For k=1000 this is simply too large to
be computed in finite time.
You can try to break it down to a coarser classification of levels that is
still computable. Or, if the categorical variable were ordered, it needs
to be declared and then only k-1 splits are possible which is small
enough.
hth,
Z
>
>
>
>
> THanks.
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>