Hi everyone, I have a problem using rpart (R 2.0.1 under Unix) Indeed, I have a large matrix (9271x7), my response variable is numeric and all my predictor variables are categorical (from 3 to 8 levels). Here is an example :> mydata[1:5,]distance group3 group4 group5 group6 group7 group8 pos_1 0.141836040224967 a c e a g g pos_501 0.153605961621317 a a a a g g pos_1001 0.152246705384699 a c e a g g pos_1501 0.145563737522463 a c e a g g pos_2001 0.143940027378837 a c e e g g When using rpart() as follow, the program runs for ages, and after a few hours, R is abruptly killed : library(rpart) fit <- rpart(distance ~ ., data = mydata) When I change the categorical variables into numeric values (e.g. a = 1, b = 2, c = 3, etc...), the program runs normally in a few seconds. But this is not what I want because it separates my variables according to "group7 > 4.5" (continuous) and not "group7 = a,b,d,f" or "c,e,g" (discrete). here is the result :>fitn= 9271 node), split, n, deviance, yval * denotes terminal node 1) root 9271 28.43239000 0.1768883 2) group7>=4.5 5830 4.87272700 0.1534626 4) group5< 5.5 5783 3.29538700 0.1520110 8) group5>=4.5 3068 0.68517040 0.1412967 * 9) group5< 4.5 2715 1.86003600 0.1641184 * 5) group5>=5.5 47 0.06597044 0.3320614 * 3) group7< 4.5 3441 14.93984000 0.2165781 6) group5< 1.5 1461 1.00414700 0.1906630 * 7) group5>=1.5 1980 12.23050000 0.2357002 14) group6>=2.5 1659 2.95395700 0.2090232 28) group3>=2.5 1315 1.65184200 0.1957505 * 29) group3< 2.5 344 0.18490260 0.2597607 * 15) group6< 2.5 321 1.99404400 0.3735729 * When I create a small dataframe such as the example above, e.g. : distance = rnorm(5,0.15,0.01) group3 = c("a","a","a","a","a") group4 = c("c","a","c","c","c") group5 = c("e","a","e","e","e") group6 = c("a","a","a","a","e") smalldata = data.frame(cbind(distance,group3,group4,group5,group6)) The program runs normally in a few seconds. Why does it work using the large dataset whith only numeric values but not with categorical predictor variables ? I have the impression that it considers my response variable also as a categorical variable and therefore it can't handle 9271 levels, which is quite normal. Is there a way to solve this problem ? I thank you all for your time and help, Jennifer Becq
jenniferbecq at free.fr wrote:> Hi everyone, > > I have a problem using rpart (R 2.0.1 under Unix) > > Indeed, I have a large matrix (9271x7), my response variable is numeric and all > my predictor variables are categorical (from 3 to 8 levels).Your problem is the number of levels. You get a similar number of dummy variables and your problem becomes really huge. Uwe Ligges> > Here is an example : > > >>mydata[1:5,] > > distance group3 group4 group5 group6 group7 group8 > pos_1 0.141836040224967 a c e a g g > pos_501 0.153605961621317 a a a a g g > pos_1001 0.152246705384699 a c e a g g > pos_1501 0.145563737522463 a c e a g g > pos_2001 0.143940027378837 a c e e g g > > When using rpart() as follow, the program runs for ages, and after a few hours, > R is abruptly killed : > > library(rpart) > fit <- rpart(distance ~ ., data = mydata) > > When I change the categorical variables into numeric values (e.g. a = 1, b = 2, > c = 3, etc...), the program runs normally in a few seconds. But this is not > what I want because it separates my variables according to "group7 > 4.5" > (continuous) and not "group7 = a,b,d,f" or "c,e,g" (discrete). > > here is the result : > >>fit > > n= 9271 > > node), split, n, deviance, yval > * denotes terminal node > > 1) root 9271 28.43239000 0.1768883 > 2) group7>=4.5 5830 4.87272700 0.1534626 > 4) group5< 5.5 5783 3.29538700 0.1520110 > 8) group5>=4.5 3068 0.68517040 0.1412967 * > 9) group5< 4.5 2715 1.86003600 0.1641184 * > 5) group5>=5.5 47 0.06597044 0.3320614 * > 3) group7< 4.5 3441 14.93984000 0.2165781 > 6) group5< 1.5 1461 1.00414700 0.1906630 * > 7) group5>=1.5 1980 12.23050000 0.2357002 > 14) group6>=2.5 1659 2.95395700 0.2090232 > 28) group3>=2.5 1315 1.65184200 0.1957505 * > 29) group3< 2.5 344 0.18490260 0.2597607 * > 15) group6< 2.5 321 1.99404400 0.3735729 * > > > When I create a small dataframe such as the example above, e.g. : > > distance = rnorm(5,0.15,0.01) > group3 = c("a","a","a","a","a") > group4 = c("c","a","c","c","c") > group5 = c("e","a","e","e","e") > group6 = c("a","a","a","a","e") > smalldata = data.frame(cbind(distance,group3,group4,group5,group6)) > > The program runs normally in a few seconds. > > Why does it work using the large dataset whith only numeric values but not with > categorical predictor variables ?>> I have the impression that it considers my response variable also as a > categorical variable and therefore it can't handle 9271 levels, which is quite > normal. Is there a way to solve this problem ? > > I thank you all for your time and help, > > Jennifer Becq > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html