Hi, I am trying to make a multi-class classification tree by using rpart. I used MASS package'd data: fgl to test and it works well. However, when I used my small-sampled data as below, the program seems to take forever. I am not sure if it is due to slowness or there is something wrong with my codes or data manipulation. Please be advised ! The data is described as the output from str() function. The call to rpart is like: library(rpart) test_tree<-rpart(x$V142 ~ ., data=x, parms=list(split='gini'), cp =0.01) the response variable is $V142, with 3 levels. Thanks for your suggestions! Ed.> str(x)`data.frame': 500 obs. of 142 variables $ V1 : int 4 4 4 4 4 4 4 4 4 4 ... $ V2 : Factor w/ 8 levels "1","2","3","4",..: 4 4 4 5 5 7 6 4 5 4 ... $ V3 : num 0.00803 0.00111 0.00995 0.01032 0.01295 ... $ V4 : num -0.011034 -0.003711 0.003436 0.000968 -0.006914 ... $ V5 : num 0.00524 0.00563 0.00973 0.01285 0.02148 ... $ V6 : num 0.00633 0.00831 0.02750 0.03375 0.01254 ... $ V7 : num -0.00422 -0.00151 0.00214 -0.00101 0.00299 ... $ V8 : num 0.02224 0.01761 0.01359 0.01045 0.00592 ... $ V9 : num -0.00301 -0.00260 0.01338 0.01129 0.00604 ... $ V10 : num 0.00224 0.00303 0.02312 0.02414 0.02752 ... $ V11 : num 0.00857 0.01134 0.05062 0.05789 0.04007 ... $ V12 : num 0.00435 0.00983 0.05276 0.05688 0.04305 ... $ V13 : num 0.025627 0.000429 0.055087 0.088517 0.068946 ... $ V14 : num 0.21 0.15 0.34 0.31 0.36 ... $ V15 : int 157 81 39 40 40 40 38 72 105 31 ... $ V16 : int 21 238 236 259 253 253 258 258 259 246 ... $ V17 : int 82 81 39 40 40 40 38 72 105 31 ... $ V18 : int 21 15 7 129 14 129 129 6 9 110 ... $ V19 : int 60 45 39 40 40 40 38 59 63 31 ... $ V20 : int 21 15 7 14 14 12 53 6 9 62 ... $ V21 : num 0.953 0.893 0.913 0.843 0.872 ... $ V22 : num 1.19 1.08 1.03 1.04 1.11 ... $ V23 : num 0.953 0.893 0.913 0.898 0.872 ... $ V24 : num 0.955 0.893 0.913 0.898 0.872 ... $ V25 : num 1.013 0.979 0.985 0.998 0.994 ... $ V26 : num 0.972 0.940 0.913 0.909 0.895 ... $ V27 : num 0.999 0.979 0.985 0.998 0.994 ... $ V28 : num 0.979 0.959 0.928 0.940 0.926 ... $ V29 : num 0.999 0.979 0.985 0.998 0.994 ... $ V30 : num 0.989 0.969 0.962 0.976 0.951 ... $ V31 : num 0.992 0.973 0.971 0.980 0.973 ... $ V32 : num 0.992 0.975 0.977 0.989 0.980 ... $ V33 : num 0.999 0.979 0.985 0.998 0.994 ... $ V34 : num 0.585 0.633 0.878 1.355 0.880 ... $ V35 : num 1.40 1.18 1.55 1.99 1.62 ... $ V36 : num 0.906 1.156 1.661 1.981 1.372 ... $ V37 : num 0.375 0.881 1.445 1.603 0.605 ... $ V38 : num 0.462 1.016 1.448 1.766 0.718 ... $ V39 : num 0.424 0.509 1.322 1.754 0.566 ... $ V40 : num 0.341 0.514 1.393 1.786 0.546 ... $ V41 : num -0.0681 0.8196 1.2009 1.6561 2.7571 ... $ V42 : num -4.354 -1.388 0.761 -0.145 -3.107 ... $ V43 : num 0.478 0.341 0.501 0.983 0.511 ... $ V44 : num 0.341 0.274 0.504 1.108 0.447 ... $ V45 : num 0.440 0.196 0.631 1.076 0.535 ... $ V46 : num 0.873 0.326 0.933 1.354 0.977 ... $ V47 : num -0.383 -0.170 0.686 0.843 0.328 ... $ V48 : num 0.138 0.384 1.332 1.352 0.217 ... $ V49 : num -0.105 0.311 0.984 1.201 -0.196 ... $ V50 : num -0.118 0.215 0.942 1.173 -0.233 ... $ V51 : num -0.245 0.165 0.890 1.057 -0.354 ... $ V52 : num -1.568 -0.577 -0.399 -0.748 -1.883 ... $ V53 : num -1.530 -0.420 -0.264 -0.522 -1.430 ... $ V54 : num 0.331 0.264 0.324 0.574 0.308 ... $ V55 : num 0.426 0.497 1.209 1.296 0.901 ... $ V56 : num 0.149 0.282 1.028 0.888 0.277 ... $ V57 : num 0.384 0.430 1.039 1.387 0.541 ... $ V58 : num 0.334 0.420 1.033 1.348 0.524 ... $ V59 : num 0.780 0.866 0.792 1.296 2.664 ... $ V60 : num -8.25 -3.22 6.06 3.82 -2.95 ... $ V61 : Factor w/ 20 levels "1","2","3","4",..: 3 14 8 3 14 10 9 16 17 14 ... $ V62 : num 0.589 1.062 1.083 0.721 2.764 ... $ V63 : num 0.830 0.878 1.030 1.218 3.371 ... $ V64 : num 0.0477 0.0183 0.0195 0.0535 0.0230 ... $ V65 : num 1.01 1.04 1.04 1.05 1.07 ... $ V66 : num 1.00 1.00 1.01 1.01 1.01 ... $ V67 : num 1.01 1.02 1.04 1.05 1.06 ... $ V68 : num 0.865 1.181 0.797 0.863 1.584 ... $ V69 : num 0.955 1.105 0.876 0.953 1.434 ... $ V70 : num 0.769 0.974 1.036 0.935 1.809 ... $ V71 : num 0.665 1.150 0.826 0.807 2.866 ... $ V72 : num 1.001 0.999 0.997 0.999 NA ... $ V73 : num 0.998 0.000 NA 0.992 NA ... $ V74 : num 1462 1462 1462 1462 1462 ... $ V75 : num 1 1 1 1 1 ... $ V76 : num 1.00 1.00 1.00 1.00 1.00 ... $ V77 : num 1.00 1.00 1.00 1.00 1.00 ... $ V78 : num 1.00 1.00 1.00 1.00 1.00 ... $ V79 : num 1.17 4.25 0.37 0.60 NA ... $ V80 : num 0.4375 0.6296 0.0855 0.0411 4.3333 ... $ V81 : num 0.500 0.160 NA 0.167 NA ... $ V82 : num 0.0625 0.0000 0.4444 0.2740 1.0000 ... $ V83 : num 0.0714 0.0000 NA 1.1111 NA ... $ V84 : num 3 1 3 NA NA ... $ V85 : num 0.600 0.667 0.750 0.667 1.000 ... $ V86 : num 0.200 0.667 0.500 0.667 3.000 ... $ V87 : num 4.16 4.16 4.16 4.16 4.16 ... $ V88 : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ... $ V89 : num 1 1 1 1 1 ... $ V90 : num 1.01 1.01 1.01 1.01 1.01 ... $ V91 : num 1.02 1.02 1.02 1.02 1.02 ... $ V92 : num 1.00 1.00 1.00 1.00 1.00 ... $ V93 : num 0.998 0.998 0.998 0.998 0.998 ... $ V94 : num 0.28851 -0.00130 0.16621 0.19513 0.15963 ... $ V95 : num 0.2804 -0.1910 0.1693 0.2661 0.0609 ... $ V96 : num 0.290 -0.238 0.233 0.287 0.147 ... $ V97 : num 0.4559 -0.4030 0.0401 0.0264 -0.0420 ... $ V98 : num -1.64 -1.58 -1.15 -1.90 -1.48 ... $ V99 : num -1.47 -1.47 -1.19 -2.30 -1.85 ... $ V100: num -1.350 -1.517 -0.362 -2.072 -1.323 ... $ V101: num -1.070 -0.450 -1.064 -1.175 -0.453 ... $ V102: num -1.038 -0.183 -0.948 -1.094 -0.355 ... $ V103: num -1.093 -0.215 -1.019 -1.205 -0.399 ... $ V104: num -0.9914 0.0897 -0.1980 -0.1433 -0.2038 ... $ V105: num -2.168 -0.535 -0.850 -1.161 -2.329 ... $ V106: num 1.00 1.00 1.00 1.00 1.00 ... $ V107: num 0.261 0.119 0.199 0.248 0.214 ... $ V108: num 0.2236 -0.0521 0.1689 0.2283 0.1619 ... $ V109: num 0.247 -0.127 0.128 0.198 0.159 ... $ V110: num 0.22516 -0.25404 -0.10006 -0.00692 0.00511 ... $ V111: num -0.994 -0.720 -0.796 -1.127 -0.684 ... $ V112: num -0.707 -0.188 -0.492 -1.176 -0.445 ... $ V113: num -0.535 0.150 -0.566 -0.864 -0.183 ... $ V114: num -0.636 -0.296 -0.812 -1.070 -0.365 ... $ V115: num -0.596 -0.263 -0.755 -0.964 -0.318 ... $ V116: num -0.6744 -0.0153 -0.0433 -0.0560 -0.2140 ... $ V117: num -1.195 -0.316 -0.298 -0.489 -1.175 ... $ V118: num 0.974 0.974 0.974 0.974 0.974 ... $ V119: num 0.98 0.98 0.98 0.98 0.98 ... $ V120: num 1.01 1.01 1.01 1.01 1.01 ... $ V121: num 0.00896 0.02651 -0.04939 0.00899 -0.08663 ... $ V122: num 0.0738 -0.1606 -0.1370 -0.1215 -0.2073 ... $ V123: num 0.0198 -0.3605 -0.2717 -0.1872 -0.3372 ... $ V124: num 0.16734 -0.41323 -0.10217 -0.05534 0.00103 ... $ V125: num -0.601 -0.845 -0.541 -0.669 -0.304 ... $ V126: num -0.803 -1.478 -1.113 -1.634 -1.208 ... $ V127: num -0.201 -1.387 -1.049 -1.007 -0.366 ... $ V128: num -0.0721 -0.3654 -1.3118 -1.6504 0.2691 ... $ V129: num -0.0444 -0.3489 -1.2789 -1.5970 0.2938 ... $ V130: num 2.52 1.28 1.79 2.53 4.80 ... $ V131: num 3.572 0.769 2.283 2.694 4.399 ... $ V132: num 0.379 0.295 0.352 0.541 0.373 ... $ V133: num 0.401 0.264 0.488 0.728 0.554 ... $ V134: num 0.859 0.214 0.572 0.801 0.683 ... $ V135: num 0.367 0.149 1.021 1.161 0.480 ... $ V136: num 0.2451 0.0938 0.7866 1.1074 0.2471 ... $ V137: num 0.290 0.357 0.933 1.231 0.353 ... $ V138: num 0.238 0.343 0.922 1.188 0.320 ... $ V139: num -0.00656 1.05492 0.84693 1.45898 2.93747 ... $ V140: num -9.64 -2.58 -2.16 -3.73 -9.33 ... $ V141: Factor w/ 88 levels "1001","1002",..: 59 59 59 59 59 59 55 78 7 73 ... $ V142: Factor w/ 3 levels "1","2","3": 3 3 3 3 3 3 3 3 3 3 ...
> From: WeiWei Shi > > Hi, > I am trying to make a multi-class classification tree by using rpart. > I used MASS package'd data: fgl to test and it works well. > > However, when I used my small-sampled data as below, the program seems > to take forever. I am not sure if it is due to slowness or there is > something wrong with my codes or data manipulation. > > Please be advised ! > > The data is described as the output from str() function. The call to > rpart is like: > > library(rpart) > test_tree<-rpart(x$V142 ~ ., data=x, > parms=list(split='gini'), cp =0.01) > > the response variable is $V142, with 3 levels. > > Thanks for your suggestions! > > Ed.[snip]> $ V141: Factor w/ 88 levels "1001","1002",..: 59 59 59 59 59 > 59 55 78 7 73 ...I'd bet this is the problem. There are 2^(88-1) - 1 possible ways to split a factor with 88 levels. It will work on those splits til the cows come home... I'd suggest getting rid of that variable, or collapse the levels to something more reasonable. The CART book describes some heuristic shortcuts for testing only n-1 splits for factors with n levels, but I believe that only works for 2-class problems, if I'm not mistaken. Andy
WeiWei Shi
2005-Jan-25 20:59 UTC
Collapsing solution to the question discussed above: Re: [R] multi-class classification using rpart
Hi, All: The variable is used to encode industries: like computer science, electronics and so on. Therefore, there is no order in them. My previous effforts indicate that grouping them according to some domain knowledge decreases the accuracy. However, using some "distance" or "entropy" is my current thought to collapse them since it is a classification problem. I am searching for some papers which discussed on this topic. Anyone has more ideas or info like paper? Thanks. Ed On Tue, 25 Jan 2005 21:49:26 +0100, Uwe Ligges <ligges at statistik.uni-dortmund.de> wrote:> WeiWei Shi wrote: > > > Hi, Andy: > > Thanks. It works after I removed the variable. I think I got a similar > > problem when I used randomForest. And I am not sure if they were due > > to the same reason. > > > > Practically and Unfortunately, that variable is very important to the > > accuracy. I am wondering if there is another way besides collapsing > > it. BTW, I remember you mentioned some alternative implementation to > > randomForest (the author provided) to avoid the upper limit (32, if I > > am correct) for the level of factor which can be used in the R > > version's randomForest. > > > > Thanks for further assistance! > > > So you *really* want it to be factor?! Thought it was a mistake not to > have it numerical.... > Amazing! Maybe computers are sometimes even too fast these days. > > Uwe > > > > Ed > > > > On Tue, 25 Jan 2005 14:58:04 -0500, Liaw, Andy <andy_liaw at merck.com> wrote: > > > >>>From: WeiWei Shi > >>> > >>>Hi, > >>>I am trying to make a multi-class classification tree by using rpart. > >>>I used MASS package'd data: fgl to test and it works well. > >>> > >>>However, when I used my small-sampled data as below, the program seems > >>>to take forever. I am not sure if it is due to slowness or there is > >>>something wrong with my codes or data manipulation. > >>> > >>>Please be advised ! > >>> > >>>The data is described as the output from str() function. The call to > >>>rpart is like: > >>> > >>>library(rpart) > >>>test_tree<-rpart(x$V142 ~ ., data=x, > >>>parms=list(split='gini'), cp =0.01) > >>> > >>>the response variable is $V142, with 3 levels. > >>> > >>>Thanks for your suggestions! > >>> > >>>Ed. > >> > >>[snip] > >> > >> > >>> $ V141: Factor w/ 88 levels "1001","1002",..: 59 59 59 59 59 > >>>59 55 78 7 73 ... > >> > >>I'd bet this is the problem. There are 2^(88-1) - 1 possible ways to split > >>a factor with 88 levels. It will work on those splits til the cows come > >>home... > >> > >>I'd suggest getting rid of that variable, or collapse the levels to > >>something more reasonable. The CART book describes some heuristic shortcuts > >>for testing only n-1 splits for factors with n levels, but I believe that > >>only works for 2-class problems, if I'm not mistaken. > >> > >>Andy > >> > >>------------------------------------------------------------------------------ > >>Notice: This e-mail message, together with any attachment...{{dropped}} > > > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! R-project.org/posting-guide.html > >
> From: Uwe Ligges > > WeiWei Shi wrote: > > > Hi, Andy: > > Thanks. It works after I removed the variable. I think I > got a similar > > problem when I used randomForest. And I am not sure if they were due > > to the same reason. > > > > Practically and Unfortunately, that variable is very > important to the > > accuracy. I am wondering if there is another way besides collapsing > > it. BTW, I remember you mentioned some alternative implementation to > > randomForest (the author provided) to avoid the upper limit > (32, if I > > am correct) for the level of factor which can be used in the R > > version's randomForest. > > > > Thanks for further assistance! > > > So you *really* want it to be factor?! Thought it was a > mistake not to > have it numerical.... > Amazing! Maybe computers are sometimes even too fast these days. > > Uwe[Uwe: Not sure if you meant to keep this off-list. If so, my most sincere apologies.] Er... not really. Currently (classification) randomForest encode splits on categorical variables by binary expansion of levels that go to the left. Such split is stored in (4-byte) integers, thus the 32-level restriction. In newer version of Breiman & Cutler's Fortran code, that restriction is removed by storing the entire indicator matrix (# of nodes by max. number of levels, then by number of trees in the forest). For the stand-alone Fortran, each tree is written to file as soon as it's grown, so it doesn't need to store the entire forest in memory. The R version has no such luxury (if you can call it that). The way the new RF Fortran code deals with categorical variables with more than 10 categories is by randomly sampling some number (say 512) of random splits and pick the best among them. That's probably a good strategy for random forests, but may not be what one would do to grow a single tree. When growing a single tree with data containing categorical variables with large number of categories, one should also be mindful of the problem that, because of the greedy nature of the algorithm, it will tend to split on variables with larger numbers of possible splits, even if those variables are less `informative'. Andy> > Ed > > > > On Tue, 25 Jan 2005 14:58:04 -0500, Liaw, Andy > <andy_liaw at merck.com> wrote: > > > >>>From: WeiWei Shi > >>> > >>>Hi, > >>>I am trying to make a multi-class classification tree by > using rpart. > >>>I used MASS package'd data: fgl to test and it works well. > >>> > >>>However, when I used my small-sampled data as below, the > program seems > >>>to take forever. I am not sure if it is due to slowness or there is > >>>something wrong with my codes or data manipulation. > >>> > >>>Please be advised ! > >>> > >>>The data is described as the output from str() function. > The call to > >>>rpart is like: > >>> > >>>library(rpart) > >>>test_tree<-rpart(x$V142 ~ ., data=x, > >>>parms=list(split='gini'), cp =0.01) > >>> > >>>the response variable is $V142, with 3 levels. > >>> > >>>Thanks for your suggestions! > >>> > >>>Ed. > >> > >>[snip] > >> > >> > >>> $ V141: Factor w/ 88 levels "1001","1002",..: 59 59 59 59 59 > >>>59 55 78 7 73 ... > >> > >>I'd bet this is the problem. There are 2^(88-1) - 1 > possible ways to split > >>a factor with 88 levels. It will work on those splits til > the cows come > >>home... > >> > >>I'd suggest getting rid of that variable, or collapse the levels to > >>something more reasonable. The CART book describes some > heuristic shortcuts > >>for testing only n-1 splits for factors with n levels, but > I believe that > >>only works for 2-class problems, if I'm not mistaken. > >> > >>Andy > >> > >>------------------------------------------------------------ > ------------------ > >>Notice: This e-mail message, together with any > attachment...{{dropped}} > > > > > > ______________________________________________ > > R-help at stat.math.ethz.ch mailing list > > stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > R-project.org/posting-guide.html > >