Carlos J. Gil Bellosta
2005-Dec-07 19:10 UTC
[R] Are minbucket and minsplit rpart options working as expected?
Dear r-list: I am using rpart to build a tree on a dataset. First I obtain a perhaps too large tree:> arbol.bsvg.02 <- rpart(formula, data = bsvg, subset=grp.entr,control=rpart.control(cp=0.001))> arbol.bsvg.02n= 100000 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 100000 6657 0 (0.93343000 0.06657000) 2) meses_antiguedad_svg>=10.5 73899 3658 0 (0.95050001 0.04949999) 4) eor_n1_gns< 1.5 63968 2807 0 (0.95611868 0.04388132) 8) tarifa_gas=31,32,33,34 63842 2771 0 (0.95659597 0.04340403) * 9) tarifa_gas=NO 126 36 0 (0.71428571 0.28571429) 18) tipo_mercado=ESP,N/A 90 10 0 (0.88888889 0.11111111) * 19) tipo_mercado=NE ,SAH,SAV 36 10 1 (0.27777778 0.72222222) * 5) eor_n1_gns>=1.5 9931 851 0 (0.91430873 0.08569127) 10) sn_calef>=0.5 8390 546 0 (0.93492253 0.06507747) * 11) sn_calef< 0.5 1541 305 0 (0.80207657 0.19792343) 22) tarifa_gas=31,NO 1134 141 0 (0.87566138 0.12433862) * 23) tarifa_gas=32 407 164 0 (0.59705160 0.40294840) 46) cons_gas_delta_1< 6997 196 51 0 (0.73979592 0.26020408) * 47) cons_gas_delta_1>=6997 211 98 1 (0.46445498 0.53554502) 94) meses_antiguedad_svg>=23.5 134 54 0 (0.59701493 0.40298507) 188) altitud< 312 61 16 0 (0.73770492 0.26229508) * 189) altitud>=312 73 35 1 (0.47945205 0.52054795) 378) back_office>=1.5 39 12 0 (0.69230769 0.30769231) * 379) back_office< 1.5 34 8 1 (0.23529412 0.76470588) * 95) meses_antiguedad_svg< 23.5 77 18 1 (0.23376623 0.76623377) * 3) meses_antiguedad_svg< 10.5 26101 2999 0 (0.88510019 0.11489981) 6) sn_calef>=0.5 20129 1853 0 (0.90794376 0.09205624) * 7) sn_calef< 0.5 5972 1146 0 (0.80810449 0.19189551) 14) tarifa_gas=31 4406 664 0 (0.84929641 0.15070359) * 15) tarifa_gas=32,NO 1566 482 0 (0.69220945 0.30779055) 30) eor_n1_gns< 0.5 1168 306 0 (0.73801370 0.26198630) * 31) eor_n1_gns>=0.5 398 176 0 (0.55778894 0.44221106) 62) back_office>=0.5 148 35 0 (0.76351351 0.23648649) * 63) back_office< 0.5 250 109 1 (0.43600000 0.56400000) * So I decide not to consider branches with less than 1000 observations, a 1% of the original number of observations. Therefore, according to the rpart.control help pages, I set minbucket=1000. However,> arbol.bsvg.02n= 100000 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 100000 6657 0 (0.9334300 0.0665700) * And I get an "empty" tree. But there were branches in the original tree with more than 1000 observations. Something similar happens if I set minsplit (or both minbucket and minsplit) to a similar value: I end up with the same root, branch-less tree. Am I misreading something? Can anybody cast a light on the correct usage of the minbucket (and/or minsplit) for me? Sincerely, Carlos J. Gil Bellosta http://www.datanalytics.com