I've been using the package rpart with R 1.3.0 for Windows to produce simple classification trees for some measurement data from paleontological specimens. Both the rpart documentation and the output confirm that the program produces splits on continuous data that leave "holes" in the data. It is probably of little practical importance, but is there a reason why the binary splits are constructed in the form (e.g): x7 < 37 x7 > 37 as opposed to the actual CART (tm) methodology of: x7 <= 37 x7 > 37 It seems to me that if one were to use rpart to classify an unknown case where x7 = 37, the program wouldn't actually know which way to move the case. I've read through the rpart technical report, the rpart user's manual, the rpart help file and see this practice illustrated, but don't find any explanation for this minor (and probably trivial) departure from the methodology illustrated in the CART program and in the Breiman et al book. ====================Dr. Marc R. Feldesman Professor and Chairman Anthropology Department Portland State University 1721 SW Broadway Portland, Oregon 97201 email: feldesmanm at pdx.edu phone: 503-725-3081 fax: 503-725-3905 http://web.pdx.edu/~h1mf PGP Key Available On Request ===================== "Anyway, no drug, not even alcohol, causes the fundamental ills of society. If we're looking for the source of our troubles, we shouldn't test people for drugs, we should test them for stupidity, ignorance, greed and love of power." P.J. O'Rourke Powered by Optiplochoerus and Windows 2000 (scary isn't it?) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
I haven't looked that carefully at rpart, but in tree the potential splits are midpoints between actual data values. So if x7 had values of 36 and 38, but not 37, a valid split would be < 37 and > 37. Marc Feldesman <feldesmanm at pdx.edu> To: r-help at stat.math.ethz.ch Sent by: cc: therneau at mayo.edu owner-r-help at stat.ma Subject: [R] rpart puzzle th.ethz.ch 07/12/2001 09:02 I've been using the package rpart with R 1.3.0 for Windows to produce simple classification trees for some measurement data from paleontological specimens. Both the rpart documentation and the output confirm that the program produces splits on continuous data that leave "holes" in the data. It is probably of little practical importance, but is there a reason why the binary splits are constructed in the form (e.g): x7 < 37 x7 > 37 as opposed to the actual CART (tm) methodology of: x7 <= 37 x7 > 37 It seems to me that if one were to use rpart to classify an unknown case where x7 = 37, the program wouldn't actually know which way to move the case. I've read through the rpart technical report, the rpart user's manual, the rpart help file and see this practice illustrated, but don't find any explanation for this minor (and probably trivial) departure from the methodology illustrated in the CART program and in the Breiman et al book. ====================Dr. Marc R. Feldesman Professor and Chairman Anthropology Department Portland State University 1721 SW Broadway Portland, Oregon 97201 email: feldesmanm at pdx.edu phone: 503-725-3081 fax: 503-725-3905 http://web.pdx.edu/~h1mf PGP Key Available On Request ===================== "Anyway, no drug, not even alcohol, causes the fundamental ills of society. If we're looking for the source of our troubles, we shouldn't test people for drugs, we should test them for stupidity, ignorance, greed and love of power." P.J. O'Rourke Powered by Optiplochoerus and Windows 2000 (scary isn't it?) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. -.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. _._._ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
I amend my previous observation. After constructing a very careful example, rpart works exactly the opposite of CART. In the following split: x7 < 37 go left x7 > 37 go right if x7=37 the case appears to go right. In other words, the split appears to be of the form: x7 < 37 x7 >= 37, which is precisely the opposite form that CART(tm) uses. Again, I'm not sure what practical difference this makes except that when a case has a primary splitter that is in an (apparently) excluded part of the domain, the case goes with the "no" answer to the question. (This is, of course, obvious if typical 'short-circuit' evaluation is used - because the value fails the first test (x7 <37) it must obviously go with the alternative. In CART, the case goes with the "yes" answer. Don't know what tree does since I don't use it. In my test example, rpart's behavior results in a misclassification. Had the test result gone the other way the case gets classified correctly. Walking the tree demonstrates this quite easily. Also, changing the value of 37 to 36.9999 produces the correct classification. (Now I *do* realize that I'm working with floating point numbers and so "real" 37 may not truly equal "integer" 37, which may account for *this* anomaly). Did I have the misfortune to pull an "unknown" with a major primary splitter occupying an ambiguous part of the domain, or is this a more significant problem? -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._