Carlos J. Gil Bellosta
2009-Jun-19 18:35 UTC
[R] Recursive partitioning algorithms in R vs. alia
Dear R-helpers, I had a conversation with a guy working in a "business intelligence" department at a major Spanish bank. They rely on recursive partitioning methods to rank customers according to certain criteria. They use both SAS EM and Salford Systems' CART. I have used package R part in the past, but I could not provide any kind of feature comparison or the like as I have no access to any installation of the first two proprietary products. Has anybody experience with them? Is there any public benchmark available? Is there any very good --although solely technical-- reason to pay hefty software licences? How would the algorithms implemented in rpart compare to those in SAS and/or CART? Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com
"Carlos J. Gil Bellosta" <cgb at datanalytics.com> wrote> >I had a conversation with a guy working in a "business intelligence" >department at a major Spanish bank. They rely on recursive partitioning >methods to rank customers according to certain criteria. > >They use both SAS EM and Salford Systems' CART. I have used package R >part in the past, but I could not provide any kind of feature comparison >or the like as I have no access to any installation of the first two >proprietary products. > >Has anybody experience with them? Is there any public benchmark >available? Is there any very good --although solely technical-- reason >to pay hefty software licences? How would the algorithms implemented in >rpart compare to those in SAS and/or CART? > >Best regards, >Hi I've used CART and a few different R packages - tree, rpart, rparty. I can't comment on the algorithms - I'm not qualified to judge, and I think the ones in CART are proprietary. One big difference is that the output from CART is beautiful with minimal fuss. Presentation quality, multicolor, multipage tree diagrams with the default settings. Another was speed - I am not sure I was doing everything right in R, but for one problem I had that had about 500 variables, R was quite slow, and CART blitzed through it. Another big difference is the price. I got CART for a reasonable fee, as I was working at a university, but the commercial price is very high (well into the thousands of dollars, if I recall correctly). Peter Peter L. Flom, PhD Statistical Consultant www DOT peterflomconsulting DOT com
in terms of the richness of features and ability to handle large data(which is normal in bank), SAS EM should be on top of others. however, it is not cheap. in terms of algorithm, split procedure in sas em can do chaid/cart/c4.5, if i remember correctly. On Fri, Jun 19, 2009 at 2:35 PM, Carlos J. Gil Bellosta<cgb at datanalytics.com> wrote:> Dear R-helpers, > > I had a conversation with a guy working in a "business intelligence" > department at a major Spanish bank. They rely on recursive partitioning > methods to rank customers according to certain criteria. > > They use both SAS EM and Salford Systems' CART. I have used package R > part in the past, but I could not provide any kind of feature comparison > or the like as I have no access to any installation of the first two > proprietary products. > > Has anybody experience with them? Is there any public benchmark > available? Is there any very good --although solely technical-- reason > to pay hefty software licences? How would the algorithms implemented in > rpart compare to those in SAS and/or CART? > > Best regards, > > Carlos J. Gil Bellosta > http://www.datanalytics.com > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- =============================WenSui Liu Blog : statcompute.spaces.live.com Tough Times Never Last. But Tough People Do. - Robert Schuller
jude.ryan at ubs.com
2009-Jun-22 22:56 UTC
[R] Recursive partitioning algorithms in R vs. alia
I have used all 3 packages for decision trees (SAS/EM, CART and R). As another user on the list commented, the algorithms CART uses are proprietary. I also know that since the algorithms are proprietary, the decision tree that you get from SAS is based on a "slightly different" algorithm so as to not violate copyright laws. When I first started using R (rpart) I benchmarked it (in terms of results obtained) for my particular problem at the time against Salford Systems CART. R gave me an identical tree with the splitting value being different in the 2nd or 3rd decimal place from what I recall. I did not have SAS/EM at that particular company and so could not benchmark it. Salford Systems CART does have additional types of splitting criteria such as "towing" etc., but again, these may be of value in certain types of problems. The splitting criteria found in R are good enough. I do have SAS/EM right now but prefer R to SAS/EM since R can be programmed and SAS/EM cannot. This may not be relevant for decision trees but for neural networks, for example, if I want to build hundreds of neural networks (since there are no variable selection methods for neural networks) with different predictors and different number of neurons, I can do this easily in R but cannot do this in SAS/EM. SAS/EM does have a variable selection node but that is independent of the neural network node, so, from what I understand, you have to select the variables and then pass them to the neural network node. In general, you get "prettier" output with CART and SAS/EM for trees. However, there are packages in R that can give you prettier output than rpart does. One GUI that you may want to explore, that works with R, is Rattle. This builds trees, neural network, boosting, etc. and you can see the generated R code as well. In terms of handling large volumes of data, SAS/EM is probably the best. However, if you have a 64 bit operating system with lots of RAM, and use random sampling, R should suffice. It is debatable whether the extra features like pretty output and variable importance is worth the huge costs you have to pay for those products, unless you really need these features. With R you can do what you want, and that is build a good tree. From what I have read, variable importance measures can be biased as they are affected by factors such as multicollinearity, variables with many categories, etc., so their usefulness is questionable (however, end-users may love them). SAS/EM is by far the most expensive product, and Salford Systems CART is pretty expensive as well. So depending on your needs, R may be good enough or the best, because you can program it, and the latest methodologies will always be implemented in R first. For comparisons of the programming capabilities of SAS (macros) versus R you may want to look at what Frank Harrell and Terry Thearneu (who wrote rpart) have to say. Both are experts in SAS and R. Hope this helps. Jude Carlos wrote: Dear R-helpers, I had a conversation with a guy working in a "business intelligence" department at a major Spanish bank. They rely on recursive partitioning methods to rank customers according to certain criteria. They use both SAS EM and Salford Systems' CART. I have used package R part in the past, but I could not provide any kind of feature comparison or the like as I have no access to any installation of the first two proprietary products. Has anybody experience with them? Is there any public benchmark available? Is there any very good --although solely technical-- reason to pay hefty software licences? How would the algorithms implemented in rpart compare to those in SAS and/or CART? Best regards, Carlos J. Gil Bellosta http://www.datanalytics.com <http://www.datanalytics.com/> ___________________________________________ Jude Ryan Director, Client Analytical Services Strategy & Business Development UBS Financial Services Inc. 1200 Harbor Boulevard, 4th Floor Weehawken, NJ 07086-6791 Tel. 201-352-1935 Fax 201-272-2914 Email: jude.ryan at ubs.com -------------- next part -------------- Please do not transmit orders or instructions regarding a UBS account electronically, including but not limited to e-mail, fax, text or instant messaging. The information provided in this e-mail or any attachments is not an official transaction confirmation or account statement. For your protection, do not include account numbers, Social Security numbers, credit card numbers, passwords or other non-public information in your e-mail. Because the information contained in this message may be privileged, confidential, proprietary or otherwise protected from disclosure, please notify us immediately by replying to this message and deleting it from your computer if you have received this communication in error. Thank you. UBS Financial Services Inc. UBS International Inc. UBS Financial Services Incorporated of Puerto Rico UBS AG UBS reserves the right to retain all messages. Messages are protected and accessed only in legally justified cases.
A point of history: Both the commercial CART program and the rpart() function are based on the book Classification and Regression Trees (Breiman, Friedman, Olshen, Stone, 1984). As a reader/commentator on one of the early drafts I got to know the material well. CART started as a large Fortran program written by Jerry Friedman which was the testing ground for the ideas in the book. I had the code at one time and made some modifications to it, but found it too frustrating to go very far with. Fortran is just too clumsy for a recursive task, and Jerry's ability to hold upteen variables in his head at once greater than mine -- the Fortran was a large monlithic block. Salford Systems aquired rights to that code; I don't know whether any of the original lines remain in their product. I had lots of conversations with their main programmer (15-20 years ago now) about methods for speeding it up; mainly an interesting problem in optimal indexing. When rpart was first written it's output agreed with CART almost entirely. The only major difference was in surrogates: I pick the surrogate with the largest number of agreements, CART picked that with the greatest % agreement. This means that rpart favors variables with fewer missing values. Since that point in time both codes have evolved. I haven't had time to do important work on rpart in over a decade. It' not surprising that the graphics and display are behind the curve, what's more surprising is that it still endures. Rpart is called "rpart" because the authors copyrighted the term "CART" for their program. It was the best alternative name that I could come up with at the time. I find it amusing that one consequence of their copyright choice is that I now see "recursive partitioning" far more often than "CART" as the generic label for tree based methods. Terry T