apeshifter
2015-Sep-14 10:11 UTC
[R] Trees (and Forests) with packages 'party' vs. 'partykit': Different results
Dear all, I'm currently exploring a dataset with the help of conditional inference trees (still very much a beginner with this technique & log. reg. methods as a whole t.b.h.), since they explained more variation in my dataset than a binary logistic regression with /glm/. I started out with the /party /package, but after I while I ran into the 'updated' /partykit /package and tried this out, too. Now, the strange thing is that both trees look quite different - actually even the very first split is different. So I did some research and came across the 'forest' concept. However, it seems that the /varImp /function does not yet work in the /partykit /implementation, which raises the question for me how I should evaluate the /partykit /forest - how can I find out whether the variables are important in the forest as in my /partykit /tree? Is there some way to do this or some other solution for this problem? I'd prefer to continue the /partykit /implementation of ctree, since it allows more settings for the final plot, which I'd need to get the final (large) plot into a readable form. Related to this project, I'd also like to give statistics for the overall model, e.g. overall significance, Nagelkerke's R?, a C-value. After a 'regular' binary log. reg., I would use the lrm function to get these values, but I am unsure whether it would be correct to also apply this method to my tree data. Any help would be greatly appreciated! -- Christopher -- View this message in context: http://r.789695.n4.nabble.com/Trees-and-Forests-with-packages-party-vs-partykit-Different-results-tp4712214.html Sent from the R help mailing list archive at Nabble.com.
Achim Zeileis
2015-Sep-14 14:52 UTC
[R] Trees (and Forests) with packages 'party' vs. 'partykit': Different results
Christopher, thanks for you interest.> I'm currently exploring a dataset with the help of conditional inference > trees (still very much a beginner with this technique & log. reg. > methods as a whole t.b.h.), since they explained more variation in my > dataset than a binary logistic regression with /glm/. I started out with > the /party /package, but after I while I ran into the 'updated' > /partykit /package and tried this out, too.If you want to use individual trees (as opposed to forests), then the "partykit" package is recommended because it contains much improved re-implementations of ctree() and mob() as well as the mob() convenience interfaces lmtree() and glmtree(). For forests see below.> Now, the strange thing is that both trees look quite different - > actually even the very first split is different.This might be due to several partitioning variables being associated with tiny p-values in the root node. The re-implementation in partykit internally computes with log-p-values and hence should be numerically more stable. In the old implementation it could happen that from several highly significant variables, always the first is chosen because the p-values were essentially indistinguishable for the computer. If you think that this is not the problem, then please contact the package maintainer with a reproducible example. Except for bug fixes like the one above, the trees grown by partykit::ctree and party::ctree should be the same.> So I did some research and came across the 'forest' concept. However, it > seems that the /varImp /function does not yet work in the /partykit > /implementation,Correct. While the ctree() implementation in partykit is better than that in party, the same is _not_ true for cforest(). The new partykit::cforest is currently still a basic implementation which doesn't offer as many features as the party::cforest implementation. More work is needed especially for variable importance measures and different kinds of predictions.> which raises the question for me how I should evaluate the /partykit > /forest - how can I find out whether the variables are important in the > forest as in my /partykit /tree? Is there some way to do this or some > other solution for this problem? I'd prefer to continue the /partykit > /implementation of ctree, since it allows more settings for the final > plot, which I'd need to get the final (large) plot into a readable form. > > Related to this project, I'd also like to give statistics for the overall > model, e.g. overall significance, Nagelkerke's R?, a C-value. After a > 'regular' binary log. reg., I would use the lrm function to get these > values, but I am unsure whether it would be correct to also apply this > method to my tree data.Overall significance is difficult because you have done model selection when growing the tree. As for pseudo R-squared or information criteria etc., it is relatively easy to compute these "by hand" based on the observed and fitted responses. An example for this is provided at: http://stackoverflow.com/questions/29524670/how-to-find-the-the-deviance-of-an-as-party-object-converted-from-rpart-tree-in/29693223#29693223> Any help would be greatly appreciated! > > -- Christopher > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Trees-and-Forests-with-packages-party-vs-partykit-Different-results-tp4712214.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
apeshifter
2015-Sep-21 13:27 UTC
[R] Trees (and Forests) with packages 'party' vs. 'partykit': Different results
Achim, thank you very much for your help, this really cleared up a number of issues. As for the differences in results between the party and partykit implementations of ctree, I guess that the situation is indeed as you assumed. Four out of five variables have p-values <2.2e-16. (However, it is not the first of these variables that is selected but the one in the second column.) I will just continue using the newer implementation. -- Christopher -- View this message in context: http://r.789695.n4.nabble.com/Trees-and-Forests-with-packages-party-vs-partykit-Different-results-tp4712214p4712539.html Sent from the R help mailing list archive at Nabble.com.