Hello, Is there an accepted way to convey, for regression trees, something akin to R-squared? I'm developing regression trees for a continuous y variable and I'd like to say how well they are doing. In particular, I'm analyzing the results of a simulation model having highly non-linear behavior, and asking what characteristics of the inputs are related to a particular output measure. I've got a very large number of points: n=4000. I'm not able to do a model sensitivity analysis because of the large number of inputs and the model run time. I've been googling around both on the archives and on the rest of the web for several hours, but I'm still having trouble getting a firm sense of the state of the art. Could someone help me to quickly understand what strategy, if any, is acceptable to say something like "The regression tree in Figure 3 captures 42% of the variance"? The target audience is readers who will be interested in the subsequent verbal explanation of the relationship, but only once they are comfortable that the tree really does capture something. I've run across methods to say how well a tree does relative to a set of trees on the same data, but that doesn't help much unless I'm sure the trees in question are really capturing the essence of the system. I'm happy to be pointed to a web site or to a thread I may have missed that answers this exact question. Thanks very much, Jeff ------------------------------------------ Prof. Jeffrey Cardille jeffrey.cardille@umontreal.ca ************************************************************************ ************************************ ** Département de Géographie ** Bureau: ** ** professeur adjoint / assistant professor ** Salle 440 ** ** Université de Montréal ** Pavillon Strathcona ** ** C.P. 6128 ** 520, chemin de la Côte-Ste-Catherine ** ** Succursale Centre-ville ** Montreal, QC H2V 2B8 ** ** Montréal, QC, H3C 3J7 ** Télé: (514) 343-8003 ** ************************************************************************ ************************************ ** Web: ** ** http://www.geog.umontreal.ca/geog/cardille.htm ** ** ** ** Calendrier de Disponibilité à: ** ** http://jeffcardille.googlepages.com/udem ** **************************************************************** [[alternative HTML version deleted]]
All-- Apologies if I have inadvertently posted this message twice; I just joined the list today, after trying to post once. Thanks- Jeff # r-help message is below # Hello, Is there an accepted way to convey, for regression trees, something akin to R-squared? I'm developing regression trees for a continuous y variable and I'd like to say how well they are doing. In particular, I'm analyzing the results of a simulation model having highly non-linear behavior, and asking what characteristics of the inputs are related to a particular output measure. I've got a very large number of points: n=4000. I'm not able to do a model sensitivity analysis because of the large number of inputs and the model run time. I've been googling around both on the archives and on the rest of the web for several hours, but I'm still having trouble getting a firm sense of the state of the art. Could someone help me to quickly understand what strategy, if any, is acceptable to say something like "The regression tree in Figure 3 captures 42% of the variance"? The target audience is readers who will be interested in the subsequent verbal explanation of the relationship, but only once they are comfortable that the tree really does capture something. I've run across methods to say how well a tree does relative to a set of trees on the same data, but that doesn't help much unless I'm sure the trees in question are really capturing the essence of the system. I'm happy to be pointed to a web site or to a thread I may have missed that answers this exact question. I've seen similar postings but nothing that's an unequivocal answer.... any help would be greatly appreciated! Thanks very much, Jeff ------------------------------------------ Prof. Jeffrey Cardille jeffrey.cardille@umontreal.ca ************************************************************************ ************************************ ** Département de Géographie ** Bureau: ** ** professeur adjoint / assistant professor ** Salle 440 ** ** Université de Montréal ** Pavillon Strathcona ** ** C.P. 6128 ** 520, chemin de la Côte-Ste-Catherine ** ** Succursale Centre-ville ** Montreal, QC H2V 2B8 ** ** Montréal, QC, H3C 3J7 ** Télé: (514) 343-8003 ** ************************************************************************ ************************************ ** Web: ** ** http://www.geog.umontreal.ca/geog/cardille.htm ** ** ** ** Calendrier de Disponibilité à: ** ** http://jeffcardille.googlepages.com/udem ** **************************************************************** [[alternative HTML version deleted]]
Prof. Jeffrey Cardille wrote:> Hello, > > Is there an accepted way to convey, for regression trees, something > akin to R-squared? > > I'm developing regression trees for a continuous y variable and I'd > like to say how well they are doing. In particular, I'm analyzing the > results of a simulation model having highly non-linear behavior, and > asking what characteristics of the inputs are related to a particular > output measure. I've got a very large number of points: n=4000. I'm > not able to do a model sensitivity analysis because of the large > number of inputs and the model run time. > > I've been googling around both on the archives and on the rest of the > web for several hours, but I'm still having trouble getting a firm > sense of the state of the art. Could someone help me to quickly > understand what strategy, if any, is acceptable to say something like > "The regression tree in Figure 3 captures 42% of the variance"? The > target audience is readers who will be interested in the subsequent > verbal explanation of the relationship, but only once they are > comfortable that the tree really does capture something. I've run > across methods to say how well a tree does relative to a set of trees > on the same data, but that doesn't help much unless I'm sure the > trees in question are really capturing the essence of the system. > > I'm happy to be pointed to a web site or to a thread I may have > missed that answers this exact question. > > Thanks very much, > > Jeff > > ------------------------------------------ > Prof. Jeffrey Cardille > jeffrey.cardille at umontreal.ca > R-help at stat.math.ethz.ch mailing listYe (below) has a method to get a nearly unbiased estimate of R^2 from recursive partitioning. In his examples the result was similar to using the formula for adjusted R^2 with regression degrees of freedom equal to about 3n/4. You can also use something like 10-fold cross-validation repeated 20 times to get a fairly precise and unbiased estimate of R^2. Frank>@ARTICLE{ye98mea,author = {Ye, Jianming}, year = 1998, title = {On measuring and correcting the effects of data mining and model selection}, journal = JASA, volume = 93, pages = {120-131}, annote = {generalized degrees of freedom;GDF;effective degrees of freedom;data mining;model selection;model uncertainty;overfitting;nonparametric regression;CART;simulation setup} } -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
On Sat, 5 May 2007, Prof. Jeffrey Cardille wrote:> Hello, > > Is there an accepted way to convey, for regression trees, something > akin to R-squared?Why not use R-squared itself for your purposes? Just get the fitted values from however you do the fit, and compute R-squared from the basic formula (the one which compares with an intercept only: all regression trees extend that model). Now, R-squared has lots of problems of its own (to the extent that it is only mentioned as something to avoid in some statistical texts) and these are worse here as the number of parameters fitted is unquantifiable. But as a factual summary it does mean what you quote. Whether any model of comparable complexity would also explain 42% of the variance is a much harder question. (Small anecdote: one of my first experiences of this was a psychologist who had funded a research project to relate personality/intelligence tests to 20-odd measurements on facial profiles by (stepwise) linear regression. My contribution was to point out that the R^2 produced was less for every one of the responses than one would expect on average for the same number of random unrelated regressors. To be systematically worse than such a straw man takes some achieving, and I have always suspected a bug in the fitting software.)> I'm developing regression trees for a continuous y variable and I'd > like to say how well they are doing. In particular, I'm analyzing the > results of a simulation model having highly non-linear behavior, and > asking what characteristics of the inputs are related to a particular > output measure. I've got a very large number of points: n=4000. I'm > not able to do a model sensitivity analysis because of the large > number of inputs and the model run time. > > I've been googling around both on the archives and on the rest of the > web for several hours, but I'm still having trouble getting a firm > sense of the state of the art. Could someone help me to quickly > understand what strategy, if any, is acceptable to say something like > "The regression tree in Figure 3 captures 42% of the variance"? The > target audience is readers who will be interested in the subsequent > verbal explanation of the relationship, but only once they are > comfortable that the tree really does capture something. I've run > across methods to say how well a tree does relative to a set of trees > on the same data, but that doesn't help much unless I'm sure the > trees in question are really capturing the essence of the system. > > I'm happy to be pointed to a web site or to a thread I may have > missed that answers this exact question. > > Thanks very much, > > Jeff > > ------------------------------------------ > Prof. Jeffrey Cardille > jeffrey.cardille at umontreal.ca > > ************************************************************************ > ************************************ > ** D?partement de G?ographie ** Bureau: ** > ** professeur adjoint / assistant professor ** Salle 440 ** > ** Universit? de Montr?al ** Pavillon Strathcona ** > ** C.P. 6128 ** 520, chemin de la C?te-Ste-Catherine ** > ** Succursale Centre-ville ** Montreal, QC H2V 2B8 ** > ** Montr?al, QC, H3C 3J7 ** T?l?: (514) 343-8003 ** > ************************************************************************ > ************************************ > ** Web: ** > ** http://www.geog.umontreal.ca/geog/cardille.htm ** > ** ** > ** Calendrier de Disponibilit? ?: ** > ** http://jeffcardille.googlepages.com/udem ** > **************************************************************** > > > > > [[alternative HTML version deleted]] > >-- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595