Dear all, I would like to perform a regression tree analysis on a dataset with multicollinear variables (as climate variables often are). The questions that I am asking are: 1- Is there any particular statistical problem in using multicollinear variables in a regression tree? 2- Multicollinear variables should appear as alternate splits. Would it be more accurate to present these alternate splits in the results of the analysis or apply a variable selection or reduction procedure before the regression tree? Thank you in advance, Jean-Noel Candau INRA - Unit? de Recherches Foresti?res M?diterran?ennes Avenue A. Vivaldi 84000 AVIGNON Tel: (33) 4 90 13 59 22 Fax: (33) 4 90 13 59 59
Bill.Venables@csiro.au
2004-Feb-09 11:32 UTC
[R] Recursive partitioning with multicollinear variables
No, for regression trees collinearity is a non-issue, because it is not a linear procedure. Having variables that are linearly dependent (even exactly so) merely widens the scope of choice that the algorithm has to make cuts. I'm not sure what you mean by "Multicollinear variables should appear as alternate splits". Do you mean that every second split should be in one variable of a particular set? Perhaps you mean "alternative" instead of "alternate"? In either case I think you are worrying over nothing. Just go ahead and do the tree-based model analysis and don't worry about it. Here is a little picture that might clarify things. Suppose Latitude and Longitude are two variables on which the algorithm may choose to split. This means that splits in these geographical variables can only occur in a North-South or an East-West direction. Let's suppose you add in two extra variables that are completely dependent on the first, namely LatPlusLong <- Latitude + Longitude LatMinusLong <- Latitude - Longitude and now offer all four variables as potential split variables. Now the algorithm may split North-South, East-West, NorthEast-SouthWest or NorthWest-SouthEast. All you have done is increase the scope of choice for the algorithm to make splits. Not only does the linear dependence not matter, but I'd argue it could be a pretty good thing. One serious message to take from this as well, though, is to use regression trees for prediction. Don't read too much into the variables that the algorithm has chosen to use at any stage. Bill Venables. -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Jean-Noel Sent: Monday, 9 February 2004 8:25 PM To: r-help at stat.math.ethz.ch Subject: [R] Recursive partitioning with multicollinear variables Dear all, I would like to perform a regression tree analysis on a dataset with multicollinear variables (as climate variables often are). The questions that I am asking are: 1- Is there any particular statistical problem in using multicollinear variables in a regression tree? 2- Multicollinear variables should appear as alternate splits. Would it be more accurate to present these alternate splits in the results of the analysis or apply a variable selection or reduction procedure before the regression tree? Thank you in advance, Jean-Noel Candau INRA - Unit? de Recherches Foresti?res M?diterran?ennes Avenue A. Vivaldi 84000 AVIGNON Tel: (33) 4 90 13 59 22 Fax: (33) 4 90 13 59 59 ______________________________________________ R-help at stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Frank E Harrell Jr
2004-Feb-09 11:53 UTC
[R] Recursive partitioning with multicollinear variables
On Mon, 9 Feb 2004 11:24:39 +0100 "Jean-Noel" <jean-noel.candau at avignon.inra.fr> wrote:> Dear all, > I would like to perform a regression tree analysis on a dataset with > multicollinear variables (as climate variables often are). The questions > that I am asking are: > 1- Is there any particular statistical problem in using multicollinear > variables in a regression tree? > 2- Multicollinear variables should appear as alternate splits. Would it > be > more accurate to present these alternate splits in the results of the > analysis or apply a variable selection or reduction procedure before the > regression tree? > Thank you in advance, > > Jean-Noel CandauA more accurate and stable result would be obtained by performing a data reduction procedure that ignores the response variable. Combining collinear variables into an index is often better than arbitrarily choosing between them. Then use the indexes in a regression model unless you have tens of thousands of observations for recursive partitioning, or are using bagging of trees or a related procedure to cancel out the instability in the tree growing process [which unfortunately will often result in an average of trees that is more complex in appearance than a regression model]. Frank --- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University