Andrew Park
2007-Feb-23 15:51 UTC
[R] Repeated measures in Classification and Regresssion Trees
Dear R members, I have been trying to find out whether one can use multivariate regression trees (for example mvpart) to analyze repeated measures data. As a non-parametric technique, CART is insensitive to most of the assumptions of parametric regression, but repeated measures data raises the issue of the independence of several data points measured on the same subject, or from the same plot over time. Any perspectives will be welcome, Andy Park (Assistant Professor) Centre for Forest Interdisciplinary Research (CFIR), Department of Biology, University of Winnipeg, 515 Portage Avenue, Winnipeg, Manitoba, R3B 2E9, Canada Phone: (204) 786-9407
Bert Gunter
2007-Feb-23 16:46 UTC
[R] Repeated measures in Classification and Regresssion Trees
Andrew: Good question! AFAIK most of the so-called "machine learning" machinery -- regression and classification trees, SVM's, neural nets, random forests, and other more chic methods (I make no attempt to keep up with all of them) -- ignore error structure; that is, they assume the data are at least independent (not necessarily identically distributed). I don't think merely exchangeable is good enough either, though I may be wrong about this. But I believe you have put your finger on a key issue: although all this "cool" methodology is usually not terribly concerned with inference (x-validation and bootstrapping being the usual methodology rather than, say, asymptotics), one wonders how biased the estimators are when there are various correlations in the data. I suspect a lot, depending on the nature of the correlations and the methods. I think the moral is: thermodynamics still rules -- there's no free lunch. You are just as likely to produce nonsense using all this "nonparametric" methodology as you are using parametric methods if you ignore the error structure of the data. Incidentally, I should point out that George Box fulminated on this very issue about 50 years ago. In his statistics classes he always used to say that all the fuss (then) about using non-parametric rank-based methods (e.g. Mann-Whitney-Wilcoxon) rather than parametric t-statistics was silly since the t-statistics were relatively insensitive to deopartures from normality anyway and it was lack of independence, not exact normality, that was the key practical issue, and both approaches were sensitive to that. He published several papers to this effect, of course. Needless to say, I would welcome other -- especially better informed and contrary -- views on these issues, either on or off list. Cheers, Bert Gunter Genentech Nonclinical Statistics South San Francisco, CA 94404 650-467-7374 -----Original Message----- From: r-help-bounces at stat.math.ethz.ch [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Andrew Park Sent: Friday, February 23, 2007 7:51 AM To: r-help at stat.math.ethz.ch Subject: [R] Repeated measures in Classification and Regresssion Trees Dear R members, I have been trying to find out whether one can use multivariate regression trees (for example mvpart) to analyze repeated measures data. As a non-parametric technique, CART is insensitive to most of the assumptions of parametric regression, but repeated measures data raises the issue of the independence of several data points measured on the same subject, or from the same plot over time. Any perspectives will be welcome, Andy Park (Assistant Professor) Centre for Forest Interdisciplinary Research (CFIR), Department of Biology, University of Winnipeg, 515 Portage Avenue, Winnipeg, Manitoba, R3B 2E9, Canada Phone: (204) 786-9407 ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.