Johannes Klene
2015-Dec-04 09:15 UTC
[R] Random forest regression: feedback on general approach and possible issues
Hi all, I'd like to use random forest regression to say something about the importance of a set of genes (binary) for schizophrenia-related behavior (continuous measure). I am still reading up on this technique, but would already really appreciate any feedback on whether my approach is valid. So...using the randomForest package, is it a good approach to enter a few dozen binary predictors to assess their importance (as a set, and individually) for a continuous measure with a sample size of ~1000 people? More specific questions: - I have an additional interest in interactions (though perhaps not the best word in this context), does it make any sense to say something about the influence one predictor has over others by looking at the change in estimated importance of the others when that predictor is removed from the model? - I have a few siblings in the data, i.e. non-independence, is this a problem and if so, is there anything I can do about it? - The few papers I have seen so far on using this technique in a similar situation do not include any 'standard' covariates such as age and gender, should I? Any and all feedback is greatly appreciated!! Kind regards, Johannes p.s. Hope I've come to the right place despite this being a more general question, if not please let me know of a forum where this is more suited for. [[alternative HTML version deleted]]
Bert Gunter
2015-Dec-04 16:02 UTC
[R] Random forest regression: feedback on general approach and possible issues
I would suggest that you post instead on stats.stackexchange.com . This forum is mostly about R programming issues, not statistics (admittedly, the intersection is nonempty, but ...) That stackexchange forum is more about statistics. You might also consider a bioconductor forum, as this appears to be a bioinformatics type of issue. Cheers, Bert P.S. Both of these could be found with suitable internet searches. Don't neglect search engines for these types of queries. I have found them to be very helpful. Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Fri, Dec 4, 2015 at 1:15 AM, Johannes Klene <jklene000 at gmail.com> wrote:> Hi all, > I'd like to use random forest regression to say something about the > importance of a set of genes (binary) for schizophrenia-related behavior > (continuous measure). I am still reading up on this technique, but would > already really appreciate any feedback on whether my approach is valid. > So...using the randomForest package, is it a good approach to enter a few > dozen binary predictors to assess their importance (as a set, and > individually) for a continuous measure with a sample size of ~1000 people? > More specific questions: > - I have an additional interest in interactions (though perhaps not the > best word in this context), does it make any sense to say something about > the influence one predictor has over others by looking at the change in > estimated importance of the others when that predictor is removed from the > model? > - I have a few siblings in the data, i.e. non-independence, is this a > problem and if so, is there anything I can do about it? > - The few papers I have seen so far on using this technique in a similar > situation do not include any 'standard' covariates such as age and gender, > should I? > Any and all feedback is greatly appreciated!! Kind regards, Johannes > > p.s. Hope I've come to the right place despite this being a more general > question, if not please let me know of a forum where this is more suited > for. > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.