Damjan Krstajic
2010-Mar-06 00:39 UTC
[R] scientific (statistical) foundation for Y-RANDOMIZATION in regression analysis
Dear all, I am a statistician doing research in QSAR, building regression models where the dependent variable is a numerical expression of some chemical activity and input variables are chemical descriptors, e.g. molecular weight, number of carbon atoms, etc. I am building regression models and I am confronted with a widely a technique called Y-RANDOMIZATION for which I have difficulties in finding references in general statistical literature regarding regression analysis. I would be grateful if someone could point me to papers/literature in statistical regression analysis which give scientific (statistical) foundation for using Y-RANDOMIZATION. Y-RANDOMIZATION is a widely used technique in QSAR community to unsure the robustness of a QSPR (regression) model. It is used after the "best" regression model is selected and to make sure that there are no chance correlations. Here is a short description. The dependent variable vector (Y-vector) is randomly shuffled and a new QSPR (regression) model is fitted using the original independent variable matrix. By repeating this a number of times, say 100 times, one will get hundred R2 and q2 (leave one out cross-validation R2) based on hundred shuffled Y. It is expected that the resulting regression models should generally have low R2 and low q2 values. However, if the majority of hundred regression models obtained in the Y-randomization have relatively high R2 and high q2 then it implies that an acceptable regression model cannot be obtained for the given data set by the current modelling method. I cannot find any references to Y-randomization or Y-scrambling anywhere in the literature outside chemometrics/QSAR. Any links or references would be much appreciated. Thanks in advance. DK ---------------------------------------------- Damjan Krstajic Director Research Centre for Cheminformatics Belgrade, Serbia ---------------------------------------------- _________________________________________________________________ Tell us your greatest, weirdest and funniest Hotmail stories [[alternative HTML version deleted]]
Greg Snow
2010-Mar-06 05:51 UTC
[R] scientific (statistical) foundation for Y-RANDOMIZATION in regression analysis
In the stats literature these are more often called permutation tests. Looking up that term should give you some results (if not, I have some references, but they are at work and I am not, I could probably get them for you on Monday if you have not found anything before then). -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.snow at imail.org 801.408.8111> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- > project.org] On Behalf Of Damjan Krstajic > Sent: Friday, March 05, 2010 5:39 PM > To: r-help at r-project.org > Subject: [R] scientific (statistical) foundation for Y-RANDOMIZATION in > regression analysis > > > Dear all, > > I am a statistician doing research in QSAR, building regression models > where the dependent variable is a numerical expression of some chemical > activity and input variables are chemical descriptors, e.g. molecular > weight, number of carbon atoms, etc. > > I am building regression models and I am confronted with a widely a > technique called Y-RANDOMIZATION for which I have difficulties in > finding references in general statistical literature regarding > regression analysis. I would be grateful if someone could point me to > papers/literature in statistical regression analysis which give > scientific (statistical) foundation for using Y-RANDOMIZATION. > > Y-RANDOMIZATION is a widely used technique in QSAR community to unsure > the robustness of a QSPR (regression) model. It is used after the > "best" regression model is selected and to make sure that there are no > chance correlations. Here is a short description. The dependent > variable vector (Y-vector) is randomly shuffled and a new QSPR > (regression) model is fitted using the original independent variable > matrix. By repeating this a number of times, say 100 times, one will > get hundred R2 and q2 (leave one out cross-validation R2) based on > hundred shuffled Y. It is expected that the resulting regression models > should generally have low R2 and low q2 values. However, if the > majority of hundred regression models obtained in the Y-randomization > have relatively high R2 and high q2 then it implies that an acceptable > regression model cannot be obtained for the given data set by the > current modelling method. > > I cannot find any references to Y-randomization or Y-scrambling > anywhere in the literature outside chemometrics/QSAR. Any links or > references would be much appreciated. > > Thanks in advance. > > DK > ---------------------------------------------- > Damjan Krstajic > Director > Research Centre for Cheminformatics > Belgrade, Serbia > > ---------------------------------------------- > > > _________________________________________________________________ > Tell us your greatest, weirdest and funniest Hotmail stories > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code.
Liaw, Andy
2010-Mar-08 14:44 UTC
[R] scientific (statistical) foundation for Y-RANDOMIZATION in regression analysis
That sounds like a particular form of permutation test. If the "scrambling" is replaced by sampling with replacement (i.e., some data points can be sampled more than once while others can be left out), that's the simple (or nonparametric) bootstrap. The goal is to generate the distribution of the statistic of interest (R^2 or q^2) under the null hypothesis that there's no relationship between the activity (or property) and the structure. To make the "test" valid, one needs to ensure that the entire model building process is carried through for all of the sampled data, including feature selections, etc. Andy From: Damjan Krstajic> > Dear all, > > I am a statistician doing research in QSAR, building > regression models where the dependent variable is a numerical > expression of some chemical activity and input variables are > chemical descriptors, e.g. molecular weight, number of carbon > atoms, etc. > > I am building regression models and I am confronted with a > widely a technique called Y-RANDOMIZATION for which I have > difficulties in finding references in general statistical > literature regarding regression analysis. I would be grateful > if someone could point me to papers/literature in statistical > regression analysis which give scientific (statistical) > foundation for using Y-RANDOMIZATION. > > Y-RANDOMIZATION is a widely used technique in QSAR community > to unsure the robustness of a QSPR (regression) model. It is > used after the "best" regression model is selected and to > make sure that there are no chance correlations. Here is a > short description. The dependent variable vector (Y-vector) > is randomly shuffled and a new QSPR (regression) model is > fitted using the original independent variable matrix. By > repeating this a number of times, say 100 times, one will get > hundred R2 and q2 (leave one out cross-validation R2) based > on hundred shuffled Y. It is expected that the resulting > regression models should generally have low R2 and low q2 > values. However, if the majority of hundred regression models > obtained in the Y-randomization have relatively high R2 and > high q2 then it implies that an acceptable regression model > cannot be obtained for the given data set by the current > modelling method. > > I cannot find any references to Y-randomization or > Y-scrambling anywhere in the literature outside > chemometrics/QSAR. Any links or references would be much appreciated. > > Thanks in advance. > > DK > ---------------------------------------------- > Damjan Krstajic > Director > Research Centre for Cheminformatics > Belgrade, Serbia > > ---------------------------------------------- > > > _________________________________________________________________ > Tell us your greatest, weirdest and funniest Hotmail stories > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Notice: This e-mail message, together with any attachme...{{dropped:10}}
tauQSAR
2011-Jul-11 19:31 UTC
[R] scientific (statistical) foundation for Y-RANDOMIZATION in regression analysis
Hello, I'm also working on a QSAR validation, and would like to confirm that my multiple least squares regression based on only 4 x-variables out of a pool of 300 x-variables is significant. I want to apply the y-randomization or y-scrambling permutation protocol using R, but I have not been able to find any examples that are similar to my problem. Could someone clarify whether this type of validation protocol could be done with the R 'boostrap' function, or 'onetPermutation' or 'sample'? I'm not a statistician and I'm having a hard time deciphering the code and explanations in the manual, so help would be greatly appreciated!!!! Thanks!! -- View this message in context: http://r.789695.n4.nabble.com/scientific-statistical-foundation-for-Y-RANDOMIZATION-in-regression-analysis-tp1580328p3660581.html Sent from the R help mailing list archive at Nabble.com.