Christoph Lehmann
2004-Sep-12 12:12 UTC
[R] Variable Importance in pls: R or B? (and in glpls?)
Dear R-users, dear Ron I use pls from the pls.pcr package for classification. Since I need to know which variables are most influential onto the classification performance, what criteria shall I look at: a) B, the array of regression coefficients for a certain model (means a certain number of latent variables) (and: squared or absolute values?) OR b) the weight matrix RR (or R in the De Jong publication; in Ding & Gentleman this is the P Matrix and called 'loadings')? (and again: squared or absolute values?) and what about glpls (glpls1a) ? shall I look at the 'coefficients' (regression coefficients)? Thanks for clarification Christoph
Ron Wehrens
2004-Sep-13 08:26 UTC
[R] Re: Variable Importance in pls: R or B? (and in glpls?)
On Sunday 12 September 2004 14:12, Christoph Lehmann wrote:> Dear R-users, dear Ron > > I use pls from the pls.pcr package for classification. Since I need to > know which variables are most influential onto the classification > performance, what criteria shall I look at: > > a) B, the array of regression coefficients for a certain model (means a > certain number of latent variables) (and: squared or absolute values?)The regression coefficients give the most direct information on which variables influence the classification, although you must be careful with the interpretation if the variables are correlated. So it is the absolute magitude that is important; why would you look at the squared values?> > OR > > b) the weight matrix RR (or R in the De Jong publication; in Ding & > Gentleman this is the P Matrix and called 'loadings')? (and again: > squared or absolute values?) >The object that is returned contains X and Y loadings (which are _not_ equal to te RR matrix, btw); these are mainly used for interpretation. The regression coefficients give information on your complete model; the loadings on individual components of the model. Ron> > > and what about glpls (glpls1a) ? > shall I look at the 'coefficients' (regression coefficients)? > > Thanks for clarification > > Christoph-- Ron Wehrens Institute for Molecules and Materials, Analytical Chemistry Radboud University Email: R.Wehrens at science.ru.nl Toernooiveld 1 http://www.science.ru.nl/cac 6525 ED Nijmegen Tel: +31 24 365 2053 The Netherlands Fax: +31 24 365 2653
Berton Gunter
2004-Sep-13 16:13 UTC
[R] Variable Importance in pls: R or B? (and in glpls?)
Christoph: I noted that there were not a great number of people leaping to reply. One reason, I suspect, is that there's really NO GOOD ANSWER to this question. First, there is a huge literature on this -- it's related to variable selection in regression and shrinkage estimates, but, in general, parsimonious model building; second, as Ron Wehrens already noted, when variables are correlated -- which could have as much to do with the vagaries of the sampling as to real physical causality -- the whole notion of "variable importance" is problematic. Fact is, **any** attempt to rank the contributions of particular variables to PLS components from undesigned data (the usual case) is fraught with hazard. For that reason, it is perhaps best to view pls as merely a way of developing a good predictor, not as a way to uncover causal relationships. I know this is often unsatisfying to scientists trying to build parsimonious mechanistic models (= physical theories), especially as there is quite often little likelihood that the data are representative of any underlying population and therefore capable of predicting anything, but it is the statistical reality. For a more informed, more interesting, and more eloquent discussion of these and related issues, you might look up Leo Breiman's writings on his web site and his way of trying to assess "variable importance" in his Random Forest methodology, which is available in the package randomForest on CRAN. (I make no claim about the effectiveness of this approach -- only that it is clearly different way of approaching the issue that clearly reveals the dilemmas). Cheers, -- Bert Gunter Genentech Non-Clinical Statistics South San Francisco, CA "The business of the statistician is to catalyze the scientific learning process." - George E. P. Box> -----Original Message----- > From: r-help-bounces at stat.math.ethz.ch > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of > Christoph Lehmann > Sent: Sunday, September 12, 2004 5:13 AM > To: Ron Wehrens; r-help at stat.math.ethz.ch > Subject: [R] Variable Importance in pls: R or B? (and in glpls?) > > Dear R-users, dear Ron > > I use pls from the pls.pcr package for classification. Since > I need to > know which variables are most influential onto the classification > performance, what criteria shall I look at: > > a) B, the array of regression coefficients for a certain > model (means a > certain number of latent variables) (and: squared or absolute values?) > > OR > > b) the weight matrix RR (or R in the De Jong publication; in Ding & > Gentleman this is the P Matrix and called 'loadings')? (and again: > squared or absolute values?) > > > > and what about glpls (glpls1a) ? > shall I look at the 'coefficients' (regression coefficients)? > > Thanks for clarification > > Christoph > > ______________________________________________ > R-help at stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html >