Corrado
2008-Dec-11 11:46 UTC
[R] Principal Component Analysis - Selecting components? + right choice?
Dear R gurus, I have some climatic data for a region of the world. They are monthly averages 1950 -2000 of precipitation (12 months), minimum temperature (12 months), maximum temperature (12 months). I have scaled them to 2 km x 2km cells, and I have around 75,000 cells. I need to feed them into a statistical model as co-variates, to use them to predict a response variable. The climatic data are obviously correlated: precipitation for January is correlated to precipitation for February and so on .... even precipitation and temperature are heavily correlated. I did some correlation analysis and they are all strongly correlated. I though of running PCA on them, in order to reduce the number of co-variates I feed into the model. I run the PCA using prcomp, quite successfully. Now I need to use a criteria to select the right number of PC. (that is: is it 1,2,3,4?) What criteria would you suggest? At the moment, I am using a criteria based on threshold, but that is highly subjective, even if there are some rules of thumb (Jolliffe,Principal Component Analysis, II Edition, Springer Verlag,2002). Could you suggest something more rigorous? By the way, do you think I would have been better off by using something different from PCA? Best, -- Corrado Topi Global Climate Change & Biodiversity Indicators Area 18,Department of Biology University of York, York, YO10 5YW, UK Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk
Stéphane Dray
2008-Dec-11 12:30 UTC
[R] Principal Component Analysis - Selecting components? + right choice?
You can have look to *S. Dray*. On the number of principal components: A test of dimensionality based on measurements of similarity between matrices. /Computational Statistics and Data Analysis/, 52:2228-2237, 2008. which is implemented in the testdim function of the ade4 package. Cheers. Corrado wrote:> Dear R gurus, > > I have some climatic data for a region of the world. They are monthly averages > 1950 -2000 of precipitation (12 months), minimum temperature (12 months), > maximum temperature (12 months). I have scaled them to 2 km x 2km cells, and > I have around 75,000 cells. > > I need to feed them into a statistical model as co-variates, to use them to > predict a response variable. > > The climatic data are obviously correlated: precipitation for January is > correlated to precipitation for February and so on .... even precipitation > and temperature are heavily correlated. I did some correlation analysis and > they are all strongly correlated. > > I though of running PCA on them, in order to reduce the number of co-variates > I feed into the model. > > I run the PCA using prcomp, quite successfully. Now I need to use a criteria > to select the right number of PC. (that is: is it 1,2,3,4?) > > What criteria would you suggest? > > At the moment, I am using a criteria based on threshold, but that is highly > subjective, even if there are some rules of thumb (Jolliffe,Principal > Component Analysis, II Edition, Springer Verlag,2002). > > Could you suggest something more rigorous? > > By the way, do you think I would have been better off by using something > different from PCA? > > Best, >-- St?phane DRAY (dray at biomserv.univ-lyon1.fr ) Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - Lyon I 43, Bd du 11 Novembre 1918, 69622 Villeurbanne Cedex, France Tel: 33 4 72 43 27 57 Fax: 33 4 72 43 13 88 http://biomserv.univ-lyon1.fr/~dray/
S Ellison
2008-Dec-11 14:36 UTC
[R] Principal Component Analysis - Selecting components? + right choice?
If you're intending to create a model using PCs as predictors, select the PCs based on whether they contribute significanctly to the model fit. In chemometrics (multivariate stats in chemistry, among other things), if we're expecting 3 or 4 PC's to be useful in a principal component regression, we'd generally start with at least the first half-dozen or so and let the model fit sort them out. The reason for not preselecting too rigorously early on is that there's no guarantee at all that the first couple of PC's are good predictors for what you're interested in. The're properties of the predictor set, not of the response set. Mind you, there used to be something of a gap between chemometrics and proper statistics; I'm sure chemometricians used to do things with data that would turn a statistician pale. You could also look for a PLS model, which (if I recall correctly) actually uses the response data to select the latent variables used for prediction. S>>> Corrado <ct529 at york.ac.uk> 11/12/2008 11:46:37 >>>Dear R gurus, I have some climatic data for a region of the world. They are monthly averages 1950 -2000 of precipitation (12 months), minimum temperature (12 months), maximum temperature (12 months). I have scaled them to 2 km x 2km cells, and I have around 75,000 cells. I need to feed them into a statistical model as co-variates, to use them to predict a response variable. The climatic data are obviously correlated: precipitation for January is correlated to precipitation for February and so on .... even precipitation and temperature are heavily correlated. I did some correlation analysis and they are all strongly correlated. I though of running PCA on them, in order to reduce the number of co-variates I feed into the model. I run the PCA using prcomp, quite successfully. Now I need to use a criteria to select the right number of PC. (that is: is it 1,2,3,4?) What criteria would you suggest? At the moment, I am using a criteria based on threshold, but that is highly subjective, even if there are some rules of thumb (Jolliffe,Principal Component Analysis, II Edition, Springer Verlag,2002). Could you suggest something more rigorous? By the way, do you think I would have been better off by using something different from PCA? Best, -- Corrado Topi Global Climate Change & Biodiversity Indicators Area 18,Department of Biology University of York, York, YO10 5YW, UK Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ******************************************************************* This email and any attachments are confidential. Any use...{{dropped:8}}
Seemingly Similar Threads
- Error "singular gradient matrix at initial parameter estimates" in nls
- A point in a vector?
- Strange error returned or bug in gam in mgcv????
- goodness of "prediction" using a model (lm, glm, gam, brt, regression tree .... )
- From THE R BOOK -> Warning: In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!