Dear all, (Mac OS X 10.4.11, R 2.6.0) I have a quantitative dataset with a lot of Na?s in it. So many, that it is not possible to delete all rows with NA?s and also not possible, to delete all variables with NA?s. Is there a function for a principal component analysis, that can deal with so many NA?s. Thanks in advance Birgit Birgit Lemcke Institut f?r Systematische Botanik Zollikerstrasse 107 CH-8008 Z?rich Switzerland Ph: +41 (0)44 634 8351 birgit.lemcke at systbot.uzh.ch
Dear Birgit, You need to think about why you have that many NA's. In case of vegetation data, it is very common to have only a few species present in a site. So how would you record the abundance of a species that is absent? NA or 0 (zero)? One could argument that it needs to be NA because you can't measure the abundance of the species that is absent. But others could argument that a missing species has by definition zero abundance. In my opinion it's best to use 0 (zero) for absent species and NA for present species but with missing information on the abundance. HTH, Thierry ---------------------------------------------------------------------------- ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest Cel biometrie, methodologie en kwaliteitszorg / Section biometrics, methodology and quality assurance Gaverstraat 4 9500 Geraardsbergen Belgium tel. + 32 54/436 185 Thierry.Onkelinx op inbo.be www.inbo.be Do not put your faith in what statistics say until you have carefully considered what they do not say. ~William W. Watt A statistical analysis, properly conducted, is a delicate dissection of uncertainties, a surgery of suppositions. ~M.J.Moroney -----Oorspronkelijk bericht----- Van: r-help-bounces op r-project.org [mailto:r-help-bounces op r-project.org] Namens Birgit Lemcke Verzonden: vrijdag 23 november 2007 16:43 Aan: R Hilfe Onderwerp: [R] PCA with NA Dear all, (Mac OS X 10.4.11, R 2.6.0) I have a quantitative dataset with a lot of Na?s in it. So many, that it is not possible to delete all rows with NA?s and also not possible, to delete all variables with NA?s. Is there a function for a principal component analysis, that can deal with so many NA?s. Thanks in advance Birgit Birgit Lemcke Institut f?r Systematische Botanik Zollikerstrasse 107 CH-8008 Z?rich Switzerland Ph: +41 (0)44 634 8351 birgit.lemcke op systbot.uzh.ch ______________________________________________ R-help op r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Birgit Lemcke wrote:>Dear all, >(Mac OS X 10.4.11, R 2.6.0) >I have a quantitative dataset with a lot of Na?s in it. So many, that >it is not possible to delete all rows with NA?s and also not >possible, to delete all variables with NA?s. >Is there a function for a principal component analysis, that can deal >with so many NA?s. > >Thanks in advance > >Birgit > > >Birgit Lemcke >Institut f?r Systematische Botanik >Zollikerstrasse 107 >CH-8008 Z?rich >Switzerland >Ph: +41 (0)44 634 8351 >birgit.lemcke at systbot.uzh.ch > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. > > > > >Hi, in centred PCA, missing data should be replaced by the mean of available data. Let X be your analyzed matrix (variables in columns). ## X = matrix(runif(300),ncol=10) idx = sample(1:nrow(X),5) X[idx,] = NA sum(is.na(X)) [1] 95 library(ade4) dudi.pca(X,center=TRUE,scale=FALSE) Erreur dans dudi.pca(X, center = TRUE, scale = FALSE) : na entries in table ## Now we replace missing values : ## f1 <- function(vec) { m <- mean(vec, na.rm = TRUE) vec[is.na(vec)] <- m return(vec) } Y = apply(X,2,f1) pcaY = dudi.pca(Y,center=TRUE,scale=FALSE,nf=2,scannf=FALSE) s.label(pcaY$li) sunflowerplot(pcaY$li[idx,1:2], add=TRUE) ## All missing values are placed at the non-informative point, i.e. at the origin. Regards, Thibaut. -- ###################################### Thibaut JOMBART CNRS UMR 5558 - Laboratoire de Biom?trie et Biologie Evolutive Universite Lyon 1 43 bd du 11 novembre 1918 69622 Villeurbanne Cedex T?l. : 04.72.43.29.35 Fax : 04.72.43.13.88 jombart at biomserv.univ-lyon1.fr http://lbbe.univ-lyon1.fr/-Jombart-Thibaut-.html?lang=en http://pbil.univ-lyon1.fr/software/adegenet/
The 'factor.model.stat' function (available in the public domain area of http://www.burns-stat.com) fits a principal components factor model to data that can have NAs. You might be able to copy what it does for your purposes. It does depend on there being some variables (columns) that have no missing values. If that doesn't work for you, then I would guess that doing missing value imputation could be another approach. I'm sure there be dragons there -- perhaps others on the list know where they lie. Patrick Burns patrick at burns-stat.com +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and "A Guide for the Unwilling S User") Birgit Lemcke wrote:>Dear all, >(Mac OS X 10.4.11, R 2.6.0) >I have a quantitative dataset with a lot of Na?s in it. So many, that >it is not possible to delete all rows with NA?s and also not >possible, to delete all variables with NA?s. >Is there a function for a principal component analysis, that can deal >with so many NA?s. > >Thanks in advance > >Birgit > > >Birgit Lemcke >Institut f?r Systematische Botanik >Zollikerstrasse 107 >CH-8008 Z?rich >Switzerland >Ph: +41 (0)44 634 8351 >birgit.lemcke at systbot.uzh.ch > >______________________________________________ >R-help at r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. > > > >
Thanks to all for your help. Only to complete this: The NA?s in my case mean that I have no information for this character in this species. These are not ecological data, so I have to deal somehow with the NA?s without replacing by zero. I think Thibauts help is very useful. Thanks a lot Birgit Am 23.11.2007 um 17:26 schrieb Thibaut Jombart:> Birgit Lemcke wrote: > >> Dear all, >> (Mac OS X 10.4.11, R 2.6.0) >> I have a quantitative dataset with a lot of Na?s in it. So many, >> that it is not possible to delete all rows with NA?s and also >> not possible, to delete all variables with NA?s. >> Is there a function for a principal component analysis, that can >> deal with so many NA?s. >> >> Thanks in advance >> >> Birgit >> >> >> Birgit Lemcke >> Institut f?r Systematische Botanik >> Zollikerstrasse 107 >> CH-8008 Z?rich >> Switzerland >> Ph: +41 (0)44 634 8351 >> birgit.lemcke at systbot.uzh.ch >> >> ______________________________________________ >> R-help at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting- >> guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> >> >> > Hi, > > in centred PCA, missing data should be replaced by the mean of > available data. > Let X be your analyzed matrix (variables in columns). > > ## > X = matrix(runif(300),ncol=10) > idx = sample(1:nrow(X),5) > X[idx,] = NA > sum(is.na(X)) > [1] 95 > > library(ade4) > dudi.pca(X,center=TRUE,scale=FALSE) > Erreur dans dudi.pca(X, center = TRUE, scale = FALSE) : na entries > in table > ## > > Now we replace missing values : > > ## > f1 <- function(vec) { > m <- mean(vec, na.rm = TRUE) > vec[is.na(vec)] <- m > return(vec) > } > > Y = apply(X,2,f1) > > pcaY = dudi.pca(Y,center=TRUE,scale=FALSE,nf=2,scannf=FALSE) > > s.label(pcaY$li) > sunflowerplot(pcaY$li[idx,1:2], add=TRUE) > ## > > All missing values are placed at the non-informative point, i.e. at > the origin. > > Regards, > > Thibaut. > > -- > ###################################### > Thibaut JOMBART > CNRS UMR 5558 - Laboratoire de Biom?trie et Biologie Evolutive > Universite Lyon 1 > 43 bd du 11 novembre 1918 > 69622 Villeurbanne Cedex > T?l. : 04.72.43.29.35 > Fax : 04.72.43.13.88 > jombart at biomserv.univ-lyon1.fr > http://lbbe.univ-lyon1.fr/-Jombart-Thibaut-.html?lang=en > http://pbil.univ-lyon1.fr/software/adegenet/Birgit Lemcke Institut f?r Systematische Botanik Zollikerstrasse 107 CH-8008 Z?rich Switzerland Ph: +41 (0)44 634 8351 birgit.lemcke at systbot.uzh.ch