It's not so obvious to me that this is an artifact. What prcomp() says is
that some of the eigenvectors have a lot of "activity" in some
relatively narrow ranges of SNPs (on the same chromosome, perhaps?). If
something artificial is going on, I could imagine effects not so much of
centering columns but maybe one of
(a) imputing zero for missing values
(b) the scaling by sqrt(pi*(1-pi)) implicitly requiring Hardy-Weinberg
equilibrium, so if your data are all 0 or 2 (aa or AA) there will be
overdispersion.
However, it could also be a biological effect. Are your ids by any chance from
the same pedigree? If so, you might be seeing something like the effect of a
crossover event in a distant ancestor. (Talk to a geneticist, I just "play
one on TV".)
To investigate further, you could go looking at the individual scores and see
who is having extreme values on component 2-4 and then go back and see if there
is something peculiar about their SNPs in the "strange" region.
Of course, you might have stumbled upon a bug in R, but I doubt so.
Happy hunting!
-pd
On Oct 3, 2013, at 11:41 , Hermann Norpois wrote:
> Hello,
>
> I did a pca with over 200000 snps for 340 observations (ids). If I plot the
> eigenvectors (called rotation in prcomp) 2,3 and 4 (e.g. plot
> (rotation[,2]) I see a strange "column" in my data (see
attachment). I
> suggest it is an artefact (but of what?).
>
> Suggestion:
> I used prcomp this way: prcomp (mat), where mat is a matrix with the column
> means already substracted followed by a normalisation procedure (see below
> for details). Is that okay? Or does prcomp repeat substraction steps?
>
> Originally my approach was driven by the idea to compute a covariation
> matrix followed by the use of eigen, but the covariation matrix was to huge
> to handle. So I switched to prcomp.
>
> As I guess that the "columns" in my plots reflect some artefact
production
> I hope to get some help. For the case that my use of prcomp was not okay,
> could you please give me instructions how to use it - including with the
> normalisation procedure that I need to include before doing a pca.
>
> Thanks
> Hermann
>
> #
> # mat: matrix with genotypes coded as 0,1 and 2 (columns); IDs
> (observations) as rows.
> #
> prcomp.snp <- function (mat)
> {
> m <- ncol (mat)
> n <- nrow (mat)
> snp.namen <- colnames (mat)
> for (i in 1:m)
> {
> # snps in columns
> ui <- mat[,i]
> n <- length (which (!is.na(ui)))
> # see methods Price et al. as correction
> pi <- (1+ sum(ui, na.rm=TRUE))/(2+2*n)
>
> # substract mean
> ui <- ui - mean (ui, na.rm=TRUE)
> # NAs set to zero
> ui[is.na(ui)] <- 0
> # normalisation of the genotype for each ID
> important normalisation step
> ui <- ui/ (sqrt (pi*(1-pi)))
> # fill matrix with ui
> mat[,i] <- ui
> }
> mat <- prcomp (mat)
> return (mat)
> }
> <rotplot.png>______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com