thr3ads.net - R help - [R] Fastest way to do HWE.exact test on 100K SNP data? [Jun 2006]

If this information is useful, please help other people find it:
Share via:

Anna Pluzhnikov

2006-Jun-05 19:27 UTC

[R] Fastest way to do HWE.exact test on 100K SNP data?

Hi everyone,

I'm using the function 'HWE.exact' of 'genetics' package to
compute p-values of
the HWE test. My data set consists of ~600 subjects (cases and controls) typed
at ~ 10K SNP markers; the test is applied separately to cases and controls. The
genotypes are stored in a list of 'genotype' objects, all.geno, and
p-values are
calculated inside the loop over all SNP markers. 

I wish to repeat this procedure multiple times (~1000) permuting the cases and
controls (affection status). It seems straightforward to implement it like this:

#############################################

for (iter in 1:1000) {
  set.seed(iter)
# get the permuted affection status
  permut <- sample(affSt)
  for (j in 1:nSNPs) {
    test <- tapply(all.geno[[j]], permut, HWE.exact)
    pvalControls[j] <- test$"1"$p.value
    pvalCases[j] <- test$"2"$p.value
  }
}

##############################################

The problem is that it takes ~1 min/iteration (on AMD Opteron 252 processor
running Linux). 

Is there a faster/more efficient way to do this? 

Thanks,
-- 
Anna Pluzhnikov, PhD
Section of Genetic Medicine
Department of Medicine
The University of Chicago


-------------------------------------------------
This email is intended only for the use of the individual or...{{dropped}}

Martin Morgan

2006-Jun-06 03:34 UTC

head link

[R] Fastest way to do HWE.exact test on 100K SNP data?

Anna --

This might be a little faster ...

all.geno.df <- data.frame(all.geno)     # exploit shared structure
set.seed(123)                           # once only
for (iter in 1:1000) {
    permut <- sample(affSt)
    control <- all.geno.df[permut==1,]
    case <- all.geno.df[permut==2,]
    pvalControls <- unlist(sapply(control, HWE.exact)["p.value",])
    pvalCases <- unlist(sapply(case, HWE.exact)["p.value",])
    # presmuably do something with these, before next iteration
}

but probably most of the time is being spent in HWE.exact so these
cosmetic changes won't help much.

Looking at the code for HWE.exact, it looks like it generates a table
based on numbers of alleles, and then uses this table to determine the
probability. Any time two SNPs have the same number of alleles, the
same table gets generated. It seems like this will happen alot. So one
strategy is to only calculate the table once, 'remember' it, and then
the next time a value is needed from that table look it up rather than
calculate it.

This is a bad strategy if most SNPs have different numbers of alleles
(because then the tables get calculated, and a lot of space and some
time is used to store results that are never accessed again).

Here is the code (not tested extensively, so please use with care!):

HWE.exact.lookup <- function() {
    HWE.lookup.tbl <- new.env(hash=TRUE)
    dhwe2 <- function(n11, n12, n22) {
        ## from body of HWE.exact
        f <- function(x) lgamma(x + 1)
        n <- n11 + n12 + n22
        n1 <- 2 * n11 + n12
        n2 <- 2 * n22 + n12
        exp(log(2) * (n12) + f(n) - f(n11) - f(n12) - f(n22) - 
            f(2 * n) + f(n1) + f(n2))
    }
    function (x) {
        ## adopted from HWE.exact
        if (!is.genotype(x)) 
          stop("x must be of class 'genotype' or
'haplotype'")
        nallele <- length(na.omit(allele.names(x)))
        if (nallele != 2) 
          stop("Exact HWE test can only be computed for 2 markers with 2
alleles")
        allele.tab <- table(factor(allele(x, 1), levels = allele.names(x)), 
                            factor(allele(x, 2), levels = allele.names(x)))
        n11 <- allele.tab[1, 1]
        n12 <- allele.tab[1, 2] + allele.tab[2, 1]
        n22 <- allele.tab[2, 2]
        n1 <- 2 * n11 + n12
        n2 <- 2 * n22 + n12

        nm <- paste(n1,n2,sep=".")
        if (!exists(nm, envir=HWE.lookup.tbl)) { # 'remember'
            x12 <- seq(n1%%2, min(n1, n2), 2)
            x11 <- (n1 - x12)/2
            x22 <- (n2 - x12)/2
            dist <- data.frame(n11 = x11, n12 = x12, n22 = x22,
                               density = dhwe2(x11, x12, x22))
            dist <- dist[order(dist$density),]
            dist$density <- cumsum(dist$density)
            assign(nm,dist, envir=HWE.lookup.tbl)
        }

        dist <- get(nm, HWE.lookup.tbl)
        dist$density[dist$n11 == n11 & dist$n12 == n12 & dist$n22 ==
n22]
    }
}

calcHWE <- HWE.exact.lookup()           # create a new lookup table &
function
all.geno.df <- data.frame(all.geno)     # exploit shared structure
set.seed(123)                           # once only
for (iter in 1:1000) {
    permut <- sample(affSt)
    control <- all.geno.df[permut==1,]
    case <- all.geno.df[permut==2,]
    pvalControls <- sapply(control, calcHWE)
    pvalCases <- sapply(case, calcHWE)
}

Let me know how it goes if you adopt this approach!

Martin

Anna Pluzhnikov <apluzhni at bsd.uchicago.edu> writes:
> Hi everyone,
>
> I'm using the function 'HWE.exact' of 'genetics'
package to compute p-values of
> the HWE test. My data set consists of ~600 subjects (cases and controls)
typed
> at ~ 10K SNP markers; the test is applied separately to cases and controls.
The
> genotypes are stored in a list of 'genotype' objects, all.geno, and
p-values are
> calculated inside the loop over all SNP markers. 
>
> I wish to repeat this procedure multiple times (~1000) permuting the cases
and
> controls (affection status). It seems straightforward to implement it like
this:
>
> #############################################
>
> for (iter in 1:1000) {
>   set.seed(iter)
> # get the permuted affection status
>   permut <- sample(affSt)
>   for (j in 1:nSNPs) {
>     test <- tapply(all.geno[[j]], permut, HWE.exact)
>     pvalControls[j] <- test$"1"$p.value
>     pvalCases[j] <- test$"2"$p.value
>   }
> }
>
> ##############################################
>
> The problem is that it takes ~1 min/iteration (on AMD Opteron 252 processor
> running Linux). 
>
> Is there a faster/more efficient way to do this? 
>
> Thanks,
> -- 
> Anna Pluzhnikov, PhD
> Section of Genetic Medicine
> Department of Medicine
> The University of Chicago
>
>
> -------------------------------------------------
> This email is intended only for the use of the individual or...{{dropped}}
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html

Ritwik Sinha

2006-Jun-06 05:01 UTC

head link

[R] Fastest way to do HWE.exact test on 100K SNP data?

Hi Anna,

I am not really answering your question, but, here goes my (unsolicited)
suggestion. Some time back I did some simulations to see how the HWE tests
performed. In particular I compared the exact test and the chi squared test.
Look at the attached figure Rplots.ps. I saw that for null simulations with
40 individuals, the HWE chi squared test was reasonably close to the
(expected) uniform distribution. However this was obtained using the
function HWE.chisq(X, simulate.p.value=F), the default seemed to have some
issues. Hence my suggestion would be, if you feel comfortable, to replace
the exact test with a chi sq test, at least at the screening level. Once you
identify a set of SNPs with "small" p-values, you could follow them up
with
the exact test.

-- 
Ritwik Sinha
Graduate Student
Epidemiology and Biostatistics
Case Western Reserve University

http://darwin.cwru.edu/~rsinha <http://darwin.cwru.edu/%7Ersinha>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Rplots.ps
Type: application/postscript
Size: 7129 bytes
Desc: not available
Url :
https://stat.ethz.ch/pipermail/r-help/attachments/20060606/d1bf16a8/attachment.ps

dhinds at sonic.net

2006-Jun-06 18:21 UTC

head link

[R] Fastest way to do HWE.exact test on 100K SNP data?

Anna Pluzhnikov <apluzhni at bsd.uchicago.edu>
wrote:> Hi everyone,
> I'm using the function 'HWE.exact' of 'genetics'
package to compute
> p-values of the HWE test. My data set consists of ~600 subjects
> (cases and controls) typed at ~ 10K SNP markers; the test is applied
> separately to cases and controls. The genotypes are stored in a list
> of 'genotype' objects, all.geno, and p-values are calculated inside
> the loop over all SNP markers.
Just to concur with the previous two posters: when I've needed to
calculate lots of HWE values, I've done one of two things:

1.  Use a faster test: either the chisq test or a likelihood ratio
    test.  Optionally, you could use the exact test when one of the
    allele counts is very small, and use an asymtotic test in other
    cases.  I often just use the fast test on everything.  Especially
    since you are doing a permutation analysis on the test values, the
    exact test may not be buying you anything.

2.  Caching the HWE values works great if you can get some reuse of
    previously calculated values.  If you're calculating for ~300
    cases and ~300 controls, there would be 600*601/4 or only ~90K
    possible sets of AA/AB/BB allele counts assuming complete data
    (you can flip the counts around so that the "AA" count is always
    for the more frequent allele); with missing data there can be many
    more, but most of those possibilities will never be observed under
    the null hypothesis of HWE.  And since you are computing roughly
    10 million HWE values, you'll have a lot of reuse of previously
    calculated values.

-- Dave

Seemingly Similar Threads

Search for more maybe matching threads

R help - Jun 2006 - Fastest way to do HWE.exact test on 100K SNP data?

[R] Fastest way to do HWE.exact test on 100K SNP data?

[R] Fastest way to do HWE.exact test on 100K SNP data?

[R] Fastest way to do HWE.exact test on 100K SNP data?

[R] Fastest way to do HWE.exact test on 100K SNP data?

Seemingly Similar Threads