thr3ads.net - R help - [R] Speeding up lots of calls to GLM [Mar 2012]

If this information is useful, please help other people find it:
Share via:

Davy

2012-Mar-12 18:11 UTC

[R] Speeding up lots of calls to GLM

Dear useRs,

First off, sorry about the long post. Figured it's better to give context
to get good answers (I hope!). Some time ago I wrote an R function that
will get all pairwise interactions of variables in a data frame. This
worked fine at the time, but now a colleague would like me to do this with
a much larger dataset. They don't know how many variables they are going to
have in the end but they are guessing approximately 2,500 - 3,000 variables
with ~2000 observations for each. My function below is way too slow for
this (4 minutes for 100 variables). At the bottom of this post I have
included some timings for various numbers of variables and total numbers of
interactions. I have the results of calling Rprof() on the 100 variables
run of my function, so If anyone wants to take a look at it let me know. I
don't want to make a super long post any longer than it needs to be.

What I'd like to know is if there is anything I can do to speed this
function up. I tried looking at going directly to glm.fit, but as far as I
understood, for that to be useful the computation of the design matrices
and all of that other stuff that I frankly don't understand, needs to be
the same for each model, which is not the case for my analysis, although
perhaps I am wrong about this. I also took a look at speedglm and it seems
to be for the analysis of large datasets. I think the problem is that, in
effect there are many datasets, so not sure if this package can help.

Any ideas on how to make this run faster would be greatly appreciated. I am
planning on using parallelization to run the analysis in the end but I
don't know how many CPU's I am going to have access to but I'd say
it won't
be more than 8. In addition I have been advised that the best way to
account for multiple testing here is by permutation test, which in this
context becomes almost unfeasible, since I would have to run each
interaction ~10,000 times.

Thanks in advance, Cheers
Davy.

getInteractions2 = function(data, fSNPcol, ccCol)
{
#fSNPcol is the number of the column that contains the first SNP
#ccCol is the number of the column that contains the outcome variable

  require(lmtest)

  a = data.frame()

  snps =  names(data)[-1:-(fSNPcol-1)]

  names(data)[ccCol] = "PHENOTYPE"

  terms = as.data.frame(t(combn(snps,2)))

  attach(data)

  fit1 = c()

  fit2 = c()

  pval  = c()

  for(i in 1:length(terms$V1))

  {

    fit1 =
glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i])),family="binomial")

    fit2 =
glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i]))+I(get(as.character(terms$V1[i]))*get(as.character(terms$V2[i]))),family="binomial")

    a  = lrtest(fit1, fit2)

    pval = c(pval, a[2,"Pr(>Chisq)"])

  }

  detach(data)

  results = cbind(terms,pval)

  return(results)
}

In the table below is the system.time results for increasing numbers of
variables being passed through the function. n is the number of variables,
and Ints is the number of pair-wise interactions given by that number of
variables.

      n   Ints     user.self sys.self elapsed

time  10   45      1.20     0.00    1.30

time  15  105      3.40     0.00    3.43

time  20  190      6.62     0.00    6.85
...

time  90 4005    178.04     0.07  195.84

time  95 4465    199.97     0.13  218.30

time 100 4950    221.15     0.08  242.18

Some code to reproduce a data frame in case you want to look at timings or
the Rprof() results. Please don't run this unless your machine is super
fast, or your prepared to wait for about 15-20 minutes.

df =
data.frame(paste("sid",1:2000,sep=""),rbinom(2000,1,.5))

gtypes = matrix(nrow=2000, ncol=3000)

gtypes = apply(gtypes,2,function(x){x=sample(0:2, 2000, replace=T);x})

snps = paste("rs", 1000:3999,sep="")

df = cbind(df,gtypes)

names(df) = c("sid", "status", snps)

times = c()
for(i in seq(10,100, by=5)){

  if(i==100){Rprof()}

  time = system.time((pvals = getInteractions2(df[,1:i], 3, 2)))

  print(time)

  times = rbind(times, time)

  if(i==100){Rprof(NULL)}
}

numI = function(n){return(((n^2)-n)/2)}

timings = cbind(seq(10,100,by=5), sapply(seq(10,100,by=5), numI),times)


-- 
David Kavanagh
Nephrology Research, Centre of Public Health, Queen's University Belfast, A
floor, Tower Block,
City Hospital, Lisburn Road, BT9 7AB, Belfast, United Kingdom

	[[alternative HTML version deleted]]

Thomas Lumley

2012-Mar-12 20:04 UTC

head link

[R] Speeding up lots of calls to GLM

On Tue, Mar 13, 2012 at 7:11 AM, Davy <davykavanagh at gmail.com>
wrote:> Dear useRs,
>
> First off, sorry about the long post. Figured it's better to give
context
> to get good answers (I hope!). Some time ago I wrote an R function that
> will get all pairwise interactions of variables in a data frame. This
> worked fine at the time, but now a colleague would like me to do this with
> a much larger dataset. They don't know how many variables they are
going to
> have in the end but they are guessing approximately 2,500 - 3,000 variables
> with ~2000 observations for each. My function below is way too slow for
> this (4 minutes for 100 variables). At the bottom of this post I have
> included some timings for various numbers of variables and total numbers of
> interactions. I have the results of calling Rprof() on the 100 variables
> run of my function, so If anyone wants to take a look at it let me know. I
> don't want to make a super long post any longer than it needs to be.
>
> What I'd like to know is if there is anything I can do to speed this
> function up. I tried looking at going directly to glm.fit, but as far as I
> understood, for that to be useful the computation of the design matrices
> and all of that other stuff that I frankly don't understand, needs to
be
> the same for each model, which is not the case for my analysis, although
> perhaps I am wrong about this. I also took a look at speedglm and it seems
> to be for the analysis of large datasets. I think the problem is that, in
> effect there are many datasets, so not sure if this package can help.
>
> Any ideas on how to make this run faster would be greatly appreciated. I am
> planning on using parallelization to run the analysis in the end but I
> don't know how many CPU's I am going to have access to but I'd
say it won't
> be more than 8. In addition I have been advised that the best way to
> account for multiple testing here is by permutation test, which in this
> context becomes almost unfeasible, since I would have to run each
> interaction ~10,000 times.
>
> Thanks in advance, Cheers
> Davy.
>
> getInteractions2 = function(data, fSNPcol, ccCol)
> {
> #fSNPcol is the number of the column that contains the first SNP
> #ccCol is the number of the column that contains the outcome variable
>
> ?require(lmtest)
>
> ?a = data.frame()
>
> ?snps = ?names(data)[-1:-(fSNPcol-1)]
>
> ?names(data)[ccCol] = "PHENOTYPE"
>
> ?terms = as.data.frame(t(combn(snps,2)))
>
> ?attach(data)
>
> ?fit1 = c()
>
> ?fit2 = c()
>
> ?pval ?= c()
>
> ?for(i in 1:length(terms$V1))
>
> ?{
>
> ? ?fit1 =
glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i])),family="binomial")
>
> ? ?fit2 =
glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i]))+I(get(as.character(terms$V1[i]))*get(as.character(terms$V2[i]))),family="binomial")
>
> ? ?a ?= lrtest(fit1, fit2)
>
> ? ?pval = c(pval, a[2,"Pr(>Chisq)"])
>
> ?}
>
> ?detach(data)
>
> ?results = cbind(terms,pval)
>
> ?return(results)
> }
>
> In the table below is the system.time results for increasing numbers of
> variables being passed through the function. n is the number of variables,
> and Ints is the number of pair-wise interactions given by that number of
> variables.
>
> ? ? ?n ? Ints ? ? user.self sys.self elapsed
>
> time ?10 ? 45 ? ? ?1.20 ? ? 0.00 ? ?1.30
>
> time ?15 ?105 ? ? ?3.40 ? ? 0.00 ? ?3.43
>
> time ?20 ?190 ? ? ?6.62 ? ? 0.00 ? ?6.85
> ...
>
> time ?90 4005 ? ?178.04 ? ? 0.07 ?195.84
>
> time ?95 4465 ? ?199.97 ? ? 0.13 ?218.30
>
> time 100 4950 ? ?221.15 ? ? 0.08 ?242.18
>
> Some code to reproduce a data frame in case you want to look at timings or
> the Rprof() results. Please don't run this unless your machine is super
> fast, or your prepared to wait for about 15-20 minutes.
>
> df =
data.frame(paste("sid",1:2000,sep=""),rbinom(2000,1,.5))
>
> gtypes = matrix(nrow=2000, ncol=3000)
>
> gtypes = apply(gtypes,2,function(x){x=sample(0:2, 2000, replace=T);x})
>
> snps = paste("rs", 1000:3999,sep="")
>
> df = cbind(df,gtypes)
>
> names(df) = c("sid", "status", snps)
>
> times = c()
> for(i in seq(10,100, by=5)){
>
> ?if(i==100){Rprof()}
>
> ?time = system.time((pvals = getInteractions2(df[,1:i], 3, 2)))
>
> ?print(time)
>
> ?times = rbind(times, time)
>
> ?if(i==100){Rprof(NULL)}
> }
>
> numI = function(n){return(((n^2)-n)/2)}
>
> timings = cbind(seq(10,100,by=5), sapply(seq(10,100,by=5), numI),times)
>
The first step is to use glm.fit rather than glm.  This leads to
simpler as well as faster code in your context -- you feed it a design
matrix rather than a formula.  That's probably a factor of two or so.

Using score tests rather than likelihood ratio tests would also save
some time, but again only a factor of two or so.


It looks as though you have SNP data.  If the SNPs are not in linkage
disequilibrium and the main effects are fairly small, you can center
each SNP variable at its mean to make the interaction term
uncorrelated with the main effects, and prescreen by just testing for
correlation between the interaction variable and the phenotype.  Then
go back and fit the models for the few pairs that look promising.

Note that in general you *cannot* do a permutation test to handle
multiple testing for interactions
[http://www.ncbi.nlm.nih.gov/pubmed/20384625].    In order for the
permutation null hypothesis to be true you need additional structure,
like zero main effects.   You can do a parametric bootstrap, which is
similar.   Or just choose a threshold based on expected number of
false positives rather than trying to get strong control of Type I
error.

  -thomas

-- 
Thomas Lumley
Professor of Biostatistics
University of Auckland

Apparently Analagous Threads

Search for more maybe matching threads

R help - Mar 2012 - Speeding up lots of calls to GLM

[R] Speeding up lots of calls to GLM

[R] Speeding up lots of calls to GLM

Apparently Analagous Threads