thr3ads.net - R help - [R] Is there a fast way to do several hundred thousand ANOVA tests? [Aug 2009]

If this information is useful, please help other people find it:
Share via:

big permie

2009-Aug-24 00:55 UTC

[R] Is there a fast way to do several hundred thousand ANOVA tests?

Dear R users,

I have a matrix a and a classification vector b such that
> str(a)num [1:50, 1:800000]
and> str(b)Factor w/ 3 levels "cond1","cond2","cond3"

I'd like to do an anova on all 800000 columns and record the F statistic for
each test; I currently do this using

f.stat.vec <- numeric(length(a[1,])

for (i in 1:length(a[1,]) {
  f.test.frame <- data.frame(nums = a[,i], cond = b)
  aov.vox <- aov(nums ~ cond, data = f.test.frame)
  f.stat <- summary(aov.vox)[[1]][1,4]
  f.stat.vec[i] <- f.stat
}

The problem is that this code takes about 70 minutes to run.

Is there a faster way to do an anova & record the F stat for each column?

Any help would be appreciated.

Thanks
Heath

	[[alternative HTML version deleted]]

glen_b

2009-Aug-24 01:40 UTC

head link

[R] Is there a fast way to do several hundred thousand ANOVA tests?

The usual methods of avoiding loops may provide some speedup (but you've
already done the allocation of the full results vector outside the loop,
which is probably a major saving) - others may have more detailed advice on
that score.

Within the loop there are some speedups possible.

update() will allow you to avoid constantly recomputing the things based off
the design matrix, which doesn't look like it's changing.

see ?update

If you're just interested in the F-statistic, more direct approaches that
avoid all the other calculation done by aov or lm may be substantially
faster; you can work directly with the QR or choleski decomposition (done
outside the loop/"apply" part), for example. Faster (but less
numerically
stable) methods (e.g. involving use of a SWEEP operator) exist; if your
ANOVA is balanced, then things might be done even more directly.

big permie wrote:> 
> Dear R users,
> 
> I have a matrix a and a classification vector b such that
> 
>> str(a)
> num [1:50, 1:800000]
> and
>> str(b)
> Factor w/ 3 levels "cond1","cond2","cond3"
> 
> I'd like to do an anova on all 800000 columns and record the F
statistic
> for
> each test; I currently do this using
> 
> f.stat.vec <- numeric(length(a[1,])
> 
> for (i in 1:length(a[1,]) {
>   f.test.frame <- data.frame(nums = a[,i], cond = b)
>   aov.vox <- aov(nums ~ cond, data = f.test.frame)
>   f.stat <- summary(aov.vox)[[1]][1,4]
>   f.stat.vec[i] <- f.stat
> }
> 
> The problem is that this code takes about 70 minutes to run.
> 
> Is there a faster way to do an anova & record the F stat for each
column?
> 
> Any help would be appreciated.
> 
> Thanks
> Heath
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
-- 
View this message in context:
http://www.nabble.com/Is-there-a-fast-way-to-do-several-hundred-thousand-ANOVA-tests--tp25109056p25109345.html
Sent from the R help mailing list archive at Nabble.com.

hadley wickham

2009-Aug-24 02:20 UTC

head link

[R] Is there a fast way to do several hundred thousand ANOVA tests?

You might find the article "Computing Thousands of Test Statistics
Simultaneously in R" in
http://stat-computing.org/newsletter/issues/scgn-18-1.pdf helpful.

Hadley

On Sun, Aug 23, 2009 at 7:55 PM, big permie<bigpermie at gmail.com>
wrote:> Dear R users,
>
> I have a matrix a and a classification vector b such that
>
>> str(a)
> num [1:50, 1:800000]
> and
>> str(b)
> Factor w/ 3 levels "cond1","cond2","cond3"
>
> I'd like to do an anova on all 800000 columns and record the F
statistic for
> each test; I currently do this using
>
> f.stat.vec <- numeric(length(a[1,])
>
> for (i in 1:length(a[1,]) {
> ?f.test.frame <- data.frame(nums = a[,i], cond = b)
> ?aov.vox <- aov(nums ~ cond, data = f.test.frame)
> ?f.stat <- summary(aov.vox)[[1]][1,4]
> ?f.stat.vec[i] <- f.stat
> }
>
> The problem is that this code takes about 70 minutes to run.
>
> Is there a faster way to do an anova & record the F stat for each
column?
>
> Any help would be appreciated.
>
> Thanks
> Heath
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
http://had.co.nz/

Charles C. Berry

2009-Aug-24 03:03 UTC

head link

[R] Is there a fast way to do several hundred thousand ANOVA tests?

On Mon, 24 Aug 2009, big permie wrote:
> Dear R users,
>
> I have a matrix a and a classification vector b such that
>
>> str(a)
> num [1:50, 1:800000]
> and
>> str(b)
> Factor w/ 3 levels "cond1","cond2","cond3"
>
> I'd like to do an anova on all 800000 columns and record the F
statistic for
> each test; I currently do this using
>
> f.stat.vec <- numeric(length(a[1,])
>
> for (i in 1:length(a[1,]) {
>  f.test.frame <- data.frame(nums = a[,i], cond = b)
>  aov.vox <- aov(nums ~ cond, data = f.test.frame)
>  f.stat <- summary(aov.vox)[[1]][1,4]
>  f.stat.vec[i] <- f.stat
> }
>
> The problem is that this code takes about 70 minutes to run.
Using lsfit(), my five year old windows XP PC does 100k columns in about 
40 seconds, so I reckon that in 5 minutes it could do 800000:
> x <- factor(sample(1:3,50,repl=T))
> x.mat <- model.matrix( ~x )[, 2:3 ] # drop intercept here
> y <- matrix(rnorm(50*100000),nc=100000)
> system.time({fit <- lsfit( x.mat, y );ls.pr <- ls.print( fit, 
print.it=FALSE )})    user  system elapsed
   39.16    0.58   39.79>
> # check F-statistic:
> summary(as.numeric( ls.pr[[ 'summary' ]][, "F-value" ]))    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  0.0000  0.2932  0.7095  1.0460  1.4270 17.1100>
> # theoretical 1st Qu., Median, 3rd Qu match
>
> qf(c(.25,0.5, 0.75 ), 2, 47 )
[1] 0.2894502 0.7034708 1.4280000>

If it needed to be faster, I would take the part out of ls.print() that 
does just the F-statistic and work on it.

Of course, a newer computer would be a lot faster, too.

HTH,

Chuck
>
> Is there a faster way to do an anova & record the F stat for each
column?
>
> Any help would be appreciated.
>
> Thanks
> Heath
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

Benilton Carvalho

2009-Aug-24 16:43 UTC

head link

[R] Is there a fast way to do several hundred thousand ANOVA tests?

have you tried:

fits <- lm(a~b)
fstat <- sapply(summary(fits), function(x)
x[["fstatistic"]][["value"]])

it takes 3secs for 100K columns on my machine (running on batt)

b

On Aug 23, 2009, at 9:55 PM, big permie wrote:
> Dear R users,
>
> I have a matrix a and a classification vector b such that
>
>> str(a)
> num [1:50, 1:800000]
> and
>> str(b)
> Factor w/ 3 levels "cond1","cond2","cond3"
>
> I'd like to do an anova on all 800000 columns and record the F  
> statistic for
> each test; I currently do this using
>
> f.stat.vec <- numeric(length(a[1,])
>
> for (i in 1:length(a[1,]) {
>  f.test.frame <- data.frame(nums = a[,i], cond = b)
>  aov.vox <- aov(nums ~ cond, data = f.test.frame)
>  f.stat <- summary(aov.vox)[[1]][1,4]
>  f.stat.vec[i] <- f.stat
> }
>
> The problem is that this code takes about 70 minutes to run.
>
> Is there a faster way to do an anova & record the F stat for each  
> column?
>
> Any help would be appreciated.
>
> Thanks
> Heath
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

Seemingly Similar Threads

Search for more reasonably related threads

R help - Aug 2009 - Is there a fast way to do several hundred thousand ANOVA tests?

[R] Is there a fast way to do several hundred thousand ANOVA tests?

[R] Is there a fast way to do several hundred thousand ANOVA tests?

[R] Is there a fast way to do several hundred thousand ANOVA tests?

[R] Is there a fast way to do several hundred thousand ANOVA tests?

[R] Is there a fast way to do several hundred thousand ANOVA tests?

Seemingly Similar Threads