big permie
2009-Aug-24 00:55 UTC
[R] Is there a fast way to do several hundred thousand ANOVA tests?
Dear R users, I have a matrix a and a classification vector b such that> str(a)num [1:50, 1:800000] and> str(b)Factor w/ 3 levels "cond1","cond2","cond3" I'd like to do an anova on all 800000 columns and record the F statistic for each test; I currently do this using f.stat.vec <- numeric(length(a[1,]) for (i in 1:length(a[1,]) { f.test.frame <- data.frame(nums = a[,i], cond = b) aov.vox <- aov(nums ~ cond, data = f.test.frame) f.stat <- summary(aov.vox)[[1]][1,4] f.stat.vec[i] <- f.stat } The problem is that this code takes about 70 minutes to run. Is there a faster way to do an anova & record the F stat for each column? Any help would be appreciated. Thanks Heath [[alternative HTML version deleted]]
glen_b
2009-Aug-24 01:40 UTC
[R] Is there a fast way to do several hundred thousand ANOVA tests?
The usual methods of avoiding loops may provide some speedup (but you've already done the allocation of the full results vector outside the loop, which is probably a major saving) - others may have more detailed advice on that score. Within the loop there are some speedups possible. update() will allow you to avoid constantly recomputing the things based off the design matrix, which doesn't look like it's changing. see ?update If you're just interested in the F-statistic, more direct approaches that avoid all the other calculation done by aov or lm may be substantially faster; you can work directly with the QR or choleski decomposition (done outside the loop/"apply" part), for example. Faster (but less numerically stable) methods (e.g. involving use of a SWEEP operator) exist; if your ANOVA is balanced, then things might be done even more directly. big permie wrote:> > Dear R users, > > I have a matrix a and a classification vector b such that > >> str(a) > num [1:50, 1:800000] > and >> str(b) > Factor w/ 3 levels "cond1","cond2","cond3" > > I'd like to do an anova on all 800000 columns and record the F statistic > for > each test; I currently do this using > > f.stat.vec <- numeric(length(a[1,]) > > for (i in 1:length(a[1,]) { > f.test.frame <- data.frame(nums = a[,i], cond = b) > aov.vox <- aov(nums ~ cond, data = f.test.frame) > f.stat <- summary(aov.vox)[[1]][1,4] > f.stat.vec[i] <- f.stat > } > > The problem is that this code takes about 70 minutes to run. > > Is there a faster way to do an anova & record the F stat for each column? > > Any help would be appreciated. > > Thanks > Heath > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > >-- View this message in context: http://www.nabble.com/Is-there-a-fast-way-to-do-several-hundred-thousand-ANOVA-tests--tp25109056p25109345.html Sent from the R help mailing list archive at Nabble.com.
hadley wickham
2009-Aug-24 02:20 UTC
[R] Is there a fast way to do several hundred thousand ANOVA tests?
You might find the article "Computing Thousands of Test Statistics Simultaneously in R" in http://stat-computing.org/newsletter/issues/scgn-18-1.pdf helpful. Hadley On Sun, Aug 23, 2009 at 7:55 PM, big permie<bigpermie at gmail.com> wrote:> Dear R users, > > I have a matrix a and a classification vector b such that > >> str(a) > num [1:50, 1:800000] > and >> str(b) > Factor w/ 3 levels "cond1","cond2","cond3" > > I'd like to do an anova on all 800000 columns and record the F statistic for > each test; I currently do this using > > f.stat.vec <- numeric(length(a[1,]) > > for (i in 1:length(a[1,]) { > ?f.test.frame <- data.frame(nums = a[,i], cond = b) > ?aov.vox <- aov(nums ~ cond, data = f.test.frame) > ?f.stat <- summary(aov.vox)[[1]][1,4] > ?f.stat.vec[i] <- f.stat > } > > The problem is that this code takes about 70 minutes to run. > > Is there a faster way to do an anova & record the F stat for each column? > > Any help would be appreciated. > > Thanks > Heath > > ? ? ? ?[[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- http://had.co.nz/
Charles C. Berry
2009-Aug-24 03:03 UTC
[R] Is there a fast way to do several hundred thousand ANOVA tests?
On Mon, 24 Aug 2009, big permie wrote:> Dear R users, > > I have a matrix a and a classification vector b such that > >> str(a) > num [1:50, 1:800000] > and >> str(b) > Factor w/ 3 levels "cond1","cond2","cond3" > > I'd like to do an anova on all 800000 columns and record the F statistic for > each test; I currently do this using > > f.stat.vec <- numeric(length(a[1,]) > > for (i in 1:length(a[1,]) { > f.test.frame <- data.frame(nums = a[,i], cond = b) > aov.vox <- aov(nums ~ cond, data = f.test.frame) > f.stat <- summary(aov.vox)[[1]][1,4] > f.stat.vec[i] <- f.stat > } > > The problem is that this code takes about 70 minutes to run.Using lsfit(), my five year old windows XP PC does 100k columns in about 40 seconds, so I reckon that in 5 minutes it could do 800000:> x <- factor(sample(1:3,50,repl=T)) > x.mat <- model.matrix( ~x )[, 2:3 ] # drop intercept here > y <- matrix(rnorm(50*100000),nc=100000) > system.time({fit <- lsfit( x.mat, y );ls.pr <- ls.print( fit, print.it=FALSE )})user system elapsed 39.16 0.58 39.79> > # check F-statistic: > summary(as.numeric( ls.pr[[ 'summary' ]][, "F-value" ]))Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0000 0.2932 0.7095 1.0460 1.4270 17.1100> > # theoretical 1st Qu., Median, 3rd Qu match > > qf(c(.25,0.5, 0.75 ), 2, 47 )[1] 0.2894502 0.7034708 1.4280000>If it needed to be faster, I would take the part out of ls.print() that does just the F-statistic and work on it. Of course, a newer computer would be a lot faster, too. HTH, Chuck> > Is there a faster way to do an anova & record the F stat for each column? > > Any help would be appreciated. > > Thanks > Heath > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >Charles C. Berry (858) 534-2098 Dept of Family/Preventive Medicine E mailto:cberry at tajo.ucsd.edu UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901
Benilton Carvalho
2009-Aug-24 16:43 UTC
[R] Is there a fast way to do several hundred thousand ANOVA tests?
have you tried: fits <- lm(a~b) fstat <- sapply(summary(fits), function(x) x[["fstatistic"]][["value"]]) it takes 3secs for 100K columns on my machine (running on batt) b On Aug 23, 2009, at 9:55 PM, big permie wrote:> Dear R users, > > I have a matrix a and a classification vector b such that > >> str(a) > num [1:50, 1:800000] > and >> str(b) > Factor w/ 3 levels "cond1","cond2","cond3" > > I'd like to do an anova on all 800000 columns and record the F > statistic for > each test; I currently do this using > > f.stat.vec <- numeric(length(a[1,]) > > for (i in 1:length(a[1,]) { > f.test.frame <- data.frame(nums = a[,i], cond = b) > aov.vox <- aov(nums ~ cond, data = f.test.frame) > f.stat <- summary(aov.vox)[[1]][1,4] > f.stat.vec[i] <- f.stat > } > > The problem is that this code takes about 70 minutes to run. > > Is there a faster way to do an anova & record the F stat for each > column? > > Any help would be appreciated. > > Thanks > Heath > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.