Hello, I have a question regarding how to speed up the t.test on large dataset. For example, I have a table "tab" which looks like: a b c d e f g h.... 1 2 3 4 5 ... 100000 dim(tab) is 100000 x 100 I need to do the t.test for each row on the two subsets of columns, ie to compare a b d group against e f g group at each row. subset 1: a b d 1 2 3 4 5 ... 100000 subset 2: e f g 1 2 3 4 5 ... 100000 100000 t.test's for each row for these two subsets will take around 1 min. The prblem is that I have around 10000 different combinations of such a subsets. therefore 1min*10000 =10000min in the case if I will use "for" loop like this: n1=10000 #number of subset combinations for (i1 in 1:n1) { n2=100000 # number of rows i2=1 for (i2 in 1:n1) { t.test(tab[i2,v5],tab[i2,v6])$p.value #v5 and v6 are vectors containing the veriable names for the two subsets (they are different for each loop) } } My question is there more efficient way how to do this computations in a short period of time? Any packages, like plyr? May be direct calculations isted of using t.test function? Thank you.
Have a look at: "Computing Thousands of Test Statistics Simultaneously in R" by Holger Schwender and Tina M?ller, in http://stat-computing.org/newsletter/issues/scgn-18-1.pdf Hadley On Mon, Sep 13, 2010 at 4:26 PM, Alexey Ush <ushan26 at yahoo.com> wrote:> Hello, > > I have a question regarding how to speed up the t.test on large dataset. For example, I have a table "tab" which looks like: > > ? ? ? ?a ? ? ? b ? ? ? c ? ? ? d ? ? ? e ? ? ? f ? ? ? g ? ? ? h.... > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > dim(tab) is 100000 x 100 > > > > I need to do the t.test for each row on the two subsets of columns, ie to compare a b d group against e f g group at each row. > > > subset 1: > ? ? ? ?a ? ? ? b ? ? ? d > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > > subset 2: > ? ? ? ?e ? ? ? f ? ? ? g > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > ? ?100000 t.test's for each row for these two subsets will take around 1 min. The prblem is that I have around 10000 different combinations of such a subsets. therefore 1min*10000 > =10000min in the case if I will use "for" loop like this: > > n1=10000 #number of subset combinations > for (i1 in 1:n1) { > > n2=100000 # number of rows > i2=1 > for (i2 in 1:n1) { > ? ? ? ?t.test(tab[i2,v5],tab[i2,v6])$p.value ?#v5 and v6 are vectors containing the veriable names for the two subsets (they are different for each loop) > ? ? ? ?} > > } > > > My question is there more efficient way how to do this computations in a short period of time? Any packages, like plyr? May be direct calculations isted of using t.test function? > > > Thank you. > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
See if rowttests is any faster. library(genefilter) ?rowttests You have to install Bioconductor. I've used this on large datasets, but I haven't compared timings. On Mon, Sep 13, 2010 at 4:26 PM, Alexey Ush <ushan26 at yahoo.com> wrote:> Hello, > > I have a question regarding how to speed up the t.test on large dataset. For example, I have a table "tab" which looks like: > > ? ? ? ?a ? ? ? b ? ? ? c ? ? ? d ? ? ? e ? ? ? f ? ? ? g ? ? ? h.... > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > dim(tab) is 100000 x 100 > > > > I need to do the t.test for each row on the two subsets of columns, ie to compare a b d group against e f g group at each row. > > > subset 1: > ? ? ? ?a ? ? ? b ? ? ? d > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > > subset 2: > ? ? ? ?e ? ? ? f ? ? ? g > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > ? ?100000 t.test's for each row for these two subsets will take around 1 min. The prblem is that I have around 10000 different combinations of such a subsets. therefore 1min*10000 > =10000min in the case if I will use "for" loop like this: > > n1=10000 #number of subset combinations > for (i1 in 1:n1) { > > n2=100000 # number of rows > i2=1 > for (i2 in 1:n1) { > ? ? ? ?t.test(tab[i2,v5],tab[i2,v6])$p.value ?#v5 and v6 are vectors containing the veriable names for the two subsets (they are different for each loop) > ? ? ? ?} > > } > > > My question is there more efficient way how to do this computations in a short period of time? Any packages, like plyr? May be direct calculations isted of using t.test function? > > > Thank you. > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >