Hello,
I have a question regarding how to speed up the t.test on large dataset. For
example, I have a table "tab" which looks like:
	a	b	c	d	e 	f	g	h....
1	
2
3
4
5
...
100000
dim(tab) is 100000 x 100
I need to do the t.test for each row on the two subsets of columns, ie to
compare a b d group against e f g group at each row.
subset 1:					
	a	b	d
1	
2
3
4
5
...
100000
subset 2:
	e	f	g
1	
2
3
4
5
...
100000
    100000 t.test's for each row for these two subsets will take around 1
min. The prblem is that I have around 10000 different combinations of such a
subsets. therefore 1min*10000
=10000min in the case if I will use "for" loop like this:
n1=10000 #number of subset combinations
for (i1 in 1:n1) {
n2=100000 # number of rows
i2=1
for (i2 in 1:n1) {
	t.test(tab[i2,v5],tab[i2,v6])$p.value  #v5 and v6 are vectors containing the
veriable names for the two subsets (they are different for each loop)
	}
}
My question is there more efficient way how to do this computations in a short
period of time? Any packages, like plyr? May be direct calculations isted of
using t.test function?
Thank you.
Have a look at: "Computing Thousands of Test Statistics Simultaneously in R" by Holger Schwender and Tina M?ller, in http://stat-computing.org/newsletter/issues/scgn-18-1.pdf Hadley On Mon, Sep 13, 2010 at 4:26 PM, Alexey Ush <ushan26 at yahoo.com> wrote:> Hello, > > I have a question regarding how to speed up the t.test on large dataset. For example, I have a table "tab" which looks like: > > ? ? ? ?a ? ? ? b ? ? ? c ? ? ? d ? ? ? e ? ? ? f ? ? ? g ? ? ? h.... > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > dim(tab) is 100000 x 100 > > > > I need to do the t.test for each row on the two subsets of columns, ie to compare a b d group against e f g group at each row. > > > subset 1: > ? ? ? ?a ? ? ? b ? ? ? d > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > > subset 2: > ? ? ? ?e ? ? ? f ? ? ? g > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > ? ?100000 t.test's for each row for these two subsets will take around 1 min. The prblem is that I have around 10000 different combinations of such a subsets. therefore 1min*10000 > =10000min in the case if I will use "for" loop like this: > > n1=10000 #number of subset combinations > for (i1 in 1:n1) { > > n2=100000 # number of rows > i2=1 > for (i2 in 1:n1) { > ? ? ? ?t.test(tab[i2,v5],tab[i2,v6])$p.value ?#v5 and v6 are vectors containing the veriable names for the two subsets (they are different for each loop) > ? ? ? ?} > > } > > > My question is there more efficient way how to do this computations in a short period of time? Any packages, like plyr? May be direct calculations isted of using t.test function? > > > Thank you. > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >-- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
See if rowttests is any faster. library(genefilter) ?rowttests You have to install Bioconductor. I've used this on large datasets, but I haven't compared timings. On Mon, Sep 13, 2010 at 4:26 PM, Alexey Ush <ushan26 at yahoo.com> wrote:> Hello, > > I have a question regarding how to speed up the t.test on large dataset. For example, I have a table "tab" which looks like: > > ? ? ? ?a ? ? ? b ? ? ? c ? ? ? d ? ? ? e ? ? ? f ? ? ? g ? ? ? h.... > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > dim(tab) is 100000 x 100 > > > > I need to do the t.test for each row on the two subsets of columns, ie to compare a b d group against e f g group at each row. > > > subset 1: > ? ? ? ?a ? ? ? b ? ? ? d > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > > subset 2: > ? ? ? ?e ? ? ? f ? ? ? g > 1 > 2 > 3 > 4 > 5 > > ... > > 100000 > > ? ?100000 t.test's for each row for these two subsets will take around 1 min. The prblem is that I have around 10000 different combinations of such a subsets. therefore 1min*10000 > =10000min in the case if I will use "for" loop like this: > > n1=10000 #number of subset combinations > for (i1 in 1:n1) { > > n2=100000 # number of rows > i2=1 > for (i2 in 1:n1) { > ? ? ? ?t.test(tab[i2,v5],tab[i2,v6])$p.value ?#v5 and v6 are vectors containing the veriable names for the two subsets (they are different for each loop) > ? ? ? ?} > > } > > > My question is there more efficient way how to do this computations in a short period of time? Any packages, like plyr? May be direct calculations isted of using t.test function? > > > Thank you. > > > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >