gauravbhatti
2010-Feb-12 16:47 UTC
[R] paired wilcox test on each row of a large dataframe
hI I have to calculate V statistic for each row of a large dataframe (28000). I can not use multtest package for paired wilcox test. I have been using for loop which are. Is there a way to speed the computation with another method like using apply or tapply? My data set looks like this: 11573_MB 11911_MB 11966_MB 12091_MB 12168_MB 12420_MB................ cg00000292 0.62123125 0.82663502 0.74687013 0.61774927 0.7337809 0.73203721 cg00002426 0.63631315 0.64408750 0.61975158 0.72500713 0.5753110 0.65146526 cg00003994 0.05035499 0.05189776 0.05882848 0.11198073 0.1313330 0.03883439 cg00005847 0.13936423 0.14967690 0.31874454 0.15876243 0.1111117 0.15070058 cg00006414 0.09059770 0.09915681 0.09952658 0.13955982 0.1757718 0.07566312 cg00007981 0.05622769 0.04143790 0.07167018 0.08051046 0.1378107 0.05439999 .............. 11573_CB 11911_CB 11966_CB 12091_CB 12168_CB 12420_CB cg00000292 0.83059018 0.65396035 0.74519819 0.76007659 0.70335691 0.7857631 cg00002426 0.61450928 0.59160923 0.69857198 0.73028911 0.71808719 0.6741295 cg00003994 0.04223668 0.07910444 0.05416764 0.06156407 0.06381321 0.0643354 cg00005847 0.13897704 0.06407313 0.20449931 0.15683154 0.18936196 0.1610695 cg00006414 0.06520757 0.12243180 0.11380134 0.10957321 0.15759518 0.1236715 cg00007981 0.04789030 0.11699024 0.07143036 0.05996888 0.10829510 0.1069037 . .. . . . There are 12 columns and 27000 rows. I have to perform paired test on each row (1:6 vs 7:12) and store the p value and statistic in two columns . Whats the fastest way? Gaurav Bhatti -- View this message in context: http://n4.nabble.com/paired-wilcox-test-on-each-row-of-a-large-dataframe-tp1478750p1478750.html Sent from the R help mailing list archive at Nabble.com.
gauravbhatti <gaurav15984 <at> hotmail.com> writes:> > > hI> I have to calculate V statistic for each row of a large dataframe > (28000). I can not use multtest package for paired wilcox test. I have > been using for loop which are. Is there a way to speed the computation > with another method like using apply or tapply?Using a for loop is fine here (and basically unavoidable). If you need it to be faster, use a matrix rather than a data.frame. (i.e. make a matrix containing columns 1-12, which are all numeric and so do not need to be in a data frame). Below are versions using apply, sapply and an explicit for loop. There's not much difference in speed. But the last one, in which the data is in a data.frame with rownames, is much slower.> d <- matrix(rnorm(12000), nrow=1000) > system.time(ans <- apply(d, 1, function(row) unlist(wilcox.test(row[1:6],row[7:12])[c("p.value","statistic")]))) user system elapsed 2.660 0.064 2.730> system.time(ans2 <- sapply(1:nrow(d), function(i)unlist(wilcox.test(d[i,1:6], d[i,7:12])[c("p.value","statistic")]))) user system elapsed 2.480 0.108 2.583> system.time({ans3 <- matrix(nrow=nrow(d), ncol=2) ;for(i in 1:nrow(d)) { ans3[i,] <- unlist(wilcox.test(d[i,1:6], d[i,7:12]) [c("p.value","statistic")])}}) user system elapsed 2.504 0.000 2.503> d <- as.data.frame(d) > rownames(d) <- paste(letters, 1:nrow(d)) > system.time(ans2 <- sapply(1:nrow(d), function(i)unlist(wilcox.test(as.numeric(d[i,1:6]), as.numeric(d[i,7:12]))[c("p.value","statistic")]))) user system elapsed 5.673 0.212 5.885 Dan> My data set looks like this: > 11573_MB 11911_MB 11966_MB 12091_MB 12168_MB > 12420_MB................ > cg00000292 0.62123125 0.82663502 0.74687013 0.61774927 0.7337809 0.73203721 > cg00002426 0.63631315 0.64408750 0.61975158 0.72500713 0.5753110 0.65146526 > cg00003994 0.05035499 0.05189776 0.05882848 0.11198073 0.1313330 0.03883439 > cg00005847 0.13936423 0.14967690 0.31874454 0.15876243 0.1111117 0.15070058 > cg00006414 0.09059770 0.09915681 0.09952658 0.13955982 0.1757718 0.07566312 > cg00007981 0.05622769 0.04143790 0.07167018 0.08051046 0.1378107 0.05439999 > .............. 11573_CB 11911_CB 11966_CB 12091_CB 12168_CB > 12420_CB > cg00000292 0.83059018 0.65396035 0.74519819 0.76007659 0.70335691 0.7857631 > cg00002426 0.61450928 0.59160923 0.69857198 0.73028911 0.71808719 0.6741295 > cg00003994 0.04223668 0.07910444 0.05416764 0.06156407 0.06381321 0.0643354 > cg00005847 0.13897704 0.06407313 0.20449931 0.15683154 0.18936196 0.1610695 > cg00006414 0.06520757 0.12243180 0.11380134 0.10957321 0.15759518 0.1236715 > cg00007981 0.04789030 0.11699024 0.07143036 0.05996888 0.10829510 0.1069037 > . > .. > . > . > . > There are 12 columns and 27000 rows. I have to perform paired test on each > row (1:6 vs 7:12) and store the p value and statistic in two columns . Whats > the fastest way? > Gaurav Bhatti >