thr3ads.net - R help - [R] Which is more efficient? [Aug 2011]

If this information is useful, please help other people find it:
Share via:

Matt Curcio

2011-Aug-05 03:19 UTC

[R] Which is more efficient?

Greetings all,
I am curious to know if either of these two sets of code is more efficient?

Example1:
 ## t-test ##
colA <- temp [ , j ]
colB <- temp [ , k ]
ttr <- t.test ( colA, colB, var.equal=TRUE)
tt_pvalue [ i ] <- ttr$p.value

or
Example2:
tt_pvalue [ i ] <- t.test ( temp[ , j ], temp[ , k ], var.equal=TRUE)
-------------
I have three loops, i, j, k.
One to test the all of <i> files in a directory.  One to tease out
column <j> and compare it by means of t-test to column <k> in each
of
the files.
---------------
for ( i in 1:num_files ) {
   temp <- read.table ( files_to_test [ i ], header=TRUE, sep="\t")
   num_cols <- ncol ( temp )
   ## Define Columns To Compare ##
   for ( j in 2 : num_cols ) {
      for ( k in 3 : num_cols ) {
          ## t-test ##
          colA <- temp [ , j ]
          colB <- temp [ , k ]
          ttr <- t.test ( colA, colB, var.equal=TRUE)
          tt_pvalue [ i ] <- ttr$p.value
      }
   }
}
--------------------------------
I am a novice writer of code and am interested to hear if there are
any (dis)advantages to one way or the other.
M


Matt Curcio
M: 401-316-5358
E: matt.curcio.ri at gmail.com

R. Michael Weylandt

2011-Aug-05 03:56 UTC

head link

[R] Which is more efficient?

You can study this yourself using the System.time() utility: just write
System.time() around any block of code and R will time it for you.

Offhand, I'd guess example2 may be ever so slightly quicker since it
doesn't
have to create colA and colB, but not to a degree that would be noticeable
for reasonably sized data. More importantly, you should probably notice that
the examples give different output: one puts just the p.value of the t.test
in tt_pvalue while the other puts the entire t.test object. You probably
meant

Example2:
tt_pvalue [ i ] <- t.test ( temp[ , j ], temp[ , k ],
var.equal=TRUE)$p.value

If you are a beginner, I'd strongly suggest you wait the extra 3.2
milliseconds and use code like example one: it will be easier to debug.

In your second block of code, you wind up t-testing a column against itself
many times and you wind up deleting many of the p.values you store. Is this
actual code or are you more interested in how something would be vectorized?
If the first, write back and I'll talk to you about storing the results and
doing the tests in a logical manner.

If you are only interested from a coding efficiency point of view, the first
for loop over all the files is probably best replaced by

L =  lapply(files_to_test, read.table, header=TRUE, sep="\t")

This will create a list object L containing all of the file information:
List objects are basically R's way of sticking any combination of objects
together in one big "super-object" that can contain anything. (I'm
sure the
code experts will want to correct me, but for a beginner I think that gives
sufficient intuition.)

Once you have everything in R you have a wealth of opportunities depending
on what you want to do: there's an open thread started by J. Bouldin on how
to do things columnwise over different objects most efficiently in R right
now that will hopefully get some good answers. Let me know if there's a
specific thing you want to wind up doing and I'll try to give you a hand: if
it's just a theoretical interest, keep an eye on the other thread.

Hope this helps,

Michael Weylandt

On Thu, Aug 4, 2011 at 11:19 PM, Matt Curcio
<matt.curcio.ri@gmail.com>wrote:
> Greetings all,
> I am curious to know if either of these two sets of code is more efficient?
>
> Example1:
>  ## t-test ##
> colA <- temp [ , j ]
> colB <- temp [ , k ]
> ttr <- t.test ( colA, colB, var.equal=TRUE)
> tt_pvalue [ i ] <- ttr$p.value
>
> or
> Example2:
> tt_pvalue [ i ] <- t.test ( temp[ , j ], temp[ , k ], var.equal=TRUE)
> -------------
> I have three loops, i, j, k.
> One to test the all of <i> files in a directory.  One to tease out
> column <j> and compare it by means of t-test to column <k> in
each of
> the files.
> ---------------
> for ( i in 1:num_files ) {
>   temp <- read.table ( files_to_test [ i ], header=TRUE,
sep="\t")
>   num_cols <- ncol ( temp )
>   ## Define Columns To Compare ##
>   for ( j in 2 : num_cols ) {
>      for ( k in 3 : num_cols ) {
>          ## t-test ##
>          colA <- temp [ , j ]
>          colB <- temp [ , k ]
>          ttr <- t.test ( colA, colB, var.equal=TRUE)
>          tt_pvalue [ i ] <- ttr$p.value
>      }
>   }
> }
> --------------------------------
> I am a novice writer of code and am interested to hear if there are
> any (dis)advantages to one way or the other.
> M
>
>
> Matt Curcio
> M: 401-316-5358
> E: matt.curcio.ri@gmail.com
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
	[[alternative HTML version deleted]]

Dennis Murphy

2011-Aug-05 07:07 UTC

head link

[R] Which is more efficient?

Hi:

Your question about efficiency does not seem well-posed to me.
Efficient relative to what criterion?
Rather than to address your question directly, I'll show how different
possible situations that could arise in the general context of your
problem can be addressed.

One of the first rules in R programming is to learn the concepts of
vectorization and indexing. This saves a lot of code down the line. R
is not C(++) or Java, and it shouldn't be programmed as though it
were. As a result, iterative approaches to problem solving in R are
usually, but not always, inefficient. R has many vectorized functions
which should be used whenever possible. Usually, the apply family of
functions or one of the summarization packages (notably data.table,
doBy and plyr, although there are others) can be exploited to
recursively apply a function to different subsets of data. Consider
three different situations below in which one might want to apply a
t-test. Only one uses iteration. I'm using the plyr package because it
is most flexible in terms of the types of input and output objects it
can process.

Let's start by manufacturing some matrix data:

## function to generate a matrix
mgen <- function() matrix(rnorm(50), nrow = 10)
## use replicate() to generate an array
marr <- replicate(4, mgen())   # a 10 x 5 x 4 array
marr

# A matrix of column indices to use in t.test()
tcols <- matrix(c(1, 2, 1, 3, 1, 4, 1, 5), ncol = 2, byrow = TRUE)
colnames(tcols) <- c('i', 'j')
tcols

# ------------------------
# Situation 1: multiple matrices, test the same pair
#              of columns in each, in this case 2 and 4.

# The input argument m is a matrix. A data frame is
# returned because that's what the adply() function in
# the plyr package expects as output (a = array input,
# d = data frame output)
tfun1 <- function(m) {
   v <- t.test(m[, 2], m[, 4], var.equal = TRUE)
   data.frame(tstat = v$statistic, pval = v$p.value)
  }

# adply takes the input array marr, iterates over the third index
# and applies tfun1 to each marginal matrix
res1 <- adply(marr, 3, tfun1)
res1

# ------------------------
# Situation 2: one matrix, test multiple pairs of columns

mat <- mgen()    # generate a single matrix
tfun2 <- function(i, j) {
    v <- t.test(mat[, i], mat[, j], var.equal = TRUE)
    data.frame(tstat = v$statistic, pval = v$p.value)
  }

# mdply() takes the matrix of column indices as its first
# argument. Notice that tfun2 was written so that its
# arguments are i and j, the column names of tcols.
# This is required, and the order matters. For each
# row of tcols, the function tfun2 is applied to the
# matrix mat.
res2 <- mdply(tcols, tfun2)
res2

# -------------------
# Situation 3: n matrices, different pairs of columns
#              tested in each

# The idea is to perform a t-test on different pairs of
# columns in each submatrix of marr.

# The simplest thing to do in this situation is to
# iterate, although there is probably some clever way to
# do this using nested apply family calls. The reason for
# iteration is that we want to operate on the same
# relevant index of *both* marr and tcols. It's possible to
# use mapply() for this task, but that would take more
# explanation and this is long-winded enough.

outmat <- matrix(NA, nrow = nrow(tcols), ncol = 4)
for(k in seq_len(nrow(tcols))) {
   mat <- marr[, , k]      # take k-th submatrix of marr
   cols <- tcols[k, ]       # take k-th row of tcols
   v <- t.test(mat[, cols[1]], mat[, cols[2]], var.equal = TRUE)
   outmat[k, ] <- c(cols[1], cols[2], v$statistic, v$p.value)
  }
colnames(outmat) <- c('col1', 'col2', 'tstat',
'pval')
outmat

Notice that the type of input matters, so the way in which the data
are arranged has much to do with the way you program in R, especially
with the apply family of functions and their offshoots in different
packages. The basic programming strategy is to write a utility
function that works for a generic subset of the input data, and then
use one of the **ply() functions or functions in the apply family to
map the function to different data subsets.

HTH,
Dennis

On Thu, Aug 4, 2011 at 8:19 PM, Matt Curcio <matt.curcio.ri at gmail.com>
wrote:> Greetings all,
> I am curious to know if either of these two sets of code is more efficient?
>
> Example1:
> ?## t-test ##
> colA <- temp [ , j ]
> colB <- temp [ , k ]
> ttr <- t.test ( colA, colB, var.equal=TRUE)
> tt_pvalue [ i ] <- ttr$p.value
>
> or
> Example2:
> tt_pvalue [ i ] <- t.test ( temp[ , j ], temp[ , k ], var.equal=TRUE)
> -------------
> I have three loops, i, j, k.
> One to test the all of <i> files in a directory. ?One to tease out
> column <j> and compare it by means of t-test to column <k> in
each of
> the files.
> ---------------
> for ( i in 1:num_files ) {
> ? temp <- read.table ( files_to_test [ i ], header=TRUE,
sep="\t")
> ? num_cols <- ncol ( temp )
> ? ## Define Columns To Compare ##
> ? for ( j in 2 : num_cols ) {
> ? ? ?for ( k in 3 : num_cols ) {
> ? ? ? ? ?## t-test ##
> ? ? ? ? ?colA <- temp [ , j ]
> ? ? ? ? ?colB <- temp [ , k ]
> ? ? ? ? ?ttr <- t.test ( colA, colB, var.equal=TRUE)
> ? ? ? ? ?tt_pvalue [ i ] <- ttr$p.value
> ? ? ?}
> ? }
> }
> --------------------------------
> I am a novice writer of code and am interested to hear if there are
> any (dis)advantages to one way or the other.
> M
>
>
> Matt Curcio
> M: 401-316-5358
> E: matt.curcio.ri at gmail.com
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Reasonably Related Threads

Search for more seemingly similar threads

R help - Aug 2011 - Which is more efficient?

[R] Which is more efficient?

[R] Which is more efficient?

[R] Which is more efficient?

Reasonably Related Threads