thr3ads.net - R help - [R] How to optimize this codes ? [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Daren Tan

2008-Dec-04 16:02 UTC

[R] How to optimize this codes ?

How to optimize the for-loop to be reasonably fast for sample.size=100000000 ?
You may want to change sample.size=1000 to have an idea what I am achieving.
 
 
set.seed(143)
A <- matrix(sample(0:1, sample.size, TRUE), ncol=10, dimnames=list(NULL,
LETTERS[1:10]))
 
B <- list()
for(i in 1:10) { 
  B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x) {
sum(apply(data.frame(A[,x]), 1, all)) })
}
 
pdf("test.pdf")
boxplot(B, xlab = "Number of predictors", ylab = "Number of
agreements across predictors")
dev.off()

William Dunlap

2008-Dec-04 17:58 UTC

head link

[R] How to optimize this codes ?

[R] How to optimize this codes ?
   Daren Tan daren76 at hotmail.com
   Thu Dec 4 17:02:49 CET 2008

   How to optimize the for-loop to be reasonably fast for
sample.size=100000000 ? 
   You may want to change sample.size=1000 to have an idea what I am
achieving.  
 
   set.seed(143)
   A <- matrix(sample(0:1, sample.size, TRUE), ncol=10,
dimnames=list(NULL, LETTERS[1:10]))
 
   B <- list()
   for(i in 1:10) { 
     B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x) {
sum(apply(data.frame(A[,x]), 1, all)) })
   }

Instead computing all(A[row,x]) each row of the matrix by looping
over the rows you could loop over the columns.  That is generally
quicker if there are many more rows than columns.  S+ and R have
functions called pmin and pmax that do this for min and max, but
no pany or pall functions.  In your 0/1 case all is the same as min
so you can replace
     apply(data.frame(A[,x]), 1, all)
with
     sum(do.call("pmin", unname(A[,x,drop=FALSE])))
after first converting A to a data.frame before starting to compute B.
(The do.call/unname combo is a hack to pass all the columns of a
data.frame
as separate arguments to a function.  If pmin took a list that wouldn't
be necessary.)

When you do timings, looking at one size of problem often isn't helpful,
as it doesn't tell you how the time depends on the size of the input.
The
relationship often is not linear.  E.g., I put your method in a function
called computeB0 and my modification of it into computeB1 and wrote a
makeA
that makes an A matrix with a given sample.size.  Here are the times, it
looks
like 1000 is below the linear range for this example:

sample.size       time:computeB0   time:computeB1
       1000                 5.36             0.98
      10000                36.82             2.49
     100000               381.34            18.97

and here are the functions used

makeA <-
function(sample.size){
   set.seed(143);
   A<-matrix(sample(0:1,size=sample.size,replace=TRUE), ncol=10,
dimnames=list(NULL, LETTERS[1:10]));
   A
}
computeB0 <-
function(A){B<-list()
  for(i in 1:10) {
    B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x) {
sum(apply(data.frame(A[,x]), 1, all)) })
  }
  B
}
computeB1 <-
function(A){
  A<-as.data.frame(A)
  B<-list()
  for(i in 1:10) {
    B[[i]] <- apply(combn(LETTERS[1:10], i), 2,
                  FUN=function(x) { sum(do.call("pmin",
unname(A[,x,drop=FALSE]))) }
            )
  }
  B
}

You could probably same more time by restructuring this as a recursive
function,
still operating on columns, that traversed the binary tree of column
inclusion/exclusion.
I just wanted to point out the advantage of looping over columns instead
of looping over
rows.

sample.size=1e8 is pretty big for a 32-bit machine.  You may have to do
things
like make A a data.frame (with integer columns) from the start to keep
the memory
requirements down.

Putting the call to data.frame() inside a nested set of apply loop is a
waste
of time.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

William Dunlap

2008-Dec-04 19:49 UTC

head link

[R] How to optimize this codes ?

> -----Original Message-----
> From: William Dunlap 
> Sent: Thursday, December 04, 2008 9:59 AM
> To: 'daren76 at hotmail.com'
> Cc: 'R help'
> Subject: Re: [R] How to optimize this codes ?
> 
>    [R] How to optimize this codes ?
>    Daren Tan daren76 at hotmail.com
>    Thu Dec 4 17:02:49 CET 2008
> 
>    How to optimize the for-loop to be reasonably fast for 
> sample.size=100000000 ? 
>    You may want to change sample.size=1000 to have an idea 
> what I am achieving.  
>  
>    set.seed(143)
>    A <- matrix(sample(0:1, sample.size, TRUE), ncol=10, 
> dimnames=list(NULL, LETTERS[1:10]))
>  
>    B <- list()
>    for(i in 1:10) { 
>      B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x) 
> { sum(apply(data.frame(A[,x]), 1, all)) })
>    }
> 
> Instead computing all(A[row,x]) each row of the matrix by looping
> over the rows you could loop over the columns.  That is generally
> quicker if there are many more rows than columns.  S+ and R have
> functions called pmin and pmax that do this for min and max, but
> no pany or pall functions.  In your 0/1 case all is the same as min
> so you can replace
>      apply(data.frame(A[,x]), 1, all)
> with
>      sum(do.call("pmin", unname(A[,x,drop=FALSE])))
> after first converting A to a data.frame before starting to compute B.
> (The do.call/unname combo is a hack to pass all the columns 
> of a data.frame
> as separate arguments to a function.  If pmin took a list 
> that wouldn't
> be necessary.)
> 
> When you do timings, looking at one size of problem often 
> isn't helpful,
> as it doesn't tell you how the time depends on the size of 
> the input.  The
> relationship often is not linear.  E.g., I put your method in 
> a function
> called computeB0 and my modification of it into computeB1 and 
> wrote a makeA
> that makes an A matrix with a given sample.size.  Here are 
> the times, it looks
> like 1000 is below the linear range for this example:
> 
> sample.size       time:computeB0   time:computeB1
>        1000                 5.36             0.98
>       10000                36.82             2.49
>      100000               381.34            18.97
> 
> and here are the functions used
> 
> makeA <-
> function(sample.size){
>    set.seed(143);
>    A<-matrix(sample(0:1,size=sample.size,replace=TRUE), 
> ncol=10, dimnames=list(NULL, LETTERS[1:10]));
>    A
> }
> computeB0 <-
> function(A){B<-list()
>   for(i in 1:10) {
>     B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x) { 
> sum(apply(data.frame(A[,x]), 1, all)) })
>   }
>   B
> }
> computeB1 <-
> function(A){
>   A<-as.data.frame(A)
>   B<-list()
>   for(i in 1:10) {
>     B[[i]] <- apply(combn(LETTERS[1:10], i), 2,
>                   FUN=function(x) { sum(do.call("pmin", 
> unname(A[,x,drop=FALSE]))) }
>             )
>   }
>   B
> }
> 
> You could probably same more time by restructuring this as a 
> recursive function,
> still operating on columns, that traversed the binary tree of 
> column inclusion/exclusion.
> I just wanted to point out the advantage of looping over 
> columns instead of looping over
> rows.
A hastily written recursive version is:

computeB3 <-
function(A) {
   B <- rep(list(integer(0)), ncol(A))
   storage.mode(A) <- "logical"
   recurse <- function(thisCol = 1, includedCols = rep(NA, ncol(A)),
allSoFar = rep(TRUE, nrow(A))) {
      if (thisCol < ncol(A)) # skip this column
          Recall(thisCol+1, replace(includedCols, thisCol, FALSE),
allSoFar = allSoFar)
      else if (thisCol > ncol(A))
          return()
      includedCols[thisCol] <- TRUE
      allSoFar <- allSoFar & A[,thisCol]
      nIncludedCols <- sum(includedCols[1:thisCol])
      # cat(thisCol, ":", nIncludedCols, ":",
includedCols[1:thisCol],
sum(allSoFar), "\n")
      # Note B<<- in next line uses lexical scoping to change
computeB3:B.  Does not work in S+.
      B[[nIncludedCols]][length(B[[nIncludedCols]])+1] <<- sum(allSoFar)
      Recall(thisCol+1, includedCols, allSoFar = allSoFar)
   }
   recurse()
   B
}

The elements of B[[i]], for each i, are in a different order than they
are in
the other functions, but the histograms are the same.

I added its time to the above table:

sample.size       time:computeB0   time:computeB1    time:computeB3
       1000                 5.36             0.98              0.12    
      10000                36.82             2.49              0.25
     100000               381.34            18.97              1.62
    1000000                    ?           290.21             16.36

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

Maybe Matching Threads

Search for more seemingly similar threads

R help - Dec 2008 - How to optimize this codes ?

[R] How to optimize this codes ?

[R] How to optimize this codes ?

[R] How to optimize this codes ?

Maybe Matching Threads