thr3ads.net - R help - [R] performance gap between R 1.7.1 and 1.8.0 [Nov 2003]

If this information is useful, please help other people find it:
Share via:

Liaw, Andy

2003-Nov-29 03:19 UTC

[R] performance gap between R 1.7.1 and 1.8.0

Dear R-help,

A colleague of mine was running some code on two of our boxes, and noticed a
rather large difference in running time.  We've so far isolated the problem
to the difference between R 1.7.1 and 1.8.0, but not more than that.  The
exact same code took 933.5 seconds in 1.7.1, and 3594.4 seconds in 1.8.1, on
the same box.

Basically, the code calls boot() to bootstrap fitting mixture models by
calling flexmix() (in the flexmix package) with intercept-only models.
The code needs to be strip down further, but I thought some of you might be
able to tell what's wrong from the info we have so far:

I ran R profiling on the code, under both R 1.7.1 and 1.8.1 (but the
performance gap is there in 1.8.0 already).  Sorted by "self" part of
the
output, the top 10 lines are:

R 1.7.1
   %       self        %       total
 self     seconds    total    seconds    name
 10.75    100.36     10.75    100.36     ".Fortran"
  8.71     81.32     34.27    319.88     "lm.wfit"
  5.43     50.66     81.19    757.88     "FLXfit"
  4.30     40.16      4.30     40.16     "^"
  4.26     39.80      4.26     39.80     "=="
  4.15     38.74      4.99     46.62     "names"
  3.90     36.38     20.57    191.98     "initialize"
  3.51     32.80      5.37     50.14     "dnorm"
  2.29     21.34      4.94     46.14     "hclass"
  1.83     17.10      5.85     54.64     "inherits"

R 1.8.1:
   %       self        %       total
 self     seconds    total    seconds    name
  6.24    224.26     11.69    420.32     "paste"
  5.93    213.04     13.21    474.76     "read.dcf"
  4.37    157.24      5.17    185.92     "names"
  4.18    150.42      5.53    198.66     "exists"
  3.71    133.52     14.66    527.00     "lapply"
  3.15    113.16      4.43    159.32     "names<-"
  2.98    107.26      2.98    107.26     ".Fortran"
  2.57     92.46      5.75    206.82     "seq"
  2.54     91.42      3.18    114.14     "seq.default"
  2.37     85.24      8.93    320.96     "lm.wfit"

The ".Fortran" call took about the same amount of time, as does
"lm.wfit",
so that's a bit comforting.  (The code also fits the same model using
mclust, and that's probably where the .Fortran call is from.)  What puzzled
me are:

-  Several functions took much longer in 1.8.1; e.g., 
          1.7.1  1.8.1 
  "paste"    28    224
  "names"    38    157
  "names<-"  15    113
  "exists"    4    150

-  The presence of "read.dcf" in 1.8.1.  Where could this be from?

Any clues as to why we're seeing this?

Best,
Andy

Andy Liaw, PhD
Biometrics Research      PO Box 2000, RY33-300     
Merck Research Labs           Rahway, NJ 07065
mailto:andy_liaw at merck.com        732-594-0820

Martin Maechler

2003-Nov-29 11:58 UTC

head link

[R] performance gap between R 1.7.1 and 1.8.0

>>>>> "AndyL" == Liaw, Andy <andy_liaw at
merck.com>
>>>>>     on Fri, 28 Nov 2003 22:19:27 -0500 writes:
    AndyL> Dear R-help, A colleague of mine was running some
    AndyL> code on two of our boxes, and noticed a rather large
    AndyL> difference in running time.  We've so far isolated
    AndyL> the problem to the difference between R 1.7.1 and
    AndyL> 1.8.0, but not more than that.  The exact same code
    AndyL> took 933.5 seconds in 1.7.1, and 3594.4 seconds in
    AndyL> 1.8.1, on the same box.

I understand your concern.  I do believe that it must be some
peculiarity of your (or the non-standard packages you used
code).

Some wild guessing: 
Looking at your profiling table, and seeing  read.dcf()  makes
me think about library() -- which does call read.dcf() only when
called on a package that's not yet attached --- and the
bioconductor extra package management -- which for me makes only
the first time (!) attachment of bioconductor packages
relatively slow.  
I still have no idea how this could influence bootstrapping,
unless you detach() and attach() packages many times.
In any case, it might depend on the exact list (and order!?) of
packages present in search() when calling your code.

A very first step of diagnosis might be to activate
  trace(read.dcf)
  trace(library)
  options(verbose = TRUE)

A step further might be to patch read.dcf such that it prints
info 
       if(getOption("verbose")) { <<print what I am trying to
read>> }

Martin

Liaw, Andy

2003-Dec-04 18:20 UTC

head link

[R] performance gap between R 1.7.1 and 1.8.0

> From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] 
[...]> A very first step of diagnosis might be to activate
>   trace(read.dcf)
>   trace(library)
>   options(verbose = TRUE)
> 
> A step further might be to patch read.dcf such that it prints
> info 
>        if(getOption("verbose")) { <<print what I am trying
to read>> }
> 
> Martin
To refresh people's memory, the problem is that the identical code took more
than twice as long in R-1.8.x than in R-1.7.x.  I've managed to strip the
code down to the following and still see the same problem.  It basically
bootstraps mixture models to groups of data.  It uses boot() from the boot
package and flexmix() with intercept only, in the flexmix package.

============================================fitmix <- function(x,
verbosefit=FALSE) {
  bpunlist <- function(x) {
    unlisted <- numeric()
    for(i in 1:length(x at components)) {
      unlisted <- c(unlisted, unlist(parameters(x, component=i)))
    }
    unlisted
  }
  unifit <-  flexmix(x ~ 1, k = 1, control=list(verbose=0))
  unlist(unlist(list(G = 1, proportion = unifit at prior,
                     param=matrix(bpunlist(unifit), nrow=2),
                     bic=BIC(unifit), loglik=unifit at logLik)))
}

bootmix <- function(data, i) {
  d <- data
  d.grps <- split.data.frame(d[i,], d[i, "id"])
  comps <- vector("list", length=length(d.grps))
  for (j in seq(along=comps)) {
    thefit <- try(fitmix(d.grps[[j]][,1], verbosefit=FALSE))
    comps[[j]] <- if (inherits(thefit, "try-error")) rep(NA, 48)
else thefit
  }
  return(unlist(comps))
}

## The actual commands:
require(flexmix)
require(boot)
set.seed(76421)
x <- rnorm(3e3)
id <- gl(10, 300)
mydf <- data.frame(x, id)
Rprof(filename=paste("Rprof.out", system("hostname -s",
int=TRUE), sep="."))
system.time(res <- boot(mydf, bootmix, strata=mydf[,"id"], R=5))
Rprof(NULL)
summaryRprof(filename=paste("Rprof.out", system("hostname
-s", int=TRUE),
               sep="."))$by.self[1:20,]
============================================
I ran this on a Xeon 2.4GHz running Mandrake 9.0, R compiled from source.

This took 2 seconds in R-1.7.1, with the output:

                  self.time self.pct total.time total.pct
"FUN"                  0.18        9       0.24        12
"names"                0.16        8       0.16         8
"initialize"           0.10        5       0.80        40
"apply"                0.08        4       0.30        15
"[.data.frame"         0.06        3       0.22        11
"inherits"             0.06        3       0.18         9
"is.null"              0.06        3       0.06         3
"match"                0.06        3       0.22        11
"seq.default"          0.06        3       0.06         3
"structure"            0.06        3       0.08         4
"any"                  0.04        2       0.14         7
"el"                   0.04        2       0.04         2
"FLXfit"               0.04        2       1.18        59
"lapply"               0.04        2       0.22        11
"lm.wfit"              0.04        2       0.16         8
"names<-.default"      0.04        2       0.04         2
"slot<-"               0.04        2       0.18         9
"<"                    0.02        1       0.02         1
"|"                    0.02        1       0.02         1
":"                    0.02        1       0.02         1

... and took over 7 seconds in R-1.8.1, with the output:

                self.time self.pct total.time total.pct
"paste"              0.52      6.7       1.10      14.1
"read.dcf"           0.38      4.9       0.96      12.3
"exists"             0.36      4.6       0.50       6.4
"lapply"             0.36      4.6       1.36      17.4
"names"              0.34      4.4       0.40       5.1
"names<-"            0.32      4.1       0.44       5.6
"inherits"           0.22      2.8       0.34       4.4
"=="                 0.20      2.6       0.20       2.6
".Call"              0.20      2.6       0.20       2.6
"seq"                0.18      2.3       0.50       6.4
"seq.default"        0.18      2.3       0.28       3.6
"dynGet"             0.16      2.1       0.46       5.9
"topenv"             0.16      2.1       0.18       2.3
"unique"             0.16      2.1       1.00      12.8
"apply"              0.14      1.8       0.20       2.6
"as.list"            0.14      1.8       0.18       2.3
".find.package"      0.14      1.8       1.88      24.1
"sapply"             0.14      1.8       1.54      19.7
"any"                0.12      1.5       1.68      21.5
"initialize"         0.12      1.5       4.98      63.8

I then did the following:
> trace(read.dcf, recover)
[1] "read.dcf"> res <- boot(mydf, bootmix, strata=mydf[,"id"], R=1)Tracing read.dcf(file.path(package.lib, package, "DESCRIPTION"),
fields "Namespace") on entry

Enter a frame number, or 0 to exit   
1:boot(mydf, bootmix, strata = mydf[, "id"], R = 1) 
2:statistic(data, original, ...) 
3:try(fitmix(d.grps[[j]][, 1], verbosefit = FALSE)) 
4:fitmix(d.grps[[j]][, 1], verbosefit = FALSE) 
5:flexmix(x ~ 1, k = 1, control = list(verbose = 0)) 
6:flexmix(x ~ 1, k = 1, control = list(verbose = 0)) 
7:flexmix(formula = formula, data = data, k = k, cluster = cluster, model list(F
8:FLXglm() 
9:new("FLXmodel", weighted = TRUE, formula = formula, name =
paste("FLXglm",
famil 
10:initialize(value, ...) 
11:initialize(value, ...) 
12:getClass(Class) 
13:.requirePackage(package) 
14:trySilent(loadNamespace(package)) 
15:eval.parent(call) 
16:eval(expr, p) 
17:eval(expr, envir, enclos) 
18:try(loadNamespace(package)) 
19:loadNamespace(package) 
20:packageHasNamespace(package, package.lib) 
21:read.dcf(file.path(package.lib, package, "DESCRIPTION"), fields
"Namespace")

I'm really out of ideas at this point.  Can anyone see the problem given the
above, or at least tell me what to try next?

Best,
Andy

Liaw, Andy

2003-Dec-04 19:04 UTC

head link

[R] performance gap between R 1.7.1 and 1.8.0

> From: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk] 
> "Liaw, Andy" <andy_liaw at merck.com> writes:
> 
> > I'm really out of ideas at this point.  Can anyone see the 
> problem given the
> > above, or at least tell me what to try next?
> 
> Hmm. Looking into the actual Rprof.out file is sometimes revealing.
> Could you provide a link to it (I assume it might be a bit large for
> email).
The files aren't that big, but it certainly isn't pretty (not intended
for
human consumption, I suppose).

I don't have a place to post it, so I'm attaching it as .tgz.  Hopefully
it
at least makes it to some...

Best,
Andy

 > -- 
>    O__  ---- Peter Dalgaard             Blegdamsvej 3  
>   c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
>  (*) \(*) -- University of Copenhagen   Denmark      Ph: 
> (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: 
> (+45) 35327907
>

Maybe Matching Threads

Search for more maybe matching threads

R help - Nov 2003 - performance gap between R 1.7.1 and 1.8.0

[R] performance gap between R 1.7.1 and 1.8.0

[R] performance gap between R 1.7.1 and 1.8.0

[R] performance gap between R 1.7.1 and 1.8.0

[R] performance gap between R 1.7.1 and 1.8.0

Maybe Matching Threads