Henrik Bengtsson
2008-Jul-25 03:10 UTC
[Rd] serialize() to via temporary file is heaps faster than doing it directly (on Windows)
Hi, FYI, I just notice that on Windows (but not Linux) it is orders of magnitude (below it's 50x) faster to serialize() and object to a temporary file and then read it back, than to serialize to an object directly. This has for instance impact on how fast digest::digest() can provide a checksum. Example: x <- 1:1e7; t1 <- system.time(raw1 <- serialize(x, connection=NULL)); print(t1); # user system elapsed # 174.23 129.35 304.70 ## 5 minutes t2 <- system.time(raw2 <- serialize2(x, connection=NULL)); print(t2); # user system elapsed # 2.19 0.18 5.72 ## 5 seconds print(t1/t2); # user system elapsed # 79.55708 718.61111 53.26923 stopifnot(identical(raw1, raw2)); where serialize2() is serialize():ing to file and reading the results back: serialize2 <- function(object, connection, ...) { if (is.null(connection)) { # It is faster to serialize to a temporary file and read it back pathname <- tempfile(); con <- file(pathname, open="wb"); on.exit({ if (!is.null(con)) close(con); if (file.exists(pathname)) file.remove(pathname); }); base::serialize(object, connection=con, ...); close(con); con <- NULL; fileSize <- file.info(pathname)$size; readBin(pathname, what="raw", n=fileSize); } else { base::serialize(object, connection=connection, ...); } } # serialize2() The above benchmarking was done in a fresh R v2.7.1 session on WinXP Pro:> sessionInfo()R version 2.7.1 Patched (2008-06-27 r46012) i386-pc-mingw32 locale: LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MON ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base When I do the same on a Linux machine there is no difference:> sessionInfo()R version 2.7.1 (2008-06-23) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base Is there an obvious reason (and an obvious fix) for this? Cheers Henrik
Henrik Bengtsson
2008-Aug-29 19:43 UTC
[Rd] serialize() to via temporary file is heaps faster than doing it directly (on Windows)
I just want to re-post this thread in case it slipped through the "summer sieve" of someone that might be interested and/or has a real solution beyond my serialize2() patch. Cheers Henrik On Thu, Jul 24, 2008 at 8:10 PM, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:> Hi, > > FYI, I just notice that on Windows (but not Linux) it is orders of > magnitude (below it's 50x) faster to serialize() and object to a > temporary file and then read it back, than to serialize to an object > directly. This has for instance impact on how fast digest::digest() > can provide a checksum. > > Example: > x <- 1:1e7; > t1 <- system.time(raw1 <- serialize(x, connection=NULL)); > print(t1); > # user system elapsed > # 174.23 129.35 304.70 ## 5 minutes > t2 <- system.time(raw2 <- serialize2(x, connection=NULL)); > print(t2); > # user system elapsed > # 2.19 0.18 5.72 ## 5 seconds > print(t1/t2); > # user system elapsed > # 79.55708 718.61111 53.26923 > stopifnot(identical(raw1, raw2)); > > where serialize2() is serialize():ing to file and reading the results back: > > serialize2 <- function(object, connection, ...) { > if (is.null(connection)) { > # It is faster to serialize to a temporary file and read it back > pathname <- tempfile(); > con <- file(pathname, open="wb"); > on.exit({ > if (!is.null(con)) > close(con); > if (file.exists(pathname)) > file.remove(pathname); > }); > base::serialize(object, connection=con, ...); > close(con); > con <- NULL; > fileSize <- file.info(pathname)$size; > readBin(pathname, what="raw", n=fileSize); > } else { > base::serialize(object, connection=connection, ...); > } > } # serialize2() > > The above benchmarking was done in a fresh R v2.7.1 session on WinXP Pro: > >> sessionInfo() > R version 2.7.1 Patched (2008-06-27 r46012) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MON > ETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > > When I do the same on a Linux machine there is no difference: > >> sessionInfo() > R version 2.7.1 (2008-06-23) > x86_64-unknown-linux-gnu > > locale: > LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > Is there an obvious reason (and an obvious fix) for this? > > Cheers > > Henrik >