Ashish Kulkarni
2007-Jan-25 11:32 UTC
[Rd] serialize() takes too long when serializing to a raw vector
Hello, R version 2.4.1 (2006-12-18) i386-pc-mingw32 Calling serialize() with a NULL connection serializes it to a raw vector. However, when the object to be serialized is large, it takes a very long time:> system.time( serialize(matrix(0, 1000, 1000), NULL) )[1] 38.25 40.73 81.54 NA NA> system.time( serialize(matrix(0, 2000, 2000), NULL) )[1] 609.72 664.75 1318.57 NA NA I was using this in Rmpi, where a clustered call returned a large matrix. However, serializing to a file or sockets is very fast for the very same matrix -- hence I wrote this function which runs much faster: .mpi.quick.serialize <- function (object) { fname <- tempfile("Rmpi") stream <- file(fname, "wb") on.exit({ close(stream) file.remove(fname) }) serialize(object, stream) close(stream) size <- file.info(fname)$size stream <- file(fname, "rb") return(readBin(stream, "raw", n = size)) }> system.time( .mpi.quick.serialize(matrix(0, 1000, 1000) ) )[1] 0.2500000000000000 0.0499999999999545 0.3000000000001819 [4] NA NA> system.time( .mpi.quick.serialize(matrix(0, 2000, 2000) ) )[1] 1.059999999999945 0.220000000000027 1.289999999999964 [4] NA NA Does anyone have an idea why the performance difference is so large? Also, I was wondering if there is a better way -- the above solution feels like a quick fix rather than a correct approach. Regards, ashish
Hin-Tak Leung
2007-Jan-25 13:03 UTC
[Rd] serialize() takes too long when serializing to a raw vector
Ashish Kulkarni wrote:> Hello, > > R version 2.4.1 (2006-12-18) > i386-pc-mingw32 > > Calling serialize() with a NULL connection serializes it to a raw vector. However, when the object to be serialized is large, it takes a very long time: > >> system.time( serialize(matrix(0, 1000, 1000), NULL) ) > [1] 38.25 40.73 81.54 NA NA > >> system.time( serialize(matrix(0, 2000, 2000), NULL) ) > [1] 609.72 664.75 1318.57 NA NA<snipped>>> system.time( .mpi.quick.serialize(matrix(0, 1000, 1000) ) ) > [1] 0.2500000000000000 0.0499999999999545 0.3000000000001819 > [4] NA NA > >> system.time( .mpi.quick.serialize(matrix(0, 2000, 2000) ) ) > [1] 1.059999999999945 0.220000000000027 1.289999999999964 > [4] NA NA > > Does anyone have an idea why the performance difference is so > large? Also, I was wondering if there is a better way -- the > above solution feels like a quick fix rather than a correct > approach.It might be interesting to know get some details on your hardware. On my box, linux native seems to be a little slower than your quick.serialize times: > system.time( serialize(matrix(0, 1000, 1000), NULL) ) [1] 0.372 0.288 0.692 0.000 0.000 > system.time( serialize(matrix(0, 2000, 2000), NULL) ) [1] 1.237 1.195 2.501 0.000 0.000 running R 2.4.1 windows under wine (same box) is a good deal s lower, but is not anywhere nearly as slow as yours. > system.time( serialize(matrix(0, 1000, 1000), NULL) ) [1] 0.00 0.00 6.08 NA NA > system.time( serialize(matrix(0, 2000, 2000), NULL) ) [1] 0.01 0.01 78.00 NA NA > Since you mentioned that you are using Rmpi, there is a possibility that you might be calling a different serialize() than base::serialize all together???
Duncan Murdoch
2007-Jan-25 14:21 UTC
[Rd] serialize() takes too long when serializing to a raw vector
On 1/25/2007 6:32 AM, Ashish Kulkarni wrote:> Hello, > > R version 2.4.1 (2006-12-18) > i386-pc-mingw32 > > Calling serialize() with a NULL connection serializes it to a raw vector. However, when the object to be serialized is large, it takes a very long time: > >> system.time( serialize(matrix(0, 1000, 1000), NULL) ) > [1] 38.25 40.73 81.54 NA NA > >> system.time( serialize(matrix(0, 2000, 2000), NULL) ) > [1] 609.72 664.75 1318.57 NA NA > > I was using this in Rmpi, where a clustered call returned a large matrix. However, serializing to a file or sockets is very fast for the very same matrix -- hence I wrote this function which runs much faster: > > .mpi.quick.serialize <- function (object) > { > fname <- tempfile("Rmpi") > stream <- file(fname, "wb") > on.exit({ > close(stream) > file.remove(fname) > }) > serialize(object, stream) > close(stream) > size <- file.info(fname)$size > stream <- file(fname, "rb") > return(readBin(stream, "raw", n = size)) > } > >> system.time( .mpi.quick.serialize(matrix(0, 1000, 1000) ) ) > [1] 0.2500000000000000 0.0499999999999545 0.3000000000001819 > [4] NA NA > >> system.time( .mpi.quick.serialize(matrix(0, 2000, 2000) ) ) > [1] 1.059999999999945 0.220000000000027 1.289999999999964 > [4] NA NA > > Does anyone have an idea why the performance difference is so > large? Also, I was wondering if there is a better way -- the > above solution feels like a quick fix rather than a correct > approach.It looks like a bug in the serialize code: it's reallocating the output buffer far too often, and that's slowing things down. I'll confirm that's what's going on and fix it. Duncan Murdoch