Martinez de Salinas, Jorge
2015-Mar-17 17:37 UTC
[Rd] Reduce memory peak when serializing to raw vectors
Hi, I've been doing some tests using serialize() to a raw vector: df <- data.frame(runif(50e6,1,10)) ser <- serialize(df,NULL) In this example the data frame and the serialized raw vector occupy ~400MB each, for a total of ~800M. However the memory peak during serialize() is ~1.2GB: $ cat /proc/15155/status |grep Vm ... VmHWM: 1207792 kB VmRSS: 817272 kB We work with very large data frames and in many cases this is killing R with an "out of memory" error. This is the relevant code in R 3.1.3 in src/main/serialize.c:2494 InitMemOutPStream(&out, &mbs, type, version, hook, fun); R_Serialize(object, &out); val = CloseMemOutPStream(&out); The serialized object is being stored in a buffer pointed by out.data. Then in CloseMemOutPStream() R copies the whole buffer to a newly allocated SEXP object (the raw vector that stores the final result): PROTECT(val = allocVector(RAWSXP, mb->count)); memcpy(RAW(val), mb->buf, mb->count); free_mem_buffer(mb); UNPROTECT(1); Before calling free_mem_buffer() the process is using ~1.2GB (the original data frame + the serialization buffer + final serialized raw vector). One possible solution would be to allocate a buffer for the final raw vector and store the serialization result directly into that buffer. This would bring the memory peak down from ~1.2GB to ~800MB. Thanks, -Jorge
Simon Urbanek
2015-Mar-17 21:03 UTC
[Rd] Reduce memory peak when serializing to raw vectors
Jorge, what you propose is not possible because the size of the output is unknown, that's why a dynamically growing PStream buffer is used - it cannot be pre-allocated. Cheers, Simon> On Mar 17, 2015, at 1:37 PM, Martinez de Salinas, Jorge <jorge.martinez-de-salinas at hp.com> wrote: > > Hi, > > I've been doing some tests using serialize() to a raw vector: > > df <- data.frame(runif(50e6,1,10)) > ser <- serialize(df,NULL) > > In this example the data frame and the serialized raw vector occupy ~400MB each, for a total of ~800M. However the memory peak during serialize() is ~1.2GB: > > $ cat /proc/15155/status |grep Vm > ... > VmHWM: 1207792 kB > VmRSS: 817272 kB > > We work with very large data frames and in many cases this is killing R with an "out of memory" error. > > This is the relevant code in R 3.1.3 in src/main/serialize.c:2494 > > InitMemOutPStream(&out, &mbs, type, version, hook, fun); > R_Serialize(object, &out); > val = CloseMemOutPStream(&out); > > The serialized object is being stored in a buffer pointed by out.data. Then in CloseMemOutPStream() R copies the whole buffer to a newly allocated SEXP object (the raw vector that stores the final result): > > PROTECT(val = allocVector(RAWSXP, mb->count)); > memcpy(RAW(val), mb->buf, mb->count); > free_mem_buffer(mb); > UNPROTECT(1); > > Before calling free_mem_buffer() the process is using ~1.2GB (the original data frame + the serialization buffer + final serialized raw vector). > > One possible solution would be to allocate a buffer for the final raw vector and store the serialization result directly into that buffer. This would bring the memory peak down from ~1.2GB to ~800MB. > > Thanks, > -Jorge > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Michael Lawrence
2015-Mar-17 21:48 UTC
[Rd] Reduce memory peak when serializing to raw vectors
Presumably one could stream over the data twice, the first to get the size, without storing the data. Slower but more memory efficient, unless I'm missing something. Michael On Tue, Mar 17, 2015 at 2:03 PM, Simon Urbanek <simon.urbanek at r-project.org> wrote:> Jorge, > > what you propose is not possible because the size of the output is > unknown, that's why a dynamically growing PStream buffer is used - it > cannot be pre-allocated. > > Cheers, > Simon > > > > On Mar 17, 2015, at 1:37 PM, Martinez de Salinas, Jorge < > jorge.martinez-de-salinas at hp.com> wrote: > > > > Hi, > > > > I've been doing some tests using serialize() to a raw vector: > > > > df <- data.frame(runif(50e6,1,10)) > > ser <- serialize(df,NULL) > > > > In this example the data frame and the serialized raw vector occupy > ~400MB each, for a total of ~800M. However the memory peak during > serialize() is ~1.2GB: > > > > $ cat /proc/15155/status |grep Vm > > ... > > VmHWM: 1207792 kB > > VmRSS: 817272 kB > > > > We work with very large data frames and in many cases this is killing R > with an "out of memory" error. > > > > This is the relevant code in R 3.1.3 in src/main/serialize.c:2494 > > > > InitMemOutPStream(&out, &mbs, type, version, hook, fun); > > R_Serialize(object, &out); > > val = CloseMemOutPStream(&out); > > > > The serialized object is being stored in a buffer pointed by out.data. > Then in CloseMemOutPStream() R copies the whole buffer to a newly allocated > SEXP object (the raw vector that stores the final result): > > > > PROTECT(val = allocVector(RAWSXP, mb->count)); > > memcpy(RAW(val), mb->buf, mb->count); > > free_mem_buffer(mb); > > UNPROTECT(1); > > > > Before calling free_mem_buffer() the process is using ~1.2GB (the > original data frame + the serialization buffer + final serialized raw > vector). > > > > One possible solution would be to allocate a buffer for the final raw > vector and store the serialization result directly into that buffer. This > would bring the memory peak down from ~1.2GB to ~800MB. > > > > Thanks, > > -Jorge > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]