Martinez de Salinas, Jorge
2015-Mar-17 22:09 UTC
[Rd] Reduce memory peak when serializing to raw vectors
Hi, I've been doing some tests using serialize() to a raw vector: df <- data.frame(runif(50e6,1,10)) ser <- serialize(df,NULL) In this example the data frame and the serialized raw vector occupy ~400MB each, for a total of ~800M. However the memory peak during serialize() is ~1.2GB: $ cat /proc/15155/status |grep Vm ... VmHWM: 1207792 kB VmRSS: 817272 kB We work with very large data frames and in many cases this is killing R with an "out of memory" error. This is the relevant code in R 3.1.3 in src/main/serialize.c:2494 InitMemOutPStream(&out, &mbs, type, version, hook, fun); R_Serialize(object, &out); val = CloseMemOutPStream(&out); The serialized object is being stored in a buffer pointed by out.data. Then in CloseMemOutPStream() R copies the whole buffer to a newly allocated SEXP object (the raw vector that stores the final result): PROTECT(val = allocVector(RAWSXP, mb->count)); memcpy(RAW(val), mb->buf, mb->count); free_mem_buffer(mb); UNPROTECT(1); Before calling free_mem_buffer() the process is using ~1.2GB (the original data frame + the serialization buffer + final serialized raw vector). One possible solution would be to allocate a buffer for the final raw vector and store the serialization result directly into that buffer. This would bring the memory peak down from ~1.2GB to ~800MB. Thanks, -Jorge