Alexander Belikoff
2009-Dec-04 21:39 UTC
[R] simple reshape of a large data frame (reshape() runs out of memory)
Hello everyone,
I'm having a problem performing reshape() on a large data frame. The
operation is fairly trivial but it makes R run out of memory.
The data frame has the following structure:
ID DATE1 DATE2 VALTYPE
VALUE
'abcd1233' 2009-11-12 2009-12-23 'TYPE1'
123.45
...
VALTYPE is a string and is a factor with only 2 values (say TYPE1 and
TYPE2). I need to transform it into the following data frame ("wide"
transpose) based on common ID and DATEs:
ID DATE1 DATE2 VALUE.TYPE1
VALUE.TYPE2
'abcd1233' 2009-11-12 2009-12-23
123.45 NA
...
This is fairly easily achieved by reshape():
tbl2 <- reshape(tbl, direction = "wide", idvar =
c("ID", "DATE1",
"DATE2"), timevar = "VALTYPE");
This works very well on toy data frames however it fails miserably on a
production dataset. The latter has more than 4,500,000 observations
(although about 70% of VALUEs are NA). When running on a production data
frame, I eventually get:
Error: cannot allocate vector of size 4.8 Gb
The machine is an Intel-based Linux workstation with 4Gb of RAM. Loading the
data (from a compressed Rdata file) into a fresh R process makes it grow to
about 250Mb which clearly leaves a lot of space for reshaping.
So far, I managed to deal with the issue by performing a manual transpose by
splitting/merging on VALTYPE:
sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x);
tbl <- merge(sp[["TYPE1"]], sp[["TYPE2"]], by =
c("ID", "DATE1",
"DATE2"), all = TRUE, sort = TRUE);
..but this is not as elegant, let alone inflexible, should we add a new kind
of VALTYPE.
So I would appreciate help with the following questions:
1. Am I doing something wrong or is reshape() hugely suboptimal in this
particular case? Given that the data set is already sorted by ID, DATE1,
DATE2, I preforming this transformation using a small Python/Perl/AWK script
that wouldn't even require loading the entire dataset instead operating on a
stream of data
2. Is there a better way to deal with this problem? Again, I can come up
with all kinds of Python/Perl patchwork but it would be nice to have R do
the job, especially considering that the problem is not that difficult.
Thanks in advance,
-- Alexander
P.S. If needed, I'll concoct a test dataset demonstrating the problem, in
case someone is willing to debug the issue.
[[alternative HTML version deleted]]
