Alexander Belikoff
2009-Dec-04 21:39 UTC
[R] simple reshape of a large data frame (reshape() runs out of memory)
Hello everyone, I'm having a problem performing reshape() on a large data frame. The operation is fairly trivial but it makes R run out of memory. The data frame has the following structure: ID DATE1 DATE2 VALTYPE VALUE 'abcd1233' 2009-11-12 2009-12-23 'TYPE1' 123.45 ... VALTYPE is a string and is a factor with only 2 values (say TYPE1 and TYPE2). I need to transform it into the following data frame ("wide" transpose) based on common ID and DATEs: ID DATE1 DATE2 VALUE.TYPE1 VALUE.TYPE2 'abcd1233' 2009-11-12 2009-12-23 123.45 NA ... This is fairly easily achieved by reshape(): tbl2 <- reshape(tbl, direction = "wide", idvar = c("ID", "DATE1", "DATE2"), timevar = "VALTYPE"); This works very well on toy data frames however it fails miserably on a production dataset. The latter has more than 4,500,000 observations (although about 70% of VALUEs are NA). When running on a production data frame, I eventually get: Error: cannot allocate vector of size 4.8 Gb The machine is an Intel-based Linux workstation with 4Gb of RAM. Loading the data (from a compressed Rdata file) into a fresh R process makes it grow to about 250Mb which clearly leaves a lot of space for reshaping. So far, I managed to deal with the issue by performing a manual transpose by splitting/merging on VALTYPE: sp <- by(tbl[c(1,2,3,5)], tbl$VALTYPE, function(x) x); tbl <- merge(sp[["TYPE1"]], sp[["TYPE2"]], by = c("ID", "DATE1", "DATE2"), all = TRUE, sort = TRUE); ..but this is not as elegant, let alone inflexible, should we add a new kind of VALTYPE. So I would appreciate help with the following questions: 1. Am I doing something wrong or is reshape() hugely suboptimal in this particular case? Given that the data set is already sorted by ID, DATE1, DATE2, I preforming this transformation using a small Python/Perl/AWK script that wouldn't even require loading the entire dataset instead operating on a stream of data 2. Is there a better way to deal with this problem? Again, I can come up with all kinds of Python/Perl patchwork but it would be nice to have R do the job, especially considering that the problem is not that difficult. Thanks in advance, -- Alexander P.S. If needed, I'll concoct a test dataset demonstrating the problem, in case someone is willing to debug the issue. [[alternative HTML version deleted]]