Hi, I am developing a tool for converting a large data frame stored in an uncompressed binary (XDR) RData file to a delimited text file. The data frame is too large to load() and extract rows from on a typical PC. I'm looking to parse through the file and extract individual entries without loading the whole thing into memory. In terms of some C source functions, instead of doing RestoreToEnv(R_Unserialize(connection)) which is essentially what load() does, I'm looking to get the documentation I would need to build a function "SaveToCSV()" so that I could do SaveToCSV(R_Unserialize(connection)). Where can I get documentation on the RData file format? Does a spec document exist? See details below. Thanks, Ian Ian Cook | Advanced Micro Devices, Inc. | ian.cook at amd.com ------------------------- Additional details: I've browsed through the relevant source code (saveload.c, serialize.c) for ideas. Here's a demo of the problem I'm looking to solve: # create a sample data frame ds <- data.frame(row1=c(1,2,3),row2=c('a','b','c')) # save into an uncompressed binary R dataset save(ds,file="ds.rdata",compress=FALSE) rm(ds) # Then load() can be simulated like this: # create and open a file connection con <- file("ds.rdata",open="rb") # read the first 5 characters readChar(con,5) # unserialize the remainder and restore to the environment ds <- unserialize(con,NULL)[["ds"]] close(con) But this takes up too much memory if the data set is too big. I can read in the file character-by-character, i.e. using readChar(), but it's obvious that the file format is not trivial. readChar(con,10000) for this demo yields: RDX2\nX\n\0\0\0\002\0\002\004\001\0\002\003\0\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\002ds\0\0\003\023\0\0\0\002\0\0\0\016\0\0\0\003??\0\0\0\0\0\0@\0\0\0\0\0\0\0@\b\0\0\0\0\0\0\0\0\003\r\0\0\0\003\0\0\0\001\0\0\0\002\0\0\0\003\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\006levels\0\0\0\020\0\0\0\003\0\0\0\t\0\0\0\001a\0\0\0\t\0\0\0\001b\0\0\0\t\0\0\0\001c\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005class\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\006factor\0\0\0?\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005names\0\0\0\020\0\0\0\002\0\0\0\t\0\0\0\004row1\0\0\0\t\0\0\0\004row2\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\trow.names\0\0\0\r\0\0\0\002?\0\0\0\0\0\0\003\0\0\004\002\0\0\003?\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\ndata.frame\0\0\0?\0\0\0? This would be parse-able if I had a file spec. Thanks. Ian Cook | Advanced Micro Devices, Inc. | ian.cook at amd.com
Hi Cook, Ian wrote:> Hi, > > I am developing a tool for converting a large data frame stored in an > uncompressed binary (XDR) RData file to a delimited text file. The > data frame is too large to load() and extract rows from on a typical > PC. I'm looking to parse through the file and extract individual > entries without loading the whole thing into memory. > > In terms of some C source functions, instead of doing > RestoreToEnv(R_Unserialize(connection)) which is essentially what > load() does, I'm looking to get the documentation I would need to > build a function "SaveToCSV()" so that I could do > SaveToCSV(R_Unserialize(connection)). > > Where can I get documentation on the RData file format? Does a spec > document exist? > > See details below. > > Thanks, Ian > > Ian Cook | Advanced Micro Devices, Inc. | ian.cook at amd.com > > ------------------------- > > Additional details: > > I've browsed through the relevant source code (saveload.c, > serialize.c) for ideas. > > Here's a demo of the problem I'm looking to solve: > > # create a sample data frame ds <- > data.frame(row1=c(1,2,3),row2=c('a','b','c')) # save into an > uncompressed binary R dataset save(ds,file="ds.rdata",compress=FALSE) > rm(ds) > > # Then load() can be simulated like this: > > # create and open a file connection con <- file("ds.rdata",open="rb") > # read the first 5 characters readChar(con,5) # unserialize the > remainder and restore to the environment ds <- > unserialize(con,NULL)[["ds"]] close(con) > > But this takes up too much memory if the data set is too big. I can > read in the file character-by-character, i.e. using readChar(), but > it's obvious that the file format is not trivial. > readChar(con,10000) for this demo yields: > > RDX2\nX\n\0\0\0\002\0\002\004\001\0\002\003\0\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\002ds\0\0\003\023\0\0\0\002\0\0\0\016\0\0\0\003??\0\0\0\0\0\0@\0\0\0\0\0\0\0@\b\0\0\0\0\0\0\0\0\003\r\0\0\0\003\0\0\0\001\0\0\0\002\0\0\0\003\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\006levels\0\0\0\020\0\0\0\003\0\0\0\t\0\0\0\001a\0\0\0\t\0\0\0\001b\0\0\0\t\0\0\0\001c\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005class\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\006factor\0\0\0?\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005names\0\0\0\020\0\0\0\002\0\0\0\t\0\0\0\004row1\0\0\0\t\0\0\0\004row2\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\trow.names\0\0\0\r\0\0\0\002?\0\0\0\0\0\0\003\0\0\004\002\0\0\003?\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\ndata.frame\0\0\0?\0\0\0? > > > This would be parse-able if I had a file spec. Thanks.See the "R Internals" manual http://cran.r-project.org/doc/manuals/R-ints.html You might also find page 5 of R News 7/1 useful for exploring the format http://cran.r-project.org/doc/Rnews/Rnews_2007-1.pdf Paul> Ian Cook | Advanced Micro Devices, Inc. | ian.cook at amd.com > > ______________________________________________ R-devel at r-project.org > mailing list https://stat.ethz.ch/mailman/listinfo/r-devel-- Dr Paul Murrell Department of Statistics The University of Auckland Private Bag 92019 Auckland New Zealand 64 9 3737599 x85392 paul at stat.auckland.ac.nz http://www.stat.auckland.ac.nz/~paul/
I was going to write 'Use the source, Luke', but it seems that you have alreday found the relevant source files. I wrote a Python baed Rdata writer and a reader sometimes ago just using that info and I am not away of any file spec, so I know those two files are sufficient. For what you want to do, I think you'll have to write some fairly substantial code to process the Rdata as just XDR stream (as my python scripts do, using the python built-in xdrlib), because as far as I know the API you are after is not exposed - you'll have to - and you can - cut and paste a substantial part of saveload.c and serialize.c for that matter, of course. I think my python-based Rdata reader would do most of what you want (it was written for mostly diagnostic purposes as I was 'hand-crafting' R objects in C and saving them as Rdata then read it tell me what's wrong with them, if any) except it dumps a sort of general human readable ascii text format rather than csv... My sugegstion would be to use a lanaguage you are comfortable with which comes with an xdr library, and just do it by hand... Cook, Ian wrote:> Hi, > > I am developing a tool for converting a large data frame stored in an uncompressed binary (XDR) RData file to a delimited text file. The data frame is too large to load() and extract rows from on a typical PC. I'm looking to parse through the file and extract individual entries without loading the whole thing into memory. > > In terms of some C source functions, instead of doing RestoreToEnv(R_Unserialize(connection)) which is essentially what load() does, I'm looking to get the documentation I would need to build a function "SaveToCSV()" so that I could do SaveToCSV(R_Unserialize(connection)). > > Where can I get documentation on the RData file format? Does a spec document exist? > > See details below. > > Thanks, > Ian > > Ian Cook | Advanced Micro Devices, Inc. | ian.cook at amd.com > > ------------------------- > > Additional details: > > I've browsed through the relevant source code (saveload.c, serialize.c) for ideas. > > Here's a demo of the problem I'm looking to solve: > > # create a sample data frame > ds <- data.frame(row1=c(1,2,3),row2=c('a','b','c')) > # save into an uncompressed binary R dataset > save(ds,file="ds.rdata",compress=FALSE) > rm(ds) > > # Then load() can be simulated like this: > > # create and open a file connection > con <- file("ds.rdata",open="rb") > # read the first 5 characters > readChar(con,5) > # unserialize the remainder and restore to the environment > ds <- unserialize(con,NULL)[["ds"]] > close(con) > > But this takes up too much memory if the data set is too big. I can read in the file character-by-character, i.e. using readChar(), but it's obvious that the file format is not trivial. readChar(con,10000) for this demo yields: > > RDX2\nX\n\0\0\0\002\0\002\004\001\0\002\003\0\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\002ds\0\0\003\023\0\0\0\002\0\0\0\016\0\0\0\003??\0\0\0\0\0\0@\0\0\0\0\0\0\0@\b\0\0\0\0\0\0\0\0\003\r\0\0\0\003\0\0\0\001\0\0\0\002\0\0\0\003\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\006levels\0\0\0\020\0\0\0\003\0\0\0\t\0\0\0\001a\0\0\0\t\0\0\0\001b\0\0\0\t\0\0\0\001c\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005class\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\006factor\0\0\0?\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005names\0\0\0\020\0\0\0\002\0\0\0\t\0\0\0\004row1\0\0\0\t\0\0\0\004row2\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\trow.names\0\0\0\r\0\0\0\002?\0\0\0\0\0\0\003\0\0\004\002\0\0\003?\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\ndata.frame\0\0\0?\0\0\0? > > This would be parse-able if I had a file spec. Thanks. > > Ian Cook | Advanced Micro Devices, Inc. | ian.cook at amd.com > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Ian, On Aug 23, 2007, at 4:21 PM, Cook, Ian wrote:> I am developing a tool for converting a large data frame stored in > an uncompressed binary (XDR) RData file to a delimited text file. > The data frame is too large to load() and extract rows from on a > typical PC. I'm looking to parse through the file and extract > individual entries without loading the whole thing into memory. > > In terms of some C source functions, instead of doing RestoreToEnv > (R_Unserialize(connection)) which is essentially what load() does, > I'm looking to get the documentation I would need to build a > function "SaveToCSV()" so that I could do SaveToCSV(R_Unserialize > (connection)). > > Where can I get documentation on the RData file format? Does a > spec document exist? >I don't think so - basically the sources are all the documentation I'm aware of. It's a bit messy, because R supports so many old formats. However, if you want a stand-alone program that handles (uncompressed) XDR2 only, then I may have saved you a bit of work. I have a utility (based on the R sources) that allows you to scan through XDR2 files and to extract individual objects into a separate XDR2 file (this happens to be quite useful when you have a workspace that doesn't load into R and yet you want to save some pieces of it). Have a look at http://urbanek.info/rdcopy.c (you can either run it as "./rdcopy foo" to list the objects or "./ rdcopy foo -v" to show the full structure (all SEXPs with their offsets) or "./rdcopy foo bar 19" to copy SEXP at offset 19 from foo into a separate XDR2 file bar (use offset from the first call to copy entire objects). It's not prefect, but servers its purpose (it resolves references by copying them instead of re-indexing, but it doesn't detect loops). Maybe it helps, even though the task you describe is still far from trivial. Cheers, Simon