R devel - Aug 2007 - RData File Specification?

Hi,

I am developing a tool for converting a large data frame stored in an
uncompressed binary (XDR) RData file to a delimited text file.  The data frame
is too large to load() and extract rows from on a typical PC.  I'm looking
to parse through the file and extract individual entries without loading the
whole thing into memory.

In terms of some C source functions, instead of doing
RestoreToEnv(R_Unserialize(connection)) which is essentially what load() does,
I'm looking to get the documentation I would need to build a function
"SaveToCSV()" so that I could do SaveToCSV(R_Unserialize(connection)).

Where can I get documentation on the RData file format?  Does a spec document
exist?

See details below.

Thanks,
Ian

Ian Cook | Advanced Micro Devices, Inc. | ian.cook at amd.com

-------------------------

Additional details:

I've browsed through the relevant source code (saveload.c, serialize.c) for
ideas.

Here's a demo of the problem I'm looking to solve:

# create a sample data frame
ds <- data.frame(row1=c(1,2,3),row2=c('a','b','c'))
# save into an uncompressed binary R dataset
save(ds,file="ds.rdata",compress=FALSE)
rm(ds)

# Then load() can be simulated like this:

# create and open a file connection
con <- file("ds.rdata",open="rb")
# read the first 5 characters
readChar(con,5)
# unserialize the remainder and restore to the environment
ds <- unserialize(con,NULL)[["ds"]]
close(con)

But this takes up too much memory if the data set is too big.  I can read in the
file character-by-character, i.e. using readChar(), but it's obvious that
the file format is not trivial.  readChar(con,10000) for this demo yields:

RDX2\nX\n\0\0\0\002\0\002\004\001\0\002\003\0\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\002ds\0\0\003\023\0\0\0\002\0\0\0\016\0\0\0\003??\0\0\0\0\0\0@\0\0\0\0\0\0\0@\b\0\0\0\0\0\0\0\0\003\r\0\0\0\003\0\0\0\001\0\0\0\002\0\0\0\003\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\006levels\0\0\0\020\0\0\0\003\0\0\0\t\0\0\0\001a\0\0\0\t\0\0\0\001b\0\0\0\t\0\0\0\001c\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005class\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\006factor\0\0\0?\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\005names\0\0\0\020\0\0\0\002\0\0\0\t\0\0\0\004row1\0\0\0\t\0\0\0\004row2\0\0\004\002\0\0\0\001\0\0\020\t\0\0\0\trow.names\0\0\0\r\0\0\0\002?\0\0\0\0\0\0\003\0\0\004\002\0\0\003?\0\0\0\020\0\0\0\001\0\0\0\t\0\0\0\ndata.frame\0\0\0?\0\0\0?

This would be parse-able if I had a file spec.  Thanks.

Ian Cook | Advanced Micro Devices, Inc. | ian.cook at amd.com

R devel - Aug 2007 - RData File Specification?

[Rd] RData File Specification?

[Rd] RData File Specification?

[Rd] RData File Specification?

[Rd] RData File Specification?

Possibly Parallel Threads