Mike Miller
2014-Apr-22  00:59 UTC
[R] reading data saved with writeBin() into anything other than R
After saving a file like so...
con <- gzcon("file.gz", "wb"))
writeBin(vector, con, size=2)
close(con)
I can read it back into R like so...
con <- gzcon("file.gz", "rb"))
vector <- readBin(con, integer(), 48000000, size=2, signed=FALSE)
close(con)
...and I'm wondering what other programs might be able to read in these 
data.  It seems to be very straightforward:  When I store 5436 integers 
for each of 7694 subjects, at two bytes per integer that ought to be 
5436*7696*2 = 83670912 bytes, and it is exactly that:
$ zcat file.gz | wc -c
83670912
So if I just convert every pair of bytes to an integer, I guess that will 
do it.  I stored them this way because it was compact, but I guess this 
system also can work well when other software needs to read the data. 
For me that other software would probably be Octave.  I'm interested if 
anyone here has read in these files using Octave, or a C program or 
anything else.  If I don't get a good answer here, I'll try the Octave 
list, and I'll send my best answers here.
The rest of this is some related info for readers of this list.  You don't 
need to read below to answer my question above.  Thanks.
In case anyone is interested, I did some comparisons of loading speed and 
file size for a number of ways of storing my data.  These data all consist 
of positive numbers between 0 and 2, with three digits to the right of the 
decimal, so I can save them as floating point double-precision, or 
multiply by 1000 and store them as integers.  The test here as for a 
matrix of 5000 x 7845 = 39,225,000 values.  These are the file sizes:
    202.1 MB  tab-delimited text file, original, uncompressed
     29.9 MB  tab-delimited text file, original, gzip compressed
    187.7 MB  tab-delimited text file, integers, uncompressed
     24.6 MB  tab-delimited text file, integers, gzip compressed
     38.9 MB  R save() original numeric values (doubles)
     27.0 MB  R save() integers
     19.7 MB  R writeBin() 16-bit integer gzipped
So, for file size (important in my case), the gzipped writeBin() method 
storing 16-bit integers was the winner.  Impressively, storing the data 
that way and dividing by 1000 on the fly to return the original numbers 
was faster than reading an Rdata file of the matrix:
The integer text file:
> system.time( D <- matrix( scan( file = "D/D000",
what=integer(0) ), ncol=7845, byrow=TRUE ) )
Read 39225000 items
     user  system elapsed
   10.626   0.344  10.971
The R save() original numeric values (doubles):
> system.time( load("D000_test.Rdata") )
     user  system elapsed
    5.579   0.119   5.698
The R save() integers:
> system.time( load("D000_test.Rdata") )
     user  system elapsed
    4.863   0.050   4.913
The writeBin() 16-bit integer gzipped file:
> con <- gzcon(file("D000_test.gz", "rb"))
> system.time( D <- matrix( readBin( con, integer(), 7845*5000, size=2,
signed=FALSE ), ncol=7845, byrow=TRUE ) )
     user  system elapsed
    3.769   0.138   3.906> close(con)
The writeBin() 16-bit integer gzipped file, converted to numeric by 
dividing by 1000 on the fly:
> system.time( D <- matrix( readBin( con, integer(), 7845*5000, size=2,
signed=FALSE ), ncol=7845, byrow=TRUE )/1000 )
     user  system elapsed
    4.159   0.237   4.397> close(con)
Best,
Mike
-- 
Michael B. Miller, Ph.D.
Minnesota Center for Twin and Family Research
Department of Psychology
University of Minnesota
http://scholar.google.com/citations?user=EV_phq4AAAAJ
William Dunlap
2014-Apr-22  01:20 UTC
[R] reading data saved with writeBin() into anything other than R
> For me that other software would probably be Octave. I'm interested if > anyone here has read in these files using Octave, or a C program or > anything else.I typed 'octave read binary file' into google.com and the first hit was the Octave help file for its fread function. In C fread is also a good way to go (C and Octave have different argument lists for their fread functions.) In the Linux shell you can use the od command. % R --quiet> con <- gzcon(file("/tmp/file.gz", "wb")) # your gzcon("/tmp/file.gz", "wb") resulted in an error message > writeBin(c(121:130,129:121), con, size=2) > close(con) > q("no")% zcat /tmp/file.gz | od --format d2 0000000 121 122 123 124 125 126 127 128 0000020 129 130 129 128 127 126 125 124 0000040 123 122 121 0000046 Bill Dunlap TIBCO Software wdunlap tibco.com> -----Original Message----- > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf > Of Mike Miller > Sent: Monday, April 21, 2014 6:00 PM > To: R-Help List > Subject: [R] reading data saved with writeBin() into anything other than R > > After saving a file like so... > > con <- gzcon("file.gz", "wb")) > writeBin(vector, con, size=2) > close(con) > > I can read it back into R like so... > > con <- gzcon("file.gz", "rb")) > vector <- readBin(con, integer(), 48000000, size=2, signed=FALSE) > close(con) > > ...and I'm wondering what other programs might be able to read in these > data. It seems to be very straightforward: When I store 5436 integers > for each of 7694 subjects, at two bytes per integer that ought to be > 5436*7696*2 = 83670912 bytes, and it is exactly that: > > $ zcat file.gz | wc -c > 83670912 > > So if I just convert every pair of bytes to an integer, I guess that will > do it. I stored them this way because it was compact, but I guess this > system also can work well when other software needs to read the data. > For me that other software would probably be Octave. I'm interested if > anyone here has read in these files using Octave, or a C program or > anything else. If I don't get a good answer here, I'll try the Octave > list, and I'll send my best answers here. > > > The rest of this is some related info for readers of this list. You don't > need to read below to answer my question above. Thanks. > > > In case anyone is interested, I did some comparisons of loading speed and > file size for a number of ways of storing my data. These data all consist > of positive numbers between 0 and 2, with three digits to the right of the > decimal, so I can save them as floating point double-precision, or > multiply by 1000 and store them as integers. The test here as for a > matrix of 5000 x 7845 = 39,225,000 values. These are the file sizes: > > 202.1 MB tab-delimited text file, original, uncompressed > 29.9 MB tab-delimited text file, original, gzip compressed > 187.7 MB tab-delimited text file, integers, uncompressed > 24.6 MB tab-delimited text file, integers, gzip compressed > 38.9 MB R save() original numeric values (doubles) > 27.0 MB R save() integers > 19.7 MB R writeBin() 16-bit integer gzipped > > So, for file size (important in my case), the gzipped writeBin() method > storing 16-bit integers was the winner. Impressively, storing the data > that way and dividing by 1000 on the fly to return the original numbers > was faster than reading an Rdata file of the matrix: > > The integer text file: > > > system.time( D <- matrix( scan( file = "D/D000", what=integer(0) ), ncol=7845, > byrow=TRUE ) ) > Read 39225000 items > user system elapsed > 10.626 0.344 10.971 > > > The R save() original numeric values (doubles): > > > system.time( load("D000_test.Rdata") ) > user system elapsed > 5.579 0.119 5.698 > > > The R save() integers: > > > system.time( load("D000_test.Rdata") ) > user system elapsed > 4.863 0.050 4.913 > > > The writeBin() 16-bit integer gzipped file: > > > con <- gzcon(file("D000_test.gz", "rb")) > > system.time( D <- matrix( readBin( con, integer(), 7845*5000, size=2, signed=FALSE ), > ncol=7845, byrow=TRUE ) ) > user system elapsed > 3.769 0.138 3.906 > > close(con) > > > The writeBin() 16-bit integer gzipped file, converted to numeric by > dividing by 1000 on the fly: > > > system.time( D <- matrix( readBin( con, integer(), 7845*5000, size=2, signed=FALSE ), > ncol=7845, byrow=TRUE )/1000 ) > user system elapsed > 4.159 0.237 4.397 > > close(con) > > > Best, > > Mike > > -- > Michael B. Miller, Ph.D. > Minnesota Center for Twin and Family Research > Department of Psychology > University of Minnesota > http://scholar.google.com/citations?user=EV_phq4AAAAJ > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.