Jon Clayden
2007-Jan-26 11:25 UTC
[Rd] readBin is much slower for raw input than for a file
Dear all, I'm trying to write an efficient binary file reader for a file type that is made up of several fields of variable length, and so requires many small reads. Doing this on the file directly using a sequence of readBin() calls is a bit too slow for my needs, so I tried buffering the file into a raw vector and reading from that ("loc" is the equivalent of the file pointer): fileSize <- file.info(fileName)$size connection <- file(fileName, "rb") bytes <- readBin(connection, "raw", n=fileSize) loc <- 0 close(connection) -- # within a custom read function: if (loc == 0) data <- readBin(bytes, what, n, size, ...) else if (loc > 0) data <- readBin(bytes[-(1:loc)], what, n, size, ...) However, this method runs almost 10 times slower for me than the sequence of file reads did. The initial call to readBin() - for reading in the file - is very quick, but running Rprof shows that the vast majority of the run time in doing the full parse is spent in readBin, so it does seem to be that that's slowing things down. Can anyone shed any light on why this is? I'm not expecting miracles here - and I realise that writing the whole read routine in C would be much quicker - but surely reading from a raw vector should work out faster than reading from a file? The system is R-2.4.1/Linux, Xeon 3.2 GHz, 2 GiB RAM; typical file size is 44 KiB. Thanks in advance, Jon
Jon Clayden
2007-Jan-31 11:03 UTC
[R] readBin is much slower for raw input than for a file
This hasn't generated any feedback after a few days on R-devel, so I'm forwarding it to R-help in case anyone here has any ideas... Thanks, Jon ---------- Forwarded message ---------- From: Jon Clayden <jon.clayden at gmail.com> Date: 26-Jan-2007 11:25 Subject: readBin is much slower for raw input than for a file To: r-devel at r-project.org Dear all, I'm trying to write an efficient binary file reader for a file type that is made up of several fields of variable length, and so requires many small reads. Doing this on the file directly using a sequence of readBin() calls is a bit too slow for my needs, so I tried buffering the file into a raw vector and reading from that ("loc" is the equivalent of the file pointer): fileSize <- file.info(fileName)$size connection <- file(fileName, "rb") bytes <- readBin(connection, "raw", n=fileSize) loc <- 0 close(connection) -- # within a custom read function: if (loc == 0) data <- readBin(bytes, what, n, size, ...) else if (loc > 0) data <- readBin(bytes[-(1:loc)], what, n, size, ...) However, this method runs almost 10 times slower for me than the sequence of file reads did. The initial call to readBin() - for reading in the file - is very quick, but running Rprof shows that the vast majority of the run time in doing the full parse is spent in readBin, so it does seem to be that that's slowing things down. Can anyone shed any light on why this is? I'm not expecting miracles here - and I realise that writing the whole read routine in C would be much quicker - but surely reading from a raw vector should work out faster than reading from a file? The system is R-2.4.1/Linux, Xeon 3.2 GHz, 2 GiB RAM; typical file size is 44 KiB. Thanks in advance, Jon
Possibly Parallel Threads
- readBin is much slower for raw input than for a file
- readBin() arg check has unnecessary overhead (patch included)
- Reading 64-bit integers
- RODBC : reading binary data from a TXT field belonging to a PostgeSQL table
- R 2.7.0, match() and strings containing \0 - bug?