William Dunlap
2018-Oct-12 22:32 UTC
[R] readBin with connection of unknown compression type
I would like to use readBin to read parts of a compressed binary file whose compression type is not known (e.g, a *.RData file, which may be compressed with gz, xz, or bz compression or not compressed at all). If I use con <- file("theFile", "r") to create the connection then the compression type is detected; summary(con)$class will give the compression type (or "file" if no compression type is recognized). However, I cannot use that connection in readBin because uncompressed data that con will produce is considered text, not binary. E.g. with the files produced with the code in the postscript I get> con <- file(df["bz", "binaryFile"], "r") > dput(summary(con))list(description "C:\\Users\\wdunlap\\AppData\\Local\\Temp\\RtmpAlyXT6\\file472c73e710cd/ binary.bz", class = "bzfile", mode = "r", text = "text", opened = "opened", `can read` = "yes", `can write` = "no")> readBin(con, what="raw", n=8)Error in readBin(con, what = "raw", n = 8) : can only read from a binary connection I can read compressed text files with scan(file(.., "r")) and I don't have to tell it what sort of compression was used:> con <- file(df["bz", "textFile"], "r") > scan(con, what="integer", n=4)Read 4 items [1] "2" "3" "5" "7" I can read binary files with unknown compression by saving the class of the connection returned by file("r"), mapping that to one of file, bzfile, xzfile, or gzfile, and reopening the compressed file with "rb". E.g., myBinaryFile <- function(filename) { con <- file(filename, "r") class <- summary(con)$class close(con) # rely on class of a connection also being the name of a connection creator con <- getFunction(class)(filename, "rb") con }> lapply(bn(df$binaryFile), FUN=function(f) { con <- myBinaryFile(f) ;on.exit(close(con)) ; tryCatch(readBin(con, what="raw",n=12), error=function(e)conditionMessage(e))}) $`binary.gz` [1] 02 00 00 00 03 00 00 00 05 00 00 00 $binary.bz [1] 02 00 00 00 03 00 00 00 05 00 00 00 $binary.xz [1] 02 00 00 00 03 00 00 00 05 00 00 00 $binary.uncompressed [1] 02 00 00 00 03 00 00 00 05 00 00 00 Is this repeated opening of the file required to read binary files of unknown compression type, or did I miss a way to make readBin() with just one call to a connection-creating function? Bill Dunlap TIBCO Software wdunlap tibco.com Code to produce compressed binary and text files: tdata <- as.integer(c(2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97, 101, 103, 107, 109, 113, 127, 131, 137)) dir.create(tdir <- tempfile()) bn <- function(filename) structure(filename, names=basename(filename)) df <- data.frame(conMaker = I(list(gz = gzfile, bz = bzfile, xz xzfile, uncompressed = file))) df$binaryFile <- vapply(rownames(df), FUN.VALUE=NA_character_, FUN=function(nm) { con <- df[[nm, "conMaker"]]( file <- file.path(tdir, paste(sep=".", "binary", nm)), "wb") on.exit(close(con)) writeBin(tdata, con) file }) df$textFile <- vapply(rownames(df), FUN.VALUE=NA_character_, FUN=function(nm) { con <- df[[nm, "conMaker"]]( file <- file.path(tdir, paste(sep=".", "text", nm)), "wt") on.exit(close(con)) cat(tdata, sep="\n", file=con) file }) [[alternative HTML version deleted]]