Paul Schrimpf
2018-Jul-04 20:08 UTC
[Rd] unexpected behavior of unzip with list=T and unzip=/usr/bin/unzip
Hello, I encountered some unexpected behavior of unzip when using info-zip's unzip instead of R's internal program. Specifically, unzip("file.zip", list=TRUE, unzip=/usr/bin/unzip) produces incorrect output if the zip archive has filenames with spaces, and results in an error if the zip archive includes an archive comment or file comments. Here is some code to reproduce along with the attached files ## (mostly) expected behavior res.intern <- unzip("noSpaces.zip",list=TRUE) res.infozip <- unzip("noSpaces.zip",list=TRUE,unzip="/usr/bin/unzip") identical(res.intern,res.infozip) ## will be false, but expected from ## documentation about dates identical(res.infozip$Name,res.intern$Name) ## True res.infozip$Length==res.intern$Length ## TRUE identical(res.infozip$Length,res.intern$Length) ## FALSE, because ## former numeric, later integer ## More problematic cases print(unzip("fileNameWithSpaces.zip",list=TRUE)) print(unzip("fileNameWithSpaces.zip",list=TRUE,unzip="/usr/bin/unzip")) ## read.table is used to parse output of unzip -l, and gets ## confused by extra spaces print(unzip("withArchiveComment.zip",list=TRUE)) print(unzip("withArchiveComment.zip",list=TRUE,unzip="/usr/bin/unzip")) ## produces an error print(unzip("entryComments.zip",list=TRUE)) print(unzip("entryComments.zip",list=TRUE,unzip="/usr/bin/unzip")) ## produces an error Looking at the code for R's unzip, the basic problem is that it makes a bunch of assumptions about the format of the output of "unzip -l" that are not always true and are not verified. It's unclear to me whether R's unzip should be expected to be compatible with all sorts of external unzip programs, so perhaps a sufficient solution is simply to revise the documentation (which already mentions potential problems with dates and unzip, list=TRUE, and external programs). Alternatively, R's unzip function could be changed to work with info-zip unzip by : (1) add "-ql" instead of just "-l" when list=TRUE to eliminate the printing of comments (2) not use read.table to parse the output of unzip, instead to something like the following (which is an admittedly messy workaround) res <- if (WINDOWS) system2(unzip, c("-ql", shQuote(zipfile)), stdout = TRUE) else system2(unzip, c("-ql", shQuote(zipfile)), stdout = TRUE, env = c("TZ=UTC")) dashes <- grep("--",res) s <- dashes[1]+1 l <- dashes[2]-1 starts <- gregexpr("-+",res[dashes[1]])[[1]] ends <- gregexpr("[[:space:]]+",res[dashes[1]])[[1]] z <- data.frame( Name=sapply(res[s:l], function(x) { substr(x, starts[4], stop=nchar(x)) }), Length=sapply(res[s:l], function(x) { as.numeric(substr(x, starts[1], stop=ends[1])) }), Date=sapply(res[s:l], function(x) { substr(x, starts[2], stop=ends[2]) }), Time=sapply(res[s:l], function(x) { substr(x, starts[3], stop=ends[3]) }), stringsAsFactors=FALSE ) rownames(z) <- NULL I can submit a patch if this is appropriate. I'm really not sure though because I am new to R-devel. Also, this has the downsides of relying on the behavior of info-zip unzip, which might change in future versions and is unlikely to be the same for other external unzip programs. On the other hand, the current code also relies on the behavior of info-zip unzip, but also doesn't work in some cases. Thanks, Paul P.S. My sessionInfo is> sessionInfo()R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Arch Linux Matrix products: default BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.1.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] devtools_1.13.5 loaded via a namespace (and not attached): [1] compiler_3.5.1 tools_3.5.1 withr_2.1.2 memoise_1.1.0 digest_0.6.15 And unzip -v UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler. Send bug reports using http://www.info-zip.org/zip-bug.html; see README for details. Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip/ ; see ftp://ftp.info-zip.org/pub/infozip/UnZip.html for other sites. Compiled with gcc 5.3.0 for Unix (Linux ELF) on Apr 17 2016. UnZip special compilation options: ACORN_FTYPE_NFS COPYRIGHT_CLEAN (PKZIP 0.9x unreducing method not supported) SET_DIR_ATTRIB SYMLINKS (symbolic links supported, if RTL and file system permit) TIMESTAMP UNIXBACKUP USE_EF_UT_TIME USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported) USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported) UNICODE_SUPPORT [wide-chars, char coding: UTF-8] (handle UTF-8 paths) LARGE_FILE_SUPPORT (large files over 2 GiB supported) ZIP64_SUPPORT (archives using Zip64 for large files supported) USE_BZIP2 (PKZIP 4.6+, using bzip2 lib version 1.0.6, 6-Sept-2010) VMS_TEXT_CONV WILD_STOP_AT_DIR [decryption, version 2.11 of 05 Jan 2007] UnZip and ZipInfo environment options: UNZIP: [none] UNZIPOPT: [none] ZIPINFO: [none] ZIPINFOOPT: [none]
Tomas Kalibera
2018-Oct-09 14:40 UTC
[Rd] unexpected behavior of unzip with list=T and unzip=/usr/bin/unzip
Hi Paul, thanks for the report. Fixed in R-devel 75417. Best Tomas On 07/04/2018 10:08 PM, Paul Schrimpf wrote:> Hello, > > I encountered some unexpected behavior of unzip when using info-zip's unzip > instead of R's internal program. Specifically, unzip("file.zip", list=TRUE, > unzip=/usr/bin/unzip) produces incorrect output if the zip archive has > filenames with spaces, and results in an error if the zip archive includes > an archive comment or file comments. > > Here is some code to reproduce along with the attached files > > ## (mostly) expected behavior > res.intern <- unzip("noSpaces.zip",list=TRUE) > res.infozip <- unzip("noSpaces.zip",list=TRUE,unzip="/usr/bin/unzip") > > identical(res.intern,res.infozip) ## will be false, but expected from > ## documentation about dates > identical(res.infozip$Name,res.intern$Name) ## True > res.infozip$Length==res.intern$Length ## TRUE > identical(res.infozip$Length,res.intern$Length) ## FALSE, because > ## former numeric, later > integer > > ## More problematic cases > print(unzip("fileNameWithSpaces.zip",list=TRUE)) > print(unzip("fileNameWithSpaces.zip",list=TRUE,unzip="/usr/bin/unzip")) > ## read.table is used to parse output of unzip -l, and gets > ## confused by extra spaces > > print(unzip("withArchiveComment.zip",list=TRUE)) > print(unzip("withArchiveComment.zip",list=TRUE,unzip="/usr/bin/unzip")) > ## produces an error > > print(unzip("entryComments.zip",list=TRUE)) > print(unzip("entryComments.zip",list=TRUE,unzip="/usr/bin/unzip")) > ## produces an error > > Looking at the code for R's unzip, the basic problem is that it makes a > bunch of assumptions about the format of the output of "unzip -l" that are > not always true and are not verified. > > It's unclear to me whether R's unzip should be expected to be compatible > with all sorts of external unzip programs, so perhaps a sufficient solution > is simply to revise the documentation (which already mentions potential > problems with dates and unzip, list=TRUE, and external programs). > > Alternatively, R's unzip function could be changed to work with info-zip > unzip by : > (1) add "-ql" instead of just "-l" when list=TRUE to eliminate the printing > of comments > (2) not use read.table to parse the output of unzip, instead to something > like the following (which is an admittedly messy workaround) > > res <- if (WINDOWS) > system2(unzip, c("-ql", shQuote(zipfile)), stdout = TRUE) > else system2(unzip, c("-ql", shQuote(zipfile)), stdout = TRUE, > env = c("TZ=UTC")) > dashes <- grep("--",res) > s <- dashes[1]+1 > l <- dashes[2]-1 > starts <- gregexpr("-+",res[dashes[1]])[[1]] > ends <- gregexpr("[[:space:]]+",res[dashes[1]])[[1]] > z <- data.frame( > Name=sapply(res[s:l], function(x) { > substr(x, starts[4], stop=nchar(x)) > }), > Length=sapply(res[s:l], function(x) { > as.numeric(substr(x, starts[1], stop=ends[1])) > }), > Date=sapply(res[s:l], function(x) { > substr(x, starts[2], stop=ends[2]) > }), > Time=sapply(res[s:l], function(x) { > substr(x, starts[3], stop=ends[3]) > }), > stringsAsFactors=FALSE > ) > rownames(z) <- NULL > > I can submit a patch if this is appropriate. I'm really not sure though > because I am new to R-devel. Also, this has the downsides of relying on the > behavior of info-zip unzip, which might change in future versions and is > unlikely to be the same for other external unzip programs. On the other > hand, the current code also relies on the behavior of info-zip unzip, but > also doesn't work in some cases. > > Thanks, > Paul > > P.S. > > My sessionInfo is > >> sessionInfo() > R version 3.5.1 (2018-07-02) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Arch Linux > > Matrix products: default > BLAS/LAPACK: /usr/lib/libopenblas_haswellp-r0.3.1.so > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] devtools_1.13.5 > > loaded via a namespace (and not attached): > [1] compiler_3.5.1 tools_3.5.1 withr_2.1.2 memoise_1.1.0 > digest_0.6.15 > > And unzip -v > > UnZip 6.00 of 20 April 2009, by Info-ZIP. Maintained by C. Spieler. Send > bug reports using http://www.info-zip.org/zip-bug.html; see README for > details. > > Latest sources and executables are at ftp://ftp.info-zip.org/pub/infozip/ ; > see ftp://ftp.info-zip.org/pub/infozip/UnZip.html for other sites. > > Compiled with gcc 5.3.0 for Unix (Linux ELF) on Apr 17 2016. > > UnZip special compilation options: > ACORN_FTYPE_NFS > COPYRIGHT_CLEAN (PKZIP 0.9x unreducing method not supported) > SET_DIR_ATTRIB > SYMLINKS (symbolic links supported, if RTL and file system permit) > TIMESTAMP > UNIXBACKUP > USE_EF_UT_TIME > USE_UNSHRINK (PKZIP/Zip 1.x unshrinking method supported) > USE_DEFLATE64 (PKZIP 4.x Deflate64(tm) supported) > UNICODE_SUPPORT [wide-chars, char coding: UTF-8] (handle UTF-8 > paths) > LARGE_FILE_SUPPORT (large files over 2 GiB supported) > ZIP64_SUPPORT (archives using Zip64 for large files supported) > USE_BZIP2 (PKZIP 4.6+, using bzip2 lib version 1.0.6, 6-Sept-2010) > VMS_TEXT_CONV > WILD_STOP_AT_DIR > [decryption, version 2.11 of 05 Jan 2007] > > UnZip and ZipInfo environment options: > UNZIP: [none] > UNZIPOPT: [none] > ZIPINFO: [none] > ZIPINFOOPT: [none] > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel