jbrzusto at fastmail.fm
2006-Oct-02 18:09 UTC
[Rd] 2.3.1: interacting bugs in load() and gzfile() (PR#9271)
Hello, If repeated calls are made to save() using the same pre-opened gzfile connection to a file, and then the connection is closed, the objects saved by the second and subsequent calls are not correctly restored by repeated calls to load() with a new gzfile connection to the same file. What follows are a session exposing the bugs, analysis (see ANALYSIS), patches (see PATCHES), and a session showing correct behaviour after patching (see FIXED). The problematic code is still present in the trunk of the svn repository. This is really due to two minor bugs, but because they interact, I'm submitting this as one report. Regards, John Brzustowski ===============================================================================THE BUGS (with comments inserted)> sessionInfo()Version 2.3.1 (2006-06-01) i686-pc-linux-gnu attached base packages: [1] "methods" "stats" "graphics" "grDevices" "utils" "datasets" [7] "base" ## create two objects and save them to a gzfile:> x <- 1:10 > y <- x^2 > f <- gzfile("tmp.dat", "wb") > save(x, file=f)## notice where the second object will be written:> seek(f, NA)[1] 88> save(y, file=f) > close(f)## open the file and try to load the two objects, one at a time> f <- gzfile("tmp.dat", "rb") > e <- new.env() > load(f,e) > ls(e)[1] "x" ## x was loaded correctly, but the connection has been closed:> fdescription class mode text "gzcon(tmp.dat)" "gzcon" "rb" "binary" opened can read can write "closed" "yes" "no" ## <interaction> (see ANALYSIS) ## Try again:> load(f,e)Warning message: this is already a gzcon connection in: gzcon(con, level, allowNonCompressed)> ls(e)[1] "x" ## why the warning? ## and why was nothing apparently read this time?> fdescription class mode text "gzcon(tmp.dat)" "gzcon" "rb" "binary" opened can read can write "opened" "yes" "no"> load(f,e)Warning message: this is already a gzcon connection in: gzcon(con, level, allowNonCompressed)> ls(e)[1] "x" "y" ## this time, y was read correctly, but again why the warning?> fdescription class mode text "gzcon(tmp.dat)" "gzcon" "rb" "binary" opened can read can write "closed" "yes" "no" ## </interaction> ## try again, but first seek on the connection to the location of the second object> close(f) > f <- gzfile("tmp.dat", "rb") > e <- new.env() > seek(f,88)[1] 0> load(f,e) > ls(e)[1] "x" ## this should have loaded the object "y", which according to the earlier call to seek() ## begins at file offset 88. =======================================================================================ANALYSIS There are two minor bugs here: 1. The internal function loadFromConnection() incorrectly closes a connection that was already open when it was called, instead of closing a connection that was *not* already open. 2. gzfile() sets a class of its return value to "file", rather than "gzfile":> f <- gzfile("whatever.gz", "wb") > class(f)[1] "file" "connection" In contrast, bzfile() exhibits correct behaviour:> f2<-bzfile("more.bz2", "wb") > class(f2)[1] "bzfile" "connection" This prevents load() from noticing that the connection passed to it is already a gzfile, since it tests for inheritance from class "gzfile". Therefore, load() always re-opens the gzfile connection with gzcon(), which causes reading to proceed from the beginning of the file, regardless of any seek() on the original gzfile connection. This reopening of a gzfile with gzcon() causes the warning, and prevents the seek() before the load() from having any effect. Interaction: because (with bug 2) load() always reopens a gzfile connection, bug 1's effect is to not close that reopened connection. Ironically, this allows (with warnings) objects from multiple calls to save() to be restored by multiple calls to load(), as demonstrated by the <interaction></interaction> section in the session above, provided load() is being called with a filename or closed connection rather than an open gzfile connection. ===============================================================================PATCHES: for bug 1: diff -u R-2.3.1.original/src/main/saveload.c R-2.3.1/src/main/saveload.c --- R-2.3.1.original/src/main/saveload.c 2006-04-09 18:19:51.000000000 -0400 +++ R-2.3.1/src/main/saveload.c 2006-10-02 13:10:21.000000000 -0400 @@ -2301,6 +2301,7 @@ PROTECT(res = RestoreToEnv(R_Unserialize(&in), aenv)); if (wasopen) { endcontext(&cntxt); + } else { con->close(con); } UNPROTECT(1); for bug 2: diff -u R-2.3.1.original/src/main/connections.c R-2.3.1/src/main/connections.c --- R-2.3.1.original/src/main/connections.c 2006-05-18 05:13:54.000000000 -0400 +++ R-2.3.1/src/main/connections.c 2006-10-02 11:53:10.000000000 -0400 @@ -1172,7 +1172,7 @@ PROTECT(ans = allocVector(INTSXP, 1)); INTEGER(ans)[0] = ncon; PROTECT(class = allocVector(STRSXP, 2)); - SET_STRING_ELT(class, 0, mkChar("file")); + SET_STRING_ELT(class, 0, mkChar("gzfile")); SET_STRING_ELT(class, 1, mkChar("connection")); classgets(ans, class); UNPROTECT(2); ===============================================================================FIXED: Here's how the patched version operates: ## create two objects and serialize them to a gzfile> x <- 1:10 > y <- x^2 > f <- gzfile("tmp.dat", "wb") > save(x, file=f) > seek(f, NA)[1] 88> save(y, file=f) > close(f)## re-open the gzfile> f <- gzfile("tmp.dat", "rb")## seek to the second object and load it> seek(f,88)[1] 0> e <- new.env() > load(f,e) > ls(e)[1] "y"> eval(quote(y), e)[1] 1 4 9 16 25 36 49 64 81 100 ## seek to the first object and load it> seek(f,0)[1] 216> load(f,e) > ls(e)[1] "x" "y"> eval(quote(x), e)[1] 1 2 3 4 5 6 7 8 9 10>