Brandon Whitcher
2010-Jun-22 17:04 UTC
[Rd] seek() and gzfile() on 32-bit R2.12.0 in linux
I have installed both 32-bit and 64-bit versions of R2.12.0 (2010-06-15 r52300) on my Ubuntu 10.04 64-bit system. I observe the following behavior when running the examples from base::connections. There appears to be a problem with seek() on a .gz file when using a 32-bit installation of R2.12.0, but the problem doesn't appear in the 64-bit installation. I realize that seek() has been difficult in the past, and I don't want to open old wounds, but is this a known problem? Is this easily fixable? I have a package that relies on seek() when accessing gzipped files. Using the 32-bit installation... *> zz <- file("ex.data", "w") # open an output file connection> cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep "\n") > cat("One more line\n", file = zz) > close(zz) > blah = file("ex.data", "r") > seek(blah)[1] 0> > zz <- gzfile("ex.gz", "w") # compressed file > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep "\n") > close(zz) > blah = file("ex.gz", "r") > seek(blah)[1] 7.80707e+17> > zz <- bzfile("ex.bz2", "w") # bzip2-ed file > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep "\n") > close(zz) > blah = file("ex.bz2", "r") > seek(blah)Error in seek.connection(blah) : 'seek' not enabled for this connection>*Using the 64-bit installation... *> zz <- file("ex.data", "w") # open an output file connection> cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") > cat("One more line\n", file = zz) > close(zz) > blah = file("ex.data", "r") > seek(blah)[1] 0> > zz <- gzfile("ex.gz", "w") # compressed file > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") > close(zz) > blah = file("ex.gz", "r") > seek(blah)[1] 0> > zz <- bzfile("ex.bz2", "w") # bzip2-ed file > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") > close(zz) > blah = file("ex.bz2", "r") > seek(blah)Error in seek.connection(blah) : 'seek' not enabled for this connection> *thanks, Brandon [[alternative HTML version deleted]]
You used file to open "ex.gz", which ought to work, but relies on do_url to automatically detect that the file is a gzip file. It's a long shot, but you could try to verify that the file is a valid gzip file (R checks that the first two bytes == "\x1f\x8b") and try the gzfile function on the 32 bit machine and see what happens. Also, it would be nice to see the output of your sessionInfo(), in order to reproduce your finding. This might be a bug in the R source: (1 - unlikely) The C function do_url (src/main/connections.c) fails to detect the gzip file on the 32 bit machine. Unfortunately, even if do_url does detect a gzip file, the class of the returned connection object is still marked c("file", "connection") rather than c("gzfile", "connection"), so there's no easy check for this. Even so, this doesn't explain why you get 7.80707e+17. (2 - more likely) The zlib function gztell (declared: src/extra/zlib/zlib.h defined: src/extra/zlib/gzlib.c) returns z_off_t. The bug may relate to the size of z_off_t on the two different machines and/or casting z_off_t to double (which is done just before the value is returned by gzfile_seek, defined in src/main/connections.c). What a headache. Need to reproduce the bug to investigate this further. I have been wondering why double was used in the prototype for the seek member of (struct Rconn), rather than an integer type. Presumably to solve problems such as this. I'll be very interested to see what the core team has to say here. -Matt On Tue, 2010-06-22 at 13:04 -0400, Brandon Whitcher wrote:> I have installed both 32-bit and 64-bit versions of R2.12.0 (2010-06-15 > r52300) on my Ubuntu 10.04 64-bit system. I observe the following behavior > when running the examples from base::connections. There appears to be a > problem with seek() on a .gz file when using a 32-bit installation of > R2.12.0, but the problem doesn't appear in the 64-bit installation. I > realize that seek() has been difficult in the past, and I don't want to open > old wounds, but is this a known problem? Is this easily fixable? I have a > package that relies on seek() when accessing gzipped files. > > Using the 32-bit installation... > > *> zz <- file("ex.data", "w") # open an output file connection > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep > "\n") > > cat("One more line\n", file = zz) > > close(zz) > > blah = file("ex.data", "r") > > seek(blah) > [1] 0 > > > > zz <- gzfile("ex.gz", "w") # compressed file > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep > "\n") > > close(zz) > > blah = file("ex.gz", "r") > > seek(blah) > [1] 7.80707e+17 > > > > zz <- bzfile("ex.bz2", "w") # bzip2-ed file > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep > "\n") > > close(zz) > > blah = file("ex.bz2", "r") > > seek(blah) > Error in seek.connection(blah) : 'seek' not enabled for this connection > >* > > Using the 64-bit installation... > > *> zz <- file("ex.data", "w") # open an output file connection > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") > > cat("One more line\n", file = zz) > > close(zz) > > blah = file("ex.data", "r") > > seek(blah) > [1] 0 > > > > zz <- gzfile("ex.gz", "w") # compressed file > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") > > close(zz) > > blah = file("ex.gz", "r") > > seek(blah) > [1] 0 > > > > zz <- bzfile("ex.bz2", "w") # bzip2-ed file > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") > > close(zz) > > blah = file("ex.bz2", "r") > > seek(blah) > Error in seek.connection(blah) : 'seek' not enabled for this connection > > * > > thanks, > > Brandon > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Matthew S. Shotwell Graduate Student Division of Biostatistics and Epidemiology Medical University of South Carolina http://biostatmatt.com
I was able to reproduce this bug. After some investigating, it's clearly localized to gztell (a zlib function), and the z_off_t type. However, there may be a broader cross-compiling problem. I don't know what procedure Brandon used to compile the 32 bit version (I used the gcc -m32 flag), but we should be sure that we're doing this correctly (and document it!) before going on a goose chase. The real issue may or may not be related to zlib, but only manifested there. Discussion of my findings are below. -Matt I checked to ensure that R's file function was recognizing the gzip file as such. So that's not the problem. I next modified some code in gzfile_seek, just above and below the call to gztell (line 1230 of connections.c), and defined a small function z_off_t_print, to print the bits of the z_off_t offset in least significant order (assuming little endian): static void z_off_t_print(z_off_t) { z_off_t mask = 1; while( mask > 0 ) { printf("%u", (mask & u) > 0 ); mask <<= 1; } printf("\n"); } static double gzfile_seek(Rconnection con, double where, int origin, int rw) { gzFile fp = ((Rgzfileconn)(con->private))->fp; /** begin modified code **/ z_off_t pos; printf("sizeof(z_off_t): %u\n", sizeof(z_off_t)); printf("sizeof(double): %u\n", sizeof(double)); printf("before gztell():\n"); z_off_t_print(pos); pos = gztell(fp); printf("after gztell():\n"); z_off_t_print(pos); printf("(double) pos: %f\n", (double) pos); /** end modified code **/ ... Here's what happens running code similar to yours in the 32 bit build:> zz <- gzfile("ex.gz", "w") # compressed file > cat("TITLE extra line", "2 3 5 7",+ "", "11 13 17", file = zz, sep = "\n")> close(zz) > blah = file("ex.gz", "r") > seek(blah, 5)sizeof(z_off_t): 8 sizeof(double): 8 before gztell(): 000000000000000000000000000000000000000000000000000000000000000 after gztell(): 000000000000000000000000000000000000110000111011110111001001000 (double) pos: 665367468683821056.000000 [1] 6.653675e+17> seek(blah)before gztell(): 000000000000000000000000000000000000000000000000000000000000000 after gztell(): 101000000000000000000000000000000000110000111011110111001001000 (double) pos: 665367468683821056.000000 [1] 6.653675e+17 Hence, gztell is doing what we expect in the least significant 32 bits (which is binary for decimal 5), but returns junk in the most significant 32 bits. Here are the results for the 64 bit build:> zz <- gzfile("ex.gz", "w") # compressed file > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") > close(zz) > blah = file("ex.gz", "r") > seek(blah, 5)sizeof(z_off_t): 8 sizeof(double): 8 before gztell(): 000000000000000000000000000000000000000000000000000000000000000 after gztell(): 000000000000000000000000000000000000000000000000000000000000000 (double) pos: 0.000000 [1] 0> seek(blah)before gztell(): 000000000000000000000000000000000000000000000000000000000000000 after gztell(): 101000000000000000000000000000000000000000000000000000000000000 (double) pos: 5.000000 [1] 5 No problems with the 64 bit build. On Tue, 2010-06-22 at 13:04 -0400, Brandon Whitcher wrote:> I have installed both 32-bit and 64-bit versions of R2.12.0 (2010-06-15 > r52300) on my Ubuntu 10.04 64-bit system. I observe the following behavior > when running the examples from base::connections. There appears to be a > problem with seek() on a .gz file when using a 32-bit installation of > R2.12.0, but the problem doesn't appear in the 64-bit installation. I > realize that seek() has been difficult in the past, and I don't want to open > old wounds, but is this a known problem? Is this easily fixable? I have a > package that relies on seek() when accessing gzipped files. > > Using the 32-bit installation... > > *> zz <- file("ex.data", "w") # open an output file connection > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep > "\n") > > cat("One more line\n", file = zz) > > close(zz) > > blah = file("ex.data", "r") > > seek(blah) > [1] 0 > > > > zz <- gzfile("ex.gz", "w") # compressed file > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep > "\n") > > close(zz) > > blah = file("ex.gz", "r") > > seek(blah) > [1] 7.80707e+17 > > > > zz <- bzfile("ex.bz2", "w") # bzip2-ed file > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep > "\n") > > close(zz) > > blah = file("ex.bz2", "r") > > seek(blah) > Error in seek.connection(blah) : 'seek' not enabled for this connection > >* > > Using the 64-bit installation... > > *> zz <- file("ex.data", "w") # open an output file connection > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") > > cat("One more line\n", file = zz) > > close(zz) > > blah = file("ex.data", "r") > > seek(blah) > [1] 0 > > > > zz <- gzfile("ex.gz", "w") # compressed file > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") > > close(zz) > > blah = file("ex.gz", "r") > > seek(blah) > [1] 0 > > > > zz <- bzfile("ex.bz2", "w") # bzip2-ed file > > cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") > > close(zz) > > blah = file("ex.bz2", "r") > > seek(blah) > Error in seek.connection(blah) : 'seek' not enabled for this connection > > * > > thanks, > > Brandon > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Matthew S. Shotwell Graduate Student Division of Biostatistics and Epidemiology Medical University of South Carolina http://biostatmatt.com
Brandon Whitcher wrote:> I have installed both 32-bit and 64-bit versions of R2.12.0 (2010-06-15 > r52300) on my Ubuntu 10.04 64-bit system.Please notice that there is NO release of R 2.12.0 until some time around October. You are using a build from the UNSTABLE development branch. The stable branch is 2.11.1 with a release date of May 31. If Ubuntu is claiming that there is such a thing as a R 2.12.0 release, I'd say that they have a problem. Not that we don't welcome reports on problems in the development branch, but do notice that it is by definition UNSTABLE, and that bugs can come and go without notice. -pd I observe the following behavior> when running the examples from base::connections. There appears to be a > problem with seek() on a .gz file when using a 32-bit installation of > R2.12.0, but the problem doesn't appear in the 64-bit installation. I > realize that seek() has been difficult in the past, and I don't want to open > old wounds, but is this a known problem? Is this easily fixable? I have a > package that relies on seek() when accessing gzipped files. > > Using the 32-bit installation... > > *> zz <- file("ex.data", "w") # open an output file connection >> cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep > "\n") >> cat("One more line\n", file = zz) >> close(zz) >> blah = file("ex.data", "r") >> seek(blah) > [1] 0 >> zz <- gzfile("ex.gz", "w") # compressed file >> cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep > "\n") >> close(zz) >> blah = file("ex.gz", "r") >> seek(blah) > [1] 7.80707e+17 >> zz <- bzfile("ex.bz2", "w") # bzip2-ed file >> cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep > "\n") >> close(zz) >> blah = file("ex.bz2", "r") >> seek(blah) > Error in seek.connection(blah) : 'seek' not enabled for this connection >> * > > Using the 64-bit installation... > > *> zz <- file("ex.data", "w") # open an output file connection >> cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") >> cat("One more line\n", file = zz) >> close(zz) >> blah = file("ex.data", "r") >> seek(blah) > [1] 0 >> zz <- gzfile("ex.gz", "w") # compressed file >> cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") >> close(zz) >> blah = file("ex.gz", "r") >> seek(blah) > [1] 0 >> zz <- bzfile("ex.bz2", "w") # bzip2-ed file >> cat("TITLE extra line", "2 3 5 7", "", "11 13 17", file = zz, sep = "\n") >> close(zz) >> blah = file("ex.bz2", "r") >> seek(blah) > Error in seek.connection(blah) : 'seek' not enabled for this connection >> * > > thanks, > > Brandon > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel-- Peter Dalgaard Center for Statistics, Copenhagen Business School Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
Brandon Whitcher
2010-Jun-23 11:14 UTC
[Rd] seek() and gzfile() on 32-bit R2.12.0 in linux
Peter, thanks for your comments. The reason I have taken this issue to R-devel is from the advice of Kurt Hornik. An update to my package oro.nifti is being refused by CRAN because it fails on the _development_ version of R on 32-bit linux. As we have just discussed (and thanks to Matt's input), the problem is not with my package but with the development version of R. Hence, I wanted to alert to the R Core Development Team that the _unstable_ version of R appears to have a problem. Obviously, I would prefer to have my new version of oro.nifti accepted by CRAN... but at the moment I am in between a rock and a hard place. I agree that the 2.12.0 release of R is quite far in the future. Is there a possibility of relaxing the exclusion criteria for CRAN? cheers... Brandon Please notice that there is NO release of R 2.12.0 until some time around October. You are using a build from the UNSTABLE development branch. The stable branch is 2.11.1 with a release date of May 31. If Ubuntu is claiming that there is such a thing as a R 2.12.0 release, I'd say that they have a problem. Not that we don't welcome reports on problems in the development branch, but do notice that it is by definition UNSTABLE, and that bugs can come and go without notice. -pd