André Wildberg
2025-Apr-25 13:41 UTC
[R] Possible bug in gzcon() (6161:src/main/connections.c)
Hi developers, originally sent as bug report request, but got re-routed here: Problem: Connections established via gzcon (also used by packages e.g. vroom/readr) may end reading/connection prematurely (see https://stackoverflow.com/questions/79587028/read-csv-only-reads-a-fraction-of-rows-from-a-zipped-file-when-reading-from-url<https://stackoverflow.com/questions/79587028/read-csv-only-reads-a-fraction-of-rows-from-a-zipped-file-when-reading-from-url#comment140365314_79587028>) Reproducible example: addr <- "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_station/USW00014839.csv.gz" # online/stream nrow(read.csv(gzcon(url(addr), text=T), header=F)) # [1] 1798 # local download.file(addr, destfile=basename(addr)) nrow(read.csv(gzcon(file(basename(addr), "r"), text=T), header=F)) # [1] 429498 # or nrow(read.csv(gzfile(basename(addr), "r"), header=F)) # [1] 429498 closeAllConnections()> sessionInfo()R version 4.5.0 (2025-04-11) Platform: aarch64-apple-darwin24.4.0 Running under: macOS Sequoia 15.4.1 likely to be platform independent. [[alternative HTML version deleted]]
Ivan Krylov
2025-Apr-26 19:55 UTC
[R] Possible bug in gzcon() (6161:src/main/connections.c)
? Fri, 25 Apr 2025 13:41:35 +0000 Andr? Wildberg <andre.wildberg at outlook.com> ?????:> Reproducible example: > > addr <- > "https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_station/USW00014839.csv.gz" > > > # online/stream > > nrow(read.csv(gzcon(url(addr), text=T), header=F)) > > # [1] 1798 > > > # local > > download.file(addr, destfile=basename(addr)) > > nrow(read.csv(gzcon(file(basename(addr), "r"), text=T), header=F)) > > # [1] 429498I can reproduce the problem (with slightly different numbers): length(readLines(gzcon(file("USW00014839.csv.gz", "rb"), text = TRUE))) # [1] 28002 length(readLines("USW00014839.csv.gz")) # [1] 429535 The underlying reason for the problem you're having with gzcon() is most likely that the gzip archive has been concatenated from multiple separate archives: perl -MIO::Uncompress::Gunzip -E' my $z = IO::Uncompress::Gunzip::->new(shift); say "end of stream" while $z->nextStream() == 1; ' -- USW00014839.csv.gz # end of stream # end of stream # end of stream # end of stream # end of stream readLines("USW00014839.csv.gz") calls file(), which can transparently switch to a gzfile() connection, which supports concatenated archives, but gzcon() currently doesn't. Feature request submitted at <https://bugs.r-project.org/show_bug.cgi?id=18887>. -- Best regards, Ivan