Hervé Pagès
2013-May-08 08:51 UTC
[Rd] getting corrupted data when using readBin() after seek() on a gzfile connection
Hi,
I'm running into more issues when reading data from a gzfile connection.
If I read the data sequentially with successive calls to readBin(), the
data I get looks ok. But if I call seek() between the successive calls
to readBin(), I get corrupted data.
Here is a (hopefully) reproducible example. See my sessionInfo() at the
end (I'm not on Windows, where, according to the man page, seek() is
broken).
## Generate data with a repeated easy-to-recognize byte pattern
## of length 26:
mydata <- rep(charToRaw(paste(letters, collapse="")), 400)
## Write the data to test.gz file:
con <- gzfile("test.gz", open="wb")
writeBin(mydata, con)
close(con)
## Read the data from test.gz file. We'll read blocks of 26 bytes
## located at various offsets that are multiple of 26, so we expect
## to see our original pattern ("abc...xyz").
con <- gzfile("test.gz", open="rb")
## Offset 0: ok
> rawToChar(readBin(con, "raw", n=26))
[1] "abcdefghijklmnopqrstuvwxyz"
## Offset 78: still ok
> seek(con, where=78)
[1] 26
> seek(con)
[1] 78
> rawToChar(readBin(con, "raw", n=26))
[1] "abcdefghijklmnopqrstuvwxyz"
## Offset 520: data is messed up
> seek(con, where=520)
[1] 104
> seek(con)
[1] 520
> rawToChar(readBin(con, "raw", n=26))
[1] "zabcdefghijklmnopqrstuvvuv"
## Offset 2600: very messed up
> seek(con, where=2600)
[1] 546
> seek(con)
[1] 2600
> rawToChar(readBin(con, "raw", n=26))
[1] "xxxxxmpxxxxxxesxxxxxxxxxxp"
## Offset 10400: see previous email (subject: "error when calling
## seek() twice on a gzfile connection")
> seek(con, where=10400)
[1] 2626
Warning message:
In seek.connection(con, where = 10400) :
seek on a gzfile connection returned an internal error
close(con)
Reading the data sequentially with no calls to seek() returns the
expected pattern 400 times:
con <- gzfile("test.gz", open="rb")
blocks <- sapply(1:400, function(i) rawToChar(readBin(con,
"raw", n=26)))
## Check the result:
> readBin(con, "raw", n=26) # no more data
raw(0)
> seek(con)
[1] 10400
> table(blocks)
blocks
abcdefghijklmnopqrstuvwxyz
400
Thanks,
H.
> sessionInfo()
R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
Henrik Bengtsson
2013-May-08 17:54 UTC
[Rd] getting corrupted data when using readBin() after seek() on a gzfile connection
I can reproduce this (exactly the same output) on Windows:> sessionInfo()R version 3.0.0 Patched (2013-04-29 r62694) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 [2] LC_CTYPE=English_United States.1252 [3] LC_MONETARY=English_United States.1252 [4] LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] tools_3.0.0 /Henrik On Wed, May 8, 2013 at 1:51 AM, Herv? Pag?s <hpages at fhcrc.org> wrote:> Hi, > > I'm running into more issues when reading data from a gzfile connection. > If I read the data sequentially with successive calls to readBin(), the > data I get looks ok. But if I call seek() between the successive calls > to readBin(), I get corrupted data. > > Here is a (hopefully) reproducible example. See my sessionInfo() at the > end (I'm not on Windows, where, according to the man page, seek() is > broken). > > ## Generate data with a repeated easy-to-recognize byte pattern > ## of length 26: > mydata <- rep(charToRaw(paste(letters, collapse="")), 400) > > ## Write the data to test.gz file: > con <- gzfile("test.gz", open="wb") > writeBin(mydata, con) > close(con) > > ## Read the data from test.gz file. We'll read blocks of 26 bytes > ## located at various offsets that are multiple of 26, so we expect > ## to see our original pattern ("abc...xyz"). > con <- gzfile("test.gz", open="rb") > > ## Offset 0: ok > > rawToChar(readBin(con, "raw", n=26)) > [1] "abcdefghijklmnopqrstuvwxyz" > > ## Offset 78: still ok > > seek(con, where=78) > [1] 26 > > seek(con) > [1] 78 > > rawToChar(readBin(con, "raw", n=26)) > [1] "abcdefghijklmnopqrstuvwxyz" > > ## Offset 520: data is messed up > > seek(con, where=520) > [1] 104 > > seek(con) > [1] 520 > > rawToChar(readBin(con, "raw", n=26)) > [1] "zabcdefghijklmnopqrstuvvuv" > > > ## Offset 2600: very messed up > > seek(con, where=2600) > [1] 546 > > seek(con) > [1] 2600 > > rawToChar(readBin(con, "raw", n=26)) > [1] "xxxxxmpxxxxxxesxxxxxxxxxxp" > > ## Offset 10400: see previous email (subject: "error when calling > ## seek() twice on a gzfile connection") > > seek(con, where=10400) > [1] 2626 > Warning message: > In seek.connection(con, where = 10400) : > seek on a gzfile connection returned an internal error > > close(con) > > Reading the data sequentially with no calls to seek() returns the > expected pattern 400 times: > > con <- gzfile("test.gz", open="rb") > blocks <- sapply(1:400, function(i) rawToChar(readBin(con, "raw", n=26))) > > ## Check the result: > > > readBin(con, "raw", n=26) # no more data > raw(0) > > > seek(con) > [1] 10400 > > > table(blocks) > blocks > abcdefghijklmnopqrstuvwxyz > 400 > > Thanks, > H. > >> sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > -- > Herv? Pag?s > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages at fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel