Hi all,
I used `gzfile` and `gzcon` to read a compressed file but I found that
`gzcon` gave me a different result than `gzfile`. It seems like the `gzcon`
does not handle the data correctly. I have posted an example below. In the
example, a portion of a compressed file is downloaded from Google Cloud as
a raw vector, and the data is saved into a temp file. If I use ` gzfile` to
read the file, it can show the first 1000 lines successfully. However, if I
wrap the raw vector as a connection, and use `gzcon` to read from that
connection, it shows the first 884 lines along with a warning(see the
output).
code:
> # installed.packages("BiocManager")
> # BiocManager::install("GCSConnection", version =
"devel")
> library(GCSConnection)
> ## Download data from cloud
> uri <-
>
"gs://gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.chr1.vcf.bgz"
> con <- gcs_connection(uri)
> data <- readBin(con, raw(), 4*1024*1024)
> close(con)
>
## write data to a file> file_path <- tempfile()
> writeBin(data, file_path)
>
## Read the data using `gzfile`> con1 <- gzfile(file_path)
> str(readLines(con1, 1000))
>
## Read the data using `gzcon`> ## We create a raw connection from the raw vector
> con2 <- gzcon(rawConnection(data))
> str(readLines(con2, 1000))
output:
> > str(readLines(con1, 1000))
> chr [1:1000] "##fileformat=VCFv4.2"
"##hailversion=0.2.24-9cd88d97bedd"
> ...
> > str(readLines(con2, 1000))
> chr [1:884] "##fileformat=VCFv4.2"
"##hailversion=0.2.24-9cd88d97bedd" ...
> Warning message:
> In readLines(con2, 1000) : incomplete final line found on
'gzcon(data)'
I am not sure if this is caused by a bug in `gzcon` or the misuse of the
function. The same result can be observed at R4.0 and R4.1 devel on Win.
Here is my session info, I hope it can be helpful. Any suggestions and help
would be appreciated.
R Under development (unstable) (2020-06-27 r78747)> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows 10 x64 (build 18363)
> Matrix products: default
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252
> [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
>
> [5] LC_TIME=English_United States.1252
> system code page: 65001
Best,
Jiefei
[[alternative HTML version deleted]]