Joris Meys
2018-May-02 19:21 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
Dear all, I've noticed by trying to download gz files from here : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811 At the bottom one can download GSM907811.CEL.gz . If I download this manually and try oligo::read.celfiles("GSM907811.CEL.gz") everything works fine. (oligo is a bioConductor package) However, if I download using download.file(" https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907811&format=file&file=GSM907811%2ECEL%2Egz ", destfile = "GSM907811.CEL.gz") The file is downloaded, but oligo::read.celfiles() returns the following error: Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) : End of gz file reached unexpectedly. Perhaps this file is truncated. Moreover, if I try to delete it after using download.file(), I get a warning that permission is denied. I can only remove it using Windows file explorer after I closed the R session, indicating that the connection is still open. Yet, showConnections() doesn't show any open connections either. Session info below. Note that I started from a completely fresh R session. oligo is needed due to the specific file format of these gz files. They're not standard tarred files. Cheers Joris Session Info ------------------------------------------------------------------------------------- R version 3.5.0 (2018-04-23) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200) Matrix products: default locale: [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C [5] LC_TIME=English_United Kingdom.1252 attached base packages: [1] stats4 parallel stats graphics grDevices utils datasets methods [9] base other attached packages: [1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8 oligo_1.44.0 [4] Biobase_2.39.2 oligoClasses_1.42.0 RSQLite_2.1.0 [7] Biostrings_2.48.0 XVector_0.19.9 IRanges_2.13.28 [10] S4Vectors_0.17.42 BiocGenerics_0.25.3 loaded via a namespace (and not attached): [1] Rcpp_0.12.16 compiler_3.5.0 [3] BiocInstaller_1.30.0 GenomeInfoDb_1.15.5 [5] bitops_1.0-6 iterators_1.0.9 [7] tools_3.5.0 zlibbioc_1.25.0 [9] digest_0.6.15 bit_1.1-12 [11] memoise_1.1.0 preprocessCore_1.41.0 [13] lattice_0.20-35 ff_2.2-13 [15] pkgconfig_2.0.1 Matrix_1.2-14 [17] foreach_1.4.4 DelayedArray_0.5.31 [19] yaml_2.1.18 GenomeInfoDbData_1.1.0 [21] affxparser_1.52.0 bit64_0.9-7 [23] grid_3.5.0 BiocParallel_1.13.3 [25] blob_1.1.1 codetools_0.2-15 [27] matrixStats_0.53.1 GenomicRanges_1.31.23 [29] splines_3.5.0 SummarizedExperiment_1.9.17 [31] RCurl_1.95-4.10 affyio_1.49.2 -- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> ----------- Biowiskundedagen 2017-2018 http://www.biowiskundedagen.ugent.be/ ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Joris Meys
2018-May-03 09:48 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
Dear all, I've been diving a bit deeper into this per request of Tomas Kalibra, and found the following : - the lock on the file is only after trying to read it using oligo, so that's not a R problem in itself. The problem is independent of extrenal packages. - using Windows' fc utility and cygwin's cmp utility I found out that every so often the download.file() function inserts an extra byte. There's no real obvious pattern in how these bytes are added, but the file downloaded using download.file() is actually larger (in this case by about 8 kb). The file xxx_inR.CEL.gz is read in using: setwd("E:/Temp/genexpr/Compare") id <- "GSM907854" flink <- paste0(" https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907854&format=file&file=GSM907854%2ECEL%2Egz ") fname <- paste0(id,"_inR.CEL.gz") download.file(flink, destfile = fname) The file xxx_direct.CEL.gz is downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907854 (download link at the bottom of the page). Output of dir in CMD: 05/03/2018 11:02 AM 4,529,547 GSM907854_direct.CEL.gz 05/03/2018 11:17 AM 4,537,668 GSM907854_inR.CEL.gz or from R :> diff(file.size(dir())) # contains both CEL files.[1] 8121 Strangely enough I get the following message from download.file() : Content type 'application/octet-stream' length 4529547 bytes (4.3 MB) downloaded 4.3 MB So the reported length is exactly the same as if I would download the file directly, but the file on disk itself is larger. So it seems download.file() is adding bytes when saving the data on disk. This behaviour is independent of antivirus and/or firewalls turned on or off. Also keep in mind that these are NOT standard gzipped files. These files are a specific format for Affymetrix Human Gene 1.0 ST Arrays. If I need to run other tests, please let me know. Kind regards Joris On Wed, May 2, 2018 at 9:21 PM, Joris Meys <jorismeys at gmail.com> wrote:> Dear all, > > I've noticed by trying to download gz files from here : > https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811 > > At the bottom one can download GSM907811.CEL.gz . If I download this > manually and try > > oligo::read.celfiles("GSM907811.CEL.gz") > > everything works fine. (oligo is a bioConductor package) > > However, if I download using > > download.file("https://www.ncbi.nlm.nih.gov/geo/download/ > ?acc=GSM907811&format=file&file=GSM907811%2ECEL%2Egz", > destfile = "GSM907811.CEL.gz") > > The file is downloaded, but oligo::read.celfiles() returns the following > error: > > Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) : > End of gz file reached unexpectedly. Perhaps this file is truncated. > > Moreover, if I try to delete it after using download.file(), I get a > warning that permission is denied. I can only remove it using Windows file > explorer after I closed the R session, indicating that the connection is > still open. Yet, showConnections() doesn't show any open connections either. > > Session info below. Note that I started from a completely fresh R session. > oligo is needed due to the specific file format of these gz files. They're > not standard tarred files. > > Cheers > Joris > > Session Info > ------------------------------------------------------------ > ------------------------- > > R version 3.5.0 (2018-04-23) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows >= 8 x64 (build 9200) > > Matrix products: default > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C > > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats4 parallel stats graphics grDevices utils datasets > methods > [9] base > > other attached packages: > [1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8 > oligo_1.44.0 > [4] Biobase_2.39.2 oligoClasses_1.42.0 > RSQLite_2.1.0 > [7] Biostrings_2.48.0 XVector_0.19.9 > IRanges_2.13.28 > [10] S4Vectors_0.17.42 BiocGenerics_0.25.3 > > loaded via a namespace (and not attached): > [1] Rcpp_0.12.16 compiler_3.5.0 > [3] BiocInstaller_1.30.0 GenomeInfoDb_1.15.5 > [5] bitops_1.0-6 iterators_1.0.9 > [7] tools_3.5.0 zlibbioc_1.25.0 > [9] digest_0.6.15 bit_1.1-12 > [11] memoise_1.1.0 preprocessCore_1.41.0 > [13] lattice_0.20-35 ff_2.2-13 > [15] pkgconfig_2.0.1 Matrix_1.2-14 > [17] foreach_1.4.4 DelayedArray_0.5.31 > [19] yaml_2.1.18 GenomeInfoDbData_1.1.0 > [21] affxparser_1.52.0 bit64_0.9-7 > [23] grid_3.5.0 BiocParallel_1.13.3 > [25] blob_1.1.1 codetools_0.2-15 > [27] matrixStats_0.53.1 GenomicRanges_1.31.23 > [29] splines_3.5.0 SummarizedExperiment_1.9.17 > [31] RCurl_1.95-4.10 affyio_1.49.2 > > > -- > Joris Meys > Statistical consultant > > Department of Data Analysis and Mathematical Modelling > Ghent University > Coupure Links 653, B-9000 Gent (Belgium) > > <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> > > ----------- > Biowiskundedagen 2017-2018 > http://www.biowiskundedagen.ugent.be/ > > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >-- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> ----------- Biowiskundedagen 2017-2018 http://www.biowiskundedagen.ugent.be/ ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Martin Morgan
2018-May-03 12:10 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
On 05/02/2018 03:21 PM, Joris Meys wrote:> Dear all, > > I've noticed by trying to download gz files from here : > https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811 > > At the bottom one can download GSM907811.CEL.gz . If I download this > manually and try > > oligo::read.celfiles("GSM907811.CEL.gz") > > everything works fine. (oligo is a bioConductor package) > > However, if I download using > > download.file(" > https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907811&format=file&file=GSM907811%2ECEL%2Egz > ", > destfile = "GSM907811.CEL.gz")On windows, the 'mode' argument to download.file() needs to be "wb" (write binary) for binary files. Martin> > The file is downloaded, but oligo::read.celfiles() returns the following > error: > > Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) : > End of gz file reached unexpectedly. Perhaps this file is truncated. > > Moreover, if I try to delete it after using download.file(), I get a > warning that permission is denied. I can only remove it using Windows file > explorer after I closed the R session, indicating that the connection is > still open. Yet, showConnections() doesn't show any open connections either. > > Session info below. Note that I started from a completely fresh R session. > oligo is needed due to the specific file format of these gz files. They're > not standard tarred files. > > Cheers > Joris > > Session Info > ------------------------------------------------------------------------------------- > > R version 3.5.0 (2018-04-23) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows >= 8 x64 (build 9200) > > Matrix products: default > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 > LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats4 parallel stats graphics grDevices utils datasets > methods > [9] base > > other attached packages: > [1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8 > oligo_1.44.0 > [4] Biobase_2.39.2 oligoClasses_1.42.0 > RSQLite_2.1.0 > [7] Biostrings_2.48.0 XVector_0.19.9 > IRanges_2.13.28 > [10] S4Vectors_0.17.42 BiocGenerics_0.25.3 > > loaded via a namespace (and not attached): > [1] Rcpp_0.12.16 compiler_3.5.0 > [3] BiocInstaller_1.30.0 GenomeInfoDb_1.15.5 > [5] bitops_1.0-6 iterators_1.0.9 > [7] tools_3.5.0 zlibbioc_1.25.0 > [9] digest_0.6.15 bit_1.1-12 > [11] memoise_1.1.0 preprocessCore_1.41.0 > [13] lattice_0.20-35 ff_2.2-13 > [15] pkgconfig_2.0.1 Matrix_1.2-14 > [17] foreach_1.4.4 DelayedArray_0.5.31 > [19] yaml_2.1.18 GenomeInfoDbData_1.1.0 > [21] affxparser_1.52.0 bit64_0.9-7 > [23] grid_3.5.0 BiocParallel_1.13.3 > [25] blob_1.1.1 codetools_0.2-15 > [27] matrixStats_0.53.1 GenomicRanges_1.31.23 > [29] splines_3.5.0 SummarizedExperiment_1.9.17 > [31] RCurl_1.95-4.10 affyio_1.49.2 > >This email message may contain legally privileged and/or...{{dropped:2}}
Joris Meys
2018-May-03 12:15 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
Using the correct mode absolutely solves it. Apologies for not trying the obvious. Cheers Joris On Thu, May 3, 2018 at 2:10 PM, Martin Morgan <martin.morgan at roswellpark.org> wrote:> > > On 05/02/2018 03:21 PM, Joris Meys wrote: > >> Dear all, >> >> I've noticed by trying to download gz files from here : >> https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811 >> >> At the bottom one can download GSM907811.CEL.gz . If I download this >> manually and try >> >> oligo::read.celfiles("GSM907811.CEL.gz") >> >> everything works fine. (oligo is a bioConductor package) >> >> However, if I download using >> >> download.file(" >> https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907811&for >> mat=file&file=GSM907811%2ECEL%2Egz >> ", >> destfile = "GSM907811.CEL.gz") >> > > On windows, the 'mode' argument to download.file() needs to be "wb" (write > binary) for binary files. > > Martin > > >> The file is downloaded, but oligo::read.celfiles() returns the following >> error: >> >> Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) : >> End of gz file reached unexpectedly. Perhaps this file is truncated. >> >> Moreover, if I try to delete it after using download.file(), I get a >> warning that permission is denied. I can only remove it using Windows file >> explorer after I closed the R session, indicating that the connection is >> still open. Yet, showConnections() doesn't show any open connections >> either. >> >> Session info below. Note that I started from a completely fresh R session. >> oligo is needed due to the specific file format of these gz files. They're >> not standard tarred files. >> >> Cheers >> Joris >> >> Session Info >> ------------------------------------------------------------ >> ------------------------- >> >> R version 3.5.0 (2018-04-23) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> Running under: Windows >= 8 x64 (build 9200) >> >> Matrix products: default >> >> locale: >> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United >> Kingdom.1252 >> [3] LC_MONETARY=English_United Kingdom.1252 >> LC_NUMERIC=C >> [5] LC_TIME=English_United Kingdom.1252 >> >> attached base packages: >> [1] stats4 parallel stats graphics grDevices utils datasets >> methods >> [9] base >> >> other attached packages: >> [1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8 >> oligo_1.44.0 >> [4] Biobase_2.39.2 oligoClasses_1.42.0 >> RSQLite_2.1.0 >> [7] Biostrings_2.48.0 XVector_0.19.9 >> IRanges_2.13.28 >> [10] S4Vectors_0.17.42 BiocGenerics_0.25.3 >> >> loaded via a namespace (and not attached): >> [1] Rcpp_0.12.16 compiler_3.5.0 >> [3] BiocInstaller_1.30.0 GenomeInfoDb_1.15.5 >> [5] bitops_1.0-6 iterators_1.0.9 >> [7] tools_3.5.0 zlibbioc_1.25.0 >> [9] digest_0.6.15 bit_1.1-12 >> [11] memoise_1.1.0 preprocessCore_1.41.0 >> [13] lattice_0.20-35 ff_2.2-13 >> [15] pkgconfig_2.0.1 Matrix_1.2-14 >> [17] foreach_1.4.4 DelayedArray_0.5.31 >> [19] yaml_2.1.18 GenomeInfoDbData_1.1.0 >> [21] affxparser_1.52.0 bit64_0.9-7 >> [23] grid_3.5.0 BiocParallel_1.13.3 >> [25] blob_1.1.1 codetools_0.2-15 >> [27] matrixStats_0.53.1 GenomicRanges_1.31.23 >> [29] splines_3.5.0 SummarizedExperiment_1.9.17 >> [31] RCurl_1.95-4.10 affyio_1.49.2 >> >> >> > > This email message may contain legally privileged and/or confidential > information. If you are not the intended recipient(s), or the employee or > agent responsible for the delivery of this message to the intended > recipient(s), you are hereby notified that any disclosure, copying, > distribution, or use of this email message is prohibited. If you have > received this message in error, please notify the sender immediately by > e-mail and delete this email message from your computer. Thank you. >-- Joris Meys Statistical consultant Department of Data Analysis and Mathematical Modelling Ghent University Coupure Links 653, B-9000 Gent (Belgium) <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> ----------- Biowiskundedagen 2017-2018 http://www.biowiskundedagen.ugent.be/ ------------------------------- Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php [[alternative HTML version deleted]]
Henrik Bengtsson
2018-May-03 12:42 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
Use mode="wb" when you download the file. See https://github.com/HenrikBengtsson/Wishlist-for-R/issues/30. R core, and others, is there a good argument for why we are not making this the default download mode? It seems like a such a simple fix to such a common "mistake". Henrik On Thu, May 3, 2018, 00:44 Joris Meys <jorismeys at gmail.com> wrote:> Dear all, > > I've noticed by trying to download gz files from here : > https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811 > > At the bottom one can download GSM907811.CEL.gz . If I download this > manually and try > > oligo::read.celfiles("GSM907811.CEL.gz") > > everything works fine. (oligo is a bioConductor package) > > However, if I download using > > download.file(" > > https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907811&format=file&file=GSM907811%2ECEL%2Egz > ", > destfile = "GSM907811.CEL.gz") > > The file is downloaded, but oligo::read.celfiles() returns the following > error: > > Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) : > End of gz file reached unexpectedly. Perhaps this file is truncated. > > Moreover, if I try to delete it after using download.file(), I get a > warning that permission is denied. I can only remove it using Windows file > explorer after I closed the R session, indicating that the connection is > still open. Yet, showConnections() doesn't show any open connections > either. > > Session info below. Note that I started from a completely fresh R session. > oligo is needed due to the specific file format of these gz files. They're > not standard tarred files. > > Cheers > Joris > > Session Info > > ------------------------------------------------------------------------------------- > > R version 3.5.0 (2018-04-23) > Platform: x86_64-w64-mingw32/x64 (64-bit) > Running under: Windows >= 8 x64 (build 9200) > > Matrix products: default > > locale: > [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United > Kingdom.1252 > [3] LC_MONETARY=English_United Kingdom.1252 > LC_NUMERIC=C > [5] LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats4 parallel stats graphics grDevices utils datasets > methods > [9] base > > other attached packages: > [1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8 > oligo_1.44.0 > [4] Biobase_2.39.2 oligoClasses_1.42.0 > RSQLite_2.1.0 > [7] Biostrings_2.48.0 XVector_0.19.9 > IRanges_2.13.28 > [10] S4Vectors_0.17.42 BiocGenerics_0.25.3 > > loaded via a namespace (and not attached): > [1] Rcpp_0.12.16 compiler_3.5.0 > [3] BiocInstaller_1.30.0 GenomeInfoDb_1.15.5 > [5] bitops_1.0-6 iterators_1.0.9 > [7] tools_3.5.0 zlibbioc_1.25.0 > [9] digest_0.6.15 bit_1.1-12 > [11] memoise_1.1.0 preprocessCore_1.41.0 > [13] lattice_0.20-35 ff_2.2-13 > [15] pkgconfig_2.0.1 Matrix_1.2-14 > [17] foreach_1.4.4 DelayedArray_0.5.31 > [19] yaml_2.1.18 GenomeInfoDbData_1.1.0 > [21] affxparser_1.52.0 bit64_0.9-7 > [23] grid_3.5.0 BiocParallel_1.13.3 > [25] blob_1.1.1 codetools_0.2-15 > [27] matrixStats_0.53.1 GenomicRanges_1.31.23 > [29] splines_3.5.0 SummarizedExperiment_1.9.17 > [31] RCurl_1.95-4.10 affyio_1.49.2 > > > -- > Joris Meys > Statistical consultant > > Department of Data Analysis and Mathematical Modelling > Ghent University > Coupure Links 653, B-9000 Gent (Belgium) > < > https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g > > > > ----------- > Biowiskundedagen 2017-2018 > http://www.biowiskundedagen.ugent.be/ > > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Duncan Murdoch
2018-May-03 13:02 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
On 03/05/2018 8:42 AM, Henrik Bengtsson wrote:> Use mode="wb" when you download the file. See > https://github.com/HenrikBengtsson/Wishlist-for-R/issues/30. > > R core, and others, is there a good argument for why we are not making this > the default download mode? It seems like a such a simple fix to such a > common "mistake".Many downloads are text files (HTML, CSV, etc.), and if those are downloaded in binary, a Windows user might end up with a file that Notepad can't handle, because it would have Unix-style line endings. (It's possible Notepad no longer requires CR LF endings; I haven't used it in years. But there are probably other brain-dead Windows programs that do.) Duncan Murdoch> > Henrik > > On Thu, May 3, 2018, 00:44 Joris Meys <jorismeys at gmail.com> wrote: > >> Dear all, >> >> I've noticed by trying to download gz files from here : >> https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811 >> >> At the bottom one can download GSM907811.CEL.gz . If I download this >> manually and try >> >> oligo::read.celfiles("GSM907811.CEL.gz") >> >> everything works fine. (oligo is a bioConductor package) >> >> However, if I download using >> >> download.file(" >> >> https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907811&format=file&file=GSM907811%2ECEL%2Egz >> ", >> destfile = "GSM907811.CEL.gz") >> >> The file is downloaded, but oligo::read.celfiles() returns the following >> error: >> >> Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) : >> End of gz file reached unexpectedly. Perhaps this file is truncated. >> >> Moreover, if I try to delete it after using download.file(), I get a >> warning that permission is denied. I can only remove it using Windows file >> explorer after I closed the R session, indicating that the connection is >> still open. Yet, showConnections() doesn't show any open connections >> either. >> >> Session info below. Note that I started from a completely fresh R session. >> oligo is needed due to the specific file format of these gz files. They're >> not standard tarred files. >> >> Cheers >> Joris >> >> Session Info >> >> ------------------------------------------------------------------------------------- >> >> R version 3.5.0 (2018-04-23) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> Running under: Windows >= 8 x64 (build 9200) >> >> Matrix products: default >> >> locale: >> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United >> Kingdom.1252 >> [3] LC_MONETARY=English_United Kingdom.1252 >> LC_NUMERIC=C >> [5] LC_TIME=English_United Kingdom.1252 >> >> attached base packages: >> [1] stats4 parallel stats graphics grDevices utils datasets >> methods >> [9] base >> >> other attached packages: >> [1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8 >> oligo_1.44.0 >> [4] Biobase_2.39.2 oligoClasses_1.42.0 >> RSQLite_2.1.0 >> [7] Biostrings_2.48.0 XVector_0.19.9 >> IRanges_2.13.28 >> [10] S4Vectors_0.17.42 BiocGenerics_0.25.3 >> >> loaded via a namespace (and not attached): >> [1] Rcpp_0.12.16 compiler_3.5.0 >> [3] BiocInstaller_1.30.0 GenomeInfoDb_1.15.5 >> [5] bitops_1.0-6 iterators_1.0.9 >> [7] tools_3.5.0 zlibbioc_1.25.0 >> [9] digest_0.6.15 bit_1.1-12 >> [11] memoise_1.1.0 preprocessCore_1.41.0 >> [13] lattice_0.20-35 ff_2.2-13 >> [15] pkgconfig_2.0.1 Matrix_1.2-14 >> [17] foreach_1.4.4 DelayedArray_0.5.31 >> [19] yaml_2.1.18 GenomeInfoDbData_1.1.0 >> [21] affxparser_1.52.0 bit64_0.9-7 >> [23] grid_3.5.0 BiocParallel_1.13.3 >> [25] blob_1.1.1 codetools_0.2-15 >> [27] matrixStats_0.53.1 GenomicRanges_1.31.23 >> [29] splines_3.5.0 SummarizedExperiment_1.9.17 >> [31] RCurl_1.95-4.10 affyio_1.49.2 >> >> >> -- >> Joris Meys >> Statistical consultant >> >> Department of Data Analysis and Mathematical Modelling >> Ghent University >> Coupure Links 653, B-9000 Gent (Belgium) >> < >> https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g >>> >> >> ----------- >> Biowiskundedagen 2017-2018 >> http://www.biowiskundedagen.ugent.be/ >> >> ------------------------------- >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >
Martin Morgan
2018-May-03 13:40 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
On 05/03/2018 05:48 AM, Joris Meys wrote:> Dear all, > > I've been diving a bit deeper into this per request of Tomas Kalibra, and > found the following : > > - the lock on the file is only after trying to read it using oligo, so > that's not a R problem in itself. The problem is independent of extrenal > packages. > > - using Windows' fc utility and cygwin's cmp utility I found out that every > so often the download.file() function inserts an extra byte. There's no > real obvious pattern in how these bytes are added, but the file downloaded > using download.file() is actually larger (in this case by about 8 kb). The > file xxx_inR.CEL.gz is read in using:I believe the difference in mode = "w" vs "wb", and the reason this is restricted to Windows downloads, is due to the difference in text file line endings, where with mode="w", download.file (and many other utilities outside R) recognize the "foo\n" as "foo\r\n". Obviously this messes up binary files. I guess in the CEL.gz file there are about 8k "\n" characters. Henrik's suggestion (default = "wb") would introduce the complementary problem -- text files would have incorrect line endings. Martin> > setwd("E:/Temp/genexpr/Compare") > id <- "GSM907854" > flink <- paste0(" > https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSM907854&format=file&file=GSM907854%2ECEL%2Egz > ") > fname <- paste0(id,"_inR.CEL.gz") > download.file(flink, > destfile = fname) > > The file xxx_direct.CEL.gz is downloaded from > https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907854 (download link > at the bottom of the page). > > Output of dir in CMD: > > 05/03/2018 11:02 AM 4,529,547 GSM907854_direct.CEL.gz > 05/03/2018 11:17 AM 4,537,668 GSM907854_inR.CEL.gz > > or from R : > >> diff(file.size(dir())) # contains both CEL files. > [1] 8121 > > Strangely enough I get the following message from download.file() : > > Content type 'application/octet-stream' length 4529547 bytes (4.3 MB) > downloaded 4.3 MB > > So the reported length is exactly the same as if I would download the file > directly, but the file on disk itself is larger. So it seems > download.file() is adding bytes when saving the data on disk. This > behaviour is independent of antivirus and/or firewalls turned on or off. > > Also keep in mind that these are NOT standard gzipped files. These files > are a specific format for Affymetrix Human Gene 1.0 ST Arrays. > > If I need to run other tests, please let me know. > Kind regards > > Joris > > On Wed, May 2, 2018 at 9:21 PM, Joris Meys <jorismeys at gmail.com> wrote: > >> Dear all, >> >> I've noticed by trying to download gz files from here : >> https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM907811 >> >> At the bottom one can download GSM907811.CEL.gz . If I download this >> manually and try >> >> oligo::read.celfiles("GSM907811.CEL.gz") >> >> everything works fine. (oligo is a bioConductor package) >> >> However, if I download using >> >> download.file("https://www.ncbi.nlm.nih.gov/geo/download/ >> ?acc=GSM907811&format=file&file=GSM907811%2ECEL%2Egz", >> destfile = "GSM907811.CEL.gz") >> >> The file is downloaded, but oligo::read.celfiles() returns the following >> error: >> >> Error in checkChipTypes(filenames, verbose, "affymetrix", TRUE) : >> End of gz file reached unexpectedly. Perhaps this file is truncated. >> >> Moreover, if I try to delete it after using download.file(), I get a >> warning that permission is denied. I can only remove it using Windows file >> explorer after I closed the R session, indicating that the connection is >> still open. Yet, showConnections() doesn't show any open connections either. >> >> Session info below. Note that I started from a completely fresh R session. >> oligo is needed due to the specific file format of these gz files. They're >> not standard tarred files. >> >> Cheers >> Joris >> >> Session Info >> ------------------------------------------------------------ >> ------------------------- >> >> R version 3.5.0 (2018-04-23) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> Running under: Windows >= 8 x64 (build 9200) >> >> Matrix products: default >> >> locale: >> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United >> Kingdom.1252 >> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C >> >> [5] LC_TIME=English_United Kingdom.1252 >> >> attached base packages: >> [1] stats4 parallel stats graphics grDevices utils datasets >> methods >> [9] base >> >> other attached packages: >> [1] pd.hugene.1.0.st.v1_3.14.1 DBI_0.8 >> oligo_1.44.0 >> [4] Biobase_2.39.2 oligoClasses_1.42.0 >> RSQLite_2.1.0 >> [7] Biostrings_2.48.0 XVector_0.19.9 >> IRanges_2.13.28 >> [10] S4Vectors_0.17.42 BiocGenerics_0.25.3 >> >> loaded via a namespace (and not attached): >> [1] Rcpp_0.12.16 compiler_3.5.0 >> [3] BiocInstaller_1.30.0 GenomeInfoDb_1.15.5 >> [5] bitops_1.0-6 iterators_1.0.9 >> [7] tools_3.5.0 zlibbioc_1.25.0 >> [9] digest_0.6.15 bit_1.1-12 >> [11] memoise_1.1.0 preprocessCore_1.41.0 >> [13] lattice_0.20-35 ff_2.2-13 >> [15] pkgconfig_2.0.1 Matrix_1.2-14 >> [17] foreach_1.4.4 DelayedArray_0.5.31 >> [19] yaml_2.1.18 GenomeInfoDbData_1.1.0 >> [21] affxparser_1.52.0 bit64_0.9-7 >> [23] grid_3.5.0 BiocParallel_1.13.3 >> [25] blob_1.1.1 codetools_0.2-15 >> [27] matrixStats_0.53.1 GenomicRanges_1.31.23 >> [29] splines_3.5.0 SummarizedExperiment_1.9.17 >> [31] RCurl_1.95-4.10 affyio_1.49.2 >> >> >> -- >> Joris Meys >> Statistical consultant >> >> Department of Data Analysis and Mathematical Modelling >> Ghent University >> Coupure Links 653, B-9000 Gent (Belgium) >> >> <https://maps.google.com/?q=Coupure+links+653,%C2%A0B-9000+Gent,%C2%A0Belgium&entry=gmail&source=g> >> >> ----------- >> Biowiskundedagen 2017-2018 >> http://www.biowiskundedagen.ugent.be/ >> >> ------------------------------- >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >> > > >This email message may contain legally privileged and/or...{{dropped:2}}
Jeroen Ooms
2018-May-03 14:21 UTC
[Rd] download.file does not process gz files correctly (truncates them?)
On Thu, May 3, 2018 at 2:42 PM, Henrik Bengtsson <henrik.bengtsson at gmail.com> wrote:> Use mode="wb" when you download the file. See > https://github.com/HenrikBengtsson/Wishlist-for-R/issues/30. > > R core, and others, is there a good argument for why we are not making this > the default download mode? It seems like a such a simple fix to such a > common "mistake".I'd like to second this feature request. This default behaviour is unexpected and often leads to r scripts that were written on mac/linux, to produce corrupted files on windows, checksum mismatches, etc. Even for text files, the default should be to download the file as-is. Trying to "fix" line-endings should be opt-in, never the default. Downloading a file via a browser or ftp client on windows also doesn't change the file, why should R? On Thu, May 3, 2018 at 3:02 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:> Many downloads are text files (HTML, CSV, etc.), and if those are downloaded > in binary, a Windows user might end up with a file that Notepad can't > handle, because it would have Unix-style line endings.True but I don't think this is relevant. The same holds e.g. for the R files in source packages, which also have unix line endings. Most Windows users will use an actual editor that understands both types of line endings, or can convert between the two. Downloading-file should do just that.
Possibly Parallel Threads
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)
- download.file does not process gz files correctly (truncates them?)