hi, thank you for attempting this. it looks like your unix machine unzipped the txt file without corruption -- if you copied over the same txt file to windows 7, i don't think that would reproduce the problem? i think it needs to be the corrupted text file where R.utils::countLines( txtfile ) gives 809367. i am able to reproduce on two distinct windows machines but no guarantee i'm not doing something dumb On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:> I am not able to reproduce your segfault on a Windows 7 platform either: > > ########################## > fn1 <- "d:/DADOS_ENEM_2009.txt" > sessionInfo() > ## R version 3.4.1 (2017-06-30) > ## Platform: x86_64-w64-mingw32/x64 (64-bit) > ## Running under: Windows 7 x64 (build 7601) Service Pack 1 > ## > ## Matrix products: default > ## > ## locale: > ## [1] LC_COLLATE=English_United States.1252 > ## [2] LC_CTYPE=English_United States.1252 > ## [3] LC_MONETARY=English_United States.1252 > ## [4] LC_NUMERIC=C > ## [5] LC_TIME=English_United States.1252 > ## > ## attached base packages: > ## [1] stats graphics grDevices utils datasets methods base > ## > ## loaded via a namespace (and not attached): > ## [1] compiler_3.4.1 > tools::md5sum( fn1 ) > ## d:/DADOS_ENEM_2009.txt > ## "83e61c96092285b60d7bf6b0dbc7072e" > dat <- readLines( fn1 ) > length( dat ) > ## [1] 4148721 > > > On Sat, 15 Jul 2017, Jeff Newmiller wrote: > > I am not able to reproduce this on a Linux platform: >> >> #######################3 >> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >> 2009/DADOS_ENEM_2009.txt" >> sessionInfo() >> ## R version 3.4.1 (2017-06-30) >> ## Platform: x86_64-pc-linux-gnu (64-bit) >> ## Running under: Ubuntu 14.04.5 LTS >> ## >> ## Matrix products: default >> ## BLAS: /usr/lib/libblas/libblas.so.3.0 >> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0 >> ## >> ## locale: >> ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >> ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> ## [9] LC_ADDRESS=C LC_TELEPHONE=C >> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> ## >> ## attached base packages: >> ## [1] stats graphics grDevices utils datasets methods base >> ## >> ## loaded via a namespace (and not attached): >> ## [1] compiler_3.4.1 >> tools::md5sum( fn1 ) >> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >> 2009/DADOS_ENEM_2009.txt >> ## >> "83e61c96092285b60d7bf6b0dbc7072e" >> dat <- readLines( fn1 ) >> length( dat ) >> ## [1] 4148721 >> >> No segfault occurs. >> >> On Sat, 15 Jul 2017, Anthony Damico wrote: >> >> hi, i realized that the segfault happens on the text file in a new R >>> session. so, creating the segfault-generating text file requires a >>> contributed package, but prompting the actual segfault does not -- pretty >>> sure that means this is a base R bug? submitted here: >>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully i >>> am >>> not doing something remarkably stupid. the text file itself is 4GB so >>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in >>> the >>> previous message, i think most or all of it needs to be there to trigger >>> the segfault. thanks! >>> >>> >>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com> >>> wrote: >>> >>> hi, thanks Dr. Murdoch >>>> >>>> >>>> i'd appreciate if anyone on r-help could help me narrow this down? i >>>> believe the segfault occurs because there's a single line with 4GB and >>>> also >>>> embedded nuls, but i am not sure how to artificially construct that? >>>> >>>> >>>> the lodown package can be removed from my example.. it is just for file >>>> download cacheing, so `lodown::cachaca` can be replaced with >>>> `download.file` my current example requires a huge download, so sort of >>>> painful to repeat but i'm pretty confident that's not the issue. >>>> >>>> >>>> the archive::archive_extract() function unzips a (probably corrupt) .RAR >>>> file and creates a text file with 80,937 lines. this file is 4GB: >>>> >>>> > file.size(infile) >>>> [1] 4078192743 <(407)%20819-2743> >>>> >>>> >>>> i am pretty sure that nearly all of that 4GB is contained on a single >>>> line >>>> in the file. here's what happens when i create a file connection and >>>> scan >>>> through.. >>>> >>>> > file_con <- file( infile , 'r' ) >>>> > >>>> > first_80936_lines <- readLines( file_con , n = 80936 ) >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "1000023930632009" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "36F2924009PAULO" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "AFONSO" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "BA11" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "00000" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "00" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "2924009PAULO" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "AFONSO" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "BA1111" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "467.20" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "346.10" >>>> > scan( w , n = 1 , what = character() ) >>>> Read 1 item >>>> [1] "414.40" >>>> > scan( w , n = 1 , what = character() ) >>>> Error in scan(w, n = 1, what = character()) : >>>> could not allocate memory (2048 Mb) in C function >>>> 'R_AllocStringBuffer' >>>> >>>> >>>> >>>> making a huge single-line file does not reproduce the problem, i think >>>> the >>>> embedded nuls have something to do with it-- >>>> >>>> >>>> # WARNING do not run with less than 64GB RAM >>>> tf <- tempfile() >>>> a <- rep( "a" , 1000000000 ) >>>> b <- paste( a , collapse = '' ) >>>> writeLines( b , tf ) ; rm( b ) ; gc() >>>> d <- readLines( tf ) >>>> >>>> >>>> >>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch < >>>> murdoch.duncan at gmail.com> >>>> wrote: >>>> >>>> On 15/07/2017 7:35 AM, Anthony Damico wrote: >>>>> >>>>> hello, the last line of the code below causes a segfault for me on >>>>>> 3.4.1. >>>>>> i think i should submit to https://bugs.r-project.org/ unless others >>>>>> have >>>>>> advice? thanks >>>>>> >>>>>> >>>>> Segfaults are usually worth reporting as bugs. Try to come up with a >>>>> self-contained example, not using the lodown and archive packages. I >>>>> imagine you can do this by uploading the file you downloaded, or >>>>> enough of >>>>> a subset of it to trigger the segfault. If you can't do that, then >>>>> likely >>>>> the bug is with one of those packages, not with R. >>>>> >>>>> Duncan Murdoch >>>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>> >>>>>> install.packages( "devtools" ) >>>>>> devtools::install_github("ajdamico/lodown") >>>>>> devtools::install_github("jimhester/archive") >>>>>> >>>>>> >>>>>> file_folder <- file.path( tempdir() , "file_folder" ) >>>>>> >>>>>> tf <- tempfile() >>>>>> >>>>>> # large download! cachaca saves on your local disk if already >>>>>> downloaded >>>>>> lodown::cachaca( ' >>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf >>>>>> , >>>>>> mode >>>>>> = 'wb' ) >>>>>> >>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) ) >>>>>> >>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE , >>>>>> full.names >>>>>> TRUE ) >>>>>> >>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE ) >>>>>> >>>>>> # works >>>>>> R.utils::countLines( infile ) >>>>>> >>>>>> # works with warning >>>>>> my_file <- readLines( infile , skipNul = TRUE ) >>>>>> >>>>>> # crash >>>>>> my_file <- readLines( infile ) >>>>>> >>>>>> >>>>>> # run just before crash >>>>>> sessionInfo() >>>>>> # R version 3.4.1 (2017-06-30) >>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit) >>>>>> # Running under: Windows 10 x64 (build 15063) >>>>>> >>>>>> # Matrix products: default >>>>>> >>>>>> # locale: >>>>>> # [1] LC_COLLATE=English_United States.1252 >>>>>> # [2] LC_CTYPE=English_United States.1252 >>>>>> # [3] LC_MONETARY=English_United States.1252 >>>>>> # [4] LC_NUMERIC=C >>>>>> # [5] LC_TIME=English_United States.1252 >>>>>> >>>>>> # attached base packages: >>>>>> # [1] stats graphics grDevices utils datasets methods base >>>>>> >>>>>> # loaded via a namespace (and not attached): >>>>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 >>>>>> withr_1.0.2 >>>>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 >>>>>> memoise_1.1.0 >>>>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 >>>>>> lodown_0.1.0 >>>>>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 >>>>>> R.oo_1.21.0 >>>>>> # [17] archive_0.0.0.9000 >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> ______________________________________________ >>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>>>> ng-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>>> >>>>>> >>>>> >>>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> ------------------------------------------------------------ >> --------------- >> Jeff Newmiller The ..... ..... Go >> Live... >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >> Go... >> Live: OO#.. Dead: OO#.. Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >> rocks...1k >> >> ______________________________________________ >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > ------------------------------------------------------------ > --------------- > Jeff Newmiller The ..... ..... Go Live... > DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live > Go... > Live: OO#.. Dead: OO#.. Playing > Research Engineer (Solar/Batteries O.O#. #.O#. with > /Software/Embedded Controllers) .OO#. .OO#. rocks...1k > ------------------------------------------------------------ > --------------- >[[alternative HTML version deleted]]
sorry, typo, 80937 not 809367 On Sun, Jul 16, 2017 at 6:21 AM, Anthony Damico <ajdamico at gmail.com> wrote:> hi, thank you for attempting this. it looks like your unix machine > unzipped the txt file without corruption -- if you copied over the same txt > file to windows 7, i don't think that would reproduce the problem? i think > it needs to be the corrupted text file where R.utils::countLines( txtfile > ) gives 809367. i am able to reproduce on two distinct windows machines > but no guarantee i'm not doing something dumb > > On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> > wrote: > >> I am not able to reproduce your segfault on a Windows 7 platform either: >> >> ########################## >> fn1 <- "d:/DADOS_ENEM_2009.txt" >> sessionInfo() >> ## R version 3.4.1 (2017-06-30) >> ## Platform: x86_64-w64-mingw32/x64 (64-bit) >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1 >> ## >> ## Matrix products: default >> ## >> ## locale: >> ## [1] LC_COLLATE=English_United States.1252 >> ## [2] LC_CTYPE=English_United States.1252 >> ## [3] LC_MONETARY=English_United States.1252 >> ## [4] LC_NUMERIC=C >> ## [5] LC_TIME=English_United States.1252 >> ## >> ## attached base packages: >> ## [1] stats graphics grDevices utils datasets methods base >> ## >> ## loaded via a namespace (and not attached): >> ## [1] compiler_3.4.1 >> tools::md5sum( fn1 ) >> ## d:/DADOS_ENEM_2009.txt >> ## "83e61c96092285b60d7bf6b0dbc7072e" >> dat <- readLines( fn1 ) >> length( dat ) >> ## [1] 4148721 >> >> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote: >> >> I am not able to reproduce this on a Linux platform: >>> >>> #######################3 >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >>> 2009/DADOS_ENEM_2009.txt" >>> sessionInfo() >>> ## R version 3.4.1 (2017-06-30) >>> ## Platform: x86_64-pc-linux-gnu (64-bit) >>> ## Running under: Ubuntu 14.04.5 LTS >>> ## >>> ## Matrix products: default >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0 >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0 >>> ## >>> ## locale: >>> ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> ## [9] LC_ADDRESS=C LC_TELEPHONE=C >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> ## >>> ## attached base packages: >>> ## [1] stats graphics grDevices utils datasets methods base >>> ## >>> ## loaded via a namespace (and not attached): >>> ## [1] compiler_3.4.1 >>> tools::md5sum( fn1 ) >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >>> 2009/DADOS_ENEM_2009.txt >>> ## >>> "83e61c96092285b60d7bf6b0dbc7072e" >>> dat <- readLines( fn1 ) >>> length( dat ) >>> ## [1] 4148721 >>> >>> No segfault occurs. >>> >>> On Sat, 15 Jul 2017, Anthony Damico wrote: >>> >>> hi, i realized that the segfault happens on the text file in a new R >>>> session. so, creating the segfault-generating text file requires a >>>> contributed package, but prompting the actual segfault does not -- >>>> pretty >>>> sure that means this is a base R bug? submitted here: >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully >>>> i am >>>> not doing something remarkably stupid. the text file itself is 4GB so >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger error in >>>> the >>>> previous message, i think most or all of it needs to be there to trigger >>>> the segfault. thanks! >>>> >>>> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com> >>>> wrote: >>>> >>>> hi, thanks Dr. Murdoch >>>>> >>>>> >>>>> i'd appreciate if anyone on r-help could help me narrow this down? i >>>>> believe the segfault occurs because there's a single line with 4GB and >>>>> also >>>>> embedded nuls, but i am not sure how to artificially construct that? >>>>> >>>>> >>>>> the lodown package can be removed from my example.. it is just for >>>>> file >>>>> download cacheing, so `lodown::cachaca` can be replaced with >>>>> `download.file` my current example requires a huge download, so sort >>>>> of >>>>> painful to repeat but i'm pretty confident that's not the issue. >>>>> >>>>> >>>>> the archive::archive_extract() function unzips a (probably corrupt) >>>>> .RAR >>>>> file and creates a text file with 80,937 lines. this file is 4GB: >>>>> >>>>> > file.size(infile) >>>>> [1] 4078192743 <(407)%20819-2743> >>>>> >>>>> >>>>> i am pretty sure that nearly all of that 4GB is contained on a single >>>>> line >>>>> in the file. here's what happens when i create a file connection and >>>>> scan >>>>> through.. >>>>> >>>>> > file_con <- file( infile , 'r' ) >>>>> > >>>>> > first_80936_lines <- readLines( file_con , n = 80936 ) >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "1000023930632009" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "36F2924009PAULO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "AFONSO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "BA11" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "00000" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "00" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "2924009PAULO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "AFONSO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "BA1111" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "467.20" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "346.10" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "414.40" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Error in scan(w, n = 1, what = character()) : >>>>> could not allocate memory (2048 Mb) in C function >>>>> 'R_AllocStringBuffer' >>>>> >>>>> >>>>> >>>>> making a huge single-line file does not reproduce the problem, i think >>>>> the >>>>> embedded nuls have something to do with it-- >>>>> >>>>> >>>>> # WARNING do not run with less than 64GB RAM >>>>> tf <- tempfile() >>>>> a <- rep( "a" , 1000000000 ) >>>>> b <- paste( a , collapse = '' ) >>>>> writeLines( b , tf ) ; rm( b ) ; gc() >>>>> d <- readLines( tf ) >>>>> >>>>> >>>>> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch < >>>>> murdoch.duncan at gmail.com> >>>>> wrote: >>>>> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote: >>>>>> >>>>>> hello, the last line of the code below causes a segfault for me on >>>>>>> 3.4.1. >>>>>>> i think i should submit to https://bugs.r-project.org/ unless >>>>>>> others >>>>>>> have >>>>>>> advice? thanks >>>>>>> >>>>>>> >>>>>> Segfaults are usually worth reporting as bugs. Try to come up with a >>>>>> self-contained example, not using the lodown and archive packages. I >>>>>> imagine you can do this by uploading the file you downloaded, or >>>>>> enough of >>>>>> a subset of it to trigger the segfault. If you can't do that, then >>>>>> likely >>>>>> the bug is with one of those packages, not with R. >>>>>> >>>>>> Duncan Murdoch >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> install.packages( "devtools" ) >>>>>>> devtools::install_github("ajdamico/lodown") >>>>>>> devtools::install_github("jimhester/archive") >>>>>>> >>>>>>> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" ) >>>>>>> >>>>>>> tf <- tempfile() >>>>>>> >>>>>>> # large download! cachaca saves on your local disk if already >>>>>>> downloaded >>>>>>> lodown::cachaca( ' >>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , >>>>>>> tf , >>>>>>> mode >>>>>>> = 'wb' ) >>>>>>> >>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) ) >>>>>>> >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE , >>>>>>> full.names >>>>>>> TRUE ) >>>>>>> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE ) >>>>>>> >>>>>>> # works >>>>>>> R.utils::countLines( infile ) >>>>>>> >>>>>>> # works with warning >>>>>>> my_file <- readLines( infile , skipNul = TRUE ) >>>>>>> >>>>>>> # crash >>>>>>> my_file <- readLines( infile ) >>>>>>> >>>>>>> >>>>>>> # run just before crash >>>>>>> sessionInfo() >>>>>>> # R version 3.4.1 (2017-06-30) >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit) >>>>>>> # Running under: Windows 10 x64 (build 15063) >>>>>>> >>>>>>> # Matrix products: default >>>>>>> >>>>>>> # locale: >>>>>>> # [1] LC_COLLATE=English_United States.1252 >>>>>>> # [2] LC_CTYPE=English_United States.1252 >>>>>>> # [3] LC_MONETARY=English_United States.1252 >>>>>>> # [4] LC_NUMERIC=C >>>>>>> # [5] LC_TIME=English_United States.1252 >>>>>>> >>>>>>> # attached base packages: >>>>>>> # [1] stats graphics grDevices utils datasets methods >>>>>>> base >>>>>>> >>>>>>> # loaded via a namespace (and not attached): >>>>>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 >>>>>>> withr_1.0.2 >>>>>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 >>>>>>> memoise_1.1.0 >>>>>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 >>>>>>> lodown_0.1.0 >>>>>>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 >>>>>>> R.oo_1.21.0 >>>>>>> # [17] archive_0.0.0.9000 >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>>>>> ng-guide.html >>>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>> ng-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> ------------------------------------------------------------ >>> --------------- >>> Jeff Newmiller The ..... ..... Go >>> Live... >>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >>> Go... >>> Live: OO#.. Dead: OO#.. Playing >>> Research Engineer (Solar/Batteries O.O#. #.O#. with >>> /Software/Embedded Controllers) .OO#. .OO#. >>> rocks...1k >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> ------------------------------------------------------------ >> --------------- >> Jeff Newmiller The ..... ..... Go >> Live... >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >> Go... >> Live: OO#.. Dead: OO#.. Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >> rocks...1k >> ------------------------------------------------------------ >> --------------- >> > >[[alternative HTML version deleted]]
So you are saying there are two problems... one that produces a corrupt file from a valid compressed file, and one that segfaults when presented with that corrupt file? Can you please confirm the file name and run md5sum on it and share the result so we can tell when the file problem has been reproduced? -- Sent from my phone. Please excuse my brevity. On July 16, 2017 3:21:21 AM PDT, Anthony Damico <ajdamico at gmail.com> wrote:>hi, thank you for attempting this. it looks like your unix machine >unzipped >the txt file without corruption -- if you copied over the same txt file >to >windows 7, i don't think that would reproduce the problem? i think it >needs to be the corrupted text file where R.utils::countLines( >txtfile >) gives 809367. i am able to reproduce on two distinct windows >machines >but no guarantee i'm not doing something dumb > >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller ><jdnewmil at dcn.davis.ca.us> >wrote: > >> I am not able to reproduce your segfault on a Windows 7 platform >either: >> >> ########################## >> fn1 <- "d:/DADOS_ENEM_2009.txt" >> sessionInfo() >> ## R version 3.4.1 (2017-06-30) >> ## Platform: x86_64-w64-mingw32/x64 (64-bit) >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1 >> ## >> ## Matrix products: default >> ## >> ## locale: >> ## [1] LC_COLLATE=English_United States.1252 >> ## [2] LC_CTYPE=English_United States.1252 >> ## [3] LC_MONETARY=English_United States.1252 >> ## [4] LC_NUMERIC=C >> ## [5] LC_TIME=English_United States.1252 >> ## >> ## attached base packages: >> ## [1] stats graphics grDevices utils datasets methods >base >> ## >> ## loaded via a namespace (and not attached): >> ## [1] compiler_3.4.1 >> tools::md5sum( fn1 ) >> ## d:/DADOS_ENEM_2009.txt >> ## "83e61c96092285b60d7bf6b0dbc7072e" >> dat <- readLines( fn1 ) >> length( dat ) >> ## [1] 4148721 >> >> >> On Sat, 15 Jul 2017, Jeff Newmiller wrote: >> >> I am not able to reproduce this on a Linux platform: >>> >>> #######################3 >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >>> 2009/DADOS_ENEM_2009.txt" >>> sessionInfo() >>> ## R version 3.4.1 (2017-06-30) >>> ## Platform: x86_64-pc-linux-gnu (64-bit) >>> ## Running under: Ubuntu 14.04.5 LTS >>> ## >>> ## Matrix products: default >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0 >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0 >>> ## >>> ## locale: >>> ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >>> ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >>> ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 >>> ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >>> ## [9] LC_ADDRESS=C LC_TELEPHONE=C >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >>> ## >>> ## attached base packages: >>> ## [1] stats graphics grDevices utils datasets methods >base >>> ## >>> ## loaded via a namespace (and not attached): >>> ## [1] compiler_3.4.1 >>> tools::md5sum( fn1 ) >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem >>> 2009/DADOS_ENEM_2009.txt >>> ## >>> "83e61c96092285b60d7bf6b0dbc7072e" >>> dat <- readLines( fn1 ) >>> length( dat ) >>> ## [1] 4148721 >>> >>> No segfault occurs. >>> >>> On Sat, 15 Jul 2017, Anthony Damico wrote: >>> >>> hi, i realized that the segfault happens on the text file in a new R >>>> session. so, creating the segfault-generating text file requires a >>>> contributed package, but prompting the actual segfault does not -- >pretty >>>> sure that means this is a base R bug? submitted here: >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 >hopefully i >>>> am >>>> not doing something remarkably stupid. the text file itself is 4GB >so >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger >error in >>>> the >>>> previous message, i think most or all of it needs to be there to >trigger >>>> the segfault. thanks! >>>> >>>> >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico ><ajdamico at gmail.com> >>>> wrote: >>>> >>>> hi, thanks Dr. Murdoch >>>>> >>>>> >>>>> i'd appreciate if anyone on r-help could help me narrow this down? > i >>>>> believe the segfault occurs because there's a single line with 4GB >and >>>>> also >>>>> embedded nuls, but i am not sure how to artificially construct >that? >>>>> >>>>> >>>>> the lodown package can be removed from my example.. it is just >for file >>>>> download cacheing, so `lodown::cachaca` can be replaced with >>>>> `download.file` my current example requires a huge download, so >sort of >>>>> painful to repeat but i'm pretty confident that's not the issue. >>>>> >>>>> >>>>> the archive::archive_extract() function unzips a (probably >corrupt) .RAR >>>>> file and creates a text file with 80,937 lines. this file is 4GB: >>>>> >>>>> > file.size(infile) >>>>> [1] 4078192743 <(407)%20819-2743> >>>>> >>>>> >>>>> i am pretty sure that nearly all of that 4GB is contained on a >single >>>>> line >>>>> in the file. here's what happens when i create a file connection >and >>>>> scan >>>>> through.. >>>>> >>>>> > file_con <- file( infile , 'r' ) >>>>> > >>>>> > first_80936_lines <- readLines( file_con , n = 80936 ) >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "1000023930632009" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "36F2924009PAULO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "AFONSO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "BA11" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "00000" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "00" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "2924009PAULO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "AFONSO" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "BA1111" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "467.20" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "346.10" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Read 1 item >>>>> [1] "414.40" >>>>> > scan( w , n = 1 , what = character() ) >>>>> Error in scan(w, n = 1, what = character()) : >>>>> could not allocate memory (2048 Mb) in C function >>>>> 'R_AllocStringBuffer' >>>>> >>>>> >>>>> >>>>> making a huge single-line file does not reproduce the problem, i >think >>>>> the >>>>> embedded nuls have something to do with it-- >>>>> >>>>> >>>>> # WARNING do not run with less than 64GB RAM >>>>> tf <- tempfile() >>>>> a <- rep( "a" , 1000000000 ) >>>>> b <- paste( a , collapse = '' ) >>>>> writeLines( b , tf ) ; rm( b ) ; gc() >>>>> d <- readLines( tf ) >>>>> >>>>> >>>>> >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch < >>>>> murdoch.duncan at gmail.com> >>>>> wrote: >>>>> >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote: >>>>>> >>>>>> hello, the last line of the code below causes a segfault for me >on >>>>>>> 3.4.1. >>>>>>> i think i should submit to https://bugs.r-project.org/ unless >others >>>>>>> have >>>>>>> advice? thanks >>>>>>> >>>>>>> >>>>>> Segfaults are usually worth reporting as bugs. Try to come up >with a >>>>>> self-contained example, not using the lodown and archive >packages. I >>>>>> imagine you can do this by uploading the file you downloaded, or >>>>>> enough of >>>>>> a subset of it to trigger the segfault. If you can't do that, >then >>>>>> likely >>>>>> the bug is with one of those packages, not with R. >>>>>> >>>>>> Duncan Murdoch >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> install.packages( "devtools" ) >>>>>>> devtools::install_github("ajdamico/lodown") >>>>>>> devtools::install_github("jimhester/archive") >>>>>>> >>>>>>> >>>>>>> file_folder <- file.path( tempdir() , "file_folder" ) >>>>>>> >>>>>>> tf <- tempfile() >>>>>>> >>>>>>> # large download! cachaca saves on your local disk if already >>>>>>> downloaded >>>>>>> lodown::cachaca( ' >>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' >, tf >>>>>>> , >>>>>>> mode >>>>>>> = 'wb' ) >>>>>>> >>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder >) ) >>>>>>> >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE , >>>>>>> full.names >>>>>>> TRUE ) >>>>>>> >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value >TRUE ) >>>>>>> >>>>>>> # works >>>>>>> R.utils::countLines( infile ) >>>>>>> >>>>>>> # works with warning >>>>>>> my_file <- readLines( infile , skipNul = TRUE ) >>>>>>> >>>>>>> # crash >>>>>>> my_file <- readLines( infile ) >>>>>>> >>>>>>> >>>>>>> # run just before crash >>>>>>> sessionInfo() >>>>>>> # R version 3.4.1 (2017-06-30) >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit) >>>>>>> # Running under: Windows 10 x64 (build 15063) >>>>>>> >>>>>>> # Matrix products: default >>>>>>> >>>>>>> # locale: >>>>>>> # [1] LC_COLLATE=English_United States.1252 >>>>>>> # [2] LC_CTYPE=English_United States.1252 >>>>>>> # [3] LC_MONETARY=English_United States.1252 >>>>>>> # [4] LC_NUMERIC=C >>>>>>> # [5] LC_TIME=English_United States.1252 >>>>>>> >>>>>>> # attached base packages: >>>>>>> # [1] stats graphics grDevices utils datasets methods > base >>>>>>> >>>>>>> # loaded via a namespace (and not attached): >>>>>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 >>>>>>> withr_1.0.2 >>>>>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 >>>>>>> memoise_1.1.0 >>>>>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 >>>>>>> lodown_0.1.0 >>>>>>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 >>>>>>> R.oo_1.21.0 >>>>>>> # [17] archive_0.0.0.9000 >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> ______________________________________________ >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, >see >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>>>>> ng-guide.html >>>>>>> and provide commented, minimal, self-contained, reproducible >code. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>> ng-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> ------------------------------------------------------------ >>> --------------- >>> Jeff Newmiller The ..... ..... Go >>> Live... >>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. >Live >>> Go... >>> Live: OO#.. Dead: OO#.. >Playing >>> Research Engineer (Solar/Batteries O.O#. #.O#. >with >>> /Software/Embedded Controllers) .OO#. .OO#. >>> rocks...1k >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> ------------------------------------------------------------ >> --------------- >> Jeff Newmiller The ..... ..... Go >Live... >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live >> Go... >> Live: OO#.. Dead: OO#.. >Playing >> Research Engineer (Solar/Batteries O.O#. #.O#. with >> /Software/Embedded Controllers) .OO#. .OO#. >rocks...1k >> ------------------------------------------------------------ >> --------------- >>
hi, yep, there are two problems -- but i think only the segfault is within the scope of a base R issue? i need to look closer at the corrupted decompression and figure out whether i should talk to the brazilian government agency that creates that .rar file or open an issue with the archive package maintainer. my goal in this thread is only to figure out how to replicate the goofy text file so the r team can turn it into an error instead of a segfault. the original example i sent stores the .txt file somewhere inside the tempdir(), but when i copy it over elsewhere on my machine, the md5sum() gives the same result. thanks again for looking at this > tools::md5sum(infile) C:\\Users\\AnthonyD\\AppData\\Local\\Temp\\RtmpIBy7qt/file_folder/Microdados ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt "30beb57419486108e98d42ec7a2f8b19" > tools::md5sum( "S:/temp/crash.txt" ) S:/temp/crash.txt "30beb57419486108e98d42ec7a2f8b19" On Sun, Jul 16, 2017 at 10:10 AM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:> So you are saying there are two problems... one that produces a corrupt > file from a valid compressed file, and one that segfaults when presented > with that corrupt file? Can you please confirm the file name and run md5sum > on it and share the result so we can tell when the file problem has been > reproduced? > -- > Sent from my phone. Please excuse my brevity. > > On July 16, 2017 3:21:21 AM PDT, Anthony Damico <ajdamico at gmail.com> > wrote: > >hi, thank you for attempting this. it looks like your unix machine > >unzipped > >the txt file without corruption -- if you copied over the same txt file > >to > >windows 7, i don't think that would reproduce the problem? i think it > >needs to be the corrupted text file where R.utils::countLines( > >txtfile > >) gives 809367. i am able to reproduce on two distinct windows > >machines > >but no guarantee i'm not doing something dumb > > > >On Sat, Jul 15, 2017 at 6:29 PM, Jeff Newmiller > ><jdnewmil at dcn.davis.ca.us> > >wrote: > > > >> I am not able to reproduce your segfault on a Windows 7 platform > >either: > >> > >> ########################## > >> fn1 <- "d:/DADOS_ENEM_2009.txt" > >> sessionInfo() > >> ## R version 3.4.1 (2017-06-30) > >> ## Platform: x86_64-w64-mingw32/x64 (64-bit) > >> ## Running under: Windows 7 x64 (build 7601) Service Pack 1 > >> ## > >> ## Matrix products: default > >> ## > >> ## locale: > >> ## [1] LC_COLLATE=English_United States.1252 > >> ## [2] LC_CTYPE=English_United States.1252 > >> ## [3] LC_MONETARY=English_United States.1252 > >> ## [4] LC_NUMERIC=C > >> ## [5] LC_TIME=English_United States.1252 > >> ## > >> ## attached base packages: > >> ## [1] stats graphics grDevices utils datasets methods > >base > >> ## > >> ## loaded via a namespace (and not attached): > >> ## [1] compiler_3.4.1 > >> tools::md5sum( fn1 ) > >> ## d:/DADOS_ENEM_2009.txt > >> ## "83e61c96092285b60d7bf6b0dbc7072e" > >> dat <- readLines( fn1 ) > >> length( dat ) > >> ## [1] 4148721 > >> > >> > >> On Sat, 15 Jul 2017, Jeff Newmiller wrote: > >> > >> I am not able to reproduce this on a Linux platform: > >>> > >>> #######################3 > >>> fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem > >>> 2009/DADOS_ENEM_2009.txt" > >>> sessionInfo() > >>> ## R version 3.4.1 (2017-06-30) > >>> ## Platform: x86_64-pc-linux-gnu (64-bit) > >>> ## Running under: Ubuntu 14.04.5 LTS > >>> ## > >>> ## Matrix products: default > >>> ## BLAS: /usr/lib/libblas/libblas.so.3.0 > >>> ## LAPACK: /usr/lib/lapack/liblapack.so.3.0 > >>> ## > >>> ## locale: > >>> ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > >>> ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > >>> ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > >>> ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > >>> ## [9] LC_ADDRESS=C LC_TELEPHONE=C > >>> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > >>> ## > >>> ## attached base packages: > >>> ## [1] stats graphics grDevices utils datasets methods > >base > >>> ## > >>> ## loaded via a namespace (and not attached): > >>> ## [1] compiler_3.4.1 > >>> tools::md5sum( fn1 ) > >>> ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem > >>> 2009/DADOS_ENEM_2009.txt > >>> ## > >>> "83e61c96092285b60d7bf6b0dbc7072e" > >>> dat <- readLines( fn1 ) > >>> length( dat ) > >>> ## [1] 4148721 > >>> > >>> No segfault occurs. > >>> > >>> On Sat, 15 Jul 2017, Anthony Damico wrote: > >>> > >>> hi, i realized that the segfault happens on the text file in a new R > >>>> session. so, creating the segfault-generating text file requires a > >>>> contributed package, but prompting the actual segfault does not -- > >pretty > >>>> sure that means this is a base R bug? submitted here: > >>>> https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 > >hopefully i > >>>> am > >>>> not doing something remarkably stupid. the text file itself is 4GB > >so > >>>> cannot upload it to bugzilla, and from the R_AllocStringBugger > >error in > >>>> the > >>>> previous message, i think most or all of it needs to be there to > >trigger > >>>> the segfault. thanks! > >>>> > >>>> > >>>> On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico > ><ajdamico at gmail.com> > >>>> wrote: > >>>> > >>>> hi, thanks Dr. Murdoch > >>>>> > >>>>> > >>>>> i'd appreciate if anyone on r-help could help me narrow this down? > > i > >>>>> believe the segfault occurs because there's a single line with 4GB > >and > >>>>> also > >>>>> embedded nuls, but i am not sure how to artificially construct > >that? > >>>>> > >>>>> > >>>>> the lodown package can be removed from my example.. it is just > >for file > >>>>> download cacheing, so `lodown::cachaca` can be replaced with > >>>>> `download.file` my current example requires a huge download, so > >sort of > >>>>> painful to repeat but i'm pretty confident that's not the issue. > >>>>> > >>>>> > >>>>> the archive::archive_extract() function unzips a (probably > >corrupt) .RAR > >>>>> file and creates a text file with 80,937 lines. this file is 4GB: > >>>>> > >>>>> > file.size(infile) > >>>>> [1] 4078192743 <(407)%20819-2743> > >>>>> > >>>>> > >>>>> i am pretty sure that nearly all of that 4GB is contained on a > >single > >>>>> line > >>>>> in the file. here's what happens when i create a file connection > >and > >>>>> scan > >>>>> through.. > >>>>> > >>>>> > file_con <- file( infile , 'r' ) > >>>>> > > >>>>> > first_80936_lines <- readLines( file_con , n = 80936 ) > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "1000023930632009" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "36F2924009PAULO" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "AFONSO" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "BA11" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "00000" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "00" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "2924009PAULO" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "AFONSO" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "BA1111" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "467.20" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "346.10" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Read 1 item > >>>>> [1] "414.40" > >>>>> > scan( w , n = 1 , what = character() ) > >>>>> Error in scan(w, n = 1, what = character()) : > >>>>> could not allocate memory (2048 Mb) in C function > >>>>> 'R_AllocStringBuffer' > >>>>> > >>>>> > >>>>> > >>>>> making a huge single-line file does not reproduce the problem, i > >think > >>>>> the > >>>>> embedded nuls have something to do with it-- > >>>>> > >>>>> > >>>>> # WARNING do not run with less than 64GB RAM > >>>>> tf <- tempfile() > >>>>> a <- rep( "a" , 1000000000 ) > >>>>> b <- paste( a , collapse = '' ) > >>>>> writeLines( b , tf ) ; rm( b ) ; gc() > >>>>> d <- readLines( tf ) > >>>>> > >>>>> > >>>>> > >>>>> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch < > >>>>> murdoch.duncan at gmail.com> > >>>>> wrote: > >>>>> > >>>>> On 15/07/2017 7:35 AM, Anthony Damico wrote: > >>>>>> > >>>>>> hello, the last line of the code below causes a segfault for me > >on > >>>>>>> 3.4.1. > >>>>>>> i think i should submit to https://bugs.r-project.org/ unless > >others > >>>>>>> have > >>>>>>> advice? thanks > >>>>>>> > >>>>>>> > >>>>>> Segfaults are usually worth reporting as bugs. Try to come up > >with a > >>>>>> self-contained example, not using the lodown and archive > >packages. I > >>>>>> imagine you can do this by uploading the file you downloaded, or > >>>>>> enough of > >>>>>> a subset of it to trigger the segfault. If you can't do that, > >then > >>>>>> likely > >>>>>> the bug is with one of those packages, not with R. > >>>>>> > >>>>>> Duncan Murdoch > >>>>>> > >>>>>> > >>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> install.packages( "devtools" ) > >>>>>>> devtools::install_github("ajdamico/lodown") > >>>>>>> devtools::install_github("jimhester/archive") > >>>>>>> > >>>>>>> > >>>>>>> file_folder <- file.path( tempdir() , "file_folder" ) > >>>>>>> > >>>>>>> tf <- tempfile() > >>>>>>> > >>>>>>> # large download! cachaca saves on your local disk if already > >>>>>>> downloaded > >>>>>>> lodown::cachaca( ' > >>>>>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' > >, tf > >>>>>>> , > >>>>>>> mode > >>>>>>> = 'wb' ) > >>>>>>> > >>>>>>> archive::archive_extract( tf , dir = normalizePath( file_folder > >) ) > >>>>>>> > >>>>>>> unzipped_files <- list.files( file_folder , recursive = TRUE , > >>>>>>> full.names > >>>>>>> TRUE ) > >>>>>>> > >>>>>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value > >TRUE ) > >>>>>>> > >>>>>>> # works > >>>>>>> R.utils::countLines( infile ) > >>>>>>> > >>>>>>> # works with warning > >>>>>>> my_file <- readLines( infile , skipNul = TRUE ) > >>>>>>> > >>>>>>> # crash > >>>>>>> my_file <- readLines( infile ) > >>>>>>> > >>>>>>> > >>>>>>> # run just before crash > >>>>>>> sessionInfo() > >>>>>>> # R version 3.4.1 (2017-06-30) > >>>>>>> # Platform: x86_64-w64-mingw32/x64 (64-bit) > >>>>>>> # Running under: Windows 10 x64 (build 15063) > >>>>>>> > >>>>>>> # Matrix products: default > >>>>>>> > >>>>>>> # locale: > >>>>>>> # [1] LC_COLLATE=English_United States.1252 > >>>>>>> # [2] LC_CTYPE=English_United States.1252 > >>>>>>> # [3] LC_MONETARY=English_United States.1252 > >>>>>>> # [4] LC_NUMERIC=C > >>>>>>> # [5] LC_TIME=English_United States.1252 > >>>>>>> > >>>>>>> # attached base packages: > >>>>>>> # [1] stats graphics grDevices utils datasets methods > > base > >>>>>>> > >>>>>>> # loaded via a namespace (and not attached): > >>>>>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 > >>>>>>> withr_1.0.2 > >>>>>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 > >>>>>>> memoise_1.1.0 > >>>>>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 > >>>>>>> lodown_0.1.0 > >>>>>>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 > >>>>>>> R.oo_1.21.0 > >>>>>>> # [17] archive_0.0.0.9000 > >>>>>>> > >>>>>>> [[alternative HTML version deleted]] > >>>>>>> > >>>>>>> ______________________________________________ > >>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, > >see > >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti > >>>>>>> ng-guide.html > >>>>>>> and provide commented, minimal, self-contained, reproducible > >code. > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> [[alternative HTML version deleted]] > >>>> > >>>> ______________________________________________ > >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting guide http://www.R-project.org/posti > >>>> ng-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. > >>>> > >>>> > >>> ------------------------------------------------------------ > >>> --------------- > >>> Jeff Newmiller The ..... ..... Go > >>> Live... > >>> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. > >Live > >>> Go... > >>> Live: OO#.. Dead: OO#.. > >Playing > >>> Research Engineer (Solar/Batteries O.O#. #.O#. > >with > >>> /Software/Embedded Controllers) .OO#. .OO#. > >>> rocks...1k > >>> > >>> ______________________________________________ > >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide http://www.R-project.org/posti > >>> ng-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >>> > >> ------------------------------------------------------------ > >> --------------- > >> Jeff Newmiller The ..... ..... Go > >Live... > >> DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live > >> Go... > >> Live: OO#.. Dead: OO#.. > >Playing > >> Research Engineer (Solar/Batteries O.O#. #.O#. with > >> /Software/Embedded Controllers) .OO#. .OO#. > >rocks...1k > >> ------------------------------------------------------------ > >> --------------- > >> >[[alternative HTML version deleted]]