hi, thanks Dr. Murdoch
i'd appreciate if anyone on r-help could help me narrow this down?  i
believe the segfault occurs because there's a single line with 4GB and also
embedded nuls, but i am not sure how to artificially construct that?
the lodown package can be removed from my example..  it is just for file
download cacheing, so `lodown::cachaca` can be replaced with
`download.file`  my current example requires a huge download, so sort of
painful to repeat but i'm pretty confident that's not the issue.
the archive::archive_extract() function unzips a (probably corrupt) .RAR
file and creates a text file with 80,937 lines.  this file is 4GB:
    > file.size(infile)
    [1] 4078192743
i am pretty sure that nearly all of that 4GB is contained on a single line
in the file.  here's what happens when i create a file connection and scan
through..
    > file_con <- file( infile , 'r' )
    >
    > first_80936_lines <- readLines( file_con , n = 80936 )
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "1000023930632009"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "36F2924009PAULO"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "AFONSO"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "BA11"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "00000"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "00"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "2924009PAULO"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "AFONSO"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "BA1111"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "467.20"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "346.10"
    > scan( w , n = 1 , what = character() )
    Read 1 item
    [1] "414.40"
    > scan( w , n = 1 , what = character() )
    Error in scan(w, n = 1, what = character()) :
      could not allocate memory (2048 Mb) in C function
'R_AllocStringBuffer'
making a huge single-line file does not reproduce the problem, i think the
embedded nuls have something to do with it--
    # WARNING do not run with less than 64GB RAM
    tf <- tempfile()
    a <- rep( "a" , 1000000000 )
    b <- paste( a , collapse = '' )
    writeLines( b , tf ) ; rm( b ) ; gc()
    d <- readLines( tf )
On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <murdoch.duncan at
gmail.com>
wrote:
> On 15/07/2017 7:35 AM, Anthony Damico wrote:
>
>> hello, the last line of the code below causes a segfault for me on
3.4.1.
>> i think i should submit to https://bugs.r-project.org/  unless others
>> have
>> advice?  thanks
>>
>
> Segfaults are usually worth reporting as bugs.  Try to come up with a
> self-contained example, not using the lodown and archive packages.  I
> imagine you can do this by uploading the file you downloaded, or enough of
> a subset of it to trigger the segfault.  If you can't do that, then
likely
> the bug is with one of those packages, not with R.
>
> Duncan Murdoch
>
>
>>
>>
>>
>>
>> install.packages( "devtools" )
>> devtools::install_github("ajdamico/lodown")
>> devtools::install_github("jimhester/archive")
>>
>>
>> file_folder <- file.path( tempdir() , "file_folder" )
>>
>> tf <- tempfile()
>>
>> # large download!  cachaca saves on your local disk if already
downloaded
>> lodown::cachaca( '
>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' ,
tf ,
>> mode
>> = 'wb' )
>>
>> archive::archive_extract( tf , dir = normalizePath( file_folder ) )
>>
>> unzipped_files <- list.files( file_folder , recursive = TRUE ,
full.names
>> >> TRUE  )
>>
>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files ,
value = TRUE )
>>
>> # works
>> R.utils::countLines( infile )
>>
>> # works with warning
>> my_file <- readLines( infile , skipNul = TRUE )
>>
>> # crash
>> my_file <- readLines( infile )
>>
>>
>> # run just before crash
>> sessionInfo()
>> # R version 3.4.1 (2017-06-30)
>> # Platform: x86_64-w64-mingw32/x64 (64-bit)
>> # Running under: Windows 10 x64 (build 15063)
>>
>> # Matrix products: default
>>
>> # locale:
>> # [1] LC_COLLATE=English_United States.1252
>> # [2] LC_CTYPE=English_United States.1252
>> # [3] LC_MONETARY=English_United States.1252
>> # [4] LC_NUMERIC=C
>> # [5] LC_TIME=English_United States.1252
>>
>> # attached base packages:
>> # [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>> # loaded via a namespace (and not attached):
>>  # [1] httr_1.2.1         compiler_3.4.1     R6_2.2.1
>>  withr_1.0.2
>>  # [5] tibble_1.3.3       curl_2.6           Rcpp_0.12.11
>> memoise_1.1.0
>>  # [9] R.methodsS3_1.7.1  git2r_0.18.0       digest_0.6.12
>> lodown_0.1.0
>> # [13] R.utils_2.5.0      rlang_0.1.1        devtools_1.13.2
>> R.oo_1.21.0
>> # [17] archive_0.0.0.9000
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
	[[alternative HTML version deleted]]
hi, i realized that the segfault happens on the text file in a new R session. so, creating the segfault-generating text file requires a contributed package, but prompting the actual segfault does not -- pretty sure that means this is a base R bug? submitted here: https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully i am not doing something remarkably stupid. the text file itself is 4GB so cannot upload it to bugzilla, and from the R_AllocStringBugger error in the previous message, i think most or all of it needs to be there to trigger the segfault. thanks! On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com> wrote:> hi, thanks Dr. Murdoch > > > i'd appreciate if anyone on r-help could help me narrow this down? i > believe the segfault occurs because there's a single line with 4GB and also > embedded nuls, but i am not sure how to artificially construct that? > > > the lodown package can be removed from my example.. it is just for file > download cacheing, so `lodown::cachaca` can be replaced with > `download.file` my current example requires a huge download, so sort of > painful to repeat but i'm pretty confident that's not the issue. > > > the archive::archive_extract() function unzips a (probably corrupt) .RAR > file and creates a text file with 80,937 lines. this file is 4GB: > > > file.size(infile) > [1] 4078192743 <(407)%20819-2743> > > > i am pretty sure that nearly all of that 4GB is contained on a single line > in the file. here's what happens when i create a file connection and scan > through.. > > > file_con <- file( infile , 'r' ) > > > > first_80936_lines <- readLines( file_con , n = 80936 ) > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "1000023930632009" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "36F2924009PAULO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "AFONSO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "BA11" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "00000" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "00" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "2924009PAULO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "AFONSO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "BA1111" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "467.20" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "346.10" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "414.40" > > scan( w , n = 1 , what = character() ) > Error in scan(w, n = 1, what = character()) : > could not allocate memory (2048 Mb) in C function > 'R_AllocStringBuffer' > > > > making a huge single-line file does not reproduce the problem, i think the > embedded nuls have something to do with it-- > > > # WARNING do not run with less than 64GB RAM > tf <- tempfile() > a <- rep( "a" , 1000000000 ) > b <- paste( a , collapse = '' ) > writeLines( b , tf ) ; rm( b ) ; gc() > d <- readLines( tf ) > > > > On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <murdoch.duncan at gmail.com> > wrote: > >> On 15/07/2017 7:35 AM, Anthony Damico wrote: >> >>> hello, the last line of the code below causes a segfault for me on 3.4.1. >>> i think i should submit to https://bugs.r-project.org/ unless others >>> have >>> advice? thanks >>> >> >> Segfaults are usually worth reporting as bugs. Try to come up with a >> self-contained example, not using the lodown and archive packages. I >> imagine you can do this by uploading the file you downloaded, or enough of >> a subset of it to trigger the segfault. If you can't do that, then likely >> the bug is with one of those packages, not with R. >> >> Duncan Murdoch >> >> >>> >>> >>> >>> >>> install.packages( "devtools" ) >>> devtools::install_github("ajdamico/lodown") >>> devtools::install_github("jimhester/archive") >>> >>> >>> file_folder <- file.path( tempdir() , "file_folder" ) >>> >>> tf <- tempfile() >>> >>> # large download! cachaca saves on your local disk if already downloaded >>> lodown::cachaca( ' >>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf , >>> mode >>> = 'wb' ) >>> >>> archive::archive_extract( tf , dir = normalizePath( file_folder ) ) >>> >>> unzipped_files <- list.files( file_folder , recursive = TRUE , >>> full.names >>> TRUE ) >>> >>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE ) >>> >>> # works >>> R.utils::countLines( infile ) >>> >>> # works with warning >>> my_file <- readLines( infile , skipNul = TRUE ) >>> >>> # crash >>> my_file <- readLines( infile ) >>> >>> >>> # run just before crash >>> sessionInfo() >>> # R version 3.4.1 (2017-06-30) >>> # Platform: x86_64-w64-mingw32/x64 (64-bit) >>> # Running under: Windows 10 x64 (build 15063) >>> >>> # Matrix products: default >>> >>> # locale: >>> # [1] LC_COLLATE=English_United States.1252 >>> # [2] LC_CTYPE=English_United States.1252 >>> # [3] LC_MONETARY=English_United States.1252 >>> # [4] LC_NUMERIC=C >>> # [5] LC_TIME=English_United States.1252 >>> >>> # attached base packages: >>> # [1] stats graphics grDevices utils datasets methods base >>> >>> # loaded via a namespace (and not attached): >>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 >>> withr_1.0.2 >>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 >>> memoise_1.1.0 >>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 >>> lodown_0.1.0 >>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 >>> R.oo_1.21.0 >>> # [17] archive_0.0.0.9000 >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________ >>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posti >>> ng-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >[[alternative HTML version deleted]]
On 15/07/2017 11:33 AM, Anthony Damico wrote:> hi, i realized that the segfault happens on the text file in a new R > session. so, creating the segfault-generating text file requires a > contributed package, but prompting the actual segfault does not -- > pretty sure that means this is a base R bug? submitted here: > https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully i > am not doing something remarkably stupid. the text file itself is 4GB > so cannot upload it to bugzilla, and from the R_AllocStringBugger error > in the previous message, i think most or all of it needs to be there to > trigger the segfault. thanks!Hopefully someone can debug it with the info you provided. Duncan Murdoch> > On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com > <mailto:ajdamico at gmail.com>> wrote: > > hi, thanks Dr. Murdoch > > > i'd appreciate if anyone on r-help could help me narrow this down? > i believe the segfault occurs because there's a single line with 4GB > and also embedded nuls, but i am not sure how to artificially > construct that? > > > the lodown package can be removed from my example.. it is just for > file download cacheing, so `lodown::cachaca` can be replaced with > `download.file` my current example requires a huge download, so > sort of painful to repeat but i'm pretty confident that's not the issue. > > > the archive::archive_extract() function unzips a (probably corrupt) > .RAR file and creates a text file with 80,937 lines. this file is 4GB: > > > file.size(infile) > [1] 4078192743 <tel:(407)%20819-2743> > > > i am pretty sure that nearly all of that 4GB is contained on a > single line in the file. here's what happens when i create a file > connection and scan through.. > > > file_con <- file( infile , 'r' ) > > > > first_80936_lines <- readLines( file_con , n = 80936 ) > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "1000023930632009" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "36F2924009PAULO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "AFONSO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "BA11" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "00000" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "00" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "2924009PAULO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "AFONSO" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "BA1111" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "467.20" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "346.10" > > scan( w , n = 1 , what = character() ) > Read 1 item > [1] "414.40" > > scan( w , n = 1 , what = character() ) > Error in scan(w, n = 1, what = character()) : > could not allocate memory (2048 Mb) in C function > 'R_AllocStringBuffer' > > > > making a huge single-line file does not reproduce the problem, i > think the embedded nuls have something to do with it-- > > > # WARNING do not run with less than 64GB RAM > tf <- tempfile() > a <- rep( "a" , 1000000000 ) > b <- paste( a , collapse = '' ) > writeLines( b , tf ) ; rm( b ) ; gc() > d <- readLines( tf ) > > > > On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch > <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote: > > On 15/07/2017 7:35 AM, Anthony Damico wrote: > > hello, the last line of the code below causes a segfault for > me on 3.4.1. > i think i should submit to https://bugs.r-project.org/ > unless others have > advice? thanks > > > Segfaults are usually worth reporting as bugs. Try to come up > with a self-contained example, not using the lodown and archive > packages. I imagine you can do this by uploading the file you > downloaded, or enough of a subset of it to trigger the > segfault. If you can't do that, then likely the bug is with one > of those packages, not with R. > > Duncan Murdoch > > > > > > > install.packages( "devtools" ) > devtools::install_github("ajdamico/lodown") > devtools::install_github("jimhester/archive") > > > file_folder <- file.path( tempdir() , "file_folder" ) > > tf <- tempfile() > > # large download! cachaca saves on your local disk if > already downloaded > lodown::cachaca( ' > http://download.inep.gov.br/microdados/microdados_enem2009.rar > <http://download.inep.gov.br/microdados/microdados_enem2009.rar>' > , tf , mode > = 'wb' ) > > archive::archive_extract( tf , dir = normalizePath( > file_folder ) ) > > unzipped_files <- list.files( file_folder , recursive = TRUE > , full.names > TRUE ) > > infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value > = TRUE ) > > # works > R.utils::countLines( infile ) > > # works with warning > my_file <- readLines( infile , skipNul = TRUE ) > > # crash > my_file <- readLines( infile ) > > > # run just before crash > sessionInfo() > # R version 3.4.1 (2017-06-30) > # Platform: x86_64-w64-mingw32/x64 (64-bit) > # Running under: Windows 10 x64 (build 15063) > > # Matrix products: default > > # locale: > # [1] LC_COLLATE=English_United States.1252 > # [2] LC_CTYPE=English_United States.1252 > # [3] LC_MONETARY=English_United States.1252 > # [4] LC_NUMERIC=C > # [5] LC_TIME=English_United States.1252 > > # attached base packages: > # [1] stats graphics grDevices utils datasets > methods base > > # loaded via a namespace (and not attached): > # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 > withr_1.0.2 > # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 > memoise_1.1.0 > # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 > lodown_0.1.0 > # [13] R.utils_2.5.0 rlang_0.1.1 > devtools_1.13.2 R.oo_1.21.0 > # [17] archive_0.0.0.9000 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org <mailto:R-help at r-project.org> mailing > list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > <https://stat.ethz.ch/mailman/listinfo/r-help> > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > <http://www.R-project.org/posting-guide.html> > and provide commented, minimal, self-contained, reproducible > code. > > > >
I am not able to reproduce this on a Linux platform: #######################3 fn1 <- "/home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt" sessionInfo() ## R version 3.4.1 (2017-06-30) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 14.04.5 LTS ## ## Matrix products: default ## BLAS: /usr/lib/libblas/libblas.so.3.0 ## LAPACK: /usr/lib/lapack/liblapack.so.3.0 ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## loaded via a namespace (and not attached): ## [1] compiler_3.4.1 tools::md5sum( fn1 ) ## /home/jdnewmil/Downloads/Microdados ENEM 2009/Dados Enem 2009/DADOS_ENEM_2009.txt ## "83e61c96092285b60d7bf6b0dbc7072e" dat <- readLines( fn1 ) length( dat ) ## [1] 4148721 No segfault occurs. On Sat, 15 Jul 2017, Anthony Damico wrote:> hi, i realized that the segfault happens on the text file in a new R > session. so, creating the segfault-generating text file requires a > contributed package, but prompting the actual segfault does not -- pretty > sure that means this is a base R bug? submitted here: > https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully i am > not doing something remarkably stupid. the text file itself is 4GB so > cannot upload it to bugzilla, and from the R_AllocStringBugger error in the > previous message, i think most or all of it needs to be there to trigger > the segfault. thanks! > > > On Sat, Jul 15, 2017 at 10:32 AM, Anthony Damico <ajdamico at gmail.com> wrote: > >> hi, thanks Dr. Murdoch >> >> >> i'd appreciate if anyone on r-help could help me narrow this down? i >> believe the segfault occurs because there's a single line with 4GB and also >> embedded nuls, but i am not sure how to artificially construct that? >> >> >> the lodown package can be removed from my example.. it is just for file >> download cacheing, so `lodown::cachaca` can be replaced with >> `download.file` my current example requires a huge download, so sort of >> painful to repeat but i'm pretty confident that's not the issue. >> >> >> the archive::archive_extract() function unzips a (probably corrupt) .RAR >> file and creates a text file with 80,937 lines. this file is 4GB: >> >> > file.size(infile) >> [1] 4078192743 <(407)%20819-2743> >> >> >> i am pretty sure that nearly all of that 4GB is contained on a single line >> in the file. here's what happens when i create a file connection and scan >> through.. >> >> > file_con <- file( infile , 'r' ) >> > >> > first_80936_lines <- readLines( file_con , n = 80936 ) >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "1000023930632009" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "36F2924009PAULO" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "AFONSO" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "BA11" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "00000" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "00" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "2924009PAULO" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "AFONSO" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "BA1111" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "467.20" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "346.10" >> > scan( w , n = 1 , what = character() ) >> Read 1 item >> [1] "414.40" >> > scan( w , n = 1 , what = character() ) >> Error in scan(w, n = 1, what = character()) : >> could not allocate memory (2048 Mb) in C function >> 'R_AllocStringBuffer' >> >> >> >> making a huge single-line file does not reproduce the problem, i think the >> embedded nuls have something to do with it-- >> >> >> # WARNING do not run with less than 64GB RAM >> tf <- tempfile() >> a <- rep( "a" , 1000000000 ) >> b <- paste( a , collapse = '' ) >> writeLines( b , tf ) ; rm( b ) ; gc() >> d <- readLines( tf ) >> >> >> >> On Sat, Jul 15, 2017 at 9:17 AM, Duncan Murdoch <murdoch.duncan at gmail.com> >> wrote: >> >>> On 15/07/2017 7:35 AM, Anthony Damico wrote: >>> >>>> hello, the last line of the code below causes a segfault for me on 3.4.1. >>>> i think i should submit to https://bugs.r-project.org/ unless others >>>> have >>>> advice? thanks >>>> >>> >>> Segfaults are usually worth reporting as bugs. Try to come up with a >>> self-contained example, not using the lodown and archive packages. I >>> imagine you can do this by uploading the file you downloaded, or enough of >>> a subset of it to trigger the segfault. If you can't do that, then likely >>> the bug is with one of those packages, not with R. >>> >>> Duncan Murdoch >>> >>> >>>> >>>> >>>> >>>> >>>> install.packages( "devtools" ) >>>> devtools::install_github("ajdamico/lodown") >>>> devtools::install_github("jimhester/archive") >>>> >>>> >>>> file_folder <- file.path( tempdir() , "file_folder" ) >>>> >>>> tf <- tempfile() >>>> >>>> # large download! cachaca saves on your local disk if already downloaded >>>> lodown::cachaca( ' >>>> http://download.inep.gov.br/microdados/microdados_enem2009.rar' , tf , >>>> mode >>>> = 'wb' ) >>>> >>>> archive::archive_extract( tf , dir = normalizePath( file_folder ) ) >>>> >>>> unzipped_files <- list.files( file_folder , recursive = TRUE , >>>> full.names >>>> TRUE ) >>>> >>>> infile <- grep( "DADOS(.*)\\.txt$" , unzipped_files , value = TRUE ) >>>> >>>> # works >>>> R.utils::countLines( infile ) >>>> >>>> # works with warning >>>> my_file <- readLines( infile , skipNul = TRUE ) >>>> >>>> # crash >>>> my_file <- readLines( infile ) >>>> >>>> >>>> # run just before crash >>>> sessionInfo() >>>> # R version 3.4.1 (2017-06-30) >>>> # Platform: x86_64-w64-mingw32/x64 (64-bit) >>>> # Running under: Windows 10 x64 (build 15063) >>>> >>>> # Matrix products: default >>>> >>>> # locale: >>>> # [1] LC_COLLATE=English_United States.1252 >>>> # [2] LC_CTYPE=English_United States.1252 >>>> # [3] LC_MONETARY=English_United States.1252 >>>> # [4] LC_NUMERIC=C >>>> # [5] LC_TIME=English_United States.1252 >>>> >>>> # attached base packages: >>>> # [1] stats graphics grDevices utils datasets methods base >>>> >>>> # loaded via a namespace (and not attached): >>>> # [1] httr_1.2.1 compiler_3.4.1 R6_2.2.1 >>>> withr_1.0.2 >>>> # [5] tibble_1.3.3 curl_2.6 Rcpp_0.12.11 >>>> memoise_1.1.0 >>>> # [9] R.methodsS3_1.7.1 git2r_0.18.0 digest_0.6.12 >>>> lodown_0.1.0 >>>> # [13] R.utils_2.5.0 rlang_0.1.1 devtools_1.13.2 >>>> R.oo_1.21.0 >>>> # [17] archive_0.0.0.9000 >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide http://www.R-project.org/posti >>>> ng-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. >--------------------------------------------------------------------------- Jeff Newmiller The ..... ..... Go Live... DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k
On 15/07/2017 11:33 AM, Anthony Damico wrote:> hi, i realized that the segfault happens on the text file in a new R > session. so, creating the segfault-generating text file requires a > contributed package, but prompting the actual segfault does not -- > pretty sure that means this is a base R bug? submitted here: > https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=17311 hopefully i > am not doing something remarkably stupid. the text file itself is 4GB > so cannot upload it to bugzilla, and from the R_AllocStringBugger error > in the previous message, i think most or all of it needs to be there to > trigger the segfault. thanks!I don't want to download the big file or install the archive package. Could you run the code below on the bad file? If you're right and it's only nulls that matter, this might allow me to create a file that triggers the bug. f <- # put the filename of the bad file here con <- file(f, open="rb") zeros <- numeric() repeat { bytes <- readBin(con, "int", 1000000, size=1) zeros <- c(zeros, count + which(bytes == 0)) count <- count + length(bytes) if (length(bytes) < 1000000) break } close(con) cat("File length=", count, "\n") cat("Nulls:\n") zeros Here's some code to recreate a file of the same length with nulls in the same places, and spaces everywhere else: size <- count f2 <- tempfile() con <- file(f2, open="wb") count <- 0 while (count < size) { nonzeros <- min(c(size - count, 1000000, zeros - 1)) if (nonzeros) { writeBin(rep(32L, nonzeros), con, size = 1) count <- count + nonzeros } zeros <- zeros - nonzeros if (length(zeros) && min(zeros) == 1) { writeBin(0L, con, size = 1) count <- count + 1 zeros <- zeros[-1] - 1 } } close(con) Duncan Murdoch