Jennifer Lyon
2017-Sep-03 18:50 UTC
[Rd] readLines() segfaults on large file & question on how to work around
Jeroen: Thank you for pointing me to ndjson, which I had not heard of and is exactly my case. My experience: jsonlite::stream_in - segfaults ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old so it won't compile the package corpus::read_ndjson - works!!! Of course it does a different simplification than jsonlite::fromJSON, so I have to change some code, but it works beautifully at least in simple tests. The memory-map option may be of use in the future. Another correspondent said that strings in R can only be 2^31-1 long, which is why any "solution" that tries to load the whole file into R first as a string, will fail. Thanks for suggesting a path forward for me! Jen On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <jeroenooms at gmail.com> wrote:> On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon <jennifer.s.lyon at gmail.com> > wrote: > > I have a 2.1GB JSON file. Typically I use readLines() and > > jsonlite:fromJSON() to extract data from a JSON file. > > If your data consists of one json object per line, this is called > 'ndjson'. There are several packages specialized to read ndjon files: > > - corpus::read_ndjson > - ndjson::stream_in > - jsonlite::stream_in > > In particular the 'corpus' package handles large files really well > because it has an option to memory-map the file instead of reading all > of its data into memory. > > If the data is too large to read, you can preprocess it using > stedolan.github.io/jq to extract the fields that you need. > > You really don't need hadoop/spark/etc for this. >[[alternative HTML version deleted]]
rhelp at eoos.dds.nl
2017-Sep-04 06:46 UTC
[Rd] readLines() segfaults on large file & question on how to work around
Although the problem can apparently be avoided in this case. readLines causing a segfault still seems unwanted behaviour to me. I can replicate this with the example below (sessionInfo is further down): # Generate an example file l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE), collapse="") con <- file("test.txt", "wt") for (i in seq_len(2500)) { writeLines(l, con, sep ="") } close(con) # Causes segfault: readLines("test.txt") Also the error reported by readr is also reproduced (a more informative error message and checking for integer overflows would be nice). I will report this with readr. library(readr) read_file("test.txt") # Error in read_file_(ds, locale) : negative length vectors are not # allowed -- Jan > sessionInfo() R version 3.4.1 (2017-06-30) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 17.04 Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.7.0 LAPACK: /usr/lib/lapack/liblapack.so.3.7.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=nl_NL.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=nl_NL.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] readr_1.1.1 loaded via a namespace (and not attached): [1] compiler_3.4.1 R6_2.2.2 hms_0.3 tools_3.4.1 tibble_1.3.3 Rcpp_0.12.12 rlang_0.1.2 On 03-09-17 20:50, Jennifer Lyon wrote:> Jeroen: > > Thank you for pointing me to ndjson, which I had not heard of and is > exactly my case. > > My experience: > jsonlite::stream_in - segfaults > ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old > so it won't compile the package > corpus::read_ndjson - works!!! Of course it does a different simplification > than jsonlite::fromJSON, so I have to change some code, but it works > beautifully at least in simple tests. The memory-map option may be of > use in the future. > > Another correspondent said that strings in R can only be 2^31-1 long, which > is why any "solution" that tries to load the whole file into R first as a > string, will fail. > > Thanks for suggesting a path forward for me! > > Jen > > On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <jeroenooms at gmail.com> wrote: > >> On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon <jennifer.s.lyon at gmail.com> >> wrote: >>> I have a 2.1GB JSON file. Typically I use readLines() and >>> jsonlite:fromJSON() to extract data from a JSON file. >> >> If your data consists of one json object per line, this is called >> 'ndjson'. There are several packages specialized to read ndjon files: >> >> - corpus::read_ndjson >> - ndjson::stream_in >> - jsonlite::stream_in >> >> In particular the 'corpus' package handles large files really well >> because it has an option to memory-map the file instead of reading all >> of its data into memory. >> >> If the data is too large to read, you can preprocess it using >> stedolan.github.io/jq to extract the fields that you need. >> >> You really don't need hadoop/spark/etc for this. >> > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel >
Tomas Kalibera
2017-Sep-04 11:36 UTC
[Rd] readLines() segfaults on large file & question on how to work around
As of R-devel 72925 one gets a proper error message instead of the crash. Tomas On 09/04/2017 08:46 AM, rhelp at eoos.dds.nl wrote:> Although the problem can apparently be avoided in this case. readLines > causing a segfault still seems unwanted behaviour to me. I can > replicate this with the example below (sessionInfo is further down): > > > # Generate an example file > l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE), > collapse="") > con <- file("test.txt", "wt") > for (i in seq_len(2500)) { > writeLines(l, con, sep ="") > } > close(con) > > > # Causes segfault: > readLines("test.txt") > > Also the error reported by readr is also reproduced (a more > informative error message and checking for integer overflows would be > nice). I will report this with readr. > > library(readr) > read_file("test.txt") > # Error in read_file_(ds, locale) : negative length vectors are not > # allowed > > > -- > Jan > > > > > > > > > > sessionInfo() > R version 3.4.1 (2017-06-30) > Platform: x86_64-pc-linux-gnu (64-bit) > Running under: Ubuntu 17.04 > > Matrix products: default > BLAS: /usr/lib/libblas/libblas.so.3.7.0 > LAPACK: /usr/lib/lapack/liblapack.so.3.7.0 > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=nl_NL.UTF-8 > [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=nl_NL.UTF-8 > LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=nl_NL.UTF-8 LC_NAME=C LC_ADDRESS=C > [10] LC_TELEPHONE=C LC_MEASUREMENT=nl_NL.UTF-8 > LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] readr_1.1.1 > > loaded via a namespace (and not attached): > [1] compiler_3.4.1 R6_2.2.2 hms_0.3 tools_3.4.1 > tibble_1.3.3 Rcpp_0.12.12 rlang_0.1.2 > > > > > > > > On 03-09-17 20:50, Jennifer Lyon wrote: >> Jeroen: >> >> Thank you for pointing me to ndjson, which I had not heard of and is >> exactly my case. >> >> My experience: >> jsonlite::stream_in - segfaults >> ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too >> old >> so it won't compile the package >> corpus::read_ndjson - works!!! Of course it does a different >> simplification >> than jsonlite::fromJSON, so I have to change some code, but it >> works >> beautifully at least in simple tests. The memory-map option may >> be of >> use in the future. >> >> Another correspondent said that strings in R can only be 2^31-1 long, >> which >> is why any "solution" that tries to load the whole file into R first >> as a >> string, will fail. >> >> Thanks for suggesting a path forward for me! >> >> Jen >> >> On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <jeroenooms at gmail.com> >> wrote: >> >>> On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon >>> <jennifer.s.lyon at gmail.com> >>> wrote: >>>> I have a 2.1GB JSON file. Typically I use readLines() and >>>> jsonlite:fromJSON() to extract data from a JSON file. >>> >>> If your data consists of one json object per line, this is called >>> 'ndjson'. There are several packages specialized to read ndjon files: >>> >>> - corpus::read_ndjson >>> - ndjson::stream_in >>> - jsonlite::stream_in >>> >>> In particular the 'corpus' package handles large files really well >>> because it has an option to memory-map the file instead of reading all >>> of its data into memory. >>> >>> If the data is too large to read, you can preprocess it using >>> stedolan.github.io/jq to extract the fields that you need. >>> >>> You really don't need hadoop/spark/etc for this. >>> >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-devel at r-project.org mailing list >> stat.ethz.ch/mailman/listinfo/r-devel >> > > ______________________________________________ > R-devel at r-project.org mailing list > stat.ethz.ch/mailman/listinfo/r-devel
Reasonably Related Threads
- readLines() segfaults on large file & question on how to work around
- readLines() segfaults on large file & question on how to work around
- readLines() segfaults on large file & question on how to work around
- A few new packages on CRAN
- A few new packages on CRAN