thr3ads.net - R devel - [Rd] readLines() segfaults on large file & question on how to work around [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Jennifer Lyon

2017-Sep-03 18:50 UTC

[Rd] readLines() segfaults on large file & question on how to work around

Jeroen:

Thank you for pointing me to ndjson, which I had not heard of and is
exactly my case.

My experience:
jsonlite::stream_in - segfaults
ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old
      so it won't compile the package
corpus::read_ndjson - works!!! Of course it does a different simplification
     than jsonlite::fromJSON, so I have to change some code, but it works
     beautifully at least in simple tests. The memory-map option may be of
     use in the future.

Another correspondent said that strings in R can only be 2^31-1 long, which
is why any "solution" that tries to load the whole file into R first
as a
string, will fail.

Thanks for suggesting a path forward for me!

Jen

On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <jeroenooms at gmail.com>
wrote:
> On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon <jennifer.s.lyon at
gmail.com>
> wrote:
> > I have a 2.1GB JSON file. Typically I use readLines() and
> > jsonlite:fromJSON() to extract data from a JSON file.
>
> If your data consists of one json object per line, this is called
> 'ndjson'. There are several packages specialized to read ndjon
files:
>
>  - corpus::read_ndjson
>  - ndjson::stream_in
>  - jsonlite::stream_in
>
> In particular the 'corpus' package handles large files really well
> because it has an option to memory-map the file instead of reading all
> of its data into memory.
>
> If the data is too large to read, you can preprocess it using
> stedolan.github.io/jq to extract the fields that you need.
>
> You really don't need hadoop/spark/etc for this.
>
	[[alternative HTML version deleted]]

rhelp at eoos.dds.nl

2017-Sep-04 06:46 UTC

head link

[Rd] readLines() segfaults on large file & question on how to work around

Although the problem can apparently be avoided in this case. readLines 
causing a segfault still seems unwanted behaviour to me. I can replicate 
this with the example below (sessionInfo is further down):


# Generate an example file
l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE),
   collapse="")
con <- file("test.txt", "wt")
for (i in seq_len(2500)) {
   writeLines(l, con, sep ="")
}
close(con)


# Causes segfault:
readLines("test.txt")

Also the error reported by readr is also reproduced (a more informative 
error message and checking for integer overflows would be nice). I will 
report this with readr.

library(readr)
read_file("test.txt")
# Error in read_file_(ds, locale) : negative length vectors are not
# allowed


--
Jan








 > sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C 
LC_TIME=nl_NL.UTF-8
  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8 
LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C                  LC_ADDRESS=C 

[10] LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 
LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] readr_1.1.1

loaded via a namespace (and not attached):
[1] compiler_3.4.1 R6_2.2.2       hms_0.3        tools_3.4.1 
tibble_1.3.3   Rcpp_0.12.12   rlang_0.1.2







On 03-09-17 20:50, Jennifer Lyon wrote:> Jeroen:
> 
> Thank you for pointing me to ndjson, which I had not heard of and is
> exactly my case.
> 
> My experience:
> jsonlite::stream_in - segfaults
> ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too old
>        so it won't compile the package
> corpus::read_ndjson - works!!! Of course it does a different simplification
>       than jsonlite::fromJSON, so I have to change some code, but it works
>       beautifully at least in simple tests. The memory-map option may be of
>       use in the future.
> 
> Another correspondent said that strings in R can only be 2^31-1 long, which
> is why any "solution" that tries to load the whole file into R
first as a
> string, will fail.
> 
> Thanks for suggesting a path forward for me!
> 
> Jen
> 
> On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <jeroenooms at gmail.com>
wrote:
> 
>> On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon <jennifer.s.lyon at
gmail.com>
>> wrote:
>>> I have a 2.1GB JSON file. Typically I use readLines() and
>>> jsonlite:fromJSON() to extract data from a JSON file.
>>
>> If your data consists of one json object per line, this is called
>> 'ndjson'. There are several packages specialized to read ndjon
files:
>>
>>   - corpus::read_ndjson
>>   - ndjson::stream_in
>>   - jsonlite::stream_in
>>
>> In particular the 'corpus' package handles large files really
well
>> because it has an option to memory-map the file instead of reading all
>> of its data into memory.
>>
>> If the data is too large to read, you can preprocess it using
>> stedolan.github.io/jq to extract the fields that you need.
>>
>> You really don't need hadoop/spark/etc for this.
>>
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> stat.ethz.ch/mailman/listinfo/r-devel
>

Tomas Kalibera

2017-Sep-04 11:36 UTC

head link

[Rd] readLines() segfaults on large file & question on how to work around

As of R-devel 72925 one gets a proper error message instead of the crash.

Tomas


On 09/04/2017 08:46 AM, rhelp at eoos.dds.nl wrote:> Although the problem can apparently be avoided in this case. readLines 
> causing a segfault still seems unwanted behaviour to me. I can 
> replicate this with the example below (sessionInfo is further down):
>
>
> # Generate an example file
> l <- paste0(sample(c(letters, LETTERS), 1E6, replace = TRUE),
>   collapse="")
> con <- file("test.txt", "wt")
> for (i in seq_len(2500)) {
>   writeLines(l, con, sep ="")
> }
> close(con)
>
>
> # Causes segfault:
> readLines("test.txt")
>
> Also the error reported by readr is also reproduced (a more 
> informative error message and checking for integer overflows would be 
> nice). I will report this with readr.
>
> library(readr)
> read_file("test.txt")
> # Error in read_file_(ds, locale) : negative length vectors are not
> # allowed
>
>
> -- 
> Jan
>
>
>
>
>
>
>
>
> > sessionInfo()
> R version 3.4.1 (2017-06-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
> Running under: Ubuntu 17.04
>
> Matrix products: default
> BLAS: /usr/lib/libblas/libblas.so.3.7.0
> LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C LC_TIME=nl_NL.UTF-8
>  [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=nl_NL.UTF-8 
> LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=nl_NL.UTF-8       LC_NAME=C LC_ADDRESS=C
> [10] LC_TELEPHONE=C             LC_MEASUREMENT=nl_NL.UTF-8 
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods base
>
> other attached packages:
> [1] readr_1.1.1
>
> loaded via a namespace (and not attached):
> [1] compiler_3.4.1 R6_2.2.2       hms_0.3        tools_3.4.1 
> tibble_1.3.3   Rcpp_0.12.12   rlang_0.1.2
>
>
>
>
>
>
>
> On 03-09-17 20:50, Jennifer Lyon wrote:
>> Jeroen:
>>
>> Thank you for pointing me to ndjson, which I had not heard of and is
>> exactly my case.
>>
>> My experience:
>> jsonlite::stream_in - segfaults
>> ndjson::stream_in - my fault, I am running Ubuntu 14.04 and it is too 
>> old
>>        so it won't compile the package
>> corpus::read_ndjson - works!!! Of course it does a different 
>> simplification
>>       than jsonlite::fromJSON, so I have to change some code, but it 
>> works
>>       beautifully at least in simple tests. The memory-map option may 
>> be of
>>       use in the future.
>>
>> Another correspondent said that strings in R can only be 2^31-1 long, 
>> which
>> is why any "solution" that tries to load the whole file into
R first
>> as a
>> string, will fail.
>>
>> Thanks for suggesting a path forward for me!
>>
>> Jen
>>
>> On Sun, Sep 3, 2017 at 2:15 AM, Jeroen Ooms <jeroenooms at
gmail.com>
>> wrote:
>>
>>> On Sat, Sep 2, 2017 at 8:58 PM, Jennifer Lyon 
>>> <jennifer.s.lyon at gmail.com>
>>> wrote:
>>>> I have a 2.1GB JSON file. Typically I use readLines() and
>>>> jsonlite:fromJSON() to extract data from a JSON file.
>>>
>>> If your data consists of one json object per line, this is called
>>> 'ndjson'. There are several packages specialized to read
ndjon files:
>>>
>>>   - corpus::read_ndjson
>>>   - ndjson::stream_in
>>>   - jsonlite::stream_in
>>>
>>> In particular the 'corpus' package handles large files
really well
>>> because it has an option to memory-map the file instead of reading
all
>>> of its data into memory.
>>>
>>> If the data is too large to read, you can preprocess it using
>>> stedolan.github.io/jq to extract the fields that you need.
>>>
>>> You really don't need hadoop/spark/etc for this.
>>>
>>
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> stat.ethz.ch/mailman/listinfo/r-devel
>>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> stat.ethz.ch/mailman/listinfo/r-devel

Possibly Parallel Threads

Search for more apparently analagous threads

R devel - Sep 2017 - readLines() segfaults on large file & question on how to work around

[Rd] readLines() segfaults on large file & question on how to work around

[Rd] readLines() segfaults on large file & question on how to work around

[Rd] readLines() segfaults on large file & question on how to work around

Possibly Parallel Threads