Spencer Graves
2023-Feb-11 05:38 UTC
[Rd] scan(..., skip=1e11): infinite loop; cannot interrupt
Hello, All: I have a 4.54 GB file that I'm trying to read in chunks using "scan(..., skip=__)". It works as expected for small values of "skip" but goes into an infinite loop for "skip=1e11" and similar large values of skip: I cannot even interrupt it; I must kill R. Below please find sessionInfo() with a toy example. My real problem is a large corrupted Thunderbird email file. It's file type "Mork", which is mostly standard characters with "\n" between records of varying length. Is there some other function in R that allows me to read chunks of a large file like this? Thanks, Spencer Graves writeLines(as.character(1:11), 'tstNums.txt') (Tst2 <- scan('tstNums.txt', n=12, skip=5)) # works: 6 7 8 9 10 11 (Tst13 <- scan('tstNums.txt', n=12, skip=13)) # works: numeric(0) (tst1e11 <- scan('tst.txt', n=12, skip=1e11)) # Goes into an infinite loop that I cannot even interrupt. # I must kill R and start over. sessionInfo() R version 4.2.2 (2022-10-31) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.7.3 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] compiler_4.2.2 fastmap_1.1.0 cli_3.6.0 htmltools_0.5.4 [5] tools_4.2.2 rstudioapi_0.14 yaml_2.3.6 rmarkdown_2.20 [9] knitr_1.41 xfun_0.36 digest_0.6.31 rlang_1.0.6 [13] evaluate_0.20
Ivan Krylov
2023-Feb-11 08:33 UTC
[Rd] scan(..., skip=1e11): infinite loop; cannot interrupt
On Fri, 10 Feb 2023 23:38:55 -0600 Spencer Graves <spencer.graves at prodsyse.com> wrote:> I have a 4.54 GB file that I'm trying to read in chunks using > "scan(..., skip=__)". It works as expected for small values of > "skip" but goes into an infinite loop for "skip=1e11" and similar > large values of skip: I cannot even interrupt it; I must kill R.Skipping lines is done by two nested loops. The outer loop counts the lines to skip; the inner loop reads characters until it encounters a newline or end of file. The outer loop doesn't check for EOF and keeps asking for more characters until the inner loop runs at least once for every line it wants to skip. The following patch should avoid the wait in such cases: --- src/main/scan.c (revision 83797) +++ src/main/scan.c (working copy) @@ -835,7 +835,7 @@ attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho) { SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr; - int c, flush, fill, blskip, multiline, escapes, skipNul; + int c = 0, flush, fill, blskip, multiline, escapes, skipNul; R_xlen_t nmax, nlines, nskip; const char *p, *encoding; RCNTXT cntxt; @@ -952,7 +952,7 @@ if(!data.con->canread) error(_("cannot read from this connection")); } - for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */ + for (R_xlen_t i = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */ while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF); } Making it interruptible is a bit more work: we need to ensure that a valid context is set up and check regularly for an interrupt. --- src/main/scan.c (revision 83797) +++ src/main/scan.c (working copy) @@ -835,7 +835,7 @@ attribute_hidden SEXP do_scan(SEXP call, SEXP op, SEXP args, SEXP rho) { SEXP ans, file, sep, what, stripwhite, dec, quotes, comstr; - int c, flush, fill, blskip, multiline, escapes, skipNul; + int c = 0, flush, fill, blskip, multiline, escapes, skipNul; R_xlen_t nmax, nlines, nskip; const char *p, *encoding; RCNTXT cntxt; @@ -952,8 +952,6 @@ if(!data.con->canread) error(_("cannot read from this connection")); } - for (R_xlen_t i = 0; i < nskip; i++) /* MBCS-safe */ - while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF); } ans = R_NilValue; /* -Wall */ @@ -966,6 +964,10 @@ cntxt.cend = &scan_cleanup; cntxt.cenddata = &data; + if (ii) for (R_xlen_t i = 0, j = 0; i < nskip && c != R_EOF; i++) /* MBCS-safe */ + while ((c = scanchar(FALSE, &data)) != '\n' && c != R_EOF) + if (j++ % 10000 == 9999) R_CheckUserInterrupt(); + switch (TYPEOF(what)) { case LGLSXP: case INTSXP: This way, even if you pour a Decanter of Endless Lines (e.g. mkfifo LINES; perl -E'print "A"x42 while 1;' > LINES) into scan(), it can still be interrupted, even if neither newline nor EOF ever arrives. (We never skip lines when reading from the console? I suppose it makes sense. I think this needs to be documented and can write a documentation patch.) If you actually have 1e11 lines in your file and would like to read it in chunks, it may help to use f <- file('...') chunk1 <- scan(f, n = n1, skip = nskip1) # the following will continue reading where chunk1 had ended chunk2 <- scan(f, n = n2, skip = nskip2) ...in order to avoid having to skip over chunks you have already read, which otherwise makes the algorithm quadratic in number of lines instead of linear. (I couldn't determine whether you're already doing this, sorry.) Skipping a fixed number of lines is hard: since they have variable length, it's required to read every character in order to determine whether it starts a new line. With byte ranges, it would have been possible to use seek(), but not here. -- Best regards, Ivan
Possibly Parallel Threads
- scan(..., skip=1e11): infinite loop; cannot interrupt
- scan(..., skip=1e11): infinite loop; cannot interrupt
- read.table problem on Linux/Alpha (seg faults caused by isspace(R_EOF)) (PR#303)
- crash with scan(..., what=list(,,)) (PR#802)
- Fetching a range of columns