Sun Yijiang
2019-Aug-02 08:23 UTC
[Rd] Infrequent but steady NULL-pointer caused segfault in as.POSIXlt.POSIXct (R 3.4.4)
The R script I run daily for hours looks like this: while (!finish) { Sys.sleep(0.1) time = as.integer(format(Sys.time(), "%H%M")) # always crash here if (new.data.timestamp() <= time) next # ... do some jobs for about 2 minutes ... gc() } Basically it waits for new data, which comes in every 10 minutes, and do some jobs, then gc(), then loop again. It works great most of the time, but crashes strangely once a month or so. Although infrequent, it always crashes at the same place and gives the same error info, like this: *** caught segfault *** address (nil), cause 'memory not mapped' Traceback: 1: as.POSIXlt.POSIXct(x, tz) 2: as.POSIXlt(x, tz) 3: format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) 4: structure(format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...), names = names(x)) 5: format.POSIXct(Sys.time(), format = "%H%M") 6: format(Sys.time(), format = "%H%M") 7: format(Sys.time(), format = "%H%M") ? ? I looked into the dumped core with gdb, and found something very strange: gdb /usr/lib64/R/bin/exec/R ~/core.30387 (gdb) bt 5 #0 0x00007f1dca844ff1 in __strlen_sse2_pminub () from /lib64/libc.so.6 #1 0x00007f1dcb20e8f9 in Rf_mkChar (name=0x0) at envir.c:3725 #2 0x00007f1dcb1dc225 in do_asPOSIXlt (call=<optimized out>, op=<optimized out>, args=<optimized out>, env=<optimized out>) at datetime.c:705 #3 0x00007f1dcb22197f in bcEval (body=body at entry=0x4064b28, rho=rho at entry=0xc449d38, useCache=useCache at entry=TRUE) at eval.c:6473 #4 0x00007f1dcb230370 in Rf_eval (e=0x4064b28, rho=rho at entry=0xc449d38) at eval.c:624 (More stack frames follow?) Tracing into src/main/datetime.c:705, it?s a simple string-making code: SET_STRING_ELT(tzone, 1, mkChar(R_tzname[0])); mkChar function is defined in envir.c:3725: 3723 SEXP mkChar(const char *name) 3724 { 3725 size_t len = strlen(name); ? ? gdb shows that the string pointer (name=0x0) mkChar received is NULL, and subsequently strlen(NULL) caused the segfault. But quite contradictorily, gdb shows the value passed to mkChar in the caller is valid: (gdb) frame 2 #2 0x00007f1dcb1dc225 in do_asPOSIXlt (call=<optimized out>, op=<optimized out>, args=<optimized out>, env=<optimized out>) at datetime.c:705 705 datetime.c: No such file or directory. (gdb) p tzname[0] $1 = 0x4cf39c0 ?CST? R_tzname is an alias of tzname. (#define R_tzname tzname in the same file.) At first, I suspect that some library may have messed up the memory and accidentally zeroed tzname (a global variable). But with this gdb trace, it shows that tzname is good, only that the pointer passed to mkChar magically changed to zero. Like this: mkChar(tzname[0]) // tzname[0] is ?CST?, address 0x4cf39c ? ? SEXP mkChar(const char *name) // name should be 0x4cf39c, but gdb shows 0x0 { size_t len = strlen(name); // segfault, as name is NULL ? ? The only theory I can think of so far is that, on calling mkChar, the parameter passed on stack somehow got wiped out to zero by some buggy code in R or library. At a higher level, what I see is this: If you run format(Sys.time(), "%H%M?) a million times a day (together with other codes of course), once in a month or so this simple line can segfault. I?m lost in this confusion, could someone please help me find the right direction to further look into this problem? Regards, Steve
Tomas Kalibera
2019-Aug-02 08:59 UTC
[Rd] Infrequent but steady NULL-pointer caused segfault in as.POSIXlt.POSIXct (R 3.4.4)
In an optimized build, debug info is just an approximation. It might help to debug in a build of R and packages without compiler optimizations (-O0), where the debug information is accurate. However, first I would try to modify the example to trigger more often, or try to find external ways to make it trigger more often (e.g. via gctorture). Then I would try to make the example smaller (not call gc() explicitly, not call any external code - e.g. the jobs, etc) - any time the example is reduced but still triggers the errors, the reasoning is made easier. Once you have a repeatable situation in a build with reliable debug symbols, debugging is easier too, e.g. sometimes a watchpoint helps to find memory corruption. Please feel free to ask more when you have more information/updates. If this ends up being a bug in R, please report (and with a reproducible example, if it is not obvious from the source code). Best Tomas On 8/2/19 10:23 AM, Sun Yijiang wrote:> The R script I run daily for hours looks like this: > > while (!finish) { > Sys.sleep(0.1) > time = as.integer(format(Sys.time(), "%H%M")) # always crash here > if (new.data.timestamp() <= time) > next > # ... do some jobs for about 2 minutes ... > gc() > } > > Basically it waits for new data, which comes in every 10 minutes, and > do some jobs, then gc(), then loop again. It works great most of the > time, but crashes strangely once a month or so. Although infrequent, > it always crashes at the same place and gives the same error info, > like this: > > *** caught segfault *** > address (nil), cause 'memory not mapped' > > Traceback: > 1: as.POSIXlt.POSIXct(x, tz) > 2: as.POSIXlt(x, tz) > 3: format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) > 4: structure(format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...), > names = names(x)) > 5: format.POSIXct(Sys.time(), format = "%H%M") > 6: format(Sys.time(), format = "%H%M") > 7: format(Sys.time(), format = "%H%M") > ? ? > > I looked into the dumped core with gdb, and found something very strange: > > gdb /usr/lib64/R/bin/exec/R ~/core.30387 > (gdb) bt 5 > #0 0x00007f1dca844ff1 in __strlen_sse2_pminub () from /lib64/libc.so.6 > #1 0x00007f1dcb20e8f9 in Rf_mkChar (name=0x0) at envir.c:3725 > #2 0x00007f1dcb1dc225 in do_asPOSIXlt (call=<optimized out>, > op=<optimized out>, args=<optimized out>, > env=<optimized out>) at datetime.c:705 > #3 0x00007f1dcb22197f in bcEval (body=body at entry=0x4064b28, > rho=rho at entry=0xc449d38, useCache=useCache at entry=TRUE) > at eval.c:6473 > #4 0x00007f1dcb230370 in Rf_eval (e=0x4064b28, > rho=rho at entry=0xc449d38) at eval.c:624 > (More stack frames follow?) > > Tracing into src/main/datetime.c:705, it?s a simple string-making code: > SET_STRING_ELT(tzone, 1, mkChar(R_tzname[0])); > > mkChar function is defined in envir.c:3725: > 3723 SEXP mkChar(const char *name) > 3724 { > 3725 size_t len = strlen(name); > ? ? > > gdb shows that the string pointer (name=0x0) mkChar received is NULL, > and subsequently strlen(NULL) caused the segfault. But quite > contradictorily, gdb shows the value passed to mkChar in the caller is > valid: > > (gdb) frame 2 > #2 0x00007f1dcb1dc225 in do_asPOSIXlt (call=<optimized out>, > op=<optimized out>, args=<optimized out>, > env=<optimized out>) at datetime.c:705 > 705 datetime.c: No such file or directory. > (gdb) p tzname[0] > $1 = 0x4cf39c0 ?CST? > > R_tzname is an alias of tzname. (#define R_tzname tzname in the same file.) > > At first, I suspect that some library may have messed up the memory > and accidentally zeroed tzname (a global variable). But with this gdb > trace, it shows that tzname is good, only that the pointer passed to > mkChar magically changed to zero. Like this: > > mkChar(tzname[0]) // tzname[0] is ?CST?, address 0x4cf39c > ? ? > SEXP mkChar(const char *name) // name should be 0x4cf39c, but gdb shows 0x0 > { > size_t len = strlen(name); // segfault, as name is NULL > ? ? > > The only theory I can think of so far is that, on calling mkChar, the > parameter passed on stack somehow got wiped out to zero by some buggy > code in R or library. At a higher level, what I see is this: If you > run format(Sys.time(), "%H%M?) a million times a day (together with > other codes of course), once in a month or so this simple line can > segfault. > > I?m lost in this confusion, could someone please help me find the > right direction to further look into this problem? > > Regards, > Steve > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel
Dirk Eddelbuettel
2019-Aug-02 14:24 UTC
[Rd] Infrequent but steady NULL-pointer caused segfault in as.POSIXlt.POSIXct (R 3.4.4)
On 2 August 2019 at 16:23, Sun Yijiang wrote: | The R script I run daily for hours looks like this: | | while (!finish) { | Sys.sleep(0.1) | time = as.integer(format(Sys.time(), "%H%M")) # always crash here | if (new.data.timestamp() <= time) | next | # ... do some jobs for about 2 minutes ... | gc() | } | | Basically it waits for new data, which comes in every 10 minutes, and | do some jobs, then gc(), then loop again. It works great most of the | time, but crashes strangely once a month or so. Although infrequent, | it always crashes at the same place and gives the same error info, | like this: A really long time ago in a galaxy not too far away I also shephearded such jobs where the job were left running for a long time from a single "outer" call, and aimed to control resources via gc(), Sys.sleep()and alike. Doing that, I learned a different session. As R is an fact _an environment_ taking care of a great many things, robustness can be had more easily via fresh processes. Since then I mostly control the jobs _from the outside_ via cron, and aim to have well-defined taks (with few dependencies, a different topic). I do the same for e.g. package testing and loading/unloading of shared libraries: hard to get right, easier to test in vanilla sessions. These are also some of the reasons I joined Jeff Horner in his then-nascient littler project, and continue to look after it. I run many (automated) jobs with it from cron, and this generally works _great_. Rscript came just a little bit later, has also improved over time and is a fine alternative. Lastly, I would recommend to take advantage of the fact that POSIXct is in fact numeric internally and rewrite this as currentTime <- as.numeric(Sys.time()) if (currentTime > prevTime + minDelta) { # do stuff ... prevTime <- currentTime } That alone may help you as you no longer need all the temporary string objects you were chasing through the debugger. Good luck, and keep re-engineering your scripts. R can be a very, very reliable tool when driven the right way. Cheers, Dirk -- http://dirk.eddelbuettel.com | @eddelbuettel | edd at debian.org
William Dunlap
2019-Aug-02 14:50 UTC
[Rd] Infrequent but steady NULL-pointer caused segfault in as.POSIXlt.POSIXct (R 3.4.4)
If you can run things on LInux try running a few iterations of that loop under valgrind, setting gctorture(TRUE) before the loop. % R --debugger=valgrind --silent> gctorture(TRUE) > for(i in 1:5) { ... body of your loop ... }valgrind can show memory misuse that eventually will cause R to crash. Bill Dunlap TIBCO Software wdunlap tibco.com On Fri, Aug 2, 2019 at 1:23 AM Sun Yijiang <sunyijiang at gmail.com> wrote:> The R script I run daily for hours looks like this: > > while (!finish) { > Sys.sleep(0.1) > time = as.integer(format(Sys.time(), "%H%M")) # always crash here > if (new.data.timestamp() <= time) > next > # ... do some jobs for about 2 minutes ... > gc() > } > > Basically it waits for new data, which comes in every 10 minutes, and > do some jobs, then gc(), then loop again. It works great most of the > time, but crashes strangely once a month or so. Although infrequent, > it always crashes at the same place and gives the same error info, > like this: > > *** caught segfault *** > address (nil), cause 'memory not mapped' > > Traceback: > 1: as.POSIXlt.POSIXct(x, tz) > 2: as.POSIXlt(x, tz) > 3: format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) > 4: structure(format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...), > names = names(x)) > 5: format.POSIXct(Sys.time(), format = "%H%M") > 6: format(Sys.time(), format = "%H%M") > 7: format(Sys.time(), format = "%H%M") > ? ? > > I looked into the dumped core with gdb, and found something very strange: > > gdb /usr/lib64/R/bin/exec/R ~/core.30387 > (gdb) bt 5 > #0 0x00007f1dca844ff1 in __strlen_sse2_pminub () from /lib64/libc.so.6 > #1 0x00007f1dcb20e8f9 in Rf_mkChar (name=0x0) at envir.c:3725 > #2 0x00007f1dcb1dc225 in do_asPOSIXlt (call=<optimized out>, > op=<optimized out>, args=<optimized out>, > env=<optimized out>) at datetime.c:705 > #3 0x00007f1dcb22197f in bcEval (body=body at entry=0x4064b28, > rho=rho at entry=0xc449d38, useCache=useCache at entry=TRUE) > at eval.c:6473 > #4 0x00007f1dcb230370 in Rf_eval (e=0x4064b28, > rho=rho at entry=0xc449d38) at eval.c:624 > (More stack frames follow?) > > Tracing into src/main/datetime.c:705, it?s a simple string-making code: > SET_STRING_ELT(tzone, 1, mkChar(R_tzname[0])); > > mkChar function is defined in envir.c:3725: > 3723 SEXP mkChar(const char *name) > 3724 { > 3725 size_t len = strlen(name); > ? ? > > gdb shows that the string pointer (name=0x0) mkChar received is NULL, > and subsequently strlen(NULL) caused the segfault. But quite > contradictorily, gdb shows the value passed to mkChar in the caller is > valid: > > (gdb) frame 2 > #2 0x00007f1dcb1dc225 in do_asPOSIXlt (call=<optimized out>, > op=<optimized out>, args=<optimized out>, > env=<optimized out>) at datetime.c:705 > 705 datetime.c: No such file or directory. > (gdb) p tzname[0] > $1 = 0x4cf39c0 ?CST? > > R_tzname is an alias of tzname. (#define R_tzname tzname in the same file.) > > At first, I suspect that some library may have messed up the memory > and accidentally zeroed tzname (a global variable). But with this gdb > trace, it shows that tzname is good, only that the pointer passed to > mkChar magically changed to zero. Like this: > > mkChar(tzname[0]) // tzname[0] is ?CST?, address 0x4cf39c > ? ? > SEXP mkChar(const char *name) // name should be 0x4cf39c, but gdb shows > 0x0 > { > size_t len = strlen(name); // segfault, as name is NULL > ? ? > > The only theory I can think of so far is that, on calling mkChar, the > parameter passed on stack somehow got wiped out to zero by some buggy > code in R or library. At a higher level, what I see is this: If you > run format(Sys.time(), "%H%M?) a million times a day (together with > other codes of course), once in a month or so this simple line can > segfault. > > I?m lost in this confusion, could someone please help me find the > right direction to further look into this problem? > > Regards, > Steve > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]
Sun Yijiang
2019-Aug-04 13:30 UTC
[Rd] Infrequent but steady NULL-pointer caused segfault in as.POSIXlt.POSIXct (R 3.4.4)
A reply from stackoverflow suggests I might have hit this bug: https://sourceware.org/bugzilla/show_bug.cgi?id=14023 I can confirm that this glibc bug affects my system (latest CentOS 7). However, as far as I know, R is not multithreaded in its core. Is it possible that some library triggered this? Regards, Steve Tomas Kalibera <tomas.kalibera at gmail.com> ?2019?8?2??? ??4:59???> > In an optimized build, debug info is just an approximation. It might > help to debug in a build of R and packages without compiler > optimizations (-O0), where the debug information is accurate. However, > first I would try to modify the example to trigger more often, or try to > find external ways to make it trigger more often (e.g. via gctorture). > Then I would try to make the example smaller (not call gc() explicitly, > not call any external code - e.g. the jobs, etc) - any time the example > is reduced but still triggers the errors, the reasoning is made easier. > Once you have a repeatable situation in a build with reliable debug > symbols, debugging is easier too, e.g. sometimes a watchpoint helps to > find memory corruption. Please feel free to ask more when you have more > information/updates. If this ends up being a bug in R, please report > (and with a reproducible example, if it is not obvious from the source > code). > > Best > Tomas > > > On 8/2/19 10:23 AM, Sun Yijiang wrote: > > The R script I run daily for hours looks like this: > > > > while (!finish) { > > Sys.sleep(0.1) > > time = as.integer(format(Sys.time(), "%H%M")) # always crash here > > if (new.data.timestamp() <= time) > > next > > # ... do some jobs for about 2 minutes ... > > gc() > > } > > > > Basically it waits for new data, which comes in every 10 minutes, and > > do some jobs, then gc(), then loop again. It works great most of the > > time, but crashes strangely once a month or so. Although infrequent, > > it always crashes at the same place and gives the same error info, > > like this: > > > > *** caught segfault *** > > address (nil), cause 'memory not mapped' > > > > Traceback: > > 1: as.POSIXlt.POSIXct(x, tz) > > 2: as.POSIXlt(x, tz) > > 3: format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...) > > 4: structure(format.POSIXlt(as.POSIXlt(x, tz), format, usetz, ...), > > names = names(x)) > > 5: format.POSIXct(Sys.time(), format = "%H%M") > > 6: format(Sys.time(), format = "%H%M") > > 7: format(Sys.time(), format = "%H%M") > > ? ? > > > > I looked into the dumped core with gdb, and found something very strange: > > > > gdb /usr/lib64/R/bin/exec/R ~/core.30387 > > (gdb) bt 5 > > #0 0x00007f1dca844ff1 in __strlen_sse2_pminub () from /lib64/libc.so.6 > > #1 0x00007f1dcb20e8f9 in Rf_mkChar (name=0x0) at envir.c:3725 > > #2 0x00007f1dcb1dc225 in do_asPOSIXlt (call=<optimized out>, > > op=<optimized out>, args=<optimized out>, > > env=<optimized out>) at datetime.c:705 > > #3 0x00007f1dcb22197f in bcEval (body=body at entry=0x4064b28, > > rho=rho at entry=0xc449d38, useCache=useCache at entry=TRUE) > > at eval.c:6473 > > #4 0x00007f1dcb230370 in Rf_eval (e=0x4064b28, > > rho=rho at entry=0xc449d38) at eval.c:624 > > (More stack frames follow?) > > > > Tracing into src/main/datetime.c:705, it?s a simple string-making code: > > SET_STRING_ELT(tzone, 1, mkChar(R_tzname[0])); > > > > mkChar function is defined in envir.c:3725: > > 3723 SEXP mkChar(const char *name) > > 3724 { > > 3725 size_t len = strlen(name); > > ? ? > > > > gdb shows that the string pointer (name=0x0) mkChar received is NULL, > > and subsequently strlen(NULL) caused the segfault. But quite > > contradictorily, gdb shows the value passed to mkChar in the caller is > > valid: > > > > (gdb) frame 2 > > #2 0x00007f1dcb1dc225 in do_asPOSIXlt (call=<optimized out>, > > op=<optimized out>, args=<optimized out>, > > env=<optimized out>) at datetime.c:705 > > 705 datetime.c: No such file or directory. > > (gdb) p tzname[0] > > $1 = 0x4cf39c0 ?CST? > > > > R_tzname is an alias of tzname. (#define R_tzname tzname in the same file.) > > > > At first, I suspect that some library may have messed up the memory > > and accidentally zeroed tzname (a global variable). But with this gdb > > trace, it shows that tzname is good, only that the pointer passed to > > mkChar magically changed to zero. Like this: > > > > mkChar(tzname[0]) // tzname[0] is ?CST?, address 0x4cf39c > > ? ? > > SEXP mkChar(const char *name) // name should be 0x4cf39c, but gdb shows 0x0 > > { > > size_t len = strlen(name); // segfault, as name is NULL > > ? ? > > > > The only theory I can think of so far is that, on calling mkChar, the > > parameter passed on stack somehow got wiped out to zero by some buggy > > code in R or library. At a higher level, what I see is this: If you > > run format(Sys.time(), "%H%M?) a million times a day (together with > > other codes of course), once in a month or so this simple line can > > segfault. > > > > I?m lost in this confusion, could someone please help me find the > > right direction to further look into this problem? > > > > Regards, > > Steve > > > > ______________________________________________ > > R-devel at r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > >