Dear r-devel readers, I have a pretty deep problem with package loading and unloading in the development version of the lme4 package <https://github.com/lme4/lme4>; it's not boiled down to a properly minimal example yet (this has been difficult), but I am posting anyway in the hopes that someone has ideas about how to proceed farther, since I'm nearly stumped. Apologies in advance for the long post. EXECUTIVE SUMMARY: after one cycle of loading, testing (e.g. by running example(lmer)) and unloading lme4, then loading and unloading the nlme package, re-loading and exercising lme4 becomes very unstable, leading eventually to a segmentation fault. More detail is available at <https://github.com/lme4/lme4/issues/35> . Because it's a segmentation fault, exactly _where_ the crash happens varies a bit according to platform and precise incantation, but it seems I can always get a segfault eventually. We had previously forgotten to write a .onUnload() function to call library.dynam.unload() [it's not entirely clear to me why R handles dynamic loading automatically when there is a useDynLib() in the NAMESPACE, but not unloading: does anyone know if this is intentional?]; before we had the appropriate unloading hook, instead of segfaults we got errors about objects from the DLL not being found. The package has many complexities that might make it an especially tough case: it uses reference classes as well as making use of the Rcpp and RcppEigen packages ... I have tried running under valgrind and have seen what appear to be memory leaks but no memory access errors. The beginning and end of the valgrind output are appended below -- I can send the full output if anyone wants to see it ... I plan to continue to try to isolate the problem by (1) minimizing the amount of package-exercising code run between loading and unloading; (2) trying to create a minimal version of the package. In the meantime if anyone can think of any shortcuts or clever diagnostics I would love to hear about it ... thanks Ben Bolker --24640-- Valgrind options: --24640-- --suppressions=/usr/lib/valgrind/debian-libc6-dbg.supp --24640-- -v --24640-- --leak-check=full ==24640== LEAK SUMMARY: ==24640== definitely lost: 80 bytes in 2 blocks ==24640== indirectly lost: 240 bytes in 20 blocks ==24640== possibly lost: 56,014,306 bytes in 32,200 blocks ==24640== still reachable: 11,032,257 bytes in 993 blocks ==24640== suppressed: 0 bytes in 0 blocks ==24640== Reachable blocks (those to which a pointer was found) are not shown. ==24640== To see them, rerun with: --leak-check=full --show-reachable=yes ==24640===24640== ERROR SUMMARY: 2493 errors from 2493 contexts (suppressed: 151 from 11) --24640-- --24640-- used_suppression: 151 dl-hack3-cond-1 ==24640===24640== ERROR SUMMARY: 2493 errors from 2493 contexts (suppressed: 151 from 11)
Ben Bolker <bbolker <at> gmail.com> writes:> > > Dear r-devel readers, > > I have a pretty deep problem with package loading and unloading in > the development version of the lme4 package > <https://github.com/lme4/lme4>; it's not boiled down to a properly > minimal example yet (this has been difficult), but I am posting anyway > in the hopes that someone has ideas about how to proceed farther, > since I'm nearly stumped. Apologies in advance for the long post. > > EXECUTIVE SUMMARY: after one cycle of loading, testing (e.g. by > running example(lmer)) and unloading lme4, then loading and unloading > the nlme package, re-loading and exercising lme4 becomes very > unstable, leading eventually to a segmentation fault. More detail is > available at <https://github.com/lme4/lme4/issues/35> . Because it's > a segmentation fault, exactly _where_ the crash happens varies a bit > according to platform and precise incantation, but it seems I can > always get a segfault eventually. >[snip] UPDATE: after some useful advice off-list, I tried with a fully valgrind-instrumented version of R. No suspicious memory accesses occurred until the very end, right before the crash: Attempt #1 loading lme4 loaded DLLs: Rcpp RcppEigen minqa lme4 detaching lme4 ==18150== Jump to the invalid address stated on the next line ==18150== at 0x9E46D00: ??? ==18150== by 0x410AB8A: RunFinalizers (memory.c:1357) ==18150== by 0x410D314: R_gc_internal (memory.c:2709) ==18150== by 0x410E73C: Rf_allocVector (memory.c:2421) ==18150== by 0x4153D30: ReadItem (serialize.c:1685) ==18150== by 0x4152D6D: ReadBC1 (serialize.c:1825) ==18150== by 0x4153983: ReadItem (serialize.c:1851) ==18150== by 0x41531C0: ReadItem (serialize.c:1601) ==18150== by 0x41531A1: ReadItem (serialize.c:1599) ==18150== by 0x41531C0: ReadItem (serialize.c:1601) ==18150== by 0x41531C0: ReadItem (serialize.c:1601) ==18150== by 0x41531C0: ReadItem (serialize.c:1601) ==18150== Address 0x9e46d00 is not stack'd, malloc'd or (recently) free'd ==18150== Does this suggest anything to anyone, or is it just the final symptom of a mysterious problem that occurred earlier ... ?
On 13-06-12 07:09 PM, Duncan Murdoch wrote:> On 13-06-12 1:47 PM, Ben Bolker wrote: >> Ben Bolker <bbolker <at> gmail.com> writes: >> >>> >>> >>> Dear r-devel readers, >>> >>> I have a pretty deep problem with package loading and unloading in >>> the development version of the lme4 package >>> <https://github.com/lme4/lme4>; it's not boiled down to a properly >>> minimal example yet (this has been difficult), but I am posting anyway >>> in the hopes that someone has ideas about how to proceed farther, >>> since I'm nearly stumped. Apologies in advance for the long post. >>> >>> EXECUTIVE SUMMARY: after one cycle of loading, testing (e.g. by >>> running example(lmer)) and unloading lme4, then loading and unloading >>> the nlme package, re-loading and exercising lme4 becomes very >>> unstable, leading eventually to a segmentation fault. More detail is >>> available at <https://github.com/lme4/lme4/issues/35> . Because it's >>> a segmentation fault, exactly _where_ the crash happens varies a bit >>> according to platform and precise incantation, but it seems I can >>> always get a segfault eventually. >>> >> >> [snip] >> >> UPDATE: after some useful advice off-list, I tried with a >> fully valgrind-instrumented version of R. No suspicious memory >> accesses occurred until the very end, right before the crash: >> >> Attempt #1 >> loading lme4 >> loaded DLLs: Rcpp RcppEigen minqa lme4 >> detaching lme4 >> >> >> ==18150== Jump to the invalid address stated on the next line >> ==18150== at 0x9E46D00: ??? >> ==18150== by 0x410AB8A: RunFinalizers (memory.c:1357) >> ==18150== by 0x410D314: R_gc_internal (memory.c:2709) >> ==18150== by 0x410E73C: Rf_allocVector (memory.c:2421) >> ==18150== by 0x4153D30: ReadItem (serialize.c:1685) >> ==18150== by 0x4152D6D: ReadBC1 (serialize.c:1825) >> ==18150== by 0x4153983: ReadItem (serialize.c:1851) >> ==18150== by 0x41531C0: ReadItem (serialize.c:1601) >> ==18150== by 0x41531A1: ReadItem (serialize.c:1599) >> ==18150== by 0x41531C0: ReadItem (serialize.c:1601) >> ==18150== by 0x41531C0: ReadItem (serialize.c:1601) >> ==18150== by 0x41531C0: ReadItem (serialize.c:1601) >> ==18150== Address 0x9e46d00 is not stack'd, malloc'd or (recently) >> free'd >> ==18150=>> >> Does this suggest anything to anyone, or is it just the >> final symptom of a mysterious problem that occurred earlier ... ? > > One other suggestion: run under gctorture(). This will make it very > slow, but might trigger the fault earlier. > > Duncan Murdoch >Yes, thanks -- Bill Dunlap already suggested this. Your and Bill's warning about how slow gctorture makes things is correct -- I gave up after running for 3.5 hours when it had gotten only partway through loading the Matrix package; I will have to find a machine with a decent cooling system (i.e. not my laptop) where I can replicate the error. I've just re-run the regular valgrind, with a fresh build right after an SVN update. I got exactly the same results as above. We're certainly *not* calling reg.finalizer() anywhere in our package, and I don't think Rcpp or RcppEigen or minqa do ... there looks to be some kind of default finalization done on the reference class objects (based on running 'strings' on the object files ... I tried gdb'ing in and setting a breakpoint at memory.c:1357, but this breakpoint gets hit a lot, and I'm sort of stabbing in the dark at this point. Ben Bolker
Ben Have you compiled R form source yourself? If so, I would be tempted to mark up memory. c with some debug log statements - especially around line 1357, and possibly inside the finalizers function as it attempts to run the C finalizers....not pretty I know, but may be the quickest approach to quickly identify whats failing... Cheers -- Rory>> Yes, thanks -- Bill Dunlap already suggested this. Your and Bill's > warning about how slow gctorture makes things is correct -- I gave up > after running for 3.5 hours when it had gotten only partway through > loading the Matrix package; I will have to find a machine with a decent > cooling system (i.e. not my laptop) where I can replicate the error. > I've just re-run the regular valgrind, with a fresh build right after > an SVN update. I got exactly the same results as above. We're > certainly *not* calling reg.finalizer() anywhere in our package, and I > don't think Rcpp or RcppEigen or minqa do ... there looks to be some > kind of default finalization done on the reference class objects (based > on running 'strings' on the object files ... > I tried gdb'ing in and setting a breakpoint at memory.c:1357, but this > breakpoint gets hit a lot, and I'm sort of stabbing in the dark at this > point. > > Ben Bolker > > > > ------------------------------ > > _______________________________________________ > R-devel at r-project.org mailing list DIGESTED > https://stat.ethz.ch/mailman/listinfo/r-devel > > > End of R-devel Digest, Vol 124, Issue 12 > ****************************************