?I've hit a roadblock debugging a new update to the survival package.?? I do debugging in a developement envinment, i.e. I don't create and load a package but rather? source all the .R files and dyn.load an .so file, which makes things a bit easier. ? Running with R -d "valgrind --tool=memcheck --leak-check=full" one of my test files crashes in simple R code a dozen lines after the? likely culprit has been called, i.e, the survfit function for an Aalen-Johansen, containing a .Call to the new C code.???? The valgrind approach had already allowed me to find a few other (mostly dumb) errors that led to an out of bounds access, e.g., the wrong endpoint variable in a for( ) loop.??? What would others advise as a next step? Here is the last part of the screen > fit2 <- coxph(list(Surv(tstart, tstop, bstat) ~ 1, +??????????????????? c(1:4):5 ~ age / common + shared), id= id, istate=bili4, +?????????????? data=pbc2, ties='breslow', x=TRUE) > surv2 <- survfit(fit2, newdata=list(age=50), p0=c(.4, .3, .2, .1, 0)) > test2 <- mysurv(fit2, pbc2$bili4, p0= 4:0/10, fit2, x0 =50) ==31730== Invalid read of size 8 ==31730==??? at 0x298A07: Rf_allocVector3 (memory.c:2861) ==31730==??? by 0x299B2C: Rf_allocVector (Rinlinedfuns.h:595) ==31730==??? by 0x299B2C: R_alloc (memory.c:2330) ==31730==??? by 0x3243C6: do_which (summary.c:1152) ==31730==??? by 0x23D8EF: bcEval (eval.c:7724) ==31730==??? by 0x25731F: Rf_eval (eval.c:1152) ==31730==??? by 0x25927D: R_execClosure (eval.c:2362) ==31730==??? by 0x25A35A: R_execMethod (eval.c:2535) ==31730==??? by 0x887E93F: R_dispatchGeneric (methods_list_dispatch.c:1151) ==31730==??? by 0x2A0E72: do_standardGeneric (objects.c:1344) ==31730==??? by 0x2577E7: Rf_eval (eval.c:1254) ==31730==??? by 0x25927D: R_execClosure (eval.c:2362) ==31730==??? by 0x25A01C: applyClosure_core (eval.c:2250) ==31730==? Address 0x10 is not stack'd, malloc'd or (recently) free'd ==31730= ?*** caught segfault *** address 0x10, cause 'memory not mapped' Traceback: ?1: which(smap == j) ?2: which(smap == j) ?3: mysurv(fit2, pbc2$bili4, p0 = 4:0/10, fit2, x0 = 50) The offending call is amost certainly the one to survfit; mysurv() is a local function that caculates some things 'by hand'. ? It does nothing complex: counts, loops, etc, the only non-base action is a call to Matrix::exp near the end, but the which() failure is well before that. The session info just before the offending material: > sessionInfo() R Under development (unstable) (2024-02-07 r85873) Platform: x86_64-pc-linux-gnu Running under: Ubuntu 22.04.3 LTS Matrix products: default BLAS:?? /usr/local/src/R-devel/lib/libRblas.so LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 locale: ?[1] LC_CTYPE=en_US.UTF-8?????? LC_NUMERIC=C ?[3] LC_TIME=en_US.UTF-8??????? LC_COLLATE=C ?[5] LC_MONETARY=en_US.UTF-8??? LC_MESSAGES=en_US.UTF-8 ?[7] LC_PAPER=en_US.UTF-8?????? LC_NAME=C ?[9] LC_ADDRESS=C?????????????? LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C time zone: America/Chicago tzcode source: system (glibc) attached base packages: [1] splines?? stats???? graphics? grDevices utils???? datasets methods [8] base other attached packages: [1] Matrix_1.6-0 loaded via a namespace (and not attached): [1] compiler_4.4.0 tools_4.4.0??? grid_4.4.0???? lattice_0.22-5 --- Footnote.? The impetus for this is realizing that the robust variance for an Aalen-Johansen was incorrect when there are case weights for a subject that vary over time;? a rare case but will occur with time dependent IPC weights.? Carefully figuring this out has been all I did for the last week, leading to a new routine survfitaj.c and approx 14 pages of derivation and explanation in the methods.Rnw vignette.?? Subjects who "change horses in midstream", i.e., swap from one curve to another mid-followup make the code more complex.?? This arises out of the "extended Kaplan-Meier"; I am not a fan of this statistically, but some will use it and expect my code to work. -- Terry M Therneau, PhD Department of Quantitative Health Sciences Mayo Clinic therneau at mayo.edu "TERR-ree THUR-noh" [[alternative HTML version deleted]]
On Wed, 07 Feb 2024 14:01:44 -0600 "Therneau, Terry M., Ph.D. via R-devel" <r-devel at r-project.org> wrote:> > test2 <- mysurv(fit2, pbc2$bili4, p0= 4:0/10, fit2, x0 =50) > ==31730== Invalid read of size 8 > ==31730==??? at 0x298A07: Rf_allocVector3 (memory.c:2861) > ==31730==??? by 0x299B2C: Rf_allocVector (Rinlinedfuns.h:595) > ==31730==??? by 0x299B2C: R_alloc (memory.c:2330) > ==31730==??? by 0x3243C6: do_which (summary.c:1152)<...>> ==31730==? Address 0x10 is not stack'd, malloc'd or (recently) free'd<...>> ?*** caught segfault *** > address 0x10, cause 'memory not mapped'An unrelated allocation function suddenly dereferencing a null pointer is likely indication of heap corruption. Valgrind may be silent about it because the C heap (that it knows how to override and track) is still intact, but the R memory management metadata got corrupted (which looks like a valid memory access to Valgrind). An easy solution could be brought by more instrumentation. R can tell Valgrind to consider some memory accesses invalid if you configure it using --with-valgrind-instrumentation [*], but I'm not sure it will be able to trap overwriting GC metadata, so let's set it aside for now. If you compile your own R, you can configure it with -fsanitize=address added to the compiler and linker flags [**]. I'm not sure whether the bounds checks performed by AddressSanitizer would be sufficient to catch the problem, but it's worth a try. Instead of compiling R with sanitizers, it should be also possible to use the container image docker.io/rocker/r-devel-san. The hard option is left if no instrumentation lets you pinpoint the error. Since the first (as far as Valgrind is concerned) memory error already happens to result in a SIGSEGV, you can run R in a regular debugger and try to work backwards from the local variables at the location of the crash. Maybe there's a way to identify the block containing the pointer that gets overwritten and set a watchpoint on it for the next run of R. Maybe you can read the overwritten value as double and guess where the number came from. If your processor is sufficiently new, you can try `rr`, the time-travelling debugger [***], to rewind the process execution back to the point where the pointer gets overwritten. -- Best regards, Ivan [*] https://cran.r-project.org/doc/manuals/R-exts.html#Using-valgrind [**] https://cran.r-project.org/doc/manuals/R-exts.html#Using-Address-Sanitizer [***] https://rr-project.org Judging by the domain name, it's practically designed to fix troublesome bugs in R packages!
I haven't done any R memory debugging lately, but https://www.mail-archive.com/rcpp-devel at lists.r-forge.r-project.org/msg10289.html shows how I used to have gdb break where valgrind finds a problem so you could examine the details. Also, running your code after running gctorture(TRUE) can help track down memory problems. -Bill On Wed, Feb 7, 2024 at 12:03?PM Therneau, Terry M., Ph.D. via R-devel < r-devel at r-project.org> wrote:> I've hit a roadblock debugging a new update to the survival package. I > do debugging in > a developement envinment, i.e. I don't create and load a package but > rather source all > the .R files and dyn.load an .so file, which makes things a bit easier. > > Running with R -d "valgrind --tool=memcheck --leak-check=full" one of > my test files > crashes in simple R code a dozen lines after the likely culprit has been > called, i.e, the > survfit function for an Aalen-Johansen, containing a .Call to the new C > code. The > valgrind approach had already allowed me to find a few other (mostly dumb) > errors that led > to an out of bounds access, e.g., the wrong endpoint variable in a for( ) > loop. What > would others advise as a next step? > > Here is the last part of the screen > > fit2 <- coxph(list(Surv(tstart, tstop, bstat) ~ 1, > + c(1:4):5 ~ age / common + shared), id= id, > istate=bili4, > + data=pbc2, ties='breslow', x=TRUE) > > surv2 <- survfit(fit2, newdata=list(age=50), p0=c(.4, .3, .2, .1, 0)) > > test2 <- mysurv(fit2, pbc2$bili4, p0= 4:0/10, fit2, x0 =50) > ==31730== Invalid read of size 8 > ==31730== at 0x298A07: Rf_allocVector3 (memory.c:2861) > ==31730== by 0x299B2C: Rf_allocVector (Rinlinedfuns.h:595) > ==31730== by 0x299B2C: R_alloc (memory.c:2330) > ==31730== by 0x3243C6: do_which (summary.c:1152) > ==31730== by 0x23D8EF: bcEval (eval.c:7724) > ==31730== by 0x25731F: Rf_eval (eval.c:1152) > ==31730== by 0x25927D: R_execClosure (eval.c:2362) > ==31730== by 0x25A35A: R_execMethod (eval.c:2535) > ==31730== by 0x887E93F: R_dispatchGeneric (methods_list_dispatch.c:1151) > ==31730== by 0x2A0E72: do_standardGeneric (objects.c:1344) > ==31730== by 0x2577E7: Rf_eval (eval.c:1254) > ==31730== by 0x25927D: R_execClosure (eval.c:2362) > ==31730== by 0x25A01C: applyClosure_core (eval.c:2250) > ==31730== Address 0x10 is not stack'd, malloc'd or (recently) free'd > ==31730=> > *** caught segfault *** > address 0x10, cause 'memory not mapped' > > Traceback: > 1: which(smap == j) > 2: which(smap == j) > 3: mysurv(fit2, pbc2$bili4, p0 = 4:0/10, fit2, x0 = 50) > > The offending call is amost certainly the one to survfit; mysurv() is a > local function > that caculates some things 'by hand'. It does nothing complex: counts, > loops, etc, the > only non-base action is a call to Matrix::exp near the end, but the > which() failure is > well before that. > > The session info just before the offending material: > > > sessionInfo() > R Under development (unstable) (2024-02-07 r85873) > Platform: x86_64-pc-linux-gnu > Running under: Ubuntu 22.04.3 LTS > > Matrix products: default > BLAS: /usr/local/src/R-devel/lib/libRblas.so > LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0 > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > time zone: America/Chicago > tzcode source: system (glibc) > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] Matrix_1.6-0 > > loaded via a namespace (and not attached): > [1] compiler_4.4.0 tools_4.4.0 grid_4.4.0 lattice_0.22-5 > > > --- > Footnote. The impetus for this is realizing that the robust variance for > an > Aalen-Johansen was incorrect when there are case weights for a subject > that vary over > time; a rare case but will occur with time dependent IPC weights. > Carefully figuring > this out has been all I did for the last week, leading to a new routine > survfitaj.c and > approx 14 pages of derivation and explanation in the methods.Rnw > vignette. Subjects who > "change horses in midstream", i.e., swap from one curve to another > mid-followup make the > code more complex. This arises out of the "extended Kaplan-Meier"; I am > not a fan of > this statistically, but some will use it and expect my code to work. > > -- > Terry M Therneau, PhD > Department of Quantitative Health Sciences > Mayo Clinic > therneau at mayo.edu > > "TERR-ree THUR-noh" > > [[alternative HTML version deleted]] > > ______________________________________________ > R-devel at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel >[[alternative HTML version deleted]]