Liaw, Andy
2002-Jun-18 13:25 UTC
can't find array overruns (was: help debugging segfaults)
Dear R-devel, Last week I got several responses to my question about debugging segfaults in my code (original post below). After I changed the S_alloc() calls to Calloc()/Free(), the symptom was gone, but I was told to keep looking. So I did: o Switched to Calloc/Free. Electric Fence did not find any problem. o Put assert(index < bound); assert(index >=0); everywhere in the C routine where arrays are accessed. Everything ran fine. (I did not (don't really know easy way to) do the same thing for the Fortran subroutines (mostly Breiman's original code) called by the C function. o Changed to malloc()/free(). Still didn't find anything with Electric Fence. Can some one suggest how to proceed? Is it still not save to assume the bug is gone? Regards, Andy> -----Original Message----- > From: Liaw, Andy [mailto:andy_liaw@merck.com] > Sent: Wednesday, June 12, 2002 9:26 AM > To: 'r-devel@stat.math.ethz.ch'; 'r-help@stat.math.ethz.ch' > Subject: [R] help debugging segfaults > > > (Sorry for the cross-post--- I wasn't sure which list is more > appropriate...) > > Hi everyone, > > I've run into segfaults when using my randomForest package on > large dataset > (e.g., 100 x 15200) and large number of trees (e.g., ntree=7000 and > mtry=3000). I'm wondering if anyone can give me some hints > on where to look > for the problem. > > The randomForest package mainly consists of two things: rf.c > contains rf(), > a C wrapper function that calls the Fortran subroutines in > rfsub.f that do > most of the work (slightly altered from Breiman's original code). All > memory allocations are done in rf.c, using S_alloc(). When I > run random > forest with the data and setting as mentioned above, it was > able to finish > growing the 7000 trees, but segfault when returning from rf() > to R. GDB > gave the following (gdb prompts removed): > > do_dotCode (call=0x873aff4, op=0x8a5f620, args=0x8a5d010, > env=0x86fd0a4) > at dotcode.c:1413 > 1413 break; > 1845 PROTECT(ans = allocVector(VECSXP, nargs)); > 1846 havenames = 0; > 1847 if (dup) { > 1849 info.cargs = cargs; > 1850 info.allArgs = args; > 1851 info.nargs = nargs; > 1852 info.functionName = buf; > 1853 nargs = 0; > 1854 for (pargs = args ; pargs != R_NilValue ; pargs > CDR(pargs)) { > 1855 if(argConverters[nargs]) { > 1864 PROTECT(s = CPtrToRObj(cargs[nargs], > CAR(pargs), > which)); > > Program received signal SIGSEGV, Segmentation fault. > 0x080ddc6a in RunGenCollect (size_needed=1515400) at memory.c:1133 > 1133 SEXP next = NEXT_NODE(s); > > This is obtained on Linux (Mandrake 8.2 w/enterprise kernel > 2.4.8) running > on dual P3-866 Xeon with 2GB RAM, using R-1.5.0 compiled from source. > > Any help/hints/comments are greatly appreciated! > > Regards, > Andy > > Andy I. Liaw, PhD > Biometrics Research Phone: (732) 594-0820 > Merck & Co., Inc. Fax: (732) 594-1565 > P.O. Box 2000, RY70-38 Rahway, NJ 07065 > mailto:andy_liaw@merck.com > > > > -------------------------------------------------------------- > ---------------- > Notice: This e-mail message, together with any attachments, > contains information of Merck & Co., Inc. (Whitehouse > Station, New Jersey, USA) that may be confidential, > proprietary copyrighted and/or legally privileged, and is > intended solely for the use of the individual or entity named > on this message. If you are not the intended recipient, and > have received this message in error, please immediately > return this by e-mail and then delete it. > > =============================================================> ===============> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-. > -.-.-.-.-.-.-.-.- > r-help mailing list -- Read > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html > Send "info", "help", or "[un]subscribe" > (in the "body", not the subject !) To: > r-help-request@stat.math.ethz.ch > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._. > _._._._._._._._._ > > -------------------------------------------------------------- > ---------------- > Notice: This e-mail message, together with any attachments, > contains information of Merck & Co., Inc. (Whitehouse > Station, New Jersey, USA) that may be confidential, > proprietary copyrighted and/or legally privileged, and is > intended solely for the use of the individual or entity named > on this message. If you are not the intended recipient, and > have received this message in error, please immediately > return this by e-mail and then delete it. > > =============================================================> ===============>------------------------------------------------------------------------------ Notice: This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (Whitehouse Station, New Jersey, USA) that may be confidential, proprietary copyrighted and/or legally privileged, and is intended solely for the use of the individual or entity named in this message. If you are not the intended recipient, and have received this message in error, please immediately return this by e-mail and then delete it. ============================================================================= -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Peter Dalgaard BSA
2002-Jun-18 14:03 UTC
can't find array overruns (was: help debugging segfaults)
"Liaw, Andy" <andy_liaw@merck.com> writes:> Dear R-devel, > > Last week I got several responses to my question about debugging segfaults > in my code (original post below). After I changed the S_alloc() calls to > Calloc()/Free(), the symptom was gone, but I was told to keep looking. So I > did: > > o Switched to Calloc/Free. Electric Fence did not find any problem. > > o Put assert(index < bound); assert(index >=0); everywhere in the C routine > where arrays are accessed. Everything ran fine. (I did not (don't really > know easy way to) do the same thing for the Fortran subroutines (mostly > Breiman's original code) called by the C function. > > o Changed to malloc()/free(). Still didn't find anything with Electric > Fence. > > Can some one suggest how to proceed? Is it still not save to assume the bug > is gone? > > Regards, > AndyThe hardcore way is to use the original code and backtrack until you find the source of the memory corruption. I.e. in your code below, it seems that "s" got corrupted so that NEXT_NODE(s) triggers the segfault. So 1. Find the exact memory location with the corrupted value. 2. Set a hardware watchpoint on that location. 3. Rerun the program with well-defined input and check whenever the value at the watchpoint changes. Very likely, the culprit will be the last change prior to the crash, so you'd have to check the program logic carefully around that point. If it happens at an assignment to something seemingly unrelated, chances are that you have an array overrun. If the location changes frequently, it can be useful to conditionalize the watchpoint (the value of number of garbage collections can be useful for this). The precise way to do this kind of stuff is in your friendly gdb manual... (sorry, but it would take all day to flesh out the details)> > The randomForest package mainly consists of two things: rf.c > > contains rf(), > > a C wrapper function that calls the Fortran subroutines in > > rfsub.f that do > > most of the work (slightly altered from Breiman's original code). All > > memory allocations are done in rf.c, using S_alloc(). When I > > run random > > forest with the data and setting as mentioned above, it was > > able to finish > > growing the 7000 trees, but segfault when returning from rf() > > to R. GDB > > gave the following (gdb prompts removed): > > > > do_dotCode (call=0x873aff4, op=0x8a5f620, args=0x8a5d010, > > env=0x86fd0a4) > > at dotcode.c:1413 > > 1413 break; > > 1845 PROTECT(ans = allocVector(VECSXP, nargs)); > > 1846 havenames = 0; > > 1847 if (dup) { > > 1849 info.cargs = cargs; > > 1850 info.allArgs = args; > > 1851 info.nargs = nargs; > > 1852 info.functionName = buf; > > 1853 nargs = 0; > > 1854 for (pargs = args ; pargs != R_NilValue ; pargs > > CDR(pargs)) { > > 1855 if(argConverters[nargs]) { > > 1864 PROTECT(s = CPtrToRObj(cargs[nargs], > > CAR(pargs), > > which)); > > > > Program received signal SIGSEGV, Segmentation fault. > > 0x080ddc6a in RunGenCollect (size_needed=1515400) at memory.c:1133 > > 1133 SEXP next = NEXT_NODE(s);-- O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk) FAX: (+45) 35327907 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._