On 10/18/2012 10:36 AM, David Blaikie wrote:> On Thu, Oct 18, 2012 at 12:48 AM, Renato Golin<rengolin at systemcall.org> wrote: >> On 18 October 2012 05:11, Rick Foos<rfoos at codeaurora.org> wrote: >>> I don't think GDB testsuite should block a commit, it can vary by a few >>> tests, they rarely if ever all pass 100%. Tracking the results over time can >>> catch big regressions, as well as the ones that slowly increase the failed >>> tests. >> Agreed. It should at least serve as comparison between two branches, >> but hopefully, being actively monitored. >> >> Maybe would be good to add a directory (if there isn't one yet) to the >> testsuite repository, or at least the code necessary to make it run on >> LLVM. > > The clang-tests repository ( > http://llvm.org/viewvc/llvm-project/clang-tests/ ) includes an Apple > GCC 4.2 compatible version of the GCC and GDB test suites that Apple > run internally. I'm working on bringing up an equivalent public > buildbot at least for the GDB suite here ( > http://lab.llvm.org:8011/builders/clang-x86_64-darwin10-gdb-gcc ) - > just a few timing out tests I need to look at to get that green. > Apparently it's fairly stable. > > Beyond that I'll be trying to bring up one with the latest suite (7.4 > is what I've been playing with) on Linux as well. >Since you're going to bring a bot up in zorg, I'll stop working on bringing mine testsuite runner forward. A couple thoughts: 1) I've been running on the latest test suite, polling once a day. I think Eric and anyone working dwarf 4/5 should be running against the upstream testsuite. (I have no problems with running 7.4 too) It's been stable to run at the tip of GDB this way, the test results aren't varying much. 2) A surprise benefit of running this way is that hundreds of obsolete tests, or broken tests are getting removed. This hasn't resulted in any broken backwards compatibility here at least. Saves tons of time debugging tests that don't work, and developing around compatible things that reasonable people have decided no longer matter. 3) Testsuite runs against two compilers at a time makes it easier to see regressions. By comparing against a known stable compiler, or GCC, regressions are visible by summary numbers. 4) I have plots of the summary numbers online with a window of a month or two. The trend allows you to see regressions occurring, and remaining as regressions. Sometimes GDB Testsuite or a compiler has a bad day. The trend let's you see a stable regression, and when you get a round toit, tells you when the regression started. <soapbox> I've been doing this with Jenkins. It's fairly easy to set up, and does the plotting. Developers can grab a copy of the script to duplicate a run on their broken compiler. Running the testsuite under JNLP increased the number of executed tests - don't know why just did. </soapbox>> - David > >> -- >> cheers, >> --renato >> >> http://systemcall.org/ >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev-- Rick Foos Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation
On Thu, Oct 18, 2012 at 11:19 AM, Rick Foos <rfoos at codeaurora.org> wrote:> On 10/18/2012 10:36 AM, David Blaikie wrote: >> >> On Thu, Oct 18, 2012 at 12:48 AM, Renato Golin<rengolin at systemcall.org> >> wrote: >>> >>> On 18 October 2012 05:11, Rick Foos<rfoos at codeaurora.org> wrote: >>>> >>>> I don't think GDB testsuite should block a commit, it can vary by a few >>>> tests, they rarely if ever all pass 100%. Tracking the results over time >>>> can >>>> catch big regressions, as well as the ones that slowly increase the >>>> failed >>>> tests. >>> >>> Agreed. It should at least serve as comparison between two branches, >>> but hopefully, being actively monitored. >>> >>> Maybe would be good to add a directory (if there isn't one yet) to the >>> testsuite repository, or at least the code necessary to make it run on >>> LLVM. >> >> >> The clang-tests repository ( >> http://llvm.org/viewvc/llvm-project/clang-tests/ ) includes an Apple >> GCC 4.2 compatible version of the GCC and GDB test suites that Apple >> run internally. I'm working on bringing up an equivalent public >> buildbot at least for the GDB suite here ( >> http://lab.llvm.org:8011/builders/clang-x86_64-darwin10-gdb-gcc ) - >> just a few timing out tests I need to look at to get that green. >> Apparently it's fairly stable. >> >> Beyond that I'll be trying to bring up one with the latest suite (7.4 >> is what I've been playing with) on Linux as well. >> > Since you're going to bring a bot up in zorg, I'll stop working on bringing > mine testsuite runner forward.I'm still interested in any details you have about issues you've resolved/learnt, etc.> A couple thoughts: > > 1) I've been running on the latest test suite, polling once a day. I think > Eric and anyone working dwarf 4/5 should be running against the upstream > testsuite. (I have no problems with running 7.4 too)Interesting thought. (just so we're all on the same page when you say "test suite" you're talking about the GDB dejagnu test suite (the same one (well, more recent version of it) that's in clang-tests)) Though I hesitate to have such a moving target, I can see how it could be useful.> It's been stable to run at the tip of GDB this way, the test results aren't > varying much.With the right heuristics I suppose this could be valuable, but will require more work to find the right signal in the (even small) noise.> 2) A surprise benefit of running this way is that hundreds of obsolete > tests, or broken tests are getting removed. This hasn't resulted in any > broken backwards compatibility here at least. Saves tons of time debugging > tests that don't work, and developing around compatible things that > reasonable people have decided no longer matter.Fair point.> 3) Testsuite runs against two compilers at a time makes it easier to see > regressions. By comparing against a known stable compiler, or GCC, > regressions are visible by summary numbers.I assume GDB runs their own test suite against some version (or the simultaneous latest) of GCC? If we can't scrape those existing results we can reproduce them (running the full suite with both GCC & Clang side-by-side).> 4) I have plots of the summary numbers online with a window of a month or > two. The trend allows you to see regressions occurring, and remaining as > regressions. Sometimes GDB Testsuite or a compiler has a bad day. The trend > let's you see a stable regression, and when you get a round toit, tells you > when the regression started.Yep, also if we're trying to address all these issues, would be to prioritize the very stable failures (where Clang fails a test that GCC passes & does so consistently for a long time) first. Then look at the unstable ones last - figure out which compiler's to blame, "XFAIL: clang" them or whatever is necessary.> <soapbox> > I've been doing this with Jenkins. It's fairly easy to set up, and does the > plotting. Developers can grab a copy of the script to duplicate a run on > their broken compiler. Running the testsuite under JNLP increased the number > of executed tests - don't know why just did. > </soapbox>I wouldn't mind seeing your jenkins setup/config/tweaks/etc as a reference point, if you've got it lying around. - David
On 10/18/2012 01:39 PM, David Blaikie wrote:> On Thu, Oct 18, 2012 at 11:19 AM, Rick Foos<rfoos at codeaurora.org> wrote: >> On 10/18/2012 10:36 AM, David Blaikie wrote: >>> On Thu, Oct 18, 2012 at 12:48 AM, Renato Golin<rengolin at systemcall.org> >>> wrote: >>>> On 18 October 2012 05:11, Rick Foos<rfoos at codeaurora.org> wrote: >>>>> I don't think GDB testsuite should block a commit, it can vary by a few >>>>> tests, they rarely if ever all pass 100%. Tracking the results over time >>>>> can >>>>> catch big regressions, as well as the ones that slowly increase the >>>>> failed >>>>> tests. >>>> Agreed. It should at least serve as comparison between two branches, >>>> but hopefully, being actively monitored. >>>> >>>> Maybe would be good to add a directory (if there isn't one yet) to the >>>> testsuite repository, or at least the code necessary to make it run on >>>> LLVM. >>> >>> The clang-tests repository ( >>> http://llvm.org/viewvc/llvm-project/clang-tests/ ) includes an Apple >>> GCC 4.2 compatible version of the GCC and GDB test suites that Apple >>> run internally. I'm working on bringing up an equivalent public >>> buildbot at least for the GDB suite here ( >>> http://lab.llvm.org:8011/builders/clang-x86_64-darwin10-gdb-gcc ) - >>> just a few timing out tests I need to look at to get that green. >>> Apparently it's fairly stable. >>> >>> Beyond that I'll be trying to bring up one with the latest suite (7.4 >>> is what I've been playing with) on Linux as well. >>> >> Since you're going to bring a bot up in zorg, I'll stop working on bringing >> mine testsuite runner forward. > I'm still interested in any details you have about issues you've > resolved/learnt, etc. > >> A couple thoughts: >> >> 1) I've been running on the latest test suite, polling once a day. I think >> Eric and anyone working dwarf 4/5 should be running against the upstream >> testsuite. (I have no problems with running 7.4 too) > Interesting thought. (just so we're all on the same page when you say > "test suite" you're talking about the GDB dejagnu test suite (the same > one (well, more recent version of it) that's in clang-tests)) Though I > hesitate to have such a moving target, I can see how it could be > useful. >Yes, the sourceware.org site. I hesitated as well, but I tried it and it's OK.>> It's been stable to run at the tip of GDB this way, the test results aren't >> varying much. > With the right heuristics I suppose this could be valuable, but will > require more work to find the right signal in the (even small) noise. >I wrote a 10 line awk script to create a CSV file out of the test summaries to make a one-to-one comparison of tests. It's over 70 lines now...It's not a matter that I should have used Python, Everything is an exception to the rules. You can get close, but I can't say you can get perfect signals. From a compiler developer point of view, the spreadsheet was worthless. We're not testing GDB, but rather what the compiler feeds to GDB. Take the log file, check out the suite, rerun a failing test, use dwarfdump and llvm-dwarfdump, find the "bad" dwarf records produced by the compiler. All the eventual bugs are about dwarf records, and a gdb testsuite test to duplicate. A bad/confused dwarf record fails multiple tests without a way to map a failure back to dwarf. In the end, a fine grained signal doesn't do what you might want.>> 2) A surprise benefit of running this way is that hundreds of obsolete >> tests, or broken tests are getting removed. This hasn't resulted in any >> broken backwards compatibility here at least. Saves tons of time debugging >> tests that don't work, and developing around compatible things that >> reasonable people have decided no longer matter. > Fair point. > >> 3) Testsuite runs against two compilers at a time makes it easier to see >> regressions. By comparing against a known stable compiler, or GCC, >> regressions are visible by summary numbers. > I assume GDB runs their own test suite against some version (or the > simultaneous latest) of GCC? If we can't scrape those existing results > we can reproduce them (running the full suite with both GCC& Clang > side-by-side). >gdb-testers at sourceware.org has a run every night. Yes, I reproduce. A non-x86 target has very different results, so I look for a good gcc cross compiler to establish a baseline. In the case of clang, all the arch's share the Dwarf processing. So an x86 run covers a lot more of the dwarf processing than worrying too much about a cross compiler run. (But some worry about limiting fixes to regressions from a cross-compiled GCC, so have to run that as well)>> 4) I have plots of the summary numbers online with a window of a month or >> two. The trend allows you to see regressions occurring, and remaining as >> regressions. Sometimes GDB Testsuite or a compiler has a bad day. The trend >> let's you see a stable regression, and when you get a round toit, tells you >> when the regression started. > Yep, also if we're trying to address all these issues, would be to > prioritize the very stable failures (where Clang fails a test that GCC > passes& does so consistently for a long time) first. Then look at the > unstable ones last - figure out which compiler's to blame, "XFAIL: > clang" them or whatever is necessary. >I avoid doing that the XFAIL thing. Plotting all the lines from the summary makes more sense, and it's what you see when you run the test manually. When you move a Fail artificially to Xfail, the plot just has a few V's in it where the Fail line drops, and the Xfail line goes up. No new information. I prefer leaving the actual summary numbers in place. All the data you need is there. As you might have guessed, I like tests that fail, and want to get rid of the ones the pass too often :)>> <soapbox> >> I've been doing this with Jenkins. It's fairly easy to set up, and does the >> plotting. Developers can grab a copy of the script to duplicate a run on >> their broken compiler. Running the testsuite under JNLP increased the number >> of executed tests - don't know why just did. >> </soapbox> > I wouldn't mind seeing your jenkins setup/config/tweaks/etc as a > reference point, if you've got it lying around. >I'll see what I can send, or just as easy to walk through it. Jenkins isn't really like buildbot. Do you have Jenkins running there?> - David-- Rick Foos Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation