----- Original Message -----> From: "Gerolf Hoflehner" <ghoflehner at apple.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "LLVM Dev" <llvmdev at cs.uiuc.edu>, "Jiangning Liu" <liujiangning1 at gmail.com>, "George Burgess IV" > <george.burgess.iv at gmail.com> > Sent: Monday, September 15, 2014 7:58:59 PM > Subject: Re: [LLVMdev] Testing the new CFL alias analysis > > I filed bugzilla pr20954.Thanks! -Hal> > > -Gerolf > > > > > > > On Sep 15, 2014, at 2:56 PM, Gerolf Hoflehner < ghoflehner at apple.com > > wrote: > > > > On CINT2006 ARM64/ref input/lto+pgo I practically measure no > performance difference for the 7 benchmarks that compile. This > includes bzip2 (although different source base than in CINT2000), > mcf, hmmer, sjeng, h364ref, astar, xalancbmk > > > > On Sep 15, 2014, at 11:59 AM, Hal Finkel < hfinkel at anl.gov > wrote: > > > > ----- Original Message ----- > > > From: "Gerolf Hoflehner" < ghoflehner at apple.com > > To: "Jiangning Liu" < liujiangning1 at gmail.com >, "George Burgess IV" > < george.burgess.iv at gmail.com >, "Hal Finkel" > < hfinkel at anl.gov > > Cc: "LLVM Dev" < llvmdev at cs.uiuc.edu > > Sent: Sunday, September 14, 2014 12:15:02 AM > Subject: Re: [LLVMdev] Testing the new CFL alias analysis > > In lto+pgo some (5 out of 12 with usual suspect like perlbench and > gcc among them using -flto -Wl,-mllvm,-use-cfl-aa > -Wl,-mllvm,-use-cfl-aa-in-codegen) the CINT2006 benchmarks don’t > compile. > > On what platform? Could you bugpoint it and file a report? > Ok, I’ll see that I can get a small test case. > > > > > > > Has the implementation been tested with lto? > > I've not. > > > > If not, please > stress the implementation more. > Do we know reasons for gains? Where did you expect the biggest gains? > > I don't want to make a global statement here. My expectation is that > we'll see wins from increasing register pressure ;) -- hoisting more > loads out of loops (there are certainly cases involving > multiple-levels of dereferencing and insert/extract instructions > where CFL can provide a NoAlias answer where BasicAA gives up). > Obviously, we'll also have problems if we increase pressure too > much. > > Maybe. But I prefer the OoO HW to handle hoisting though. It is hard > to tune in the compiler. > I’m also curious about the impact on loop transformations. > > > > > > Some of the losses will likely boil down to increased register > pressure. > > Agreed. > > > > > > Looks like the current performance numbers pose a good challenge for > gaining new and refreshing insights into our heuristics (and for > smoothing out the implementation along the way). > > It certainly seems that way. > > Thanks again, > Hal > > > > > > Cheers > Gerolf > > > > > > > > > > On Sep 12, 2014, at 1:27 AM, Jiangning Liu < liujiangning1 at gmail.com > > > wrote: > > > > Hi Hal, > > I run on SPEC2000 on cortex-a57(AArch64), and got the following > results, > > (It is to measure run-time reduction, and negative is better > performance) > > spec.cpu2000.ref.183_equake 33.77% > spec.cpu2000.ref.179_art 13.44% > spec.cpu2000.ref.256_bzip2 7.80% > spec.cpu2000.ref.186_crafty 3.69% > spec.cpu2000.ref.175_vpr 2.96% > spec.cpu2000.ref.176_gcc 1.77% > spec.cpu2000.ref.252_eon 1.77% > spec.cpu2000.ref.254_gap 1.19% > spec.cpu2000.ref.197_parser 1.15% > spec.cpu2000.ref.253_perlbmk 1.11% > spec.cpu2000.ref.300_twolf -1.04% > > So we can see almost all got worse performance. > > The command line option I'm using is "-O3 -std=gnu89 -ffast-math > -fslp-vectorize -fvectorize -mcpu=cortex-a57 -mllvm -use-cfl-aa > -mllvm -use-cfl-aa-in-codegen" > > I didn't try compile-time, and I think your test on POWER7 native > build should already meant something for other hosts. Also I don't > have a good benchmark suit for compile time testing. My past > experiences showed both llvm-test-suite (single/multiple) and spec > benchmark are not good benchmarks for compile time testing. > > Thanks, > -Jiangning > > > 2014-09-04 1:11 GMT+08:00 Hal Finkel < hfinkel at anl.gov > : > > > Hello everyone, > > One of Google's summer interns, George Burgess IV, created an > implementation of the CFL pointer-aliasing analysis algorithm, and > this has now been added to LLVM trunk. Now we should determine > whether it is worthwhile adding this to the default optimization > pipeline. For ease of testing, I've added the command line option > -use-cfl-aa which will cause the CFL analysis to be added to the > optimization pipeline. This can be used with the opt program, and > also via Clang by passing: -mllvm -use-cfl-aa. > > For the purpose of testing with those targets that make use of > aliasing analysis during code generation, there is also a > corresponding -use-cfl-aa-in-codegen option. > > Running the test suite on one of our IBM POWER7 systems (comparing > -O3 -mcpu=native to -O3 -mcpu=native -mllvm -use-cfl-aa -mllvm > -use-cfl-aa-in-codegen [testing without use in code generation were > essentially the same]), I see no significant compile-time changes, > and the following performance results: > speedup: > MultiSource/Benchmarks/mafft/pairlocalalign: -11.5862% +/- 5.9257% > > slowdown: > MultiSource/Benchmarks/FreeBench/neural/neural: 158.679% +/- 22.3212% > MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset: > 0.627176% +/- 0.290698% > MultiSource/Benchmarks/Ptrdist/ks/ks: 57.5457% +/- 21.8869% > > I ran the test suite 20 times in each configuration, using make -j48 > each time, so I'll only pick up large changes. I've not yet > investigated the cause of the slowdowns (or the speedup), and I > really need people to try this on x86, ARM, etc. I appears, however, > the better aliasing analysis results might have some negative > unintended consequences, and we'll need to look at those closely. > > Please let me know how this fares on your systems! > > Thanks again, > Hal > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory
For CPU2006 4-copy specint rate runs, we measured some small gains ( 2%, 3% and 6% respectively ) for bzip2, gcc and sjeng, and some small losses ( -3% and -3% resp.) for h264ref and astar. This is for x86 and did not use PGO, but used LTO and -m32 (along with the new CFL alias flags). Overall, there is about a 0.5% gain in specint rate. -Dibyendu Das AMD Compiler Group -----Original Message----- From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Hal Finkel Sent: Tuesday, September 16, 2014 7:06 AM To: Gerolf Hoflehner Cc: George Burgess IV; LLVM Dev Subject: Re: [LLVMdev] Testing the new CFL alias analysis ----- Original Message -----> From: "Gerolf Hoflehner" <ghoflehner at apple.com> > To: "Hal Finkel" <hfinkel at anl.gov> > Cc: "LLVM Dev" <llvmdev at cs.uiuc.edu>, "Jiangning Liu" <liujiangning1 at gmail.com>, "George Burgess IV" > <george.burgess.iv at gmail.com> > Sent: Monday, September 15, 2014 7:58:59 PM > Subject: Re: [LLVMdev] Testing the new CFL alias analysis > > I filed bugzilla pr20954.Thanks! -Hal> > > -Gerolf > > > > > > > On Sep 15, 2014, at 2:56 PM, Gerolf Hoflehner < ghoflehner at apple.com > > wrote: > > > > On CINT2006 ARM64/ref input/lto+pgo I practically measure no > performance difference for the 7 benchmarks that compile. This > includes bzip2 (although different source base than in CINT2000), mcf, > hmmer, sjeng, h364ref, astar, xalancbmk > > > > On Sep 15, 2014, at 11:59 AM, Hal Finkel < hfinkel at anl.gov > wrote: > > > > ----- Original Message ----- > > > From: "Gerolf Hoflehner" < ghoflehner at apple.com > > To: "Jiangning Liu" < liujiangning1 at gmail.com >, "George Burgess IV" > < george.burgess.iv at gmail.com >, "Hal Finkel" > < hfinkel at anl.gov > > Cc: "LLVM Dev" < llvmdev at cs.uiuc.edu > > Sent: Sunday, September 14, 2014 12:15:02 AM > Subject: Re: [LLVMdev] Testing the new CFL alias analysis > > In lto+pgo some (5 out of 12 with usual suspect like perlbench and gcc > among them using -flto -Wl,-mllvm,-use-cfl-aa > -Wl,-mllvm,-use-cfl-aa-in-codegen) the CINT2006 benchmarks don’t > compile. > > On what platform? Could you bugpoint it and file a report? > Ok, I’ll see that I can get a small test case. > > > > > > > Has the implementation been tested with lto? > > I've not. > > > > If not, please > stress the implementation more. > Do we know reasons for gains? Where did you expect the biggest gains? > > I don't want to make a global statement here. My expectation is that > we'll see wins from increasing register pressure ;) -- hoisting more > loads out of loops (there are certainly cases involving > multiple-levels of dereferencing and insert/extract instructions where > CFL can provide a NoAlias answer where BasicAA gives up). > Obviously, we'll also have problems if we increase pressure too much. > > Maybe. But I prefer the OoO HW to handle hoisting though. It is hard > to tune in the compiler. > I’m also curious about the impact on loop transformations. > > > > > > Some of the losses will likely boil down to increased register > pressure. > > Agreed. > > > > > > Looks like the current performance numbers pose a good challenge for > gaining new and refreshing insights into our heuristics (and for > smoothing out the implementation along the way). > > It certainly seems that way. > > Thanks again, > Hal > > > > > > Cheers > Gerolf > > > > > > > > > > On Sep 12, 2014, at 1:27 AM, Jiangning Liu < liujiangning1 at gmail.com > > > wrote: > > > > Hi Hal, > > I run on SPEC2000 on cortex-a57(AArch64), and got the following > results, > > (It is to measure run-time reduction, and negative is better > performance) > > spec.cpu2000.ref.183_equake 33.77% > spec.cpu2000.ref.179_art 13.44% > spec.cpu2000.ref.256_bzip2 7.80% > spec.cpu2000.ref.186_crafty 3.69% > spec.cpu2000.ref.175_vpr 2.96% > spec.cpu2000.ref.176_gcc 1.77% > spec.cpu2000.ref.252_eon 1.77% > spec.cpu2000.ref.254_gap 1.19% > spec.cpu2000.ref.197_parser 1.15% > spec.cpu2000.ref.253_perlbmk 1.11% > spec.cpu2000.ref.300_twolf -1.04% > > So we can see almost all got worse performance. > > The command line option I'm using is "-O3 -std=gnu89 -ffast-math > -fslp-vectorize -fvectorize -mcpu=cortex-a57 -mllvm -use-cfl-aa -mllvm > -use-cfl-aa-in-codegen" > > I didn't try compile-time, and I think your test on POWER7 native > build should already meant something for other hosts. Also I don't > have a good benchmark suit for compile time testing. My past > experiences showed both llvm-test-suite (single/multiple) and spec > benchmark are not good benchmarks for compile time testing. > > Thanks, > -Jiangning > > > 2014-09-04 1:11 GMT+08:00 Hal Finkel < hfinkel at anl.gov > : > > > Hello everyone, > > One of Google's summer interns, George Burgess IV, created an > implementation of the CFL pointer-aliasing analysis algorithm, and > this has now been added to LLVM trunk. Now we should determine whether > it is worthwhile adding this to the default optimization pipeline. For > ease of testing, I've added the command line option -use-cfl-aa which > will cause the CFL analysis to be added to the optimization pipeline. > This can be used with the opt program, and also via Clang by passing: > -mllvm -use-cfl-aa. > > For the purpose of testing with those targets that make use of > aliasing analysis during code generation, there is also a > corresponding -use-cfl-aa-in-codegen option. > > Running the test suite on one of our IBM POWER7 systems (comparing > -O3 -mcpu=native to -O3 -mcpu=native -mllvm -use-cfl-aa -mllvm > -use-cfl-aa-in-codegen [testing without use in code generation were > essentially the same]), I see no significant compile-time changes, and > the following performance results: > speedup: > MultiSource/Benchmarks/mafft/pairlocalalign: -11.5862% +/- 5.9257% > > slowdown: > MultiSource/Benchmarks/FreeBench/neural/neural: 158.679% +/- 22.3212% > MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset: > 0.627176% +/- 0.290698% > MultiSource/Benchmarks/Ptrdist/ks/ks: 57.5457% +/- 21.8869% > > I ran the test suite 20 times in each configuration, using make -j48 > each time, so I'll only pick up large changes. I've not yet > investigated the cause of the slowdowns (or the speedup), and I really > need people to try this on x86, ARM, etc. I appears, however, the > better aliasing analysis results might have some negative unintended > consequences, and we'll need to look at those closely. > > Please let me know how this fares on your systems! > > Thanks again, > Hal > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev > > > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory >-- Hal Finkel Assistant Computational Scientist Leadership Computing Facility Argonne National Laboratory _______________________________________________ LLVM Developers mailing list LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Thanks all for the feedback. :) - George> On Sep 18, 2014, at 1:10 PM, Das, Dibyendu <Dibyendu.Das at amd.com> wrote: > > For CPU2006 4-copy specint rate runs, we measured some small gains ( 2%, 3% and 6% respectively ) for bzip2, gcc and sjeng, and some small losses ( -3% and -3% resp.) for h264ref and astar. This is for x86 and did not use PGO, but used LTO and -m32 (along with the new CFL alias flags). Overall, there is about a 0.5% gain in specint rate. > > -Dibyendu Das > AMD Compiler Group > > -----Original Message----- > From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Hal Finkel > Sent: Tuesday, September 16, 2014 7:06 AM > To: Gerolf Hoflehner > Cc: George Burgess IV; LLVM Dev > Subject: Re: [LLVMdev] Testing the new CFL alias analysis > > ----- Original Message ----- >> From: "Gerolf Hoflehner" <ghoflehner at apple.com> >> To: "Hal Finkel" <hfinkel at anl.gov> >> Cc: "LLVM Dev" <llvmdev at cs.uiuc.edu>, "Jiangning Liu" <liujiangning1 at gmail.com>, "George Burgess IV" >> <george.burgess.iv at gmail.com> >> Sent: Monday, September 15, 2014 7:58:59 PM >> Subject: Re: [LLVMdev] Testing the new CFL alias analysis >> >> I filed bugzilla pr20954. > > Thanks! > > -Hal > >> >> >> -Gerolf >> >> >> >> >> >> >> On Sep 15, 2014, at 2:56 PM, Gerolf Hoflehner < ghoflehner at apple.com >>> wrote: >> >> >> >> On CINT2006 ARM64/ref input/lto+pgo I practically measure no >> performance difference for the 7 benchmarks that compile. This >> includes bzip2 (although different source base than in CINT2000), mcf, >> hmmer, sjeng, h364ref, astar, xalancbmk >> >> >> >> On Sep 15, 2014, at 11:59 AM, Hal Finkel < hfinkel at anl.gov > wrote: >> >> >> >> ----- Original Message ----- >> >> >> From: "Gerolf Hoflehner" < ghoflehner at apple.com > >> To: "Jiangning Liu" < liujiangning1 at gmail.com >, "George Burgess IV" >> < george.burgess.iv at gmail.com >, "Hal Finkel" >> < hfinkel at anl.gov > >> Cc: "LLVM Dev" < llvmdev at cs.uiuc.edu > >> Sent: Sunday, September 14, 2014 12:15:02 AM >> Subject: Re: [LLVMdev] Testing the new CFL alias analysis >> >> In lto+pgo some (5 out of 12 with usual suspect like perlbench and gcc >> among them using -flto -Wl,-mllvm,-use-cfl-aa >> -Wl,-mllvm,-use-cfl-aa-in-codegen) the CINT2006 benchmarks don’t >> compile. >> >> On what platform? Could you bugpoint it and file a report? >> Ok, I’ll see that I can get a small test case. >> >> >> >> >> >> >> Has the implementation been tested with lto? >> >> I've not. >> >> >> >> If not, please >> stress the implementation more. >> Do we know reasons for gains? Where did you expect the biggest gains? >> >> I don't want to make a global statement here. My expectation is that >> we'll see wins from increasing register pressure ;) -- hoisting more >> loads out of loops (there are certainly cases involving >> multiple-levels of dereferencing and insert/extract instructions where >> CFL can provide a NoAlias answer where BasicAA gives up). >> Obviously, we'll also have problems if we increase pressure too much. >> >> Maybe. But I prefer the OoO HW to handle hoisting though. It is hard >> to tune in the compiler. >> I’m also curious about the impact on loop transformations. >> >> >> >> >> >> Some of the losses will likely boil down to increased register >> pressure. >> >> Agreed. >> >> >> >> >> >> Looks like the current performance numbers pose a good challenge for >> gaining new and refreshing insights into our heuristics (and for >> smoothing out the implementation along the way). >> >> It certainly seems that way. >> >> Thanks again, >> Hal >> >> >> >> >> >> Cheers >> Gerolf >> >> >> >> >> >> >> >> >> >> On Sep 12, 2014, at 1:27 AM, Jiangning Liu < liujiangning1 at gmail.com >> >> >> wrote: >> >> >> >> Hi Hal, >> >> I run on SPEC2000 on cortex-a57(AArch64), and got the following >> results, >> >> (It is to measure run-time reduction, and negative is better >> performance) >> >> spec.cpu2000.ref.183_equake 33.77% >> spec.cpu2000.ref.179_art 13.44% >> spec.cpu2000.ref.256_bzip2 7.80% >> spec.cpu2000.ref.186_crafty 3.69% >> spec.cpu2000.ref.175_vpr 2.96% >> spec.cpu2000.ref.176_gcc 1.77% >> spec.cpu2000.ref.252_eon 1.77% >> spec.cpu2000.ref.254_gap 1.19% >> spec.cpu2000.ref.197_parser 1.15% >> spec.cpu2000.ref.253_perlbmk 1.11% >> spec.cpu2000.ref.300_twolf -1.04% >> >> So we can see almost all got worse performance. >> >> The command line option I'm using is "-O3 -std=gnu89 -ffast-math >> -fslp-vectorize -fvectorize -mcpu=cortex-a57 -mllvm -use-cfl-aa -mllvm >> -use-cfl-aa-in-codegen" >> >> I didn't try compile-time, and I think your test on POWER7 native >> build should already meant something for other hosts. Also I don't >> have a good benchmark suit for compile time testing. My past >> experiences showed both llvm-test-suite (single/multiple) and spec >> benchmark are not good benchmarks for compile time testing. >> >> Thanks, >> -Jiangning >> >> >> 2014-09-04 1:11 GMT+08:00 Hal Finkel < hfinkel at anl.gov > : >> >> >> Hello everyone, >> >> One of Google's summer interns, George Burgess IV, created an >> implementation of the CFL pointer-aliasing analysis algorithm, and >> this has now been added to LLVM trunk. Now we should determine whether >> it is worthwhile adding this to the default optimization pipeline. For >> ease of testing, I've added the command line option -use-cfl-aa which >> will cause the CFL analysis to be added to the optimization pipeline. >> This can be used with the opt program, and also via Clang by passing: >> -mllvm -use-cfl-aa. >> >> For the purpose of testing with those targets that make use of >> aliasing analysis during code generation, there is also a >> corresponding -use-cfl-aa-in-codegen option. >> >> Running the test suite on one of our IBM POWER7 systems (comparing >> -O3 -mcpu=native to -O3 -mcpu=native -mllvm -use-cfl-aa -mllvm >> -use-cfl-aa-in-codegen [testing without use in code generation were >> essentially the same]), I see no significant compile-time changes, and >> the following performance results: >> speedup: >> MultiSource/Benchmarks/mafft/pairlocalalign: -11.5862% +/- 5.9257% >> >> slowdown: >> MultiSource/Benchmarks/FreeBench/neural/neural: 158.679% +/- 22.3212% >> MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset: >> 0.627176% +/- 0.290698% >> MultiSource/Benchmarks/Ptrdist/ks/ks: 57.5457% +/- 21.8869% >> >> I ran the test suite 20 times in each configuration, using make -j48 >> each time, so I'll only pick up large changes. I've not yet >> investigated the cause of the slowdowns (or the speedup), and I really >> need people to try this on x86, ARM, etc. I appears, however, the >> better aliasing analysis results might have some negative unintended >> consequences, and we'll need to look at those closely. >> >> Please let me know how this fares on your systems! >> >> Thanks again, >> Hal >> >> -- >> Hal Finkel >> Assistant Computational Scientist >> Leadership Computing Facility >> Argonne National Laboratory >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> _______________________________________________ >> LLVM Developers mailing list >> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu >> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev >> >> >> >> -- >> Hal Finkel >> Assistant Computational Scientist >> Leadership Computing Facility >> Argonne National Laboratory > > -- > Hal Finkel > Assistant Computational Scientist > Leadership Computing Facility > Argonne National Laboratory > > _______________________________________________ > LLVM Developers mailing list > LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev