On Oct 8, 2011, at 12:05 PM, Duncan Sands wrote:> PS: With -fplugin-arg-dragonegg-enable-gcc-optzns the LLVM optimizers are run at > the following levels: > > Command line option LLVM optimizers run at > ------------------- ---------------------- > -O1 tiny amount of optimization > -O2 or -O3 -O1 > -O4 or -O5 -O2 > -O6 or better -O3Hi Duncan, Out of curiosity, why do you follow this approach? People generally use -O2 or -O3. I'd recommend switching dragonegg to line those up with whatever you want people to use. -Chris
Hi Chris,>> PS: With -fplugin-arg-dragonegg-enable-gcc-optzns the LLVM optimizers are run at >> the following levels: >> >> Command line option LLVM optimizers run at >> ------------------- ---------------------- >> -O1 tiny amount of optimization >> -O2 or -O3 -O1 >> -O4 or -O5 -O2 >> -O6 or better -O3 > > Hi Duncan, > > Out of curiosity, why do you follow this approach? People generally use -O2 or -O3. I'd recommend switching dragonegg to line those up with whatever you want people to use.note that this is done only when the GCC optimizers are run. The basic observation is that running the LLVM optimizers at -O3 after running the GCC optimizers (at -O3) results in slower code! I mean slower than what you get by running the LLVM optimizers at -O1 or -O2. I didn't find time to analyse this curiosity yet. It might simply be that the LLVM inlining level is too high given that inlining has already been done by GCC. Anyway, I didn't want to run LLVM at -O3 because of this. The next question was: what is better: LLVM at -O1 or at -O2? My first experiments showed that code quality was essentially the same. Since at -O1 you get a nice compile time speedup, I settled on using -O1. Also -O1 makes some sense if the GCC optimizers did a good job and all that is needed is to clean up the mess that converting to LLVM IR can produce. However later experiments showed that -O2 does seem to consistently result in slightly better code, so I've been thinking of using -O2 instead. This is one reason I encouraged Jack to use -O4 in his benchmarks (i.e. GCC at -O3, LLVM at -O2) - to see if they show the same thing. Ciao, Duncan. PS: Dragonegg is a nice platform for understanding what the GCC optimizers do better than LLVM. It's a pity no-one seems to have used it for this.
On Wed, Oct 12, 2011 at 12:40 AM, Duncan Sands > The basic> observation is that running the LLVM optimizers at -O3 after running the > GCC optimizers (at -O3) results in slower code! I mean slower than what > you get by running the LLVM optimizers at -O1 or -O2. I didn't find time > to analyse this curiosity yet. It might simply be that the LLVM inlining > level is too high given that inlining has already been done by GCC. Anyway, > I didn't want to run LLVM at -O3 because of this.If you inline too much you will get slower code because you make poorer use of the instruction cache in most modern processors. C99 and C++ allow one to declare functions inline at the point that they are declared. For early C standards I believe GCC has an attribute that allows one to inline a function at the point of declaration as a language extension. Lots of other languages do inlining, for example I understand Java JITs will inline JIT-compiled native code even though the Java language itself doesn't support inlining. For modern processors with code caches it would be better to inline functions at the point they are used rather than when they are declared. That way one has the choice of better cache usage or avoiding function call overhead. For example: int foo( float bar ); int baz( void ) { return foo( 3 ) inline; // This call will be fast } int boo( void ) { return foo( 5 ); // This will make a hot spot at foo's definition } Profiler-guided optimizations could take care of this without needing any language extensions. I understand that that is what the Java HotSpot JIT does. Don Quixote -- Don Quixote de la Mancha Dulcinea Technologies Corporation Software of Elegance and Beauty http://www.dulcineatech.com quixote at dulcineatech.com
On Wed, Oct 12, 2011 at 09:40:53AM +0200, Duncan Sands wrote:> Hi Chris, > >>> PS: With -fplugin-arg-dragonegg-enable-gcc-optzns the LLVM optimizers are run at >>> the following levels: >>> >>> Command line option LLVM optimizers run at >>> ------------------- ---------------------- >>> -O1 tiny amount of optimization >>> -O2 or -O3 -O1 >>> -O4 or -O5 -O2 >>> -O6 or better -O3 >> >> Hi Duncan, >> >> Out of curiosity, why do you follow this approach? People generally use -O2 or -O3. I'd recommend switching dragonegg to line those up with whatever you want people to use. > > note that this is done only when the GCC optimizers are run. The basic > observation is that running the LLVM optimizers at -O3 after running the > GCC optimizers (at -O3) results in slower code! I mean slower than what > you get by running the LLVM optimizers at -O1 or -O2. I didn't find time > to analyse this curiosity yet. It might simply be that the LLVM inlining > level is too high given that inlining has already been done by GCC. Anyway, > I didn't want to run LLVM at -O3 because of this. The next question was: > what is better: LLVM at -O1 or at -O2? My first experiments showed that > code quality was essentially the same. Since at -O1 you get a nice compile > time speedup, I settled on using -O1. Also -O1 makes some sense if the GCC > optimizers did a good job and all that is needed is to clean up the mess that > converting to LLVM IR can produce. However later experiments showed that -O2 > does seem to consistently result in slightly better code, so I've been thinking > of using -O2 instead. This is one reason I encouraged Jack to use -O4 in his > benchmarks (i.e. GCC at -O3, LLVM at -O2) - to see if they show the same thing.Duncan, My preliminary runs of the pb05 benchmarks at -O4, -O5 and -O6 using -fplugin-arg-dragonegg-enable-gcc-optzns didn't show any significant run time performance changes compared to -fplugin-arg-dragonegg-enable-gcc-optzns -O3. I'll rerun those and post the tabulated results this weekend. I am using -ffast-math -funroll-loops as well in the optimization flags. Perhaps I should repeat the benchmarks without those flags. IMHO, the more important thing is to fish out the remaining regressions in the llvm vectorization code by defaulting -fplugin-arg-dragonegg-enable-gcc-optzns on in dragonegg svn once llvm 3.0 has branched. Hopefully this will get us wider testing of the llvm vectorization support and some additional smaller test cases that expose the remaining bugs in that code. Jack> > Ciao, Duncan. > > PS: Dragonegg is a nice platform for understanding what the GCC optimizers > do better than LLVM. It's a pity no-one seems to have used it for this.
The Polyhedron 2005 benchmark results for dragonegg svn at r141775 using FSF gcc 4.6.2svn measured on x86_64-apple-darwin11 are listed below. The benchmarks used the optimizaton flags... a) gfortran-fsf-4.6 -msse4 -ffast-math -funroll-loops -O3 %n.f90 -o %n b) de-gfortran46 -msse4 -ffast-math -funroll-loops -O3 %n.f90 -o %n c) de-gfortran46 -msse4 -ffast-math -funroll-loops -O3 -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n d) de-gfortran46 -msse4 -ffast-math -funroll-loops -O4 -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n e) de-gfortran46 -msse4 -ffast-math -funroll-loops -O5 -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n f) de-gfortran46 -msse4 -ffast-math -funroll-loops -O6 -fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n and no runtime regressions are observed in any of the cases. Run time (seconds) Benchmark gfortran dragonegg de+optnz de+optnz+O4 de+optnz+O5 de+optnz+O6 ---------------------------------------------------------------------------------- ac 8.84 10.80 8.90 8.90 8.90 8.90 aermod 17.65 15.98 14.55 14.53 14.52 14.47 air 5.50 7.11 6.62 6.62 6.61 6.81 capacita 32.56 41.44 36.48 36.50 36.49 36.60 channel 1.84 2.53 2.06 2.06 2.07 2.06 doduc 26.66 30.30 27.75 28.08 28.08 28.19 fatigue 8.47 9.14 8.36 8.21 8.24 8.08 gas_dyn 4.27 11.75 4.44 4.44 4.44 4.45 induct 13.09 24.02 12.20 12.17 12.14 12.25 linpk 15.46 15.56 15.75 15.76 15.76 15.75 mdbx 11.21 12.20 11.84 11.85 11.85 11.85 nf 27.85 28.71 29.31 29.30 29.31 29.24 protein 33.43 39.10 37.44 37.49 37.50 37.44 rnflow 24.02 31.95 26.44 26.51 26.51 26.46 test_fpu 8.05 11.52 9.39 9.37 9.38 9.39 tfft 1.87 1.91 1.93 1.93 1.93 1.93 mean time 10.87 13.68 11.38 11.38 11.38 11.39 Compile time (seconds) Benchmark gfortran dragonegg de+optnz de+optnz+O4 de+optnz+O5 de+optnz+O6 -------------------------------------------------------------------------------- ac 1.17 0.30 0.60 0.62 0.62 0.62 aermod 44.13 25.67 32.26 32.80 32.78 33.06 air 2.22 1.02 1.48 1.49 1.49 1.54 capacita 1.77 0.49 0.92 0.93 0.93 0.96 channel 0.62 0.23 0.40 0.41 0.41 0.41 doduc 5.34 1.61 3.16 3.22 3.22 3.27 fatigue 1.76 0.89 1.20 1.21 1.21 1.26 gas_dyn 3.02 0.65 1.18 1.22 1.21 1.24 induct 4.01 1.71 2.68 2.80 2.68 2.78 linpk 0.78 0.21 0.46 0.47 0.47 0.54 mdbx 1.85 0.68 1.19 1.20 1.20 1.22 nf 1.83 0.34 0.78 0.76 0.76 0.77 protein 4.01 0.99 1.78 1.77 1.77 1.78 rnflow 5.51 1.30 2.63 2.63 2.65 2.68 test_fpu 4.38 1.00 2.10 2.10 2.11 2.14 tfft 0.56 0.19 0.32 0.33 0.33 0.33 Code Size (bytes) Compile time (seconds) Benchmark gfortran dragonegg de+optnz de+optnz+O4 de+optnz+O5 de+optnz+O6 -------------------------------------------------------------------------------- ac 50968 26736 39120 39120 39120 39120 aermod 1265640 1035724 1051600 1050504 1050504 1066424 air 73988 61908 53740 53740 53740 57884 capacita 78000 41416 45584 45552 45552 49648 channel 34784 22696 26792 26792 26792 26792 doduc 197240 124408 141144 140856 140856 140648 fatigue 86080 69824 69984 69840 69840 73936 gas_dyn 119744 59112 67488 67384 67384 67360 induct 174976 171344 167344 167344 167344 171320 linpk 38648 18872 27080 27056 27056 26840 mdbx 82112 53692 61980 57884 57884 61916 nf 75896 23992 36200 36200 36200 36200 protein 132040 75032 87208 87208 87208 87208 rnflow 181120 76024 100712 100608 100608 100488 test_fpu 155072 58368 82752 78632 78632 82632 tfft 30768 18640 18488 18488 18488 18488