Hi Jack, On 29/05/13 22:04, Jack Howarth wrote:> On Wed, May 29, 2013 at 03:25:30PM +0200, Duncan Sands wrote: >> Hi Jack, I pulled the loop vectorizer and fast math changes into the 3.3 branch, >> so hopefully they will be part of 3.3 rc3 (and 3.3 final!). It would be great >> if you could redo the benchmarks rc3. >> > > Duncan, > As requested, appended are the updated Polyhedron 2005 benchmark results with both RC1 and RC3 llvm 3.3 testing.thanks for doing this. As rc3 hasn't been tagged yet, I assume you used latest 3.3svn?> There is a small improvement in the dragonegg results (without -fplugin-arg-dragonegg-enable-gcc-optzns) in RC3. I assume > we still only have partial coverage of all of the -ffast-math optimizations performed by FSF gcc in llvm's fast-math > support, correct?These results are very disappointing, I was hoping to see a big improvement somewhere instead of no real improvement anywhere (except for gas_dyn) or a regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math optimizations. I will try to find time to poke at gas_dyn and induct: since turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers are clearly missing something important. Ciao, Duncan.> Jack > > Tested on x86_apple-darwin12 > > Compile Flags: -ffast-math -funroll-loops -O3 > > de-gfc47: /sw/lib/gcc4.7/bin/gfortran -fplugin=/sw/lib/gcc4.7/lib/dragonegg.so -specs=/sw/lib/gcc4.7/lib/integrated-as.specs > de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs > de-gfc47+optzns: /sw/lib/gcc4.7/bin/gfortran -fplugin=/sw/lib/gcc4.7/lib/dragonegg.so -specs=/sw/lib/gcc4.7/lib/integrated-as.specs > +-fplugin-arg-dragonegg-enable-gcc-optzns > de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs > +-fplugin-arg-dragonegg-enable-gcc-optzns > gfortran47: /sw/bin/gfortran-fsf-4.7 > gfortran48: /sw/bin/gfortran-fsf-4.8 > > Run time (secs) > > Benchmark de-gfc47 de-gfc47 de-gfc48 de-gfc48 de-gfc47 de-gfc47 de-gfc48 de-gfc48 gfortran47 gfortran48 > +optzns +optzns +optzns +optzns > RC1 RC3 RC1 RC3 RC1 RC3 RC1 RC3 > ac 11.39 11.66 11.39 11.58 8.09 8.07 8.14 8.14 8.18 8.05 > aermod 16.35 16.47 16.00 16.44 14.50 14.61 15.28 14.43 16.45 16.23 > air 6.88 6.87 6.77 6.77 5.42 5.42 5.28 5.27 5.83 5.73 > capacita 39.85 37.80 39.83 37.86 34.71 34.81 33.47 33.53 32.51 33.02 > channel 2.05 2.06 2.05 2.06 2.15 2.15 1.99 1.99 1.83 1.83 > doduc 27.10 27.43 27.37 27.39 26.75 27.03 26.31 26.24 25.91 25.76 > fatigue 8.85 8.84 8.81 8.88 7.72 7.75 5.60 5.42 8.26 5.60 > gas_dyn 11.76 8.25 11.50 7.94 4.51 4.52 4.21 4.20 3.88 3.59 > induct 24.01 24.45 24.04 24.04 11.86 11.90 11.85 11.85 12.08 12.21 > linpk 15.43 15.48 15.48 15.49 15.40 15.47 15.83 15.81 15.37 15.64 > mdbx 11.92 12.14 11.91 12.15 11.30 11.29 11.27 11.27 11.18 11.42 > nf 29.57 30.08 30.04 30.11 29.50 29.82 29.59 29.86 27.21 27.25 > protein 36.15 36.15 35.21 35.17 35.93 36.02 34.16 34.06 31.88 31.81 > rnflow 27.02 27.08 25.92 26.12 26.77 26.83 22.20 22.21 24.67 21.21 > test_fpu 11.49 11.55 11.47 11.52 9.11 9.11 9.30 9.30 7.90 8.01 > tfft 1.92 1.94 1.92 1.92 1.92 1.92 1.89 1.90 1.86 1.90 > > Geom. Mean 13.19 12.95 13.10 12.83 10.99 11.02 10.52 10.47 10.60 10.22 > > Compile time (secs) > > Benchmark de-gfc47 de-gfc47 de-gfc48 de-gfc48 de-gfc47 de-gfc47 de-gfc48 de-gfc48 gfortran47 gfortran48 > +optzns +optzns +optzns +optzns > RC1 RC3 RC1 RC3 RC1 RC3 RC1 RC3 > ac 0.62 1.63 0.29 0.93 2.20 1.02 0.71 0.73 2.88 2.08 > aermod 35.19 35.57 20.44 35.86 43.50 43.39 42.90 43.08 42.75 55.97 > air 1.16 1.23 1.11 1.26 2.72 2.68 2.40 2.35 4.48 4.28 > capacita 0.52 0.60 0.52 0.62 1.02 0.94 1.04 0.96 1.90 1.89 > channel 0.26 0.28 0.23 0.30 0.47 0.45 0.50 0.47 0.65 0.75 > doduc 1.74 1.89 1.74 1.91 3.78 3.71 3.53 3.55 6.03 5.68 > fatigue 0.91 0.91 0.87 0.91 1.33 1.30 1.49 1.49 1.97 2.04 > gas_dyn 0.70 0.87 0.63 0.88 1.40 1.37 1.39 1.39 3.39 2.44 > induct 1.95 1.83 1.77 1.83 2.87 2.81 2.99 3.02 4.08 4.42 > linpk 0.25 0.32 0.21 0.32 0.53 0.52 0.72 0.73 0.92 1.25 > mdbx 0.66 0.73 0.61 0.75 1.30 1.26 1.24 1.15 2.16 1.90 > nf 0.39 0.55 0.35 0.55 0.80 0.80 0.74 0.74 2.12 1.67 > protein 1.12 1.18 1.03 1.20 2.01 1.99 1.79 1.77 4.39 3.62 > rnflow 1.26 1.55 1.19 1.55 2.93 2.84 2.72 2.73 6.43 5.47 > test_fpu 0.91 1.12 0.85 1.13 2.27 5.06 2.22 2.23 5.28 4.26 > tfft 0.22 0.24 0.18 0.22 0.39 0.40 0.46 0.46 0.59 0.78 > > Executable (bytes) > > Benchmark de-gfc47 de-gfc47 de-gfc48 de-gfc48 de-gfc47 de-gfc47 de-gfc48 de-gfc48 gfortran47 gfortran48 > +optzns +optzns +optzns +optzns > RC1 RC3 RC1 RC3 RC1 RC3 RC1 RC3 > ac 26776 30896 26792 30912 47160 47160 34928 34928 59120 42784 > aermod 1023024 1035312 1023064 1031248 1052728 1052728 1031576 1031568 1392840 1286136 > air 61940 61940 61948 61948 65964 65964 61876 61876 110768 106680 > capaci 41344 45440 41144 41144 45440 45440 45040 45040 77920 73248 > channe 22736 22600 22744 22608 26696 22600 22552 22552 34704 34656 > doduc 128376 120188 128384 120196 140580 140580 136296 136296 205320 189040 > fatigu 65648 69744 65640 69736 69808 69808 73848 73848 90240 82040 > gas_dy 54840 58936 54936 59032 63144 63144 71304 71304 123680 99184 > induct 163064 163064 158792 162888 163192 167288 166920 171024 179080 170872 > linpk 18680 22896 18688 22904 22896 22896 34920 34920 42640 50936 > mdbx 49492 57684 49508 57700 57692 57692 53604 53604 90232 78032 > nf 23880 32080 23888 27984 32088 32088 32104 32104 84072 67744 > protei 74960 79056 75048 79144 87144 87144 83128 83128 131976 115688 > rnflow 67704 79992 67712 80000 88248 88248 96152 96152 205584 176912 > test_f 50000 62296 50008 62304 70440 70440 78456 78456 179464 142608 > tfft 18568 18568 18576 18576 18416 18416 22544 22544 30680 34832 > >
On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote:> > These results are very disappointing, I was hoping to see a big improvement > somewhere instead of no real improvement anywhere (except for gas_dyn) or a > regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math > optimizations. I will try to find time to poke at gas_dyn and induct: since > turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers > are clearly missing something important. > > Ciao, Duncan.Duncan, Appended are another set of benchmark runs where I attempted to decouple the fast math optimizations from the vectorization by passing -fno-tree-vectorize. I am unclear if dragonegg really honors -fno-tree-vectorize to disable the llvm vectorization. Tested on x86_apple-darwin12 Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.spec s -fplugin-arg-dragonegg-enable-gcc-optzns gfortran48: /sw/bin/gfortran-fsf-4.8 Run time (secs) Benchmark de-gfc48 de-gfc48 gfortran48 +optzns ac 11.33 8.10 8.02 aermod 16.03 14.45 16.13 air 6.80 5.28 5.73 capacita 39.89 35.21 34.96 channel 2.06 2.29 2.69 doduc 27.35 26.13 25.74 fatigue 8.83 4.82 4.67 gas_dyn 11.41 9.79 9.60 induct 23.95 21.75 21.14 linpk 15.49 15.48 15.69 mdbx 11.91 11.28 11.39 nf 29.92 29.57 27.99 protein 36.34 33.94 31.91 rnflow 25.97 25.27 22.78 test_fpu 11.48 10.91 9.64 tfft 1.92 1.91 1.91 Geom. Mean 13.12 11.70 11.64 Assuming that the de-gfc48+optzns run really has disabled the llvm vectorization, I am hoping that additional benchmarking of de-gfc48+optzns with individual -ffast-math optimizations disabled (such as passing -fno-unsafe-math-optimizations) may give us a clue as the the origin of the performance delta between the stock dragonegg results with -ffast-math and those with -fplugin-arg-dragonegg-enable-gcc-optzns. Jack
On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote:> > These results are very disappointing, I was hoping to see a big improvement > somewhere instead of no real improvement anywhere (except for gas_dyn) or a > regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math > optimizations. I will try to find time to poke at gas_dyn and induct: since > turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers > are clearly missing something important. > > Ciao, Duncan. >Duncan, In case it helps, I benchmarked disabling individual -ffast-math optimizations (with partial results appended). The most important optimization to the benchmark runtimes seems to be -funsafe-math-optimizations (as can be seen from the runtime regression caused by -fno-unsafe-math-optimizations). Does llvm currently support all of the features of FSF gcc's -funsafe-math-optimizations? Jack Tested on x86_apple-darwin12 Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.spec s -fplugin-arg-dragonegg-enable-gcc-optzns gfortran48: /sw/bin/gfortran-fsf-4.8 de-gfc48+nounsafe+optzns:/sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated -as.specs -fplugin-arg-dragonegg-enable-gcc-optzns -fno-unsafe-math-optimzations de-gfc48+math-errno+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integra ted-as.specs -fplugin-arg-dragonegg-enable-gcc-optzns -fmath-errno de-gfc48+math-signans+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integ rated-as.specs -fplugin-arg-dragonegg-enable-gcc-optzns -fsignaling-nans Run time (secs) Benchmark de-gfc48 de-gfc48 gfortran48 de-gfc48+nounsafe de-gfc48+math-errno de-gfc48+math-signans +optzns +optzns +optzns +optzns ac 11.33 8.10 8.02 9.20 8.10 8.10 aermod 16.03 14.45 16.13 14.83 14.20 14.51 air 6.80 5.28 5.73 6.84 5.26 5.31 capacita 39.89 35.21 34.96 36.72 35.21 35.51 channel 2.06 2.29 2.69 2.30 2.29 2.30 doduc 27.35 26.13 25.74 29.90 26.42 26.99 fatigue 8.83 4.82 4.67 5.60 4.87 4.82 gas_dyn 11.41 9.79 9.60 12.97 10.56 12.13 induct 23.95 21.75 21.14 22.34 21.39 21.91 linpk 15.49 15.48 15.69 15.49 15.49 15.52 mdbx 11.91 11.28 11.39 11.85 11.27 11.83 nf 29.92 29.57 27.99 29.67 29.67 29.47 protein 36.34 33.94 31.91 34.23 33.62 33.97 rnflow 25.97 25.27 22.78 27.99 28.00 28.00 test_fpu 11.48 10.91 9.64 10.95 10.94 10.93 tfft 1.92 1.91 1.91 1.91 1.90 1.91 Geom. Mean 13.12 11.70 11.64 12.62 11.82 12.01
Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's vectorizers is. On 01/06/13 21:34, Jack Howarth wrote:> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote: >> >> These results are very disappointing, I was hoping to see a big improvement >> somewhere instead of no real improvement anywhere (except for gas_dyn) or a >> regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math >> optimizations. I will try to find time to poke at gas_dyn and induct: since >> turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers >> are clearly missing something important. >> >> Ciao, Duncan. > > Duncan, > Appended are another set of benchmark runs where I attempted to decouple the > fast math optimizations from the vectorization by passing -fno-tree-vectorize. > I am unclear if dragonegg really honors -fno-tree-vectorize to disable the llvm > vectorization.Yes, it does disable LLVM vectorization.> > Tested on x86_apple-darwin12 > > Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorizeMaybe -march=native would be a good addition.> > de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs > de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.spec > s -fplugin-arg-dragonegg-enable-gcc-optzns > gfortran48: /sw/bin/gfortran-fsf-4.8 > > Run time (secs)What is the standard deviation for each benchmark? If each run varies by +-5% then that means that the changes in runtime of around 3% measured below don't mean anything. Comparing with your previous benchmarks, I see:> > Benchmark de-gfc48 de-gfc48 gfortran48 > +optzns > > ac 11.33 8.10 8.02Turning on LLVM's vectorizer gives a 2% slowdown.> aermod 16.03 14.45 16.13Turning on LLVM's vectorizer gives a 2.5% slowdown.> air 6.80 5.28 5.73 > capacita 39.89 35.21 34.96Turning on LLVM's vectorizer gives a 5% speedup. GCC gets a 5.5% speedup from its vectorizer.> channel 2.06 2.29 2.69GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get. On the other hand, without vectorization LLVM's version runs 23% faster than GCC's, so while GCC's vectorizer leaps GCC into the lead, the final speed difference is more in the order of GCC 10% faster.> doduc 27.35 26.13 25.74 > fatigue 8.83 4.82 4.67GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get. This is a good one to look at, because all the difference between GCC and LLVM is coming from the mid-level optimizers: turning on GCC optzns in dragonegg speeds up the program to GCC levels, so it is possible to get LLVM IR with and without the effect of GCC optimizations, which should make it fairly easy to understand what GCC is doing right here.> gas_dyn 11.41 9.79 9.60Turning on LLVM's vectorizer gives a 30% speedup. GCC gets a comparable speedup from its vectorizer.> induct 23.95 21.75 21.14GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get. Like fatigue, this is a case where we can get IR showing all the improvements that the GCC optimizers made.> linpk 15.49 15.48 15.69 > mdbx 11.91 11.28 11.39Turning on LLVM's vectorizer gives a 2% slowdown> nf 29.92 29.57 27.99 > protein 36.34 33.94 31.91Turning on LLVM's vectorizer gives a 3% speedup.> rnflow 25.97 25.27 22.78GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get.> test_fpu 11.48 10.91 9.64GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.> tfft 1.92 1.91 1.91 > > Geom. Mean 13.12 11.70 11.64Ciao, Duncan.> > Assuming that the de-gfc48+optzns run really has disabled the llvm vectorization, > I am hoping that additional benchmarking of de-gfc48+optzns with individual > -ffast-math optimizations disabled (such as passing -fno-unsafe-math-optimizations) > may give us a clue as the the origin of the performance delta between the stock > dragonegg results with -ffast-math and those with -fplugin-arg-dragonegg-enable-gcc-optzns. > Jack >
Possibly Parallel Threads
- [LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn
- [LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn
- [LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn
- [LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn
- [LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn