Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's vectorizers is. On 01/06/13 21:34, Jack Howarth wrote:> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote: >> >> These results are very disappointing, I was hoping to see a big improvement >> somewhere instead of no real improvement anywhere (except for gas_dyn) or a >> regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math >> optimizations. I will try to find time to poke at gas_dyn and induct: since >> turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers >> are clearly missing something important. >> >> Ciao, Duncan. > > Duncan, > Appended are another set of benchmark runs where I attempted to decouple the > fast math optimizations from the vectorization by passing -fno-tree-vectorize. > I am unclear if dragonegg really honors -fno-tree-vectorize to disable the llvm > vectorization.Yes, it does disable LLVM vectorization.> > Tested on x86_apple-darwin12 > > Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorizeMaybe -march=native would be a good addition.> > de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs > de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.spec > s -fplugin-arg-dragonegg-enable-gcc-optzns > gfortran48: /sw/bin/gfortran-fsf-4.8 > > Run time (secs)What is the standard deviation for each benchmark? If each run varies by +-5% then that means that the changes in runtime of around 3% measured below don't mean anything. Comparing with your previous benchmarks, I see:> > Benchmark de-gfc48 de-gfc48 gfortran48 > +optzns > > ac 11.33 8.10 8.02Turning on LLVM's vectorizer gives a 2% slowdown.> aermod 16.03 14.45 16.13Turning on LLVM's vectorizer gives a 2.5% slowdown.> air 6.80 5.28 5.73 > capacita 39.89 35.21 34.96Turning on LLVM's vectorizer gives a 5% speedup. GCC gets a 5.5% speedup from its vectorizer.> channel 2.06 2.29 2.69GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get. On the other hand, without vectorization LLVM's version runs 23% faster than GCC's, so while GCC's vectorizer leaps GCC into the lead, the final speed difference is more in the order of GCC 10% faster.> doduc 27.35 26.13 25.74 > fatigue 8.83 4.82 4.67GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get. This is a good one to look at, because all the difference between GCC and LLVM is coming from the mid-level optimizers: turning on GCC optzns in dragonegg speeds up the program to GCC levels, so it is possible to get LLVM IR with and without the effect of GCC optimizations, which should make it fairly easy to understand what GCC is doing right here.> gas_dyn 11.41 9.79 9.60Turning on LLVM's vectorizer gives a 30% speedup. GCC gets a comparable speedup from its vectorizer.> induct 23.95 21.75 21.14GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get. Like fatigue, this is a case where we can get IR showing all the improvements that the GCC optimizers made.> linpk 15.49 15.48 15.69 > mdbx 11.91 11.28 11.39Turning on LLVM's vectorizer gives a 2% slowdown> nf 29.92 29.57 27.99 > protein 36.34 33.94 31.91Turning on LLVM's vectorizer gives a 3% speedup.> rnflow 25.97 25.27 22.78GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get.> test_fpu 11.48 10.91 9.64GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get.> tfft 1.92 1.91 1.91 > > Geom. Mean 13.12 11.70 11.64Ciao, Duncan.> > Assuming that the de-gfc48+optzns run really has disabled the llvm vectorization, > I am hoping that additional benchmarking of de-gfc48+optzns with individual > -ffast-math optimizations disabled (such as passing -fno-unsafe-math-optimizations) > may give us a clue as the the origin of the performance delta between the stock > dragonegg results with -ffast-math and those with -fplugin-arg-dragonegg-enable-gcc-optzns. > Jack >
Jack, Can you please file a bug report and attach the BC files for the major loops that we miss ? Thanks, Nadav On Jun 2, 2013, at 1:27, Duncan Sands <duncan.sands at gmail.com> wrote:> Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's vectorizers > is. > > On 01/06/13 21:34, Jack Howarth wrote: >> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote: >>> >>> These results are very disappointing, I was hoping to see a big improvement >>> somewhere instead of no real improvement anywhere (except for gas_dyn) or a >>> regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math >>> optimizations. I will try to find time to poke at gas_dyn and induct: since >>> turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers >>> are clearly missing something important. >>> >>> Ciao, Duncan. >> >> Duncan, >> Appended are another set of benchmark runs where I attempted to decouple the >> fast math optimizations from the vectorization by passing -fno-tree-vectorize. >> I am unclear if dragonegg really honors -fno-tree-vectorize to disable the llvm >> vectorization. > > Yes, it does disable LLVM vectorization. > >> >> Tested on x86_apple-darwin12 >> >> Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize > > Maybe -march=native would be a good addition. > >> >> de-gfc48: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.specs >> de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so -specs=/sw/lib/gcc4.8/lib/integrated-as.spec >> s -fplugin-arg-dragonegg-enable-gcc-optzns >> gfortran48: /sw/bin/gfortran-fsf-4.8 >> >> Run time (secs) > > What is the standard deviation for each benchmark? If each run varies by +-5% > then that means that the changes in runtime of around 3% measured below don't > mean anything. > > > Comparing with your previous benchmarks, I see: > >> >> Benchmark de-gfc48 de-gfc48 gfortran48 >> +optzns >> >> ac 11.33 8.10 8.02 > > Turning on LLVM's vectorizer gives a 2% slowdown. > >> aermod 16.03 14.45 16.13 > > Turning on LLVM's vectorizer gives a 2.5% slowdown. > >> air 6.80 5.28 5.73 >> capacita 39.89 35.21 34.96 > > Turning on LLVM's vectorizer gives a 5% speedup. GCC gets a 5.5% speedup from > its vectorizer. > >> channel 2.06 2.29 2.69 > > GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get. On the > other hand, without vectorization LLVM's version runs 23% faster than GCC's, so > while GCC's vectorizer leaps GCC into the lead, the final speed difference is > more in the order of GCC 10% faster. > >> doduc 27.35 26.13 25.74 >> fatigue 8.83 4.82 4.67 > > GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get. > This is a good one to look at, because all the difference between GCC > and LLVM is coming from the mid-level optimizers: turning on GCC optzns > in dragonegg speeds up the program to GCC levels, so it is possible to > get LLVM IR with and without the effect of GCC optimizations, which should > make it fairly easy to understand what GCC is doing right here. > >> gas_dyn 11.41 9.79 9.60 > > Turning on LLVM's vectorizer gives a 30% speedup. GCC gets a comparable > speedup from its vectorizer. > >> induct 23.95 21.75 21.14 > > GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get. Like > fatigue, this is a case where we can get IR showing all the improvements that > the GCC optimizers made. > >> linpk 15.49 15.48 15.69 >> mdbx 11.91 11.28 11.39 > > Turning on LLVM's vectorizer gives a 2% slowdown > >> nf 29.92 29.57 27.99 >> protein 36.34 33.94 31.91 > > Turning on LLVM's vectorizer gives a 3% speedup. > >> rnflow 25.97 25.27 22.78 > > GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get. > >> test_fpu 11.48 10.91 9.64 > > GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get. > >> tfft 1.92 1.91 1.91 >> >> Geom. Mean 13.12 11.70 11.64 > > Ciao, Duncan. > >> >> Assuming that the de-gfc48+optzns run really has disabled the llvm vectorization, >> I am hoping that additional benchmarking of de-gfc48+optzns with individual >> -ffast-math optimizations disabled (such as passing -fno-unsafe-math-optimizations) >> may give us a clue as the the origin of the performance delta between the stock >> dragonegg results with -ffast-math and those with -fplugin-arg-dragonegg-enable-gcc-optzns. >> Jack >> >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130602/dc864642/attachment.html>
Actually this kind of opportunities, as outlined bellow, was one of my contrived motivating example for fast-math. But last year we don't see such opportunities in real applications we care about. t1 = x1/y ... t2 = x2/y. I think it is better to be taken care by GVN/PRE -- blindly convert x/y => x *1/y is not necessarily beneficial. Or maybe we can blindly perform such transformation in early stage, and later on convert it back if they are not CSEed away. On 6/3/13 8:53 AM, Duncan Sands wrote:> Hi Nadav, > > On 02/06/13 19:08, Nadav Rotem wrote: >> Jack, >> >> Can you please file a bug report and attach the BC files for the >> major loops >> that we miss ? > > I took a look and it's not clear what vectorization has to do with it, > it seems > to be a missed fast-math optimization. I've attached bitcode where > only LLVM > optimizations are run (fatigue0.ll) and where GCC optimizations are > run before > LLVM optimizations (fatigue1.ll). The hottest instruction is the same > in both: > > fatigue0.ll: > %329 = fsub fast double %327, %328, !dbg !1077 > > fatique1.ll: > %1504 = fsub fast double %1501, %1503, !dbg !1148 > > However in the GCC version it is twice as hot as in the LLVM only > version, > i.e. in the LLVM only version instructions elsewhere are consuming a > lot of > time. In the LLVM only version there are 9 fdiv instructions in that > basic > block while GCC has only one. From the profile it looks like each of > them is > consuming quite some time, and all together they chew up a lot of > time. I > think this explains the speed difference. > > All of the fdiv's have the same denominator: > %260 = fdiv fast double %253, %259 > ... > %262 = fdiv fast double %219, %259 > ... > %264 = fdiv fast double %224, %259 > ... > %266 = fdiv fast double %230, %259 > and so on. It looks like GCC takes the reciprocal > %1445 = fdiv fast double 1.000000e+00, %1439 > and then turns the fdiv's into fmul's. > > I'm not sure what the best way to implement this optimization in LLVM > is. Maybe > Shuxin has some ideas. > > So it looks like a missed fast-math optimization rather than anything > to do with > vectorization, which is strange as GCC only gets the big speedup when > vectorization is turned on. > > Ciao, Duncan. > >> >> Thanks, >> Nadav >> >> >> On Jun 2, 2013, at 1:27, Duncan Sands <duncan.sands at gmail.com >> <mailto:duncan.sands at gmail.com>> wrote: >> >>> Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's >>> vectorizers >>> is. >>> >>> On 01/06/13 21:34, Jack Howarth wrote: >>>> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote: >>>>> >>>>> These results are very disappointing, I was hoping to see a big >>>>> improvement >>>>> somewhere instead of no real improvement anywhere (except for >>>>> gas_dyn) or a >>>>> regression (eg: mdbx). I think LLVM now has a reasonable array of >>>>> fast-math >>>>> optimizations. I will try to find time to poke at gas_dyn and >>>>> induct: since >>>>> turning on gcc's optimizations there halve the run-time, LLVM's IR >>>>> optimizers >>>>> are clearly missing something important. >>>>> >>>>> Ciao, Duncan. >>>> >>>> Duncan, >>>> Appended are another set of benchmark runs where I attempted to >>>> decouple the >>>> fast math optimizations from the vectorization by passing >>>> -fno-tree-vectorize. >>>> I am unclear if dragonegg really honors -fno-tree-vectorize to >>>> disable the llvm >>>> vectorization. >>> >>> Yes, it does disable LLVM vectorization. >>> >>>> >>>> Tested on x86_apple-darwin12 >>>> >>>> Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize >>> >>> Maybe -march=native would be a good addition. >>> >>>> >>>> de-gfc48: /sw/lib/gcc4.8/bin/gfortran >>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so >>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.specs >>>> de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran >>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so >>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.spec >>>> s -fplugin-arg-dragonegg-enable-gcc-optzns >>>> gfortran48: /sw/bin/gfortran-fsf-4.8 >>>> >>>> Run time (secs) >>> >>> What is the standard deviation for each benchmark? If each run >>> varies by +-5% >>> then that means that the changes in runtime of around 3% measured >>> below don't >>> mean anything. >>> >>> >>> Comparing with your previous benchmarks, I see: >>> >>>> >>>> Benchmark de-gfc48 de-gfc48 gfortran48 >>>> +optzns >>>> >>>> ac 11.33 8.10 8.02 >>> >>> Turning on LLVM's vectorizer gives a 2% slowdown. >>> >>>> aermod 16.03 14.45 16.13 >>> >>> Turning on LLVM's vectorizer gives a 2.5% slowdown. >>> >>>> air 6.80 5.28 5.73 >>>> capacita 39.89 35.21 34.96 >>> >>> Turning on LLVM's vectorizer gives a 5% speedup. GCC gets a 5.5% >>> speedup from >>> its vectorizer. >>> >>>> channel 2.06 2.29 2.69 >>> >>> GCC's gets a 30% speedup from its vectorizer which LLVM doesn't >>> get. On the >>> other hand, without vectorization LLVM's version runs 23% faster >>> than GCC's, so >>> while GCC's vectorizer leaps GCC into the lead, the final speed >>> difference is >>> more in the order of GCC 10% faster. >>> >>>> doduc 27.35 26.13 25.74 >>>> fatigue 8.83 4.82 4.67 >>> >>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get. >>> This is a good one to look at, because all the difference between GCC >>> and LLVM is coming from the mid-level optimizers: turning on GCC optzns >>> in dragonegg speeds up the program to GCC levels, so it is possible to >>> get LLVM IR with and without the effect of GCC optimizations, which >>> should >>> make it fairly easy to understand what GCC is doing right here. >>> >>>> gas_dyn 11.41 9.79 9.60 >>> >>> Turning on LLVM's vectorizer gives a 30% speedup. GCC gets a >>> comparable >>> speedup from its vectorizer. >>> >>>> induct 23.95 21.75 21.14 >>> >>> GCC's gets a 40% speedup from its vectorizer which LLVM doesn't >>> get. Like >>> fatigue, this is a case where we can get IR showing all the >>> improvements that >>> the GCC optimizers made. >>> >>>> linpk 15.49 15.48 15.69 >>>> mdbx 11.91 11.28 11.39 >>> >>> Turning on LLVM's vectorizer gives a 2% slowdown >>> >>>> nf 29.92 29.57 27.99 >>>> protein 36.34 33.94 31.91 >>> >>> Turning on LLVM's vectorizer gives a 3% speedup. >>> >>>> rnflow 25.97 25.27 22.78 >>> >>> GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get. >>> >>>> test_fpu 11.48 10.91 9.64 >>> >>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get. >>> >>>> tfft 1.92 1.91 1.91 >>>> >>>> Geom. Mean 13.12 11.70 11.64 >>> >>> Ciao, Duncan. >>> >>>> >>>> Assuming that the de-gfc48+optzns run really has disabled the llvm >>>> vectorization, >>>> I am hoping that additional benchmarking of de-gfc48+optzns with >>>> individual >>>> -ffast-math optimizations disabled (such as passing >>>> -fno-unsafe-math-optimizations) >>>> may give us a clue as the the origin of the performance delta >>>> between the stock >>>> dragonegg results with -ffast-math and those with >>>> -fplugin-arg-dragonegg-enable-gcc-optzns. >>>> Jack >>>> >>> >
[Resending without the bitcode attached, which was too big for the mailing list]. Hi Nadav, On 02/06/13 19:08, Nadav Rotem wrote:> Jack, > > Can you please file a bug report and attach the BC files for the major loops > that we miss ?I took a look and it's not clear what vectorization has to do with it, it seems to be a missed fast-math optimization. I've attached bitcode where only LLVM optimizations are run (fatigue0.ll) and where GCC optimizations are run before LLVM optimizations (fatigue1.ll). The hottest instruction is the same in both: fatigue0.ll: %329 = fsub fast double %327, %328, !dbg !1077 fatique1.ll: %1504 = fsub fast double %1501, %1503, !dbg !1148 However in the GCC version it is twice as hot as in the LLVM only version, i.e. in the LLVM only version instructions elsewhere are consuming a lot of time. In the LLVM only version there are 9 fdiv instructions in that basic block while GCC has only one. From the profile it looks like each of them is consuming quite some time, and all together they chew up a lot of time. I think this explains the speed difference. All of the fdiv's have the same denominator: %260 = fdiv fast double %253, %259 ... %262 = fdiv fast double %219, %259 ... %264 = fdiv fast double %224, %259 ... %266 = fdiv fast double %230, %259 and so on. It looks like GCC takes the reciprocal %1445 = fdiv fast double 1.000000e+00, %1439 and then turns the fdiv's into fmul's. I'm not sure what the best way to implement this optimization in LLVM is. Maybe Shuxin has some ideas. So it looks like a missed fast-math optimization rather than anything to do with vectorization, which is strange as GCC only gets the big speedup when vectorization is turned on. Ciao, Duncan.> > Thanks, > Nadav > > > On Jun 2, 2013, at 1:27, Duncan Sands <duncan.sands at gmail.com > <mailto:duncan.sands at gmail.com>> wrote: > >> Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's vectorizers >> is. >> >> On 01/06/13 21:34, Jack Howarth wrote: >>> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote: >>>> >>>> These results are very disappointing, I was hoping to see a big improvement >>>> somewhere instead of no real improvement anywhere (except for gas_dyn) or a >>>> regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math >>>> optimizations. I will try to find time to poke at gas_dyn and induct: since >>>> turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers >>>> are clearly missing something important. >>>> >>>> Ciao, Duncan. >>> >>> Duncan, >>> Appended are another set of benchmark runs where I attempted to decouple the >>> fast math optimizations from the vectorization by passing -fno-tree-vectorize. >>> I am unclear if dragonegg really honors -fno-tree-vectorize to disable the llvm >>> vectorization. >> >> Yes, it does disable LLVM vectorization. >> >>> >>> Tested on x86_apple-darwin12 >>> >>> Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize >> >> Maybe -march=native would be a good addition. >> >>> >>> de-gfc48: /sw/lib/gcc4.8/bin/gfortran >>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so >>> -specs=/sw/lib/gcc4.8/lib/integrated-as.specs >>> de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran >>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so >>> -specs=/sw/lib/gcc4.8/lib/integrated-as.spec >>> s -fplugin-arg-dragonegg-enable-gcc-optzns >>> gfortran48: /sw/bin/gfortran-fsf-4.8 >>> >>> Run time (secs) >> >> What is the standard deviation for each benchmark? If each run varies by +-5% >> then that means that the changes in runtime of around 3% measured below don't >> mean anything. >> >> >> Comparing with your previous benchmarks, I see: >> >>> >>> Benchmark de-gfc48 de-gfc48 gfortran48 >>> +optzns >>> >>> ac 11.33 8.10 8.02 >> >> Turning on LLVM's vectorizer gives a 2% slowdown. >> >>> aermod 16.03 14.45 16.13 >> >> Turning on LLVM's vectorizer gives a 2.5% slowdown. >> >>> air 6.80 5.28 5.73 >>> capacita 39.89 35.21 34.96 >> >> Turning on LLVM's vectorizer gives a 5% speedup. GCC gets a 5.5% speedup from >> its vectorizer. >> >>> channel 2.06 2.29 2.69 >> >> GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get. On the >> other hand, without vectorization LLVM's version runs 23% faster than GCC's, so >> while GCC's vectorizer leaps GCC into the lead, the final speed difference is >> more in the order of GCC 10% faster. >> >>> doduc 27.35 26.13 25.74 >>> fatigue 8.83 4.82 4.67 >> >> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get. >> This is a good one to look at, because all the difference between GCC >> and LLVM is coming from the mid-level optimizers: turning on GCC optzns >> in dragonegg speeds up the program to GCC levels, so it is possible to >> get LLVM IR with and without the effect of GCC optimizations, which should >> make it fairly easy to understand what GCC is doing right here. >> >>> gas_dyn 11.41 9.79 9.60 >> >> Turning on LLVM's vectorizer gives a 30% speedup. GCC gets a comparable >> speedup from its vectorizer. >> >>> induct 23.95 21.75 21.14 >> >> GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get. Like >> fatigue, this is a case where we can get IR showing all the improvements that >> the GCC optimizers made. >> >>> linpk 15.49 15.48 15.69 >>> mdbx 11.91 11.28 11.39 >> >> Turning on LLVM's vectorizer gives a 2% slowdown >> >>> nf 29.92 29.57 27.99 >>> protein 36.34 33.94 31.91 >> >> Turning on LLVM's vectorizer gives a 3% speedup. >> >>> rnflow 25.97 25.27 22.78 >> >> GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get. >> >>> test_fpu 11.48 10.91 9.64 >> >> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get. >> >>> tfft 1.92 1.91 1.91 >>> >>> Geom. Mean 13.12 11.70 11.64 >> >> Ciao, Duncan. >> >>> >>> Assuming that the de-gfc48+optzns run really has disabled the llvm vectorization, >>> I am hoping that additional benchmarking of de-gfc48+optzns with individual >>> -ffast-math optimizations disabled (such as passing >>> -fno-unsafe-math-optimizations) >>> may give us a clue as the the origin of the performance delta between the stock >>> dragonegg results with -ffast-math and those with >>> -fplugin-arg-dragonegg-enable-gcc-optzns. >>> Jack >>> >>
Hi Shuxin, On 03/06/13 19:12, Shuxin Yang wrote:> Actually this kind of opportunities, as outlined bellow, was one of my contrived > motivating > example for fast-math. But last year we don't see such opportunities in real > applications we care about. > > t1 = x1/y > ... > t2 = x2/y. > > I think it is better to be taken care by GVN/PRE -- blindly convert x/y => x > *1/y is not necessarily > beneficial. Or maybe we can blindly perform such transformation in early stage, > and later on > convert it back if they are not CSEed away.I've opened PR16218 to track this. Ciao, Duncan.> > > On 6/3/13 8:53 AM, Duncan Sands wrote: >> Hi Nadav, >> >> On 02/06/13 19:08, Nadav Rotem wrote: >>> Jack, >>> >>> Can you please file a bug report and attach the BC files for the major loops >>> that we miss ? >> >> I took a look and it's not clear what vectorization has to do with it, it seems >> to be a missed fast-math optimization. I've attached bitcode where only LLVM >> optimizations are run (fatigue0.ll) and where GCC optimizations are run before >> LLVM optimizations (fatigue1.ll). The hottest instruction is the same in both: >> >> fatigue0.ll: >> %329 = fsub fast double %327, %328, !dbg !1077 >> >> fatique1.ll: >> %1504 = fsub fast double %1501, %1503, !dbg !1148 >> >> However in the GCC version it is twice as hot as in the LLVM only version, >> i.e. in the LLVM only version instructions elsewhere are consuming a lot of >> time. In the LLVM only version there are 9 fdiv instructions in that basic >> block while GCC has only one. From the profile it looks like each of them is >> consuming quite some time, and all together they chew up a lot of time. I >> think this explains the speed difference. >> >> All of the fdiv's have the same denominator: >> %260 = fdiv fast double %253, %259 >> ... >> %262 = fdiv fast double %219, %259 >> ... >> %264 = fdiv fast double %224, %259 >> ... >> %266 = fdiv fast double %230, %259 >> and so on. It looks like GCC takes the reciprocal >> %1445 = fdiv fast double 1.000000e+00, %1439 >> and then turns the fdiv's into fmul's. >> >> I'm not sure what the best way to implement this optimization in LLVM is. Maybe >> Shuxin has some ideas. >> >> So it looks like a missed fast-math optimization rather than anything to do with >> vectorization, which is strange as GCC only gets the big speedup when >> vectorization is turned on. >> >> Ciao, Duncan. >> >>> >>> Thanks, >>> Nadav >>> >>> >>> On Jun 2, 2013, at 1:27, Duncan Sands <duncan.sands at gmail.com >>> <mailto:duncan.sands at gmail.com>> wrote: >>> >>>> Hi Jack, thanks for splitting out what the effects of LLVM's / GCC's >>>> vectorizers >>>> is. >>>> >>>> On 01/06/13 21:34, Jack Howarth wrote: >>>>> On Sat, Jun 01, 2013 at 06:45:48AM +0200, Duncan Sands wrote: >>>>>> >>>>>> These results are very disappointing, I was hoping to see a big improvement >>>>>> somewhere instead of no real improvement anywhere (except for gas_dyn) or a >>>>>> regression (eg: mdbx). I think LLVM now has a reasonable array of fast-math >>>>>> optimizations. I will try to find time to poke at gas_dyn and induct: since >>>>>> turning on gcc's optimizations there halve the run-time, LLVM's IR optimizers >>>>>> are clearly missing something important. >>>>>> >>>>>> Ciao, Duncan. >>>>> >>>>> Duncan, >>>>> Appended are another set of benchmark runs where I attempted to decouple >>>>> the >>>>> fast math optimizations from the vectorization by passing -fno-tree-vectorize. >>>>> I am unclear if dragonegg really honors -fno-tree-vectorize to disable the >>>>> llvm >>>>> vectorization. >>>> >>>> Yes, it does disable LLVM vectorization. >>>> >>>>> >>>>> Tested on x86_apple-darwin12 >>>>> >>>>> Compile Flags: -ffast-math -funroll-loops -O3 -fno-tree-vectorize >>>> >>>> Maybe -march=native would be a good addition. >>>> >>>>> >>>>> de-gfc48: /sw/lib/gcc4.8/bin/gfortran >>>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so >>>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.specs >>>>> de-gfc48+optzns: /sw/lib/gcc4.8/bin/gfortran >>>>> -fplugin=/sw/lib/gcc4.8/lib/dragonegg.so >>>>> -specs=/sw/lib/gcc4.8/lib/integrated-as.spec >>>>> s -fplugin-arg-dragonegg-enable-gcc-optzns >>>>> gfortran48: /sw/bin/gfortran-fsf-4.8 >>>>> >>>>> Run time (secs) >>>> >>>> What is the standard deviation for each benchmark? If each run varies by +-5% >>>> then that means that the changes in runtime of around 3% measured below don't >>>> mean anything. >>>> >>>> >>>> Comparing with your previous benchmarks, I see: >>>> >>>>> >>>>> Benchmark de-gfc48 de-gfc48 gfortran48 >>>>> +optzns >>>>> >>>>> ac 11.33 8.10 8.02 >>>> >>>> Turning on LLVM's vectorizer gives a 2% slowdown. >>>> >>>>> aermod 16.03 14.45 16.13 >>>> >>>> Turning on LLVM's vectorizer gives a 2.5% slowdown. >>>> >>>>> air 6.80 5.28 5.73 >>>>> capacita 39.89 35.21 34.96 >>>> >>>> Turning on LLVM's vectorizer gives a 5% speedup. GCC gets a 5.5% speedup from >>>> its vectorizer. >>>> >>>>> channel 2.06 2.29 2.69 >>>> >>>> GCC's gets a 30% speedup from its vectorizer which LLVM doesn't get. On the >>>> other hand, without vectorization LLVM's version runs 23% faster than GCC's, so >>>> while GCC's vectorizer leaps GCC into the lead, the final speed difference is >>>> more in the order of GCC 10% faster. >>>> >>>>> doduc 27.35 26.13 25.74 >>>>> fatigue 8.83 4.82 4.67 >>>> >>>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get. >>>> This is a good one to look at, because all the difference between GCC >>>> and LLVM is coming from the mid-level optimizers: turning on GCC optzns >>>> in dragonegg speeds up the program to GCC levels, so it is possible to >>>> get LLVM IR with and without the effect of GCC optimizations, which should >>>> make it fairly easy to understand what GCC is doing right here. >>>> >>>>> gas_dyn 11.41 9.79 9.60 >>>> >>>> Turning on LLVM's vectorizer gives a 30% speedup. GCC gets a comparable >>>> speedup from its vectorizer. >>>> >>>>> induct 23.95 21.75 21.14 >>>> >>>> GCC's gets a 40% speedup from its vectorizer which LLVM doesn't get. Like >>>> fatigue, this is a case where we can get IR showing all the improvements that >>>> the GCC optimizers made. >>>> >>>>> linpk 15.49 15.48 15.69 >>>>> mdbx 11.91 11.28 11.39 >>>> >>>> Turning on LLVM's vectorizer gives a 2% slowdown >>>> >>>>> nf 29.92 29.57 27.99 >>>>> protein 36.34 33.94 31.91 >>>> >>>> Turning on LLVM's vectorizer gives a 3% speedup. >>>> >>>>> rnflow 25.97 25.27 22.78 >>>> >>>> GCC's gets a 7% speedup from its vectorizer which LLVM doesn't get. >>>> >>>>> test_fpu 11.48 10.91 9.64 >>>> >>>> GCC's gets a 17% speedup from its vectorizer which LLVM doesn't get. >>>> >>>>> tfft 1.92 1.91 1.91 >>>>> >>>>> Geom. Mean 13.12 11.70 11.64 >>>> >>>> Ciao, Duncan. >>>> >>>>> >>>>> Assuming that the de-gfc48+optzns run really has disabled the llvm >>>>> vectorization, >>>>> I am hoping that additional benchmarking of de-gfc48+optzns with individual >>>>> -ffast-math optimizations disabled (such as passing >>>>> -fno-unsafe-math-optimizations) >>>>> may give us a clue as the the origin of the performance delta between the >>>>> stock >>>>> dragonegg results with -ffast-math and those with >>>>> -fplugin-arg-dragonegg-enable-gcc-optzns. >>>>> Jack >>>>> >>>> >> > >
Possibly Parallel Threads
- [LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn
- [LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn
- [LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn
- [LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn
- [LLVMdev] Polyhedron 2005 results for dragonegg 3.3svn