thr3ads.net - llvm dev - [LLVMdev] dragonegg svn benchmarks [Oct 2011]

If this information is useful, please help other people find it:
Share via:

Chris Lattner

2011-Oct-11 21:28 UTC

[LLVMdev] dragonegg svn benchmarks

On Oct 8, 2011, at 12:05 PM, Duncan Sands wrote:> PS: With -fplugin-arg-dragonegg-enable-gcc-optzns the LLVM optimizers are
run at
> the following levels:
> 
> Command line option      LLVM optimizers run at
> -------------------      ----------------------
>         -O1              tiny amount of optimization
>     -O2 or -O3                      -O1
>     -O4 or -O5                      -O2
>     -O6 or better                   -O3
Hi Duncan,

Out of curiosity, why do you follow this approach?  People generally use -O2 or
-O3.  I'd recommend switching dragonegg to line those up with whatever you
want people to use.

-Chris

Duncan Sands

2011-Oct-12 07:40 UTC

head link

[LLVMdev] dragonegg svn benchmarks

Hi Chris,
>> PS: With -fplugin-arg-dragonegg-enable-gcc-optzns the LLVM optimizers
are run at
>> the following levels:
>>
>> Command line option      LLVM optimizers run at
>> -------------------      ----------------------
>>          -O1              tiny amount of optimization
>>      -O2 or -O3                      -O1
>>      -O4 or -O5                      -O2
>>      -O6 or better                   -O3
>
> Hi Duncan,
>
> Out of curiosity, why do you follow this approach?  People generally use
-O2 or -O3.  I'd recommend switching dragonegg to line those up with
whatever you want people to use.
note that this is done only when the GCC optimizers are run.  The basic
observation is that running the LLVM optimizers at -O3 after running the
GCC optimizers (at -O3) results in slower code!  I mean slower than what
you get by running the LLVM optimizers at -O1 or -O2.  I didn't find time
to analyse this curiosity yet.  It might simply be that the LLVM inlining
level is too high given that inlining has already been done by GCC.  Anyway,
I didn't want to run LLVM at -O3 because of this.  The next question was:
what is better: LLVM at -O1 or at -O2?  My first experiments showed that
code quality was essentially the same.  Since at -O1 you get a nice compile
time speedup, I settled on using -O1.  Also -O1 makes some sense if the GCC
optimizers did a good job and all that is needed is to clean up the mess that
converting to LLVM IR can produce.  However later experiments showed that -O2
does seem to consistently result in slightly better code, so I've been
thinking
of using -O2 instead.  This is one reason I encouraged Jack to use -O4 in his
benchmarks (i.e. GCC at -O3, LLVM at -O2) - to see if they show the same thing.

Ciao, Duncan.

PS: Dragonegg is a nice platform for understanding what the GCC optimizers
do better than LLVM.  It's a pity no-one seems to have used it for this.

Don Quixote de la Mancha

2011-Oct-12 09:31 UTC

head link

[LLVMdev] dragonegg svn benchmarks

On Wed, Oct 12, 2011 at 12:40 AM, Duncan Sands >  The
basic> observation is that running the LLVM optimizers at -O3 after running the
> GCC optimizers (at -O3) results in slower code!  I mean slower than what
> you get by running the LLVM optimizers at -O1 or -O2.  I didn't find
time
> to analyse this curiosity yet.  It might simply be that the LLVM inlining
> level is too high given that inlining has already been done by GCC.
 Anyway,
> I didn't want to run LLVM at -O3 because of this.
If you inline too much you will get slower code because you make
poorer use of the instruction cache in most modern processors.

C99 and C++ allow one to declare functions inline at the point that
they are declared.  For early C standards I believe GCC has an
attribute that allows one to inline a function at the point of
declaration as a language extension.

Lots of other languages do inlining, for example I understand Java
JITs will inline JIT-compiled native code even though the Java
language itself doesn't support inlining.

For modern processors with code caches it would be better to inline
functions at the point they are used rather than when they are
declared.  That way one has the choice of better cache usage or
avoiding function call overhead.  For example:

int foo( float bar );

int baz( void )
{
   return foo( 3 ) inline;  // This call will be fast
}

int boo( void )
{
   return foo( 5 ); // This will make a hot spot at foo's definition
}

Profiler-guided optimizations could take care of this without needing
any language extensions.  I understand that that is what the Java
HotSpot JIT does.

Don Quixote
-- 
Don Quixote de la Mancha
Dulcinea Technologies Corporation
Software of Elegance and Beauty
http://www.dulcineatech.com
quixote at dulcineatech.com

Jack Howarth

2011-Oct-12 13:47 UTC

head link

[LLVMdev] dragonegg svn benchmarks

On Wed, Oct 12, 2011 at 09:40:53AM +0200, Duncan Sands
wrote:> Hi Chris,
>
>>> PS: With -fplugin-arg-dragonegg-enable-gcc-optzns the LLVM
optimizers are run at
>>> the following levels:
>>>
>>> Command line option      LLVM optimizers run at
>>> -------------------      ----------------------
>>>          -O1              tiny amount of optimization
>>>      -O2 or -O3                      -O1
>>>      -O4 or -O5                      -O2
>>>      -O6 or better                   -O3
>>
>> Hi Duncan,
>>
>> Out of curiosity, why do you follow this approach?  People generally
use -O2 or -O3.  I'd recommend switching dragonegg to line those up with
whatever you want people to use.
>
> note that this is done only when the GCC optimizers are run.  The basic
> observation is that running the LLVM optimizers at -O3 after running the
> GCC optimizers (at -O3) results in slower code!  I mean slower than what
> you get by running the LLVM optimizers at -O1 or -O2.  I didn't find
time
> to analyse this curiosity yet.  It might simply be that the LLVM inlining
> level is too high given that inlining has already been done by GCC. 
Anyway,
> I didn't want to run LLVM at -O3 because of this.  The next question
was:
> what is better: LLVM at -O1 or at -O2?  My first experiments showed that
> code quality was essentially the same.  Since at -O1 you get a nice compile
> time speedup, I settled on using -O1.  Also -O1 makes some sense if the GCC
> optimizers did a good job and all that is needed is to clean up the mess
that
> converting to LLVM IR can produce.  However later experiments showed that
-O2
> does seem to consistently result in slightly better code, so I've been
thinking
> of using -O2 instead.  This is one reason I encouraged Jack to use -O4 in
his
> benchmarks (i.e. GCC at -O3, LLVM at -O2) - to see if they show the same
thing.
Duncan,
   My preliminary runs of the pb05 benchmarks at -O4, -O5 and -O6 using
-fplugin-arg-dragonegg-enable-gcc-optzns didn't show any significant run
time
performance changes compared to -fplugin-arg-dragonegg-enable-gcc-optzns -O3. 
I'll rerun those and post the tabulated results this weekend. I am using
-ffast-math -funroll-loops as well in the optimization flags. Perhaps I should
repeat the benchmarks without those flags.
   IMHO, the more important thing is to fish out the remaining regressions
in the llvm vectorization code by defaulting
-fplugin-arg-dragonegg-enable-gcc-optzns
on in dragonegg svn once llvm 3.0 has branched. Hopefully this will get us wider
testing of the llvm vectorization support and some additional smaller test cases
that expose the remaining bugs in that code.
              Jack>
> Ciao, Duncan.
>
> PS: Dragonegg is a nice platform for understanding what the GCC optimizers
> do better than LLVM.  It's a pity no-one seems to have used it for
this.

Jack Howarth

2011-Oct-13 01:46 UTC

head link

[LLVMdev] dragonegg svn benchmarks

The Polyhedron 2005 benchmark results for dragonegg svn at r141775
using FSF gcc 4.6.2svn measured on x86_64-apple-darwin11 are listed below.
The benchmarks used the optimizaton flags...

a) gfortran-fsf-4.6 -msse4 -ffast-math -funroll-loops -O3 %n.f90 -o %n
b) de-gfortran46 -msse4 -ffast-math -funroll-loops -O3 %n.f90 -o %n
c) de-gfortran46 -msse4 -ffast-math -funroll-loops -O3
-fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n
d) de-gfortran46 -msse4 -ffast-math -funroll-loops -O4
-fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n
e) de-gfortran46 -msse4 -ffast-math -funroll-loops -O5
-fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n
f) de-gfortran46 -msse4 -ffast-math -funroll-loops -O6
-fplugin-arg-dragonegg-enable-gcc-optzns %n.f90 -o %n

and no runtime regressions are observed in any of the cases.

Run time (seconds)

Benchmark    gfortran  dragonegg   de+optnz  de+optnz+O4  de+optnz+O5 
de+optnz+O6
----------------------------------------------------------------------------------
ac             8.84      10.80       8.90       8.90        8.90         8.90
aermod        17.65      15.98      14.55      14.53       14.52        14.47 
air            5.50       7.11       6.62       6.62        6.61         6.81 
capacita      32.56      41.44      36.48      36.50       36.49        36.60
channel        1.84       2.53       2.06       2.06        2.07         2.06 
doduc         26.66      30.30      27.75      28.08       28.08        28.19
fatigue        8.47       9.14       8.36       8.21        8.24         8.08
gas_dyn        4.27      11.75       4.44       4.44        4.44         4.45
induct        13.09      24.02      12.20      12.17       12.14        12.25
linpk         15.46      15.56      15.75      15.76       15.76        15.75
mdbx          11.21      12.20      11.84      11.85       11.85        11.85
nf            27.85      28.71      29.31      29.30       29.31        29.24
protein       33.43      39.10      37.44      37.49       37.50        37.44
rnflow        24.02      31.95      26.44      26.51       26.51        26.46 
test_fpu       8.05      11.52       9.39       9.37        9.38         9.39
tfft           1.87       1.91       1.93       1.93        1.93         1.93 

mean time     10.87      13.68      11.38      11.38       11.38        11.39 

Compile time (seconds) 

Benchmark    gfortran  dragonegg  de+optnz  de+optnz+O4  de+optnz+O5 
de+optnz+O6
--------------------------------------------------------------------------------
ac             1.17       0.30       0.60       0.62        0.62         0.62
aermod        44.13      25.67      32.26      32.80       32.78        33.06
air            2.22       1.02       1.48       1.49        1.49         1.54
capacita       1.77       0.49       0.92       0.93        0.93         0.96
channel        0.62       0.23       0.40       0.41        0.41         0.41 
doduc          5.34       1.61       3.16       3.22        3.22         3.27
fatigue        1.76       0.89       1.20       1.21        1.21         1.26
gas_dyn        3.02       0.65       1.18       1.22        1.21         1.24  
induct         4.01       1.71       2.68       2.80        2.68         2.78
linpk          0.78       0.21       0.46       0.47        0.47         0.54
mdbx           1.85       0.68       1.19       1.20        1.20         1.22  
nf             1.83       0.34       0.78       0.76        0.76         0.77
protein        4.01       0.99       1.78       1.77        1.77         1.78 
rnflow         5.51       1.30       2.63       2.63        2.65         2.68
test_fpu       4.38       1.00       2.10       2.10        2.11         2.14  
tfft           0.56       0.19       0.32       0.33        0.33         0.33


Code Size (bytes)

Compile time (seconds)

Benchmark    gfortran  dragonegg  de+optnz  de+optnz+O4  de+optnz+O5 
de+optnz+O6
--------------------------------------------------------------------------------
ac             50968      26736     39120     39120        39120        39120
aermod       1265640    1035724   1051600   1050504      1050504      1066424
air            73988      61908     53740     53740        53740        57884 
capacita       78000      41416     45584     45552        45552        49648
channel        34784      22696     26792     26792        26792        26792 
doduc         197240     124408    141144    140856       140856       140648
fatigue        86080      69824     69984     69840        69840        73936
gas_dyn       119744      59112     67488     67384        67384        67360
induct        174976     171344    167344    167344       167344       171320 
linpk          38648      18872     27080     27056        27056        26840
mdbx           82112      53692     61980     57884        57884        61916 
nf             75896      23992     36200     36200        36200        36200
protein       132040      75032     87208     87208        87208        87208
rnflow        181120      76024    100712    100608       100608       100488 
test_fpu      155072      58368     82752     78632        78632        82632 
tfft           30768      18640     18488     18488        18488        18488

Possibly Parallel Threads

Search for more seemingly similar threads

llvm dev - Oct 2011 - [LLVMdev] dragonegg svn benchmarks

[LLVMdev] dragonegg svn benchmarks

[LLVMdev] dragonegg svn benchmarks

[LLVMdev] dragonegg svn benchmarks

[LLVMdev] dragonegg svn benchmarks

[LLVMdev] dragonegg svn benchmarks

Possibly Parallel Threads