On Sat, 19 Jun 2004, [koi8-r] "Valery A.Khamenya[koi8-r] " wrote:
> i took a look into LLVM benchmarks from nightly tester and
> ran Shootout tests on my own. Below go just few outlines.
>
> 1. results on my AMD AthlonXP and Xeon used by LLVM
> team are different sometime. In particular, both Shootout
> and Shootout-C++ show great speed up with LLVM (in
> comparison to GCC) on ackerman test on my AthlonXP.
> But here:
> llvm.cs.uiuc.edu/testresults/X86
> GCC is reported to be quicker on
> Benchmarks/Shootout-C++/ackermann test.
This test seems PARTICULARLY noisy for some reason. Take a look at this:
[zion Shootout-C++]$ time Output/ackermann.native
Ack(3,12): 32765
3.130u 0.010s 0:03.14 100.0% 0+0k 0+0io 196pf+0w
[zion Shootout-C++]$ time Output/ackermann.native
Ack(3,12): 32765
3.280u 0.000s 0:03.28 100.0% 0+0k 0+0io 196pf+0w
[zion Shootout-C++]$ time Output/ackermann.native
Ack(3,12): 32765
3.210u 0.000s 0:03.21 100.0% 0+0k 0+0io 196pf+0w
[zion Shootout-C++]$ time Output/ackermann.native
Ack(3,12): 32765
3.340u 0.010s 0:03.34 100.2% 0+0k 0+0io 196pf+0w
[zion Shootout-C++]$ time Output/ackermann.llc-ls
Ack(3,12): 32765
3.890u 0.000s 0:03.89 100.0% 0+0k 0+0io 120pf+0w
[zion Shootout-C++]$ time Output/ackermann.llc-ls
Ack(3,12): 32765
3.790u 0.000s 0:03.78 100.2% 0+0k 0+0io 120pf+0w
[zion Shootout-C++]$ time Output/ackermann.llc-ls
Ack(3,12): 32765
4.000u 0.010s 0:04.00 100.2% 0+0k 0+0io 120pf+0w
[zion Shootout-C++]$ time Output/ackermann.llc-ls
Ack(3,12): 32765
3.830u 0.000s 0:03.83 100.0% 0+0k 0+0io 120pf+0w
... the native version swings from 3.13-3.34s and the llc-ls version
swings from 3.83-4.0s. The C backend is also noisy, swinging from
3.98-4.58s. FWIW, these tests are on a Intel P4 Xeon 3.06Ghz machine with
HT enabled.
In any case, it appears that we're slower than GCC on this one. If you
wanted to dig into the test to see what is going wrong (is the LLVM code
missing an optimization, or is it bad code generation?), that would be
quite helpful.
> 2. the tests based (e.g. ackerman,fibonacci) on function
> calls are tend to be quicker with LLVM then with GCC
> :)
Yeah I guess so! A 31x speedup on ary (66s -> 2.11 for llc-ls), 10x
speedup on ary2, and 4.8x speedup on except is not bad at all. :) We
tend to do pretty well on C++ programs though.
> 3. some generic tests (like nested loops) are slower with LLVM then GCC.
> Could anyone check why Benchmarks/Shootout*/nestedloop is so slow with
> LLVM? Nested loops is a quite generic construction, so if this test
> will be boosted then it might influence very much other test too.
You're right, this one is pretty bad. Looking at the (most) inner loop in
the "benchmark":
for (f=0; f<n; f++)
x++;
It looks like we're generating this machine code:
.LBBmain_10: # no_exit.5
mov %EDX, %EBX
mov %EBP, %EDX
inc %EBP
mov %EBX, %EDX
inc %EBX
cmp %EBP, %ESI
jl .LBBmain_10 # no_exit.5
The code looks somewhat reasonable except for the three extra copies in
it. GCC destroys us on this loop though, producing:
.L30:
incl %eax
decl %edx
jne .L30
There are two things we need to get this nice code:
1. The register allocator needs a better coallescing heuristic or a
better PHI elimination pass to help it. Alkis is the one to talk to
about this, but he's out of town (should be back soon?). This is a
serious problem though, and needs to be addressed before 1.3.
This problem is far reaching (not just limited to this testcase), and
is a good example of how a (toy) microbenchmark like this can expose a
real problem.
2. We need to implement a loop indvar reversal pass that counts down with
the loop induction variable instead of up. That will allow the
instruction selector to branch on the condition codes produced by the
'dec' instruction (eliminating the cmp instruction), and allow usage
of hardware support on PPC and other targets. This optimization is
related to loop strength reduction which we also do not have yet. Note
that we do have all of the infrastructure needed to implement this, if
anyone is interested in doing so.
In this particular loop, we could do much better by just compiling the
loop into:
x += n;
Doing so would not be hard at all. All we have to do is teach the trip
count analysis code that a loop like 'for (f=0; f<n; f++)' executes
'n'
times and existing optimizations will do the hard work (this will actually
end up turning this benchmark into a constant time operation). Currently
the trip count computation code only handles loops that execute a constant
number of iterations.
This has been on my todo list for a long time, but I've been busy with
many other things. :) I'd much rather get at least problem #1 addressed
first in any case.
Thanks for looking into this. We have (obviously) spent very limited time
looking into performance issues, it's great that you're able to do so!
-Chris
--
llvm.cs.uiuc.edu
nondot.org/~sabre/Projects