thr3ads.net - llvm dev - [llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try! [May 2017]

If this information is useful, please help other people find it:
Share via:

Kristof Beyls via llvm-dev

2017-May-09 10:41 UTC

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

Great Quentin :).

I've rerun the benchmarks comparing "-O0 -g" with "-O0 -g
-mllvm -global-isel -mllvm -global-isel-abort=2" on r302453, on AArch64
Cortex-A57.
I indeed see almost no moves between GPR and FPR registers anymore (see details
below for where I still see some).
On geomean, I see 13% slow down (down from 17% on my previous run).
On geomean, code size increase is about 3% (down from 11% on my previous run).
I also checked debug info quality with the DIVA tool on the test-suite; and
asked one of our DS-5 validation guys to test debug experience for GlobalISel in
combination with the ARM DS-5 Development
Studio<https://developer.arm.com/products/software-development-tools/ds-5-development-studio>
debugger. Both of us didn't find any issues with the debug info produced at
-O0 -g with GlobalISel.
I didn't check compile time, nor do I intend to check that as I don't
have a good infrastructure set up for that.

Overall, my impression is that GlobalISel is ready to be enabled by default for
AArch64 at -O0, if others also believe it's ready.

Thanks,

Kristof

Detailed observations:

1. Debug info analysis with
DIVA<https://github.com/SNSystems/DIVA/blob/master/linux/doc/DIVA_user_guide.pdf>:
I used DIVA options "-a --show-generated --show-level --show-location"
to dump and diff debug info differences for the test-suite for "-O0
-g" comparing GlobalISel with non-GlobalISel.
The only difference I found was on MultiSource/Benchmarks/tramp3d-v4:
001 37577 {Function} "operator*<double>" ->
"Zero<double>"
- No declaration
- Template
002 {TemplateParameter} "T" <- "double"
- 002 37577 {Parameter} <- "Zero<double>"
002 37577 {Parameter} <- "const double &"
Which indicates that for non-GlobalISel, the Zero<double> parameter
isn't present in the debug info for the instantiation of the below template
function with T=double:
template<class T>
inline Zero<T> operator*(Zero<T>, const T&) { return
Zero<T>(); }
In other words, the debug info looks ever so slightly better with GlobalISel.
2. Quick analysis of reasons for slow down for the 10 programs that regress
the most when enabling GlobalISel at -O0:
* SingleSource/Benchmarks/BenchmarkGame/nsieve-bits (114% slowdown): no
conversion of divide by power-of-2 to shift right. I think this is an
improvement for -O0: no such optimization should be done when not requesting
optimizations.
* MultiSource/Benchmarks/MallocBench/gs/gs (88%): "interp"
function: switch statement lowered as cascaded-sequence-of-conditional-branches.
Same issue causes MultiSource/Applications/sqlite3/sqlite3 (71%).
Same issue causes MultiSource/Applications/lua/lua (46%).
* SingleSource/Benchmarks/Misc/flops-2 (75%): Poor lowering of fneg:
* FastISel:
ldur d0, [x29,#-16]
fneg d0, d0
stur d0, [x29,#-16]
* GlobalISel:
ldur d0, [x29,#-64]
orr x8, xzr, #0x8000000000000000
fmov d1, x8
fsub d0, d1, d0
fmov x8, d0
stur x8, [x29,#-64]
* MultiSource/Benchmarks/Prolangs-C++/city/city (74%): a call to memcpy
for copying 4 bytes is present with GlobalISel that isn't present with
FastISel, in function vehicle::select_move().
Same issue causes SingleSource/Benchmarks/Shootout-C++/Shootout-C++-moments
(58.5%)
* MultiSource/Applications/siod/siod (72%): Seems to be mainly due to
loading constants in the entry BB, but I'm not sure that indeed is the
biggest cause. (https://bugs.llvm.org//show_bug.cgi?id=32561)
* MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset (48%):
Due to creating all constants in the entry BB, see function CopyObject.
(https://bugs.llvm.org//show_bug.cgi?id=32561)
* MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode (46%):
Function Reference_IDCT: Probably due to creating all constants in the entry BB
+ spilling floating point data through an X register:
* FastISel:
fadd d0, d1, d0
str d0, [sp,#528]
* GlobalISel:
fadd d0, d1, d0
fmov x9, d0
stur x9, [x29,#-48]
3. Other overall impression when comparing code generation: The code size
increase is probably mainly due to creating constants always in the entry BB of
the function. For functions with many constants, this leads to lots and lots of
constant creation and then immediately spilling it to the stack.
(https://bugs.llvm.org//show_bug.cgi?id=32561)

On 8 May 2017, at 20:38, Quentin Colombet <qcolombet at
apple.com<mailto:qcolombet at apple.com>> wrote:

Hi Kristof,

I made a fix for PR32550, that is much less involved in r302453.

In particular, it does not rely on the greedy mode of the regbankselect pass,
thus there is no compile time implication.

Could you try it and check where we stand?

Thanks,
-Quentin
On Apr 27, 2017, at 10:40 AM, Quentin Colombet via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:

Hi Kristof,

On Apr 27, 2017, at 9:47 AM, Kristof Beyls <kristof.beyls at
arm.com<mailto:kristof.beyls at arm.com>> wrote:

Hi Quentin,

On 27 Apr 2017, at 00:48, Quentin Colombet <qcolombet at
apple.com<mailto:qcolombet at apple.com>> wrote:

Hi Kristof,

On Apr 6, 2017, at 6:53 AM, Kristof Beyls <kristof.beyls at
arm.com<mailto:kristof.beyls at arm.com>> wrote:

I've been digging a little bit deeper into the biggest performance
regressions I've observed.

What I've observed so far is:
* A lot of the biggest regressions are caused by unnecessarily moving floating
point values through general purpose registers. I've raised
http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one definitely
needs fixing before enabling GlobalISel by default at -O0.

I commented in the PR. This is a known problem and we have a solution. Given
this is an optimization in the sense that it does not affect the correctness of
the program, we didn’t push for fixing it now.

For O0 we wanted to focus ourselves on generating correct code. Unless the
regressions you are seeing are preventing debugging/running of the program, I
wouldn’t block the flip of the switch on that.

What do you think?

For O0, I think most users care about fast compile time and an excellent debug
experience.
Next to that, I am told that some users also use -O0 to have a straightforward,
simple, mapping between source code and the generated assembly.

Out of the issues I've seen so far, I think this is the single
"optimization" issue that I feel gives a negative first impression of
GlobalISel.
If users look at the generated assembly for floating point code it looks more
like active de-optimization rather than "no optimization".
My guess is also that this may be the biggest reason for the about 20%
performance slow-down and 11% code size increase I measured earlier.
OTOH, I clearly agree this is an optimization issue, not a correctness issue.
Combining the above, I think it would be best if this issue got fixed before we
turn on GlobalISel by default at -O0, or we may end up hearing quite a few
(non-critical) complaints about this from users.
Basically I think this is a tradeoff between giving a better first impression of
GlobalISel vs getting more people to use and test it earlier.

Thanks for the write-up on the PR, that is very useful.
Do you have any rough idea how much effort would be involved in getting this
fixed?

I’d say 2-3 weeks. Could actually be shorter if we don’t do all the refactoring
I have in mind to make the table for the alternative mappings smaller, but I
don’t think it is worth taking any shortcut here.

I got the impression Daniel is making good progress on the tablegen generation,
which is key to getting this issue fixed?

To be accurate, Daniel’s work on table gen is for the select phase, not
regbankselect. In other words, right now, all the table for mappings for
regbankselect are hand written and Daniel’s work is not changing that. The
(weak) rationale is that it is not on the critical path :).

I think that no matter whether we decide to switch the default before or after
fixing this issue, this is one of the most urgent issues to fix as far as I can
see.

Agree. This is one of the top items of my todo list.

If there is some way I can help contribute to fixing PR32550, I would like to
help; but with the dependency on tablegen generation, I'm not sure what the
best way is to help make that PR get fixed faster?

Thanks for offering to help. I agree that with the dependency on the tabelgen
generation in the way, this is probably not a good use of your time. Depending
when I can spare some time on this, I’ll either do it or explain the refactoring
in the google doc.

In the meantime, I’d suggest to focus on validating the debug info on your side.

* FastISel seems to transform division-by-constant-power-of-2 into right shift
(see
https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468).
GlobalISel does not. It seems to me that at -O0 there may be reasons not perform
this transformation, but maybe there is a good reason why FastISel does this?

I think FastISel tries to generate the best code it can no matter what. For
GISel O0 however, not doing this optimization sounds sensible to me.
Now, I would say that the same remark as the previous bullet point apply: we
shouldn’t do it unless it gets in the way of running/debugging the program.

I agree that these optimizations should not be done at -O0. I think not doing
them is actually an improvement: you give the user what they asked, i.e.
"no optimization", and an as-straigthforward-as-possible mapping from
source to assembly.

* FastISel doesn’t\ seem to handle functions with switch statements, so it falls
back to DAGISel. DAGISel produces code that's a lot better than GlobalISel
for switch statement at -O0. I'm not sure if we need to do something here
before enabling GlobalISel by default. I'm thinking we may need to add a
smarter way to lower switch statements rather than just a cascaded sequence of
conditional branches.

Sounds optimization-ish to me. Same remark.

Agreed, optimization-ish. In comparison to the above point on FastISel
"peepholes", I think that lowering switch statements to something else
than a cascaded sequence of conditional branches doesn't make the generated
code harder to map to the source. So, in comparison to the above point on
FastISel "peepholes", I'm not actively against being smarter about
lowering switch statements at -O0.
But I agree, this shouldn't hold up turning on GlobalISel by default at -O0.

Other than the above remarks, before turning on GlobalISel by default, we'd
better test/verify that debug info quality remains good.
I haven't looked into that at all, but am hoping to start looking into that
soon, with the help of the DIVA tool (https://github.com/SNSystems/DIVA)
presented at last EuroLLVM
(http://llvm.org/devmtg/2017-03//assets/slides/diva_debug_information_visual_analyzer.pdf).
I don't recall anyone so far making any statements about the quality of the
debug info they've measured or experienced with GlobalISel enabled?

We ran the lldb test suite with GISel. IIRC, Tim has the details, GISel was on
part with SDISel.

Other than the all of the above, I just wanted to mention that Oliver recently
started running csmith on AArch64 GlobalISel and found one issue so far that
already got fixed (https://bugs.llvm.org//show_bug.cgi?id=32733).
If he finds other correctness issues, I'm sure he'll keep on reporting
them.

Great!

Thanks,
-Quentin

I'll try to add the above content to the document Diana created at
https://goo.gl/IS2Bdw too.

Thanks,

Kristof

On 3 Apr 2017, at 17:10, Kristof Beyls <Kristof.Beyls at
arm.com<mailto:Kristof.Beyls at arm.com>> wrote:

I've kicked off a run to compare "-O0 -g" versus "-O0 -g
-mllvm -global-isel -mllvm -global-isel-abort=2".
I've selected the test-suite (albeit a version which is a couple of months
old now) and a few short-running proprietary benchmarks to get data back quickly
for an initial feel of where things are.
This was running on Cortex-A57 AArch64 Linux.

I saw one assertion failure in GlobalISel, see
http://bugs.llvm.org/show_bug.cgi?id=32471. This is in a program compiled at -O2
(my out-dated test-suite still overrides -O0 and instead uses -O for that
program). The root cause of the failure seems to be due to LowLevelType not
supporting vectors of pointers. I think this demonstrates that for correctness,
we should be trying to test more than -O0, or even more than just LLVM-IR
produced by clang, as other front-ends could run into this even at -O0.

Due to this assertion failure and the infrastructure I used, the numbers below
do not include test-suite/MultiSource/Benchmarks results.

On the non-correctness aspects, LNT tells me that:
- The programs that report execution time, on geomean are about 17% slower.
- The programs that report scores, on geomean are about 21% slower.
- Code size is up on geomean about 11%.
I'm afraid I don't have compile time numbers, nor any feel for debug
info quality.

I'll need quite a bit more time to dig into the details to come up with
something actionable, although the fact that LowLevelType doesn't support
vectors of pointers is already actionable.
Nevertheless, I thought to share what I see as is, to see if others see similar
results so far.

I thought Diana was going to look into fallback rate on the test-suite on
AArch64 linux?

Thanks,

Kristof

On 30 Mar 2017, at 10:54, Renato Golin <renato.golin at
linaro.org<mailto:renato.golin at linaro.org>> wrote:

On 30 March 2017 at 00:27, Quentin Colombet <qcolombet at
apple.com<mailto:qcolombet at apple.com>> wrote:
On iOS we are at 100% pass rate in 00 g for the LLVM test suite, standard
benchmarks and unit tests. In about 5% of all functions GlobalIsel falls
back to SDIsel.
(Kristof Beyls would have the linux numbers.)
The self host compiler correctly builds and runs the LLVM test suite in O0.

Having done no tests at all on my side, I think we need to have
similar numbers on Linux to be able to flip across the board.

I don't want to flip it only for Darwin and not Linux, as that will
fragment the effort too much.

I'll check with Diana and Kristof to know what's the best way forward,
but it should be reasonably quick.

cheers,
--renato

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170509/8fa700ae/attachment.html>

Quentin Colombet via llvm-dev

2017-May-09 18:47 UTC

head link

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

Hi Kristof,
> On May 9, 2017, at 3:41 AM, Kristof Beyls <kristof.beyls at arm.com>
wrote:
> 
> Great Quentin :).
> 
> I've rerun the benchmarks comparing "-O0 -g" with "-O0
-g -mllvm -global-isel -mllvm -global-isel-abort=2" on r302453, on AArch64
Cortex-A57.
> I indeed see almost no moves between GPR and FPR registers anymore (see
details below for where I still see some).
> On geomean, I see 13% slow down (down from 17% on my previous run).
> On geomean, code size increase is about 3% (down from 11% on my previous
run).
> I also checked debug info quality with the DIVA tool on the test-suite; and
asked one of our DS-5 validation guys to test debug experience for GlobalISel in
combination with the ARM DS-5 Development Studio
<https://developer.arm.com/products/software-development-tools/ds-5-development-studio>
debugger. Both of us didn't find any issues with the debug info produced at
-O0 -g with GlobalISel.
> I didn't check compile time, nor do I intend to check that as I
don't have a good infrastructure set up for that.
> 
> Overall, my impression is that GlobalISel is ready to be enabled by default
for AArch64 at -O0, if others also believe it's ready.
Sounds good to me!

Renato, Diana, how does that sound?
> 
> Thanks,
> 
> Kristof
> 
> 
> 
> Detailed observations:
> 
> Debug info analysis with DIVA
<https://github.com/SNSystems/DIVA/blob/master/linux/doc/DIVA_user_guide.pdf>:
> I used DIVA options "-a --show-generated --show-level
--show-location" to dump and diff debug info differences for the test-suite
for "-O0 -g" comparing GlobalISel with non-GlobalISel.
> The only difference I found was on MultiSource/Benchmarks/tramp3d-v4:
>   001 37577         {Function} "operator*<double>" ->
"Zero<double>"
>                         - No declaration
>                         - Template
>   002                 {TemplateParameter} "T" <-
"double"
> - 002 37577           {Parameter} <- "Zero<double>"
>   002 37577           {Parameter} <- "const double &"
> Which indicates that for non-GlobalISel, the Zero<double> parameter
isn't present in the debug info for the instantiation of the below template
function with T=double:
> template<class T>
>   inline Zero<T> operator*(Zero<T>, const T&) { return
Zero<T>(); }
> In other words, the debug info looks ever so slightly better with
GlobalISel.
> Quick analysis of reasons for slow down for the 10 programs that regress
the most when enabling GlobalISel at -O0:
> SingleSource/Benchmarks/BenchmarkGame/nsieve-bits (114% slowdown): no
conversion of divide by power-of-2 to shift right. I think this is an
improvement for -O0: no such optimization should be done when not requesting
optimizations.
> MultiSource/Benchmarks/MallocBench/gs/gs (88%): "interp"
function: switch statement lowered as cascaded-sequence-of-conditional-branches.
> Same issue causes MultiSource/Applications/sqlite3/sqlite3 (71%).
> Same issue causes MultiSource/Applications/lua/lua (46%).
> SingleSource/Benchmarks/Misc/flops-2 (75%): Poor lowering of fneg:
> FastISel:
> ldur d0, [x29,#-16]
> fneg d0, d0
> stur d0, [x29,#-16]
> GlobalISel:
> ldur d0, [x29,#-64]
> orr x8, xzr, #0x8000000000000000
> fmov d1, x8
> fsub d0, d1, d0
> fmov x8, d0
> stur x8, [x29,#-64]
> MultiSource/Benchmarks/Prolangs-C++/city/city (74%): a call to memcpy for
copying 4 bytes is present with GlobalISel that isn't present with FastISel,
in function vehicle::select_move().
> Same issue causes SingleSource/Benchmarks/Shootout-C++/Shootout-C++-moments
(58.5%)
> MultiSource/Applications/siod/siod (72%): Seems to be mainly due to loading
constants in the entry BB, but I'm not sure that indeed is the biggest
cause. (https://bugs.llvm.org//show_bug.cgi?id=32561
<https://bugs.llvm.org//show_bug.cgi?id=32561>)
> MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset (48%): Due
to creating all constants in the entry BB, see function CopyObject.
(https://bugs.llvm.org//show_bug.cgi?id=32561
<https://bugs.llvm.org//show_bug.cgi?id=32561>)
> MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode (46%):
Function Reference_IDCT: Probably due to creating all constants in the entry BB
+ spilling floating point data through an X register:
> FastISel:
> fadd d0, d1, d0
> str d0, [sp,#528]
> GlobalISel:
> fadd d0, d1, d0
> fmov x9, d0
> stur x9, [x29,#-48]
Good finding, I forgot to do stores in my previous fix. I’ll do them shortly.
> 
> Other overall impression when comparing code generation: The code size
increase is probably mainly due to creating constants always in the entry BB of
the function. For functions with many constants, this leads to lots and lots of
constant creation and then immediately spilling it to the stack.
(https://bugs.llvm.org//show_bug.cgi?id=32561
<https://bugs.llvm.org//show_bug.cgi?id=32561>)
About a month (r299283), I mistakenly committed a prototype I was working on to
solve this kind of issue. I reverted it in r299287, but you can give it a try.

Basically, that’s an additional generic pass that runs only at O0. It localizes
(as in basic block local) the definition of the constants to workaround the
limitations of O0 regalloc.

Cheers,
-Quentin > 
> 
>> On 8 May 2017, at 20:38, Quentin Colombet <qcolombet at apple.com
<mailto:qcolombet at apple.com>> wrote:
>> 
>> Hi Kristof,
>> 
>> I made a fix for PR32550, that is much less involved in r302453.
>> 
>> In particular, it does not rely on the greedy mode of the regbankselect
pass, thus there is no compile time implication.
>> 
>> Could you try it and check where we stand?
>> 
>> Thanks,
>> -Quentin
>>> On Apr 27, 2017, at 10:40 AM, Quentin Colombet via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>> 
>>> Hi Kristof,
>>> 
>>>> On Apr 27, 2017, at 9:47 AM, Kristof Beyls <kristof.beyls at
arm.com <mailto:kristof.beyls at arm.com>> wrote:
>>>> 
>>>> Hi Quentin,
>>>> 
>>>>> On 27 Apr 2017, at 00:48, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>>>> 
>>>>> Hi Kristof,
>>>>> 
>>>>>> On Apr 6, 2017, at 6:53 AM, Kristof Beyls
<kristof.beyls at arm.com <mailto:kristof.beyls at arm.com>> wrote:
>>>>>> 
>>>>>> I've been digging a little bit deeper into the
biggest performance regressions I've observed.
>>>>>> 
>>>>>> What I've observed so far is:
>>>>>> * A lot of the biggest regressions are caused by
unnecessarily moving floating point values through general purpose registers.
I've raised http://bugs.llvm.org/show_bug.cgi?id=32550
<http://bugs.llvm.org/show_bug.cgi?id=32550> for this. I think this one
definitely needs fixing before enabling GlobalISel by default at -O0.
>>>>> 
>>>>> I commented in the PR. This is a known problem and we have
a solution. Given this is an optimization in the sense that it does not affect
the correctness of the program, we didn’t push for fixing it now.
>>>>> 
>>>>> For O0 we wanted to focus ourselves on generating correct
code. Unless the regressions you are seeing are preventing debugging/running of
the program, I wouldn’t block the flip of the switch on that.
>>>>> 
>>>>> What do you think? 
>>>> 
>>>> For O0, I think most users care about fast compile time and an
excellent debug experience.
>>>> Next to that, I am told that some users also use -O0 to have a
straightforward, simple, mapping between source code and the generated assembly.
>>>> 
>>>> Out of the issues I've seen so far, I think this is the
single "optimization" issue that I feel gives a negative first
impression of GlobalISel.
>>>> If users look at the generated assembly for floating point code
it looks more like active de-optimization rather than "no
optimization".
>>>> My guess is also that this may be the biggest reason for the
about 20% performance slow-down and 11% code size increase I measured earlier.
>>>> OTOH, I clearly agree this is an optimization issue, not a
correctness issue.
>>>> Combining the above, I think it would be best if this issue got
fixed before we turn on GlobalISel by default at -O0, or we may end up hearing
quite a few (non-critical) complaints about this from users.
>>>> Basically I think this is a tradeoff between giving a better
first impression of GlobalISel vs getting more people to use and test it
earlier.
>>>> 
>>>> Thanks for the write-up on the PR, that is very useful.
>>>> Do you have any rough idea how much effort would be involved in
getting this fixed?
>>> 
>>> I’d say 2-3 weeks. Could actually be shorter if we don’t do all the
refactoring I have in mind to make the table for the alternative mappings
smaller, but I don’t think it is worth taking any shortcut here.
>>> 
>>>> I got the impression Daniel is making good progress on the
tablegen generation, which is key to getting this issue fixed?
>>> 
>>> To be accurate, Daniel’s work on table gen is for the select phase,
not regbankselect. In other words, right now, all the table for mappings for
regbankselect are hand written and Daniel’s work is not changing that. The
(weak) rationale is that it is not on the critical path :).
>>> 
>>>> I think that no matter whether we decide to switch the default
before or after fixing this issue, this is one of the most urgent issues to fix
as far as I can see.
>>> 
>>> Agree. This is one of the top items of my todo list.
>>> 
>>>> 
>>>> If there is some way I can help contribute to fixing PR32550, I
would like to help; but with the dependency on tablegen generation, I'm not
sure what the best way is to help make that PR get fixed faster?
>>> 
>>> Thanks for offering to help. I agree that with the dependency on
the tabelgen generation in the way, this is probably not a good use of your
time. Depending when I can spare some time on this, I’ll either do it or explain
the refactoring in the google doc.
>>> 
>>> In the meantime, I’d suggest to focus on validating the debug info
on your side.
>>> 
>>>> 
>>>> 
>>>>>> * FastISel seems to transform
division-by-constant-power-of-2 into right shift (see
https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468
<https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468>).
GlobalISel does not. It seems to me that at -O0 there may be reasons not perform
this transformation, but maybe there is a good reason why FastISel does this?
>>>>> 
>>>>> I think FastISel tries to generate the best code it can no
matter what. For GISel O0 however, not doing this optimization sounds sensible
to me.
>>>>> Now, I would say that the same remark as the previous
bullet point apply: we shouldn’t do it unless it gets in the way of
running/debugging the program.
>>>> 
>>>> I agree that these optimizations should not be done at -O0. I
think not doing them is actually an improvement: you give the user what they
asked, i.e. "no optimization", and an as-straigthforward-as-possible
mapping from source to assembly.
>>>> 
>>>>>> * FastISel doesn’t\ seem to handle functions with
switch statements, so it falls back to DAGISel. DAGISel produces code that's
a lot better than GlobalISel for switch statement at -O0. I'm not sure if we
need to do something here before enabling GlobalISel by default. I'm
thinking we may need to add a smarter way to lower switch statements rather than
just a cascaded sequence of conditional branches.
>>>>> 
>>>>> Sounds optimization-ish to me. Same remark.
>>>> 
>>>> Agreed, optimization-ish. In comparison to the above point on
FastISel "peepholes", I think that lowering switch statements to
something else than a cascaded sequence of conditional branches doesn't make
the generated code harder to map to the source. So, in comparison to the above
point on FastISel "peepholes", I'm not actively against being
smarter about lowering switch statements at -O0.
>>>> But I agree, this shouldn't hold up turning on GlobalISel
by default at -O0.
>>>> 
>>>> 
>>>> Other than the above remarks, before turning on GlobalISel by
default, we'd better test/verify that debug info quality remains good.
>>>> I haven't looked into that at all, but am hoping to start
looking into that soon, with the help of the DIVA tool
(https://github.com/SNSystems/DIVA <https://github.com/SNSystems/DIVA>)
presented at last EuroLLVM
(http://llvm.org/devmtg/2017-03//assets/slides/diva_debug_information_visual_analyzer.pdf
<http://llvm.org/devmtg/2017-03//assets/slides/diva_debug_information_visual_analyzer.pdf>).
I don't recall anyone so far making any statements about the quality of the
debug info they've measured or experienced with GlobalISel enabled?
>>> 
>>> We ran the lldb test suite with GISel. IIRC, Tim has the details,
GISel was on part with SDISel.
>>> 
>>>> 
>>>> Other than the all of the above, I just wanted to mention that
Oliver recently started running csmith on AArch64 GlobalISel and found one issue
so far that already got fixed (https://bugs.llvm.org//show_bug.cgi?id=32733
<https://bugs.llvm.org//show_bug.cgi?id=32733>).
>>>> If he finds other correctness issues, I'm sure he'll
keep on reporting them.
>>> 
>>> Great!
>>> 
>>> Thanks,
>>> -Quentin
>>>> 
>>>>>> I'll try to add the above content to the document
Diana created at https://goo.gl/IS2Bdw <https://goo.gl/IS2Bdw> too.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Kristof
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 3 Apr 2017, at 17:10, Kristof Beyls
<Kristof.Beyls at arm.com <mailto:Kristof.Beyls at arm.com>> wrote:
>>>>>>> 
>>>>>>> I've kicked off a run to compare "-O0
-g" versus "-O0 -g -mllvm -global-isel -mllvm
-global-isel-abort=2".
>>>>>>> I've selected the test-suite (albeit a version
which is a couple of months old now) and a few short-running proprietary
benchmarks to get data back quickly for an initial feel of where things are.
>>>>>>> This was running on Cortex-A57 AArch64 Linux.
>>>>>>> 
>>>>>>> I saw one assertion failure in GlobalISel, see
http://bugs.llvm.org/show_bug.cgi?id=32471
<http://bugs.llvm.org/show_bug.cgi?id=32471>. This is in a program
compiled at -O2 (my out-dated test-suite still overrides -O0 and instead uses -O
for that program). The root cause of the failure seems to be due to LowLevelType
not supporting vectors of pointers. I think this demonstrates that for
correctness, we should be trying to test more than -O0, or even more than just
LLVM-IR produced by clang, as other front-ends could run into this even at -O0.
>>>>>>> 
>>>>>>> Due to this assertion failure and the
infrastructure I used, the numbers below do not include
test-suite/MultiSource/Benchmarks results.
>>>>>>> 
>>>>>>> On the non-correctness aspects, LNT tells me that:
>>>>>>> - The programs that report execution time, on
geomean are about 17% slower.
>>>>>>> - The programs that report scores, on geomean are
about 21% slower.
>>>>>>> - Code size is up on geomean about 11%.
>>>>>>> I'm afraid I don't have compile time
numbers, nor any feel for debug info quality.
>>>>>>> 
>>>>>>> I'll need quite a bit more time to dig into the
details to come up with something actionable, although the fact that
LowLevelType doesn't support vectors of pointers is already actionable.
>>>>>>> Nevertheless, I thought to share what I see as is,
to see if others see similar results so far.
>>>>>>> 
>>>>>>> I thought Diana was going to look into fallback
rate on the test-suite on AArch64 linux?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Kristof
>>>>>>> 
>>>>>>>> On 30 Mar 2017, at 10:54, Renato Golin
<renato.golin at linaro.org <mailto:renato.golin at linaro.org>>
wrote:
>>>>>>>> 
>>>>>>>> On 30 March 2017 at 00:27, Quentin Colombet
<qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote:
>>>>>>>>> On iOS we are at 100% pass rate in 00 g for
the LLVM test suite, standard
>>>>>>>>> benchmarks and unit tests. In about 5% of
all functions GlobalIsel falls
>>>>>>>>> back to SDIsel.
>>>>>>>>> (Kristof Beyls would have the linux
numbers.)
>>>>>>>>> The self host compiler correctly builds and
runs the LLVM test suite in O0.
>>>>>>>> 
>>>>>>>> Having done no tests at all on my side, I think
we need to have
>>>>>>>> similar numbers on Linux to be able to flip
across the board.
>>>>>>>> 
>>>>>>>> I don't want to flip it only for Darwin and
not Linux, as that will
>>>>>>>> fragment the effort too much.
>>>>>>>> 
>>>>>>>> I'll check with Diana and Kristof to know
what's the best way forward,
>>>>>>>> but it should be reasonably quick.
>>>>>>>> 
>>>>>>>> cheers,
>>>>>>>> --renato
>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170509/c28aeaa1/attachment.html>

Eric Christopher via llvm-dev

2017-May-09 18:55 UTC

head link

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

Hi Quentin,

On Tue, May 9, 2017 at 11:47 AM Quentin Colombet via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi Kristof,
>
> On May 9, 2017, at 3:41 AM, Kristof Beyls <kristof.beyls at arm.com>
wrote:
>
> Great Quentin :).
>
> I've rerun the benchmarks comparing "-O0 -g" with "-O0
-g -mllvm
> -global-isel -mllvm -global-isel-abort=2" on r302453, on AArch64
Cortex-A57.
> I indeed see almost no moves between GPR and FPR registers anymore (see
> details below for where I still see some).
> On geomean, I see 13% slow down (down from 17% on my previous run).
> On geomean, code size increase is about 3% (down from 11% on my previous
> run).
> I also checked debug info quality with the DIVA tool on the test-suite;
> and asked one of our DS-5 validation guys to test debug experience for
> GlobalISel in combination with the ARM DS-5 Development Studio
>
<https://developer.arm.com/products/software-development-tools/ds-5-development-studio>
debugger.
> Both of us didn't find any issues with the debug info produced at -O0
-g
> with GlobalISel.
> I didn't check compile time, nor do I intend to check that as I
don't have
> a good infrastructure set up for that.
>
> Overall, my impression is that GlobalISel is ready to be enabled by
> default for AArch64 at -O0, if others also believe it's ready.
>
>
> Sounds good to me!
>
> Renato, Diana, how does that sound?
>
I'd definitely like to give this a run through on my side as well before
you flip the switch. I'll try to do that this week.

-eric

>
>
> Thanks,
>
> Kristof
>
>
>
> Detailed observations:
>
>
>    1. Debug info analysis with DIVA
>   
<https://github.com/SNSystems/DIVA/blob/master/linux/doc/DIVA_user_guide.pdf>
>    :
>    I used DIVA options "-a --show-generated --show-level
--show-location"
>    to dump and diff debug info differences for the test-suite for "-O0
-g"
>    comparing GlobalISel with non-GlobalISel.
>    The only difference I found was on MultiSource/Benchmarks/tramp3d-v4:
>      001 37577         {Function} "operator*<double>" ->
"Zero<double>"
>                            - No declaration
>                            - Template
>      002                 {TemplateParameter} "T" <-
"double"
>    - 002 37577           {Parameter} <- "Zero<double>"
>      002 37577           {Parameter} <- "const double &"
>    Which indicates that for non-GlobalISel, the Zero<double>
parameter
>    isn't present in the debug info for the instantiation of the below
template
>    function with T=double:
>    template<class T>
>      inline Zero<T> operator*(Zero<T>, const T&) { return
Zero<T>(); }
>    In other words, the debug info looks ever so slightly better with
>    GlobalISel.
>    2. Quick analysis of reasons for slow down for the 10 programs that
>    regress the most when enabling GlobalISel at -O0:
>       - SingleSource/Benchmarks/BenchmarkGame/nsieve-bits (114%
>       slowdown): no conversion of divide by power-of-2 to shift right. I
think
>       this is an improvement for -O0: no such optimization should be done
when
>       not requesting optimizations.
>       - MultiSource/Benchmarks/MallocBench/gs/gs (88%): "interp"
>       function: switch statement lowered as
>       cascaded-sequence-of-conditional-branches.
>       Same issue causes MultiSource/Applications/sqlite3/sqlite3 (71%).
>       Same issue causes MultiSource/Applications/lua/lua (46%).
>       - SingleSource/Benchmarks/Misc/flops-2 (75%): Poor lowering of
>       fneg:
>          - FastISel:
>          ldur d0, [x29,#-16]
>          fneg d0, d0
>          stur d0, [x29,#-16]
>          - GlobalISel:
>          ldur d0, [x29,#-64]
>          orr x8, xzr, #0x8000000000000000
>          fmov d1, x8
>          fsub d0, d1, d0
>          fmov x8, d0
>          stur x8, [x29,#-64]
>       - MultiSource/Benchmarks/Prolangs-C++/city/city (74%): a call to
>       memcpy for copying 4 bytes is present with GlobalISel that isn't
present
>       with FastISel, in function vehicle::select_move().
>       Same issue causes
>       SingleSource/Benchmarks/Shootout-C++/Shootout-C++-moments (58.5%)
>       - MultiSource/Applications/siod/siod (72%): Seems to be mainly due
>       to loading constants in the entry BB, but I'm not sure that
indeed is the
>       biggest cause. (https://bugs.llvm.org//show_bug.cgi?id=32561)
>       - MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset
>       (48%): Due to creating all constants in the entry BB, see function
>       CopyObject. (https://bugs.llvm.org//show_bug.cgi?id=32561)
>       - MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode
>       (46%): Function Reference_IDCT: Probably due to creating all
constants in
>       the entry BB + spilling floating point data through an X register:
>          - FastISel:
>          fadd d0, d1, d0
>          str d0, [sp,#528]
>          - GlobalISel:
>          fadd d0, d1, d0
>          fmov x9, d0
>          stur x9, [x29,#-48]
>
>
> Good finding, I forgot to do stores in my previous fix. I’ll do them
> shortly.
>
>
>    1.
>       -
>    2. Other overall impression when comparing code generation: The code
>    size increase is probably mainly due to creating constants always in the
>    entry BB of the function. For functions with many constants, this leads
to
>    lots and lots of constant creation and then immediately spilling it to
the
>    stack. (https://bugs.llvm.org//show_bug.cgi?id=32561)
>
>
> About a month (r299283), I mistakenly committed a prototype I was working
> on to solve this kind of issue. I reverted it in r299287, but you can
> give it a try.
>
> Basically, that’s an additional generic pass that runs only at O0. It
> localizes (as in basic block local) the definition of the constants to
> workaround the limitations of O0 regalloc.
>
> Cheers,
> -Quentin
>
>
>
> On 8 May 2017, at 20:38, Quentin Colombet <qcolombet at apple.com>
wrote:
>
> Hi Kristof,
>
> I made a fix for PR32550, that is much less involved in r302453.
>
> In particular, it does not rely on the greedy mode of the regbankselect
> pass, thus there is no compile time implication.
>
> Could you try it and check where we stand?
>
> Thanks,
> -Quentin
>
> On Apr 27, 2017, at 10:40 AM, Quentin Colombet via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Hi Kristof,
>
> On Apr 27, 2017, at 9:47 AM, Kristof Beyls <kristof.beyls at arm.com>
wrote:
>
> Hi Quentin,
>
> On 27 Apr 2017, at 00:48, Quentin Colombet <qcolombet at apple.com>
wrote:
>
> Hi Kristof,
>
> On Apr 6, 2017, at 6:53 AM, Kristof Beyls <kristof.beyls at arm.com>
wrote:
>
> I've been digging a little bit deeper into the biggest performance
> regressions I've observed.
>
> What I've observed so far is:
> * A lot of the biggest regressions are caused by unnecessarily moving
> floating point values through general purpose registers. I've raised
> http://bugs.llvm.org/show_bug.cgi?id=32550 for this. I think this one
> definitely needs fixing before enabling GlobalISel by default at -O0.
>
>
> I commented in the PR. This is a known problem and we have a solution.
> Given this is an optimization in the sense that it does not affect the
> correctness of the program, we didn’t push for fixing it now.
>
> For O0 we wanted to focus ourselves on generating correct code. Unless the
> regressions you are seeing are preventing debugging/running of the program,
> I wouldn’t block the flip of the switch on that.
>
> What do you think?
>
>
> For O0, I think most users care about fast compile time and an excellent
> debug experience.
> Next to that, I am told that some users also use -O0 to have a
> straightforward, simple, mapping between source code and the generated
> assembly.
>
> Out of the issues I've seen so far, I think this is the single
> "optimization" issue that I feel gives a negative first
impression of
> GlobalISel.
> If users look at the generated assembly for floating point code it looks
> more like active de-optimization rather than "no optimization".
> My guess is also that this may be the biggest reason for the about 20%
> performance slow-down and 11% code size increase I measured earlier.
> OTOH, I clearly agree this is an optimization issue, not a correctness
> issue.
> Combining the above, I think it would be best if this issue got fixed
> before we turn on GlobalISel by default at -O0, or we may end up hearing
> quite a few (non-critical) complaints about this from users.
> Basically I think this is a tradeoff between giving a better first
> impression of GlobalISel vs getting more people to use and test it earlier.
>
> Thanks for the write-up on the PR, that is very useful.
> Do you have any rough idea how much effort would be involved in getting
> this fixed?
>
>
> I’d say 2-3 weeks. Could actually be shorter if we don’t do all the
> refactoring I have in mind to make the table for the alternative mappings
> smaller, but I don’t think it is worth taking any shortcut here.
>
> I got the impression Daniel is making good progress on the tablegen
> generation, which is key to getting this issue fixed?
>
>
> To be accurate, Daniel’s work on table gen is for the select phase, not
> regbankselect. In other words, right now, all the table for mappings for
> regbankselect are hand written and Daniel’s work is not changing that. The
> (weak) rationale is that it is not on the critical path :).
>
> I think that no matter whether we decide to switch the default before or
> after fixing this issue, this is one of the most urgent issues to fix as
> far as I can see.
>
>
> Agree. This is one of the top items of my todo list.
>
>
> If there is some way I can help contribute to fixing PR32550, I would like
> to help; but with the dependency on tablegen generation, I'm not sure
what
> the best way is to help make that PR get fixed faster?
>
>
> Thanks for offering to help. I agree that with the dependency on the
> tabelgen generation in the way, this is probably not a good use of your
> time. Depending when I can spare some time on this, I’ll either do it or
> explain the refactoring in the google doc.
>
> In the meantime, I’d suggest to focus on validating the debug info on your
> side.
>
>
>
> * FastISel seems to transform division-by-constant-power-of-2 into right
> shift (see
>
https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468).
> GlobalISel does not. It seems to me that at -O0 there may be reasons not
> perform this transformation, but maybe there is a good reason why FastISel
> does this?
>
>
> I think FastISel tries to generate the best code it can no matter what.
> For GISel O0 however, not doing this optimization sounds sensible to me.
> Now, I would say that the same remark as the previous bullet point apply:
> we shouldn’t do it unless it gets in the way of running/debugging the
> program.
>
>
> I agree that these optimizations should not be done at -O0. I think not
> doing them is actually an improvement: you give the user what they asked,
> i.e. "no optimization", and an as-straigthforward-as-possible
mapping from
> source to assembly.
>
> * FastISel doesn’t\ seem to handle functions with switch statements, so it
> falls back to DAGISel. DAGISel produces code that's a lot better than
> GlobalISel for switch statement at -O0. I'm not sure if we need to do
> something here before enabling GlobalISel by default. I'm thinking we
may
> need to add a smarter way to lower switch statements rather than just a
> cascaded sequence of conditional branches.
>
>
> Sounds optimization-ish to me. Same remark.
>
>
> Agreed, optimization-ish. In comparison to the above point on FastISel
> "peepholes", I think that lowering switch statements to something
else than
> a cascaded sequence of conditional branches doesn't make the generated
code
> harder to map to the source. So, in comparison to the above point on
> FastISel "peepholes", I'm not actively against being smarter
about lowering
> switch statements at -O0.
> But I agree, this shouldn't hold up turning on GlobalISel by default at
> -O0.
>
>
> Other than the above remarks, before turning on GlobalISel by default,
> we'd better test/verify that debug info quality remains good.
> I haven't looked into that at all, but am hoping to start looking into
> that soon, with the help of the DIVA tool (
> https://github.com/SNSystems/DIVA) presented at last EuroLLVM (
>
http://llvm.org/devmtg/2017-03//assets/slides/diva_debug_information_visual_analyzer.pdf).
> I don't recall anyone so far making any statements about the quality of
the
> debug info they've measured or experienced with GlobalISel enabled?
>
>
> We ran the lldb test suite with GISel. IIRC, Tim has the details, GISel
> was on part with SDISel.
>
>
> Other than the all of the above, I just wanted to mention that Oliver
> recently started running csmith on AArch64 GlobalISel and found one issue
> so far that already got fixed (
> https://bugs.llvm.org//show_bug.cgi?id=32733).
> If he finds other correctness issues, I'm sure he'll keep on
reporting
> them.
>
>
> Great!
>
> Thanks,
> -Quentin
>
>
> I'll try to add the above content to the document Diana created at
> https://goo.gl/IS2Bdw too.
>
> Thanks,
>
> Kristof
>
>
>
> On 3 Apr 2017, at 17:10, Kristof Beyls <Kristof.Beyls at arm.com>
wrote:
>
> I've kicked off a run to compare "-O0 -g" versus "-O0 -g
-mllvm
> -global-isel -mllvm -global-isel-abort=2".
> I've selected the test-suite (albeit a version which is a couple of
months
> old now) and a few short-running proprietary benchmarks to get data back
> quickly for an initial feel of where things are.
> This was running on Cortex-A57 AArch64 Linux.
>
> I saw one assertion failure in GlobalISel, see
> http://bugs.llvm.org/show_bug.cgi?id=32471. This is in a program compiled
> at -O2 (my out-dated test-suite still overrides -O0 and instead uses -O for
> that program). The root cause of the failure seems to be due to
> LowLevelType not supporting vectors of pointers. I think this demonstrates
> that for correctness, we should be trying to test more than -O0, or even
> more than just LLVM-IR produced by clang, as other front-ends could run
> into this even at -O0.
>
> Due to this assertion failure and the infrastructure I used, the numbers
> below do not include test-suite/MultiSource/Benchmarks results.
>
> On the non-correctness aspects, LNT tells me that:
> - The programs that report execution time, on geomean are about 17% slower.
> - The programs that report scores, on geomean are about 21% slower.
> - Code size is up on geomean about 11%.
> I'm afraid I don't have compile time numbers, nor any feel for
debug info
> quality.
>
> I'll need quite a bit more time to dig into the details to come up with
> something actionable, although the fact that LowLevelType doesn't
support
> vectors of pointers is already actionable.
> Nevertheless, I thought to share what I see as is, to see if others see
> similar results so far.
>
> I thought Diana was going to look into fallback rate on the test-suite on
> AArch64 linux?
>
> Thanks,
>
> Kristof
>
> On 30 Mar 2017, at 10:54, Renato Golin <renato.golin at linaro.org>
wrote:
>
> On 30 March 2017 at 00:27, Quentin Colombet <qcolombet at apple.com>
wrote:
>
> On iOS we are at 100% pass rate in 00 g for the LLVM test suite, standard
> benchmarks and unit tests. In about 5% of all functions GlobalIsel falls
> back to SDIsel.
> (Kristof Beyls would have the linux numbers.)
> The self host compiler correctly builds and runs the LLVM test suite in O0.
>
>
> Having done no tests at all on my side, I think we need to have
> similar numbers on Linux to be able to flip across the board.
>
> I don't want to flip it only for Darwin and not Linux, as that will
> fragment the effort too much.
>
> I'll check with Diana and Kristof to know what's the best way
forward,
> but it should be reasonably quick.
>
> cheers,
> --renato
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170509/70a9ed5b/attachment-0001.html>

Renato Golin via llvm-dev

2017-May-09 19:34 UTC

head link

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

On 9 May 2017 at 19:47, Quentin Colombet <qcolombet at apple.com>
wrote:> Sounds good to me!
>
> Renato, Diana, how does that sound?
Hi Quentin,

I'll defer to Diana, as she was looking into this. If she's happy,
I'm happy. :)

cheers,
--renato

Quentin Colombet via llvm-dev

2017-May-10 15:36 UTC

head link

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

> On May 9, 2017, at 11:47 AM, Quentin Colombet via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> Hi Kristof,
> 
>> On May 9, 2017, at 3:41 AM, Kristof Beyls <kristof.beyls at arm.com
<mailto:kristof.beyls at arm.com>> wrote:
>> 
>> Great Quentin :).
>> 
>> I've rerun the benchmarks comparing "-O0 -g" with
"-O0 -g -mllvm -global-isel -mllvm -global-isel-abort=2" on r302453,
on AArch64 Cortex-A57.
>> I indeed see almost no moves between GPR and FPR registers anymore (see
details below for where I still see some).
>> On geomean, I see 13% slow down (down from 17% on my previous run).
>> On geomean, code size increase is about 3% (down from 11% on my
previous run).
>> I also checked debug info quality with the DIVA tool on the test-suite;
and asked one of our DS-5 validation guys to test debug experience for
GlobalISel in combination with the ARM DS-5 Development Studio
<https://developer.arm.com/products/software-development-tools/ds-5-development-studio>
debugger. Both of us didn't find any issues with the debug info produced at
-O0 -g with GlobalISel.
>> I didn't check compile time, nor do I intend to check that as I
don't have a good infrastructure set up for that.
>> 
>> Overall, my impression is that GlobalISel is ready to be enabled by
default for AArch64 at -O0, if others also believe it's ready.
> 
> Sounds good to me!
> 
> Renato, Diana, how does that sound?
> 
>> 
>> Thanks,
>> 
>> Kristof
>> 
>> 
>> 
>> Detailed observations:
>> 
>> Debug info analysis with DIVA
<https://github.com/SNSystems/DIVA/blob/master/linux/doc/DIVA_user_guide.pdf>:
>> I used DIVA options "-a --show-generated --show-level
--show-location" to dump and diff debug info differences for the test-suite
for "-O0 -g" comparing GlobalISel with non-GlobalISel.
>> The only difference I found was on MultiSource/Benchmarks/tramp3d-v4:
>>   001 37577         {Function} "operator*<double>"
-> "Zero<double>"
>>                         - No declaration
>>                         - Template
>>   002                 {TemplateParameter} "T" <-
"double"
>> - 002 37577           {Parameter} <- "Zero<double>"
>>   002 37577           {Parameter} <- "const double &"
>> Which indicates that for non-GlobalISel, the Zero<double>
parameter isn't present in the debug info for the instantiation of the below
template function with T=double:
>> template<class T>
>>   inline Zero<T> operator*(Zero<T>, const T&) { return
Zero<T>(); }
>> In other words, the debug info looks ever so slightly better with
GlobalISel.
>> Quick analysis of reasons for slow down for the 10 programs that
regress the most when enabling GlobalISel at -O0:
>> SingleSource/Benchmarks/BenchmarkGame/nsieve-bits (114% slowdown): no
conversion of divide by power-of-2 to shift right. I think this is an
improvement for -O0: no such optimization should be done when not requesting
optimizations.
>> MultiSource/Benchmarks/MallocBench/gs/gs (88%): "interp"
function: switch statement lowered as cascaded-sequence-of-conditional-branches.
>> Same issue causes MultiSource/Applications/sqlite3/sqlite3 (71%).
>> Same issue causes MultiSource/Applications/lua/lua (46%).
>> SingleSource/Benchmarks/Misc/flops-2 (75%): Poor lowering of fneg:
>> FastISel:
>> ldur d0, [x29,#-16]
>> fneg d0, d0
>> stur d0, [x29,#-16]
>> GlobalISel:
>> ldur d0, [x29,#-64]
>> orr x8, xzr, #0x8000000000000000
>> fmov d1, x8
>> fsub d0, d1, d0
>> fmov x8, d0
>> stur x8, [x29,#-64]
>> MultiSource/Benchmarks/Prolangs-C++/city/city (74%): a call to memcpy
for copying 4 bytes is present with GlobalISel that isn't present with
FastISel, in function vehicle::select_move().
>> Same issue causes
SingleSource/Benchmarks/Shootout-C++/Shootout-C++-moments (58.5%)
>> MultiSource/Applications/siod/siod (72%): Seems to be mainly due to
loading constants in the entry BB, but I'm not sure that indeed is the
biggest cause. (https://bugs.llvm.org//show_bug.cgi?id=32561
<https://bugs.llvm.org//show_bug.cgi?id=32561>)
>> MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset (48%):
Due to creating all constants in the entry BB, see function CopyObject.
(https://bugs.llvm.org//show_bug.cgi?id=32561
<https://bugs.llvm.org//show_bug.cgi?id=32561>)
>> MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode (46%):
Function Reference_IDCT: Probably due to creating all constants in the entry BB
+ spilling floating point data through an X register:
>> FastISel:
>> fadd d0, d1, d0
>> str d0, [sp,#528]
>> GlobalISel:
>> fadd d0, d1, d0
>> fmov x9, d0
>> stur x9, [x29,#-48]
> 
> Good finding, I forgot to do stores in my previous fix. I’ll do them
shortly.
Should be fixed by r302679
> 
>> 
>> Other overall impression when comparing code generation: The code size
increase is probably mainly due to creating constants always in the entry BB of
the function. For functions with many constants, this leads to lots and lots of
constant creation and then immediately spilling it to the stack.
(https://bugs.llvm.org//show_bug.cgi?id=32561
<https://bugs.llvm.org//show_bug.cgi?id=32561>)
> 
> About a month (r299283), I mistakenly committed a prototype I was working
on to solve this kind of issue. I reverted it in r299287, but you can give it a
try.
> 
> Basically, that’s an additional generic pass that runs only at O0. It
localizes (as in basic block local) the definition of the constants to
workaround the limitations of O0 regalloc.
> 
> Cheers,
> -Quentin 
>> 
>> 
>>> On 8 May 2017, at 20:38, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>> 
>>> Hi Kristof,
>>> 
>>> I made a fix for PR32550, that is much less involved in r302453.
>>> 
>>> In particular, it does not rely on the greedy mode of the
regbankselect pass, thus there is no compile time implication.
>>> 
>>> Could you try it and check where we stand?
>>> 
>>> Thanks,
>>> -Quentin
>>>> On Apr 27, 2017, at 10:40 AM, Quentin Colombet via llvm-dev
<llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>>
wrote:
>>>> 
>>>> Hi Kristof,
>>>> 
>>>>> On Apr 27, 2017, at 9:47 AM, Kristof Beyls
<kristof.beyls at arm.com <mailto:kristof.beyls at arm.com>> wrote:
>>>>> 
>>>>> Hi Quentin,
>>>>> 
>>>>>> On 27 Apr 2017, at 00:48, Quentin Colombet
<qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote:
>>>>>> 
>>>>>> Hi Kristof,
>>>>>> 
>>>>>>> On Apr 6, 2017, at 6:53 AM, Kristof Beyls
<kristof.beyls at arm.com <mailto:kristof.beyls at arm.com>> wrote:
>>>>>>> 
>>>>>>> I've been digging a little bit deeper into the
biggest performance regressions I've observed.
>>>>>>> 
>>>>>>> What I've observed so far is:
>>>>>>> * A lot of the biggest regressions are caused by
unnecessarily moving floating point values through general purpose registers.
I've raised http://bugs.llvm.org/show_bug.cgi?id=32550
<http://bugs.llvm.org/show_bug.cgi?id=32550> for this. I think this one
definitely needs fixing before enabling GlobalISel by default at -O0.
>>>>>> 
>>>>>> I commented in the PR. This is a known problem and we
have a solution. Given this is an optimization in the sense that it does not
affect the correctness of the program, we didn’t push for fixing it now.
>>>>>> 
>>>>>> For O0 we wanted to focus ourselves on generating
correct code. Unless the regressions you are seeing are preventing
debugging/running of the program, I wouldn’t block the flip of the switch on
that.
>>>>>> 
>>>>>> What do you think? 
>>>>> 
>>>>> For O0, I think most users care about fast compile time and
an excellent debug experience.
>>>>> Next to that, I am told that some users also use -O0 to
have a straightforward, simple, mapping between source code and the generated
assembly.
>>>>> 
>>>>> Out of the issues I've seen so far, I think this is the
single "optimization" issue that I feel gives a negative first
impression of GlobalISel.
>>>>> If users look at the generated assembly for floating point
code it looks more like active de-optimization rather than "no
optimization".
>>>>> My guess is also that this may be the biggest reason for
the about 20% performance slow-down and 11% code size increase I measured
earlier.
>>>>> OTOH, I clearly agree this is an optimization issue, not a
correctness issue.
>>>>> Combining the above, I think it would be best if this issue
got fixed before we turn on GlobalISel by default at -O0, or we may end up
hearing quite a few (non-critical) complaints about this from users.
>>>>> Basically I think this is a tradeoff between giving a
better first impression of GlobalISel vs getting more people to use and test it
earlier.
>>>>> 
>>>>> Thanks for the write-up on the PR, that is very useful.
>>>>> Do you have any rough idea how much effort would be
involved in getting this fixed?
>>>> 
>>>> I’d say 2-3 weeks. Could actually be shorter if we don’t do all
the refactoring I have in mind to make the table for the alternative mappings
smaller, but I don’t think it is worth taking any shortcut here.
>>>> 
>>>>> I got the impression Daniel is making good progress on the
tablegen generation, which is key to getting this issue fixed?
>>>> 
>>>> To be accurate, Daniel’s work on table gen is for the select
phase, not regbankselect. In other words, right now, all the table for mappings
for regbankselect are hand written and Daniel’s work is not changing that. The
(weak) rationale is that it is not on the critical path :).
>>>> 
>>>>> I think that no matter whether we decide to switch the
default before or after fixing this issue, this is one of the most urgent issues
to fix as far as I can see.
>>>> 
>>>> Agree. This is one of the top items of my todo list.
>>>> 
>>>>> 
>>>>> If there is some way I can help contribute to fixing
PR32550, I would like to help; but with the dependency on tablegen generation,
I'm not sure what the best way is to help make that PR get fixed faster?
>>>> 
>>>> Thanks for offering to help. I agree that with the dependency
on the tabelgen generation in the way, this is probably not a good use of your
time. Depending when I can spare some time on this, I’ll either do it or explain
the refactoring in the google doc.
>>>> 
>>>> In the meantime, I’d suggest to focus on validating the debug
info on your side.
>>>> 
>>>>> 
>>>>> 
>>>>>>> * FastISel seems to transform
division-by-constant-power-of-2 into right shift (see
https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468
<https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/FastISel.cpp#L456-L468>).
GlobalISel does not. It seems to me that at -O0 there may be reasons not perform
this transformation, but maybe there is a good reason why FastISel does this?
>>>>>> 
>>>>>> I think FastISel tries to generate the best code it can
no matter what. For GISel O0 however, not doing this optimization sounds
sensible to me.
>>>>>> Now, I would say that the same remark as the previous
bullet point apply: we shouldn’t do it unless it gets in the way of
running/debugging the program.
>>>>> 
>>>>> I agree that these optimizations should not be done at -O0.
I think not doing them is actually an improvement: you give the user what they
asked, i.e. "no optimization", and an as-straigthforward-as-possible
mapping from source to assembly.
>>>>> 
>>>>>>> * FastISel doesn’t\ seem to handle functions with
switch statements, so it falls back to DAGISel. DAGISel produces code that's
a lot better than GlobalISel for switch statement at -O0. I'm not sure if we
need to do something here before enabling GlobalISel by default. I'm
thinking we may need to add a smarter way to lower switch statements rather than
just a cascaded sequence of conditional branches.
>>>>>> 
>>>>>> Sounds optimization-ish to me. Same remark.
>>>>> 
>>>>> Agreed, optimization-ish. In comparison to the above point
on FastISel "peepholes", I think that lowering switch statements to
something else than a cascaded sequence of conditional branches doesn't make
the generated code harder to map to the source. So, in comparison to the above
point on FastISel "peepholes", I'm not actively against being
smarter about lowering switch statements at -O0.
>>>>> But I agree, this shouldn't hold up turning on
GlobalISel by default at -O0.
>>>>> 
>>>>> 
>>>>> Other than the above remarks, before turning on GlobalISel
by default, we'd better test/verify that debug info quality remains good.
>>>>> I haven't looked into that at all, but am hoping to
start looking into that soon, with the help of the DIVA tool
(https://github.com/SNSystems/DIVA <https://github.com/SNSystems/DIVA>)
presented at last EuroLLVM
(http://llvm.org/devmtg/2017-03//assets/slides/diva_debug_information_visual_analyzer.pdf
<http://llvm.org/devmtg/2017-03//assets/slides/diva_debug_information_visual_analyzer.pdf>).
I don't recall anyone so far making any statements about the quality of the
debug info they've measured or experienced with GlobalISel enabled?
>>>> 
>>>> We ran the lldb test suite with GISel. IIRC, Tim has the
details, GISel was on part with SDISel.
>>>> 
>>>>> 
>>>>> Other than the all of the above, I just wanted to mention
that Oliver recently started running csmith on AArch64 GlobalISel and found one
issue so far that already got fixed
(https://bugs.llvm.org//show_bug.cgi?id=32733
<https://bugs.llvm.org//show_bug.cgi?id=32733>).
>>>>> If he finds other correctness issues, I'm sure
he'll keep on reporting them.
>>>> 
>>>> Great!
>>>> 
>>>> Thanks,
>>>> -Quentin
>>>>> 
>>>>>>> I'll try to add the above content to the
document Diana created at https://goo.gl/IS2Bdw <https://goo.gl/IS2Bdw>
too.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Kristof
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 3 Apr 2017, at 17:10, Kristof Beyls
<Kristof.Beyls at arm.com <mailto:Kristof.Beyls at arm.com>> wrote:
>>>>>>>> 
>>>>>>>> I've kicked off a run to compare "-O0
-g" versus "-O0 -g -mllvm -global-isel -mllvm
-global-isel-abort=2".
>>>>>>>> I've selected the test-suite (albeit a
version which is a couple of months old now) and a few short-running proprietary
benchmarks to get data back quickly for an initial feel of where things are.
>>>>>>>> This was running on Cortex-A57 AArch64 Linux.
>>>>>>>> 
>>>>>>>> I saw one assertion failure in GlobalISel, see
http://bugs.llvm.org/show_bug.cgi?id=32471
<http://bugs.llvm.org/show_bug.cgi?id=32471>. This is in a program
compiled at -O2 (my out-dated test-suite still overrides -O0 and instead uses -O
for that program). The root cause of the failure seems to be due to LowLevelType
not supporting vectors of pointers. I think this demonstrates that for
correctness, we should be trying to test more than -O0, or even more than just
LLVM-IR produced by clang, as other front-ends could run into this even at -O0.
>>>>>>>> 
>>>>>>>> Due to this assertion failure and the
infrastructure I used, the numbers below do not include
test-suite/MultiSource/Benchmarks results.
>>>>>>>> 
>>>>>>>> On the non-correctness aspects, LNT tells me
that:
>>>>>>>> - The programs that report execution time, on
geomean are about 17% slower.
>>>>>>>> - The programs that report scores, on geomean
are about 21% slower.
>>>>>>>> - Code size is up on geomean about 11%.
>>>>>>>> I'm afraid I don't have compile time
numbers, nor any feel for debug info quality.
>>>>>>>> 
>>>>>>>> I'll need quite a bit more time to dig into
the details to come up with something actionable, although the fact that
LowLevelType doesn't support vectors of pointers is already actionable.
>>>>>>>> Nevertheless, I thought to share what I see as
is, to see if others see similar results so far.
>>>>>>>> 
>>>>>>>> I thought Diana was going to look into fallback
rate on the test-suite on AArch64 linux?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Kristof
>>>>>>>> 
>>>>>>>>> On 30 Mar 2017, at 10:54, Renato Golin
<renato.golin at linaro.org <mailto:renato.golin at linaro.org>>
wrote:
>>>>>>>>> 
>>>>>>>>> On 30 March 2017 at 00:27, Quentin Colombet
<qcolombet at apple.com <mailto:qcolombet at apple.com>> wrote:
>>>>>>>>>> On iOS we are at 100% pass rate in 00 g
for the LLVM test suite, standard
>>>>>>>>>> benchmarks and unit tests. In about 5%
of all functions GlobalIsel falls
>>>>>>>>>> back to SDIsel.
>>>>>>>>>> (Kristof Beyls would have the linux
numbers.)
>>>>>>>>>> The self host compiler correctly builds
and runs the LLVM test suite in O0.
>>>>>>>>> 
>>>>>>>>> Having done no tests at all on my side, I
think we need to have
>>>>>>>>> similar numbers on Linux to be able to flip
across the board.
>>>>>>>>> 
>>>>>>>>> I don't want to flip it only for Darwin
and not Linux, as that will
>>>>>>>>> fragment the effort too much.
>>>>>>>>> 
>>>>>>>>> I'll check with Diana and Kristof to
know what's the best way forward,
>>>>>>>>> but it should be reasonably quick.
>>>>>>>>> 
>>>>>>>>> cheers,
>>>>>>>>> --renato
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org>
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>
>> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
<http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170510/7093f825/attachment.html>

Apparently Analagous Threads

Search for more maybe matching threads

llvm dev - May 2017 - [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

Apparently Analagous Threads