thr3ads.net - llvm dev - [llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try! [Jun 2017]

If this information is useful, please help other people find it:
Share via:

Diana Picus via llvm-dev

2017-Jun-14 14:27 UTC

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

On 12 June 2017 at 18:54, Diana Picus <diana.picus at linaro.org> wrote:
> Hi all,
>
> I added a buildbot [1] running the test-suite with -O0 -global-isel. It
> runs into the same 2 timeouts that I reported previously on this thread
> (paq8p and scimark2). It would be nice to make it green before flipping the
> switch.
>
>I did some more investigations on a machine similar to the one running the
buildbot. For paq8p and scimark2, I get these results for O0:

PAQ8p:
Fast isel: 666.344
Global isel: 731.384

SciMark2-C:
Fast isel: 463.908
Global isel: 496.22

The current timeout is 500s (so in this particular case we didn't hit it
for scimark2, and it ran successfully to completion). I don't think the
difference between FastISel and GlobalISel is too atrocious, so I would
propose increasing the timeout for these 2 benchmarks. I'm not sure if we
can do this on a per-bot basis, but I see some precedent for setting custom
timeout thresholds for various benchmarks on different architectures
(sometimes with comments that it's done so we can run O0 on that particular
benchmark).

Something along these lines works:
https://reviews.llvm.org/differential/diff/102547/

What do you guys think about this approach?

Thanks,
Diana

PS: The buildbot is using the Makefiles because that's what our other
AArch64 test-suite bots use. Moving all of them to CMake is a transition
for another time.

> At the moment, it lives in an internal buildmaster that I've setup for
> this purpose. If we fix it and it proves to be stable for a week or two,
> I'll move it to the public master.
>
> Cheers,
> Diana
>
> [1] http://master2.llvm.validation.linaro.org/builders/
> clang-cmake-aarch64-global-isel
>
>
> On 6 June 2017 at 19:11, Quentin Colombet <qcolombet at apple.com>
wrote:
>
>> Thanks Kristof.
>>
>> Sounds like we'll need to investigate though I'd say it is not
blocking
>> the switch.
>>
>> At this point I think everybody is on board to flip the switch.
>> @Eric, how does that sound to you?
>>
>> Thanks,
>> Q
>>
>> Le 1 juin 2017 à 07:46, Kristof Beyls <Kristof.Beyls at arm.com>
a écrit :
>>
>>
>> On 31 May 2017, at 17:07, Quentin Colombet <qcolombet at
apple.com> wrote:
>>
>>
>> Latest comparisons on my side, after picking up r304244, i.e. the
correct
>> Localizer pass.
>> * CTMark compile time, comparing "-O0 -g" vs '-O0 -g
-mllvm
>> -global-isel=true -mllvm -global-isel-abort=0': about 6% increase
with
>> globalisel. This was about 3.5% before the Localizer pass landed.
>>
>>
>> That one is surprising too. I wouldn’t have expected this pass to show
up
>> in the compile time profile. At least not to this extend.
>> What is the biggest offender?
>>
>>
>> Hmmm. So I took the 3.5% compile time overhead from my last measurement
>> before the localizer landed, from around 24th of May.
>> When using -ftime-report, I see the Localizer pass typically taking
very
>> roughly about 1% of compile time.
>> Maybe another part of GlobalISel became a bit slower since I did that
>> 3.5% measurement?
>> Or maybe the Localizer pass changes the structure of the program so
that
>> another later pass gets a different compile time profile?
>> Basically, I'd have to do more experiments to figure that one out.
>>
>> As far as where time is spent in the gisel-passes itself, on average, I
>> saw the following on the latest CTMark experiment I ran:
>> Avg compile time spent in IRTranslator: 4.61%
>> Avg compile time spent in InstructionSelect: 7.51%
>> Avg compile time spent in Legalizer: 1.06%
>> Avg compile time spent in Localizer: 0.76%
>> Avg compile time spent in RegBankSelect: 2.12%
>>
>>
>> * My usual performance benchmarking run: 8.5% slow-down. This was about
>> 9.5% before the Localizer pass landed, so a slight improvement.
>> * Code size: 3.14% larger. This was about 2.8% before the Localizer
pass
>> landed, so a slight regression.
>>
>>
>> That one is surprising. Do you have an idea of what is happening?
>> Alternatively if you can point me to the biggest offender, I can have a
>> look.
>>
>>
>> So the biggest offenders on the mem_bytes metric in LNT are:
>> O0 -g O0 -g gisel-with-localizer O0 -g gisel-without-localizer
>> SingleSource/Benchmarks/Misc/perlin 14272 14640 18344 25.95%
>> SingleSource/Benchmarks/Dhrystone/dry 16560 17144 20160 18.21%
>> SingleSource/Benchmarks/Stanford/QueensProfile 13912 14192 15136 6.79%
>> MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url 71400 72272
>> 75504 4.53%
>>
>> I haven't had time to investigate what exact changes make the code
size
>> go up that much with the localizer pass in those cases...
>>
>>
>> The only thing I can think of is that we duplicate constants that are
>> expensive to materialize. If that’s the case, we were discussing with
Ahmed
>> an alternative to the localizer pass that would operate during
>> InstructionSelect so may be worth pursuing.
>>
>>
>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170614/16754f48/attachment.html>

Quentin Colombet via llvm-dev

2017-Jun-16 22:06 UTC

head link

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

> On Jun 14, 2017, at 7:27 AM, Diana Picus <diana.picus at linaro.org>
wrote:
> 
> On 12 June 2017 at 18:54, Diana Picus <diana.picus at linaro.org
<mailto:diana.picus at linaro.org>> wrote:
> Hi all,
> 
> I added a buildbot [1] running the test-suite with -O0 -global-isel. It
runs into the same 2 timeouts that I reported previously on this thread (paq8p
and scimark2). It would be nice to make it green before flipping the switch.
> 
> 
> I did some more investigations on a machine similar to the one running the
buildbot. For paq8p and scimark2, I get these results for O0:
> 
> PAQ8p:
> Fast isel: 666.344
> Global isel: 731.384
> 
> SciMark2-C:
> Fast isel: 463.908
> Global isel: 496.22
> 
> The current timeout is 500s (so in this particular case we didn't hit
it for scimark2, and it ran successfully to completion). I don't think the
difference between FastISel and GlobalISel is too atrocious, so I would propose
increasing the timeout for these 2 benchmarks. I'm not sure if we can do
this on a per-bot basis, but I see some precedent for setting custom timeout
thresholds for various benchmarks on different architectures (sometimes with
comments that it's done so we can run O0 on that particular benchmark).
> 
> Something along these lines works:
> https://reviews.llvm.org/differential/diff/102547/
<https://reviews.llvm.org/differential/diff/102547/>
> 
> What do you guys think about this approach?
Looks reasonable to me.
> 
> Thanks,
> Diana
> 
> PS: The buildbot is using the Makefiles because that's what our other
AArch64 test-suite bots use. Moving all of them to CMake is a transition for
another time.
>  
> At the moment, it lives in an internal buildmaster that I've setup for
this purpose. If we fix it and it proves to be stable for a week or two,
I'll move it to the public master.
> 
> Cheers,
> Diana
> 
> [1]
http://master2.llvm.validation.linaro.org/builders/clang-cmake-aarch64-global-isel
<http://master2.llvm.validation.linaro.org/builders/clang-cmake-aarch64-global-isel>
> 
> 
> On 6 June 2017 at 19:11, Quentin Colombet <qcolombet at apple.com
<mailto:qcolombet at apple.com>> wrote:
> Thanks Kristof.
> 
> Sounds like we'll need to investigate though I'd say it is not
blocking the switch.
> 
> At this point I think everybody is on board to flip the switch.
> @Eric, how does that sound to you?
> 
> Thanks,
> Q
> 
> Le 1 juin 2017 à 07:46, Kristof Beyls <Kristof.Beyls at arm.com
<mailto:Kristof.Beyls at arm.com>> a écrit :
> 
>> 
>>> On 31 May 2017, at 17:07, Quentin Colombet <qcolombet at
apple.com <mailto:qcolombet at apple.com>> wrote:
>>>> 
>>>> Latest comparisons on my side, after picking up r304244, i.e.
the correct Localizer pass.
>>>> * CTMark compile time, comparing "-O0 -g" vs '-O0
-g -mllvm -global-isel=true -mllvm -global-isel-abort=0': about 6% increase
with globalisel. This was about 3.5% before the Localizer pass landed.
>>> 
>>> That one is surprising too. I wouldn’t have expected this pass to
show up in the compile time profile. At least not to this extend.
>>> What is the biggest offender?
>> 
>> Hmmm. So I took the 3.5% compile time overhead from my last measurement
before the localizer landed, from around 24th of May.
>> When using -ftime-report, I see the Localizer pass typically taking
very roughly about 1% of compile time.
>> Maybe another part of GlobalISel became a bit slower since I did that
3.5% measurement?
>> Or maybe the Localizer pass changes the structure of the program so
that another later pass gets a different compile time profile?
>> Basically, I'd have to do more experiments to figure that one out.
>> 
>> As far as where time is spent in the gisel-passes itself, on average, I
saw the following on the latest CTMark experiment I ran:
>> Avg compile time spent in IRTranslator: 4.61%
>> Avg compile time spent in InstructionSelect: 7.51%
>> Avg compile time spent in Legalizer: 1.06%
>> Avg compile time spent in Localizer: 0.76%
>> Avg compile time spent in RegBankSelect: 2.12%
>> 
>>> 
>>>> * My usual performance benchmarking run: 8.5% slow-down. This
was about 9.5% before the Localizer pass landed, so a slight improvement.
>>>> * Code size: 3.14% larger. This was about 2.8% before the
Localizer pass landed, so a slight regression.
>>> 
>>> That one is surprising. Do you have an idea of what is happening?
>>> Alternatively if you can point me to the biggest offender, I can
have a look.
>> 
>> So the biggest offenders on the mem_bytes metric in LNT are:
>> O0 -g	O0 -g gisel-with-localizer	O0 -g gisel-without-localizer
>> SingleSource/Benchmarks/Misc/perlin	14272	14640	18344	25.95%
>> SingleSource/Benchmarks/Dhrystone/dry	16560	17144	20160	18.21%
>> SingleSource/Benchmarks/Stanford/QueensProfile	13912	14192	15136	6.79%
>> MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url	71400	72272
75504	4.53%
>> 
>> I haven't had time to investigate what exact changes make the code
size go up that much with the localizer pass in those cases...
>> 
>>> 
>>> The only thing I can think of is that we duplicate constants that
are expensive to materialize. If that’s the case, we were discussing with Ahmed
an alternative to the localizer pass that would operate during InstructionSelect
so may be worth pursuing.
>> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170616/e2843584/attachment.html>

Quentin Colombet via llvm-dev

2017-Jun-16 23:43 UTC

head link

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

Hi all,

We had some internal discussions about flipping the default for O0 and we
concluded that we wanted to postpone it.


*** Why Is That? ***

We don’t want to send the wrong message that GlobalISel’s design is set in stone
and ready for broader adoption.
In particular,
1. The APIs are still evolving and can still possibly change significantly
2. The TableGen backend to reuse the existing SD patterns is still at its early
stage
3. We want to investigate closely the performance of global-isel (compile-time,
runtime, code size, fallbacks)

The rationale behind those items is that we want to minimize the pain of moving
forward for everybody. We also want the out-of-the-box experience to be pleasant
(like all/most of the tablegen patterns just work, we have documentation on how
to target a new backend, etc.) Finally, we want to gain confidence we are going
to be able to address the performance issues we have with the current design and
if not, derive a plan for that.

We purposely left out of the conversation what will be the right time and
requirements to flip the switch. We want to gather more data first. Your help
would be appreciated!


*** Short-Term Proposal ***

What we would like to do instead short-term is:
A. Repurpose or create an option “-aarch64-enable-global-isel-at-O” to enable
GISel with fallbacks and warnings enables (i.e., equivalent of -global-isel
-global-isel-abort=2)
B. Advertise this option in the next open source release to allow compiler
enthusiastic to try it and report problems
C. Have GISel always built so we can push thing in the right place,
MachineVerifier in mind, and stop doing some weird gymnastic

What do people think?


*** Your Help Is Needed ***

- Please share your experience in using the GISel APIs and how we can make them
better. Moving forward we’ll have those conversations on open source instead of
internally/with a narrower audience.
- Report any performance problem you identify
- Propose patches!

Cheers,
-Quentin


> On Jun 16, 2017, at 3:06 PM, Quentin Colombet via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
>> 
>> On Jun 14, 2017, at 7:27 AM, Diana Picus <diana.picus at linaro.org
<mailto:diana.picus at linaro.org>> wrote:
>> 
>> On 12 June 2017 at 18:54, Diana Picus <diana.picus at linaro.org
<mailto:diana.picus at linaro.org>> wrote:
>> Hi all,
>> 
>> I added a buildbot [1] running the test-suite with -O0 -global-isel. It
runs into the same 2 timeouts that I reported previously on this thread (paq8p
and scimark2). It would be nice to make it green before flipping the switch.
>> 
>> 
>> I did some more investigations on a machine similar to the one running
the buildbot. For paq8p and scimark2, I get these results for O0:
>> 
>> PAQ8p:
>> Fast isel: 666.344
>> Global isel: 731.384
>> 
>> SciMark2-C:
>> Fast isel: 463.908
>> Global isel: 496.22
>> 
>> The current timeout is 500s (so in this particular case we didn't
hit it for scimark2, and it ran successfully to completion). I don't think
the difference between FastISel and GlobalISel is too atrocious, so I would
propose increasing the timeout for these 2 benchmarks. I'm not sure if we
can do this on a per-bot basis, but I see some precedent for setting custom
timeout thresholds for various benchmarks on different architectures (sometimes
with comments that it's done so we can run O0 on that particular benchmark).
>> 
>> Something along these lines works:
>> https://reviews.llvm.org/differential/diff/102547/
<https://reviews.llvm.org/differential/diff/102547/>
>> 
>> What do you guys think about this approach?
> 
> Looks reasonable to me.
> 
>> 
>> Thanks,
>> Diana
>> 
>> PS: The buildbot is using the Makefiles because that's what our
other AArch64 test-suite bots use. Moving all of them to CMake is a transition
for another time.-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170616/fb1dc279/attachment.html>

Maybe Matching Threads

Search for more possibly parallel threads

llvm dev - Jun 2017 - [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

Maybe Matching Threads