thr3ads.net - llvm dev - [llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try! [May 2017]

If this information is useful, please help other people find it:
Share via:

Quentin Colombet via llvm-dev

2017-May-24 17:31 UTC

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

Hi Kristof,

Thanks for the measurements.
> On May 24, 2017, at 6:00 AM, Kristof Beyls <kristof.beyls at arm.com>
wrote:
> 
>> 
>> On 23 May 2017, at 21:48, Quentin Colombet <qcolombet at apple.com
<mailto:qcolombet at apple.com>> wrote:
>> 
>> Great!
>> I thought I had to look at our pipeline at O0 to make sure optimized
regalloc was supported (https://bugs.llvm.org/show_bug.cgi?id=33022
<https://bugs.llvm.org/show_bug.cgi?id=33022> in mind). Glad I was wrong,
it saves me some time.
>> 
>>> On May 22, 2017, at 12:51 AM, Kristof Beyls <kristof.beyls at
arm.com <mailto:kristof.beyls at arm.com>> wrote:
>>> 
>>> 
>>>> On 22 May 2017, at 09:09, Diana Picus <diana.picus at
linaro.org <mailto:diana.picus at linaro.org>> wrote:
>>>> 
>>>> Hi Quentin,
>>>> 
>>>> I actually did a run with -mllvm -optimize-regalloc -mllvm
>>>> -regalloc=greedy over the weekend and the test does pass with
that.
>>>> Haven't measured the compile time though.
>>>> 
>>>> Cheers,
>>>> Diana
>>> 
>>> I also did my usual benchmarking run with the same options as Diana
did above:
>>> - Comparing against -O0 without globalisel: 2.5% performance drop,
0.8% code size improvement.
>> 
>> That’s compared to 9.5% performance drop and 2.8% code size regression,
without that regalloc scheme, right?
> 
> Indeed.
> 
>> 
>>> - Comparing against -O0 without globalisel but with the above
regalloc options: 5.6% performance drop, 1% code size drop.
>>> 
>>> In summary, the measurements indicate some good improvements.
>>> I also haven't measure the impact on compile time.
>> 
>> Do you have a mean to make this measurement?
>> Ahmed did a bunch of compile time measurements on our side and I wanted
to see if I need to put him on the hook again :).
> 
> I did a quick setup with CTMark (part of the test-suite). I ran each of
> * '-O0 -g',
> * '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0',
and
> * '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0 -mllvm
-optimize-regalloc -mllvm -regalloc=greedy'
> 5 times, cross-compiling from X86 to AArch64, and took the median measured
compile times.
> In summary, I see GlobalISel having a compile time that's 3.5% higher
than the current -O0 default.
> With enabling the greedy register allocator, this increases to 28%.
> 28% is probably too high?
I think it is yes.
I have attached a quick hack to the greedy allocator to feature a fast mode.
Could you give it a try?

To enable the fast mode, please use (-mllvm) -regalloc-greedy-fast=true (default
is false).
> At the moment I can't think of an alternative to having a
"constant materialization localizer" pass at -O0 to hit all the
metrics we thought of as necessary before enabling GISel by default.
> 
> It would be good if someone else could also do a compilation time
experiment - just to make sure I didn't make any silly mistakes in my
experiment.
> 
> Here are the details I see:
> 
> gisel	gisel+greedy
> CTMark/7zip/7zip-benchmark	102.8%	106.5%
> CTMark/Bullet/bullet	100.5%	105.1%
> CTMark/ClamAV/clamscan	101.6%	130.8%
> CTMark/SPASS/SPASS	101.2%	120.0%
> CTMark/consumer-typeset/consumer-typeset	105.7%	138.2%
> CTMark/kimwitu++/kc	103.1%	122.6%
> CTMark/lencod/lencod	106.2%	143.4%
> CTMark/mafft/pairlocalalign	96.2%	135.4%
> CTMark/sqlite3/sqlite3	109.1%	155.1%
> CTMark/tramp3d-v4/tramp3d-v4	109.1%	132.0%
> GEOMEAN	103.5%	128.0%
> 
> 
> Thanks,
> 
> Kristof
Thanks,
-Quentin
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170524/77fd12b2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: regalloc-fastmode.diff
Type: application/octet-stream
Size: 2893 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170524/77fd12b2/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170524/77fd12b2/attachment-0001.html>

Kristof Beyls via llvm-dev

2017-May-24 19:57 UTC

head link

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

On 24 May 2017, at 19:31, Quentin Colombet <qcolombet at
apple.com<mailto:qcolombet at apple.com>> wrote:

Hi Kristof,

Thanks for the measurements.

On May 24, 2017, at 6:00 AM, Kristof Beyls <kristof.beyls at
arm.com<mailto:kristof.beyls at arm.com>> wrote:


- Comparing against -O0 without globalisel but with the above regalloc options:
5.6% performance drop, 1% code size drop.

In summary, the measurements indicate some good improvements.
I also haven't measure the impact on compile time.

Do you have a mean to make this measurement?
Ahmed did a bunch of compile time measurements on our side and I wanted to see
if I need to put him on the hook again :).

I did a quick setup with CTMark (part of the test-suite). I ran each of
* '-O0 -g',
* '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0', and
* '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0 -mllvm
-optimize-regalloc -mllvm -regalloc=greedy'
5 times, cross-compiling from X86 to AArch64, and took the median measured
compile times.
In summary, I see GlobalISel having a compile time that's 3.5% higher than
the current -O0 default.
With enabling the greedy register allocator, this increases to 28%.
28% is probably too high?

I think it is yes.
I have attached a quick hack to the greedy allocator to feature a fast mode.
Could you give it a try?

To enable the fast mode, please use (-mllvm) -regalloc-greedy-fast=true (default
is false).

I'm afraid it doesn't seem to save much compile time. On geomean, I see
about 26% compile time increase against the current -O0 default (compared to 28%
increase for regalloc greedy without your patch).


At the moment I can't think of an alternative to having a "constant
materialization localizer" pass at -O0 to hit all the metrics we thought of
as necessary before enabling GISel by default.

It would be good if someone else could also do a compilation time experiment -
just to make sure I didn't make any silly mistakes in my experiment.

Here are the details I see:

        gisel   gisel+greedy
CTMark/7zip/7zip-benchmark      102.8%  106.5%
CTMark/Bullet/bullet    100.5%  105.1%
CTMark/ClamAV/clamscan  101.6%  130.8%
CTMark/SPASS/SPASS      101.2%  120.0%
CTMark/consumer-typeset/consumer-typeset        105.7%  138.2%
CTMark/kimwitu++/kc     103.1%  122.6%
CTMark/lencod/lencod    106.2%  143.4%
CTMark/mafft/pairlocalalign     96.2%   135.4%
CTMark/sqlite3/sqlite3  109.1%  155.1%
CTMark/tramp3d-v4/tramp3d-v4    109.1%  132.0%
GEOMEAN 103.5%  128.0%


Thanks,

Kristof

Thanks,
-Quentin
<regalloc-fastmode.diff>

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170524/9bb0afff/attachment.html>

Quentin Colombet via llvm-dev

2017-May-24 20:01 UTC

head link

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

Hi Kristof,

Thanks for going back so fast!
> On May 24, 2017, at 12:57 PM, Kristof Beyls <kristof.beyls at
arm.com> wrote:
> 
>> 
>> On 24 May 2017, at 19:31, Quentin Colombet <qcolombet at apple.com
<mailto:qcolombet at apple.com>> wrote:
>> 
>> Hi Kristof,
>> 
>> Thanks for the measurements.
>> 
>>> On May 24, 2017, at 6:00 AM, Kristof Beyls <kristof.beyls at
arm.com <mailto:kristof.beyls at arm.com>> wrote:
>>> 
>>>> 
>>>>> - Comparing against -O0 without globalisel but with the
above regalloc options: 5.6% performance drop, 1% code size drop.
>>>>> 
>>>>> In summary, the measurements indicate some good
improvements.
>>>>> I also haven't measure the impact on compile time.
>>>> 
>>>> Do you have a mean to make this measurement?
>>>> Ahmed did a bunch of compile time measurements on our side and
I wanted to see if I need to put him on the hook again :).
>>> 
>>> I did a quick setup with CTMark (part of the test-suite). I ran
each of
>>> * '-O0 -g',
>>> * '-O0 -g -mllvm -global-isel=true -mllvm
-global-isel-abort=0', and
>>> * '-O0 -g -mllvm -global-isel=true -mllvm -global-isel-abort=0
-mllvm -optimize-regalloc -mllvm -regalloc=greedy'
>>> 5 times, cross-compiling from X86 to AArch64, and took the median
measured compile times.
>>> In summary, I see GlobalISel having a compile time that's 3.5%
higher than the current -O0 default.
>>> With enabling the greedy register allocator, this increases to 28%.
>>> 28% is probably too high?
>> 
>> I think it is yes.
>> I have attached a quick hack to the greedy allocator to feature a fast
mode.
>> Could you give it a try?
>> 
>> To enable the fast mode, please use (-mllvm) -regalloc-greedy-fast=true
(default is false).
> 
> I'm afraid it doesn't seem to save much compile time. On geomean, I
see about 26% compile time increase against the current -O0 default (compared to
28% increase for regalloc greedy without your patch).
Interesting, I guess a lot of time is spent in the coalescer. Could you give a
try with -join-liveintervals=false?

Do you know where the time is spent (-time-passes)?

Anyhow, fixing all of those, although this is I think the right approach, will
take time, so we can go with the localizer.

Cheers,
-Quentin 
> 
>> 
>>> At the moment I can't think of an alternative to having a
"constant materialization localizer" pass at -O0 to hit all the
metrics we thought of as necessary before enabling GISel by default.
>>> 
>>> It would be good if someone else could also do a compilation time
experiment - just to make sure I didn't make any silly mistakes in my
experiment.
>>> 
>>> Here are the details I see:
>>> 
>>> gisel	gisel+greedy
>>> CTMark/7zip/7zip-benchmark	102.8%	106.5%
>>> CTMark/Bullet/bullet	100.5%	105.1%
>>> CTMark/ClamAV/clamscan	101.6%	130.8%
>>> CTMark/SPASS/SPASS	101.2%	120.0%
>>> CTMark/consumer-typeset/consumer-typeset	105.7%	138.2%
>>> CTMark/kimwitu++/kc	103.1%	122.6%
>>> CTMark/lencod/lencod	106.2%	143.4%
>>> CTMark/mafft/pairlocalalign	96.2%	135.4%
>>> CTMark/sqlite3/sqlite3	109.1%	155.1%
>>> CTMark/tramp3d-v4/tramp3d-v4	109.1%	132.0%
>>> GEOMEAN	103.5%	128.0%
>>> 
>>> 
>>> Thanks,
>>> 
>>> Kristof
>> 
>> Thanks,
>> -Quentin
>> <regalloc-fastmode.diff>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170524/b703c506/attachment.html>

llvm dev - May 2017 - [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!

[llvm-dev] [GlobalISel][AArch64] Toward flipping the switch for O0: Please give it a try!