thr3ads.net - llvm dev - [llvm-dev] Questions about code-size optimizations in ARM backend [Nov 2017]

If this information is useful, please help other people find it:
Share via:

Gabor Ballabas via llvm-dev

2017-Nov-07 17:02 UTC

[llvm-dev] Questions about code-size optimizations in ARM backend

Hi All,

I started to work on code-size improvements on ARM target by comparing 
GCC and LLVM generated code.
My first candidate was switch-case lowering.
I also created a Bugzilla issue for this topic: 
https://bugs.llvm.org/show_bug.cgi?id=34902
The full example code and the generated assembly for GCC and for LLVM is 
in the Bugzilla issue.

My first idea was to simplify the following instruction pattern
*lsl     r0, r0, #2**
**       ldr     pc, [r0, r1]*
to this:
*ldr     pc, [r1, r0, lsl #2]*

but then I got really confused when I started to look into the 
machine-dependent optimization passes in the backend.

I get a dump with the '-print-machineinstrs' option from the 
MachineFunctionPass and I can see these instructions in the beginning of 
the passes

*%vreg2<def> = MOVsi %vreg1, 18, pred:14, pred:%noreg, opt:%noreg; 
GPR:%vreg2,%vreg1**
**    %vreg3<def> = LEApcrelJT <jt#0>, pred:14, pred:%noreg;
GPR:%vreg3**
**    BR_JTm %vreg2<kill>, %vreg3<kill>, 0, <jt#0>;
mem:LD4[JumpTable]
GPR:%vreg2,%vreg3*

and these at the end

*%R0<def> = MOVsi %R0<kill>, 18, pred:14, pred:%noreg, opt:%noreg**
**    %R1<def> = LEApcrelJT <jt#0>, pred:14, pred:%noreg**
**    BR_JTm %R0<kill>, %R1<kill>, 0, <jt#0>;
mem:LD4[JumpTable]*

So basically I want to catch the pattern with the possible 
simplification using the shifter,
but I'm not even sure that I am looking into this issue at the right 
optimization level.
Maybe this idea should be implemented in a higher level, or as a fixup 
in ARMConstantIslands,
like the Thumb jumptable optimizations mentioned in the Bugzilla issue.

I hope someone more familiar with this part of the backend can give me 
some pointers about how to proceed with this idea
( or why it is complete rubbish in the first place :) )


Best regards,

Gabor Ballabas
Software Developer
Department of Software Engineering,
University of Szeged,
Hungary

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171107/873e0016/attachment.html>

Momchil Velikov via llvm-dev

2017-Nov-07 19:35 UTC

head link

[llvm-dev] Questions about code-size optimizations in ARM backend

On Tue, Nov 7, 2017 at 5:02 PM, Gabor Ballabas via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> I started to work on code-size improvements on ARM target by comparing GCC
> and LLVM generated code.
> My first candidate was switch-case lowering.
> I also created a Bugzilla issue for this topic:
> https://bugs.llvm.org/show_bug.cgi?id=34902
> The full example code and the generated assembly for GCC and for LLVM is in
> the Bugzilla issue.
>
> My first idea was to simplify the following instruction pattern
>         lsl     r0, r0, #2
>        ldr     pc, [r0, r1]
> to this:
>         ldr     pc, [r1, r0, lsl #2]
>
Your post prompted me to finally send this patch that I had laying
around since Jan/Feb :/ (https://reviews.llvm.org/D39752)

The LDR-with-shift instruction that we want to emit is defined in
ARMInstrInfo.td:2583

    defm LDR  : AI_ldr1<0, "ldr", IIC_iLoad_r, IIC_iLoad_si,
load>;

and then we have in ARMInstrInfo.td:1807

    multiclass AI_ldr1<bit isByte, string opc, InstrItinClass iii,
               InstrItinClass iir, PatFrag opnode> {

       ...
      def rs : AI2ldst<0b011, 1, isByte, (outs GPR:$Rt), (ins
ldst_so_reg:$shift),
                      AddrModeNone, LdFrm, iir, opc, "\t$Rt, $shift",
                     [(set GPR:$Rt, (opnode ldst_so_reg:$shift))]> {
       ...


The operand(s) to this instruction has to match `ldst_so_reg`, which
eventually is
done in `ARMDAGToDAGISel::SelectLdStSOReg`.
So my approach was to rearrange the operands, so
`ARMDAGToDAGISel::SelectLdStSOReg` can
find what it is looking for.

~chill

Friedman, Eli via llvm-dev

2017-Nov-07 20:08 UTC

head link

[llvm-dev] Questions about code-size optimizations in ARM backend

On 11/7/2017 9:02 AM, Gabor Ballabas wrote:>
> Hi All,
>
> I started to work on code-size improvements on ARM target by comparing 
> GCC and LLVM generated code.
> My first candidate was switch-case lowering.
> I also created a Bugzilla issue for this topic: 
> https://bugs.llvm.org/show_bug.cgi?id=34902
> The full example code and the generated assembly for GCC and for LLVM 
> is in the Bugzilla issue.
>
> My first idea was to simplify the following instruction pattern
> *lsl     r0, r0, #2**
> **       ldr     pc, [r0, r1]*
> to this:
> *ldr     pc, [r1, r0, lsl #2]*
>
> but then I got really confused when I started to look into the 
> machine-dependent optimization passes in the backend.
>
> I get a dump with the '-print-machineinstrs' option from the 
> MachineFunctionPass and I can see these instructions in the beginning 
> of the passes
>
> *%vreg2<def> = MOVsi %vreg1, 18, pred:14, pred:%noreg, opt:%noreg; 
> GPR:%vreg2,%vreg1**
> **    %vreg3<def> = LEApcrelJT <jt#0>, pred:14, pred:%noreg;
GPR:%vreg3**
> **    BR_JTm %vreg2<kill>, %vreg3<kill>, 0, <jt#0>;
mem:LD4[JumpTable]
> GPR:%vreg2,%vreg3*
>
> and these at the end
>
> *%R0<def> = MOVsi %R0<kill>, 18, pred:14, pred:%noreg,
opt:%noreg**
> **    %R1<def> = LEApcrelJT <jt#0>, pred:14, pred:%noreg**
> **    BR_JTm %R0<kill>, %R1<kill>, 0, <jt#0>;
mem:LD4[JumpTable]*
>
"lsl r0, r0, #2" is an alias for "mov r0, r0, lsl #2", which
is the
MachineInstr "MOVsi".

LEApcrelJT and BR_JTm are pseudo-instructions which correspond to
"adr"
and "ldr" respectively.  We use a special opcode for the jump-table 
address because we have to do some extra work in ARMConstantIslands for 
instructions which use constant pools.  We use a special opcode for the 
load so we can mark it as a branch (which matters for modeling the CFG).
> So basically I want to catch the pattern with the possible 
> simplification using the shifter,
> but I'm not even sure that I am looking into this issue at the right 
> optimization level.
> Maybe this idea should be implemented in a higher level, or as a fixup 
> in ARMConstantIslands,
> like the Thumb jumptable optimizations mentioned in the Bugzilla issue.
>
> I hope someone more familiar with this part of the backend can give me 
> some pointers about how to proceed with this idea
> ( or why it is complete rubbish in the first place :) )
>
If you just want to pull the shift into the load, you can probably get 
away with just messing with instruction selection for BR_JTm. There's 
actually a FIXME in ARMInstrInfo.td which is relevant ("FIXME: This 
shouldn't use the generic addrmode2, but rather be split into i12 and rs 
suffixed versions.")

If you want to do the fancy version where "pc" is part of the
addressing
mode, you probably need to do something in ARMConstantIslands (since the 
transform requires the jump table to be placed directly after the jump.)

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171107/3902361f/attachment.html>

Gabor Ballabas via llvm-dev

2017-Nov-08 14:00 UTC

head link

[llvm-dev] Questions about code-size optimizations in ARM backend

Seeing that Momchil already has a patch in the Phabricator for the shift 
elimination I think I'm going to
proceed with the "pc" related addressing in ARMConstantIslands.

Thanks for the advice!

Best regards,
Gabor Ballabas

On 11/07/2017 09:08 PM, Friedman, Eli wrote:> On 11/7/2017 9:02 AM, Gabor Ballabas wrote:
>>
>> Hi All,
>>
>> I started to work on code-size improvements on ARM target by 
>> comparing GCC and LLVM generated code.
>> My first candidate was switch-case lowering.
>> I also created a Bugzilla issue for this topic: 
>> https://bugs.llvm.org/show_bug.cgi?id=34902
>> The full example code and the generated assembly for GCC and for LLVM 
>> is in the Bugzilla issue.
>>
>> My first idea was to simplify the following instruction pattern
>> *lsl     r0, r0, #2**
>> **       ldr     pc, [r0, r1]*
>> to this:
>> *ldr     pc, [r1, r0, lsl #2]*
>>
>> but then I got really confused when I started to look into the 
>> machine-dependent optimization passes in the backend.
>>
>> I get a dump with the '-print-machineinstrs' option from the 
>> MachineFunctionPass and I can see these instructions in the beginning 
>> of the passes
>>
>> *%vreg2<def> = MOVsi %vreg1, 18, pred:14, pred:%noreg,
opt:%noreg;
>> GPR:%vreg2,%vreg1**
>> **    %vreg3<def> = LEApcrelJT <jt#0>, pred:14,
pred:%noreg; GPR:%vreg3**
>> **    BR_JTm %vreg2<kill>, %vreg3<kill>, 0, <jt#0>; 
>> mem:LD4[JumpTable] GPR:%vreg2,%vreg3*
>>
>> and these at the end
>>
>> *%R0<def> = MOVsi %R0<kill>, 18, pred:14, pred:%noreg,
opt:%noreg**
>> **    %R1<def> = LEApcrelJT <jt#0>, pred:14, pred:%noreg**
>> **    BR_JTm %R0<kill>, %R1<kill>, 0, <jt#0>;
mem:LD4[JumpTable]*
>>
>
> "lsl r0, r0, #2" is an alias for "mov r0, r0, lsl #2",
which is the
> MachineInstr "MOVsi".
>
> LEApcrelJT and BR_JTm are pseudo-instructions which correspond to 
> "adr" and "ldr" respectively.  We use a special opcode
for the
> jump-table address because we have to do some extra work in 
> ARMConstantIslands for instructions which use constant pools.  We use 
> a special opcode for the load so we can mark it as a branch (which 
> matters for modeling the CFG).
>
>> So basically I want to catch the pattern with the possible 
>> simplification using the shifter,
>> but I'm not even sure that I am looking into this issue at the
right
>> optimization level.
>> Maybe this idea should be implemented in a higher level, or as a 
>> fixup in ARMConstantIslands,
>> like the Thumb jumptable optimizations mentioned in the Bugzilla issue.
>>
>> I hope someone more familiar with this part of the backend can give 
>> me some pointers about how to proceed with this idea
>> ( or why it is complete rubbish in the first place :) )
>>
>
> If you just want to pull the shift into the load, you can probably get 
> away with just messing with instruction selection for BR_JTm. There's 
> actually a FIXME in ARMInstrInfo.td which is relevant ("FIXME: This 
> shouldn't use the generic addrmode2, but rather be split into i12 and 
> rs suffixed versions.")
>
> If you want to do the fancy version where "pc" is part of the 
> addressing mode, you probably need to do something in 
> ARMConstantIslands (since the transform requires the jump table to be 
> placed directly after the jump.)
>
> -Eli
>
> -- 
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20171108/61a50e03/attachment.html>

Tim Northover via llvm-dev

2017-Nov-08 17:08 UTC

head link

[llvm-dev] Questions about code-size optimizations in ARM backend

Hi Gabor,

On 7 November 2017 at 17:02, Gabor Ballabas via llvm-dev
<llvm-dev at lists.llvm.org> wrote:> The full example code and the generated assembly for GCC and for LLVM is in
> the Bugzilla issue.
I notice that this discussion seems focused around ARM-mode
instructions. Is that intentional? In my experience everyone that
actually cares about code size is using Thumb mode (mostly because
they're on M-class CPUs).

It might be intentional, but I just wanted to make sure before a lot
of effort was spent on marginal cases.

Cheers.

Tim.

llvm dev - Nov 2017 - Questions about code-size optimizations in ARM backend

[llvm-dev] Questions about code-size optimizations in ARM backend

[llvm-dev] Questions about code-size optimizations in ARM backend

[llvm-dev] Questions about code-size optimizations in ARM backend

[llvm-dev] Questions about code-size optimizations in ARM backend

[llvm-dev] Questions about code-size optimizations in ARM backend