thr3ads.net - llvm dev - [llvm-dev] RFC: SROA for method argument [May 2017]

If this information is useful, please help other people find it:
Share via:

Hiroshi 7 Inoue via llvm-dev

2017-May-09 13:05 UTC

[llvm-dev] RFC: SROA for method argument

Hi,

I am working to improve SROA to generate better code when a method has a
struct in its arguments. I would appreciate it if I could have any
suggestions or comments on how I can best proceed with this optimization.

* Problem *
I observed that LLVM often generates redundant instructions around glibc’s
istreambuf_iterator. The problem comes from the scalar replacement (SROA)
for methods with an aggregate as an argument. Here is a simplified example
in C.

struct record {
    long long a;
    int b;
    int c;
};

int func(struct record r) {
    for (int i = 0; i < r.c; i++)
        r.b++;
    return r.b;
}

When updating r.b (or r.c as well), SROA generates redundant instructions
on some platforms (such as x86_64 and ppc64); here, r.b and r.c are packed
into one 64-bit GPR when the struct is passed as a method argument. The
problem is caused when the same memory location is accessed by load/store
instructions of different types.
For this example, CLANG generates following IRs to initialize the struct
for ppc64 and x86_64. For both platforms, the 64-bit value is stored into
memory allocated by alloca first. Later, the same memory location is
accessed as 32-bit integer values (r.b and r.c).

for ppc64
    %struct.record = type { i64, i32, i32 }

    define signext i32 @ppc64le_func([2 x i64] %r.coerce) #0 {
    entry:
      %r = alloca %struct.record, align 8
      %0 = bitcast %struct.record* %r to [2 x i64]*
      store [2 x i64] %r.coerce, [2 x i64]* %0, align 8
      ....

for x86_64
    define i32 @x86_64_func(i64 %r.coerce0, i64 %r.coerce1) #0 {
    entry:
      %r = alloca %struct.record, align 8
      %0 = bitcast %struct.record* %r to { i64, i64 }*
      %1 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %0, i32 0,
i32 0
      store i64 %r.coerce0, i64* %1, align 8
      %2 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %0, i32 0,
i32 1
      store i64 %r.coerce1, i64* %2, align 8
      ....

For such code sequence, the current SROA generates instructions to update
only upper (or lower) half of the 64-bit value when storing r.b (or r.c).
SROA can split an i64 value into two i32 values under some conditions (e.g.
when the struct contains only int b and int c in this example), but it is
not capable of splitting complex cases.

* Approach *
In SROA pass, AggLoadStoreRewriter splits a load or store instructions for
an aggregate into multiple load or store instructions for simple values. In
above ppc64 case, store [2 x i64] is splitted into two store for i64.
I am extending AggLoadStoreRewriter to split a store of an aggregate that
comes from a method argument based on the format of the aggregate (Here,
{ i64, i32, i32 } instead of [2 x i64]). I have submitted a
work-in-progress patch in Phabricator ( https://reviews.llvm.org/D32998 ).
This optimization depends on the ABI, so I enabled this only for ppc64 with
ELFv2 ABI so far.

I truly appreciate any advices and comments.
Best regards,
Hiroshi

-----
Hiroshi Inoue <inouehrs at jp.ibm.com>
IBM Research - Tokyo
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170509/1d58d531/attachment-0001.html>

Friedman, Eli via llvm-dev

2017-May-09 17:53 UTC

head link

[llvm-dev] RFC: SROA for method argument

On 5/9/2017 6:05 AM, Hiroshi 7 Inoue via llvm-dev wrote:>
> Hi,
>
> I am working to improve SROA to generate better code when a method has 
> a struct in its arguments. I would appreciate it if I could have any 
> suggestions or comments on how I can best proceed with this optimization.
>
> * Problem *
> I observed that LLVM often generates redundant instructions around 
> glibc’s istreambuf_iterator. The problem comes from the scalar 
> replacement (SROA) for methods with an aggregate as an argument. Here 
> is a simplified example in C.
>
> struct record {
> long long a;
> int b;
> int c;
> };
>
> int func(struct record r) {
> for (int i = 0; i < r.c; i++)
> r.b++;
> return r.b;
> }
>
> When updating r.b (or r.c as well), SROA generates redundant 
> instructions on some platforms (such as x86_64 and ppc64); here, r.b 
> and r.c are packed into one 64-bit GPR when the struct is passed as a 
> method argument. The problem is caused when the same memory location 
> is accessed by load/store instructions of different types.
> For this example, CLANG generates following IRs to initialize the 
> struct for ppc64 and x86_64. For both platforms, the 64-bit value is 
> stored into memory allocated by alloca first. Later, the same memory 
> location is accessed as 32-bit integer values (r.b and r.c).
>
> for ppc64
> %struct.record = type { i64, i32, i32 }
>
> define signext i32 @ppc64le_func([2 x i64] %r.coerce) #0 {
> entry:
> %r = alloca %struct.record, align 8
> %0 = bitcast %struct.record* %r to [2 x i64]*
> store [2 x i64] %r.coerce, [2 x i64]* %0, align 8
> ....
>
> for x86_64
> define i32 @x86_64_func(i64 %r.coerce0, i64 %r.coerce1) #0 {
> entry:
> %r = alloca %struct.record, align 8
> %0 = bitcast %struct.record* %r to { i64, i64 }*
> %1 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %0, i32 0, i32 0
> store i64 %r.coerce0, i64* %1, align 8
> %2 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %0, i32 0, i32 1
> store i64 %r.coerce1, i64* %2, align 8
> ....
>
> For such code sequence, the current SROA generates instructions to 
> update only upper (or lower) half of the 64-bit value when storing r.b 
> (or r.c). SROA can split an i64 value into two i32 values under some 
> conditions (e.g. when the struct contains only int b and int c in this 
> example), but it is not capable of splitting complex cases.
>When there are accesses of mixed type to an alloca, SROA just treats the 
whole alloca as a big integer, and generates PHI nodes appropriately.  
In many cases, instcombine would then slice up the generated PHI nodes 
to use more appropriate types, but that doesn't work out here.  (See 
InstCombiner::SliceUpIllegalIntegerPHI.)  Probably the right solution is 
to make instcombine more aggressive here; it's hard to come up with a 
generally useful transform in SROA without reasoning about control flow.

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170509/4eaa0eab/attachment.html>

Hiroshi 7 Inoue via llvm-dev

2017-May-11 13:39 UTC

head link

[llvm-dev] RFC: SROA for method argument

Thank you for the advice.
InstCombiner::SliceUpIllegalIntegerPHI splits only when the output of the
PHI node is only used by trunc or lshr+trunc (and the integer size is not
legal).
So it cannot not handle update to the field. If we extend
SliceUpIllegalIntegerPHI to support more generic cases, I feel it may
become similar to the split functionality in SROA.

Extending the split functionality (in SROA or InstCombiner) will give
higher coverage but it requires complicated (and potentially costly)
analysis.
My approach of splitting arguments in SROA preprocessing aim to cover a
common case without such complicated analysis.
Which approach is more preferable/acceptable for the community?


[FYI]
Here, I attach the IRs generated by SROA for the example in my first mail.
The output of PHI (%r.sroa.2.0) is used by the and instruction.

define signext i32 @_Z4func6record([2 x i64] %r.coerce) #0 {
entry:
  %r.coerce.fca.0.extract = extractvalue [2 x i64] %r.coerce, 0
  %r.coerce.fca.1.extract = extractvalue [2 x i64] %r.coerce, 1
  br label %for.cond

for.cond:                                         ; preds = %for.body,
%entry
  %r.sroa.2.0 = phi i64 [ %r.coerce.fca.1.extract, %entry ],
[ %r.sroa.2.8.insert.insert, %for.body ]
  %i.0 = phi i32 [ 0, %entry ], [ %inc1, %for.body ]
  %conv = sext i32 %i.0 to i64
  %cmp = icmp slt i64 %conv, %r.coerce.fca.0.extract
  br i1 %cmp, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond
  %r.sroa.2.8.extract.trunc6 = trunc i64 %r.sroa.2.0 to i32
  ret i32 %r.sroa.2.8.extract.trunc6

for.body:                                         ; preds = %for.cond
  %r.sroa.2.8.extract.trunc = trunc i64 %r.sroa.2.0 to i32
  %inc = add nsw i32 %r.sroa.2.8.extract.trunc, 1
  %r.sroa.2.8.insert.ext = zext i32 %inc to i64
  %r.sroa.2.8.insert.mask = and i64 %r.sroa.2.0, -4294967296
  %r.sroa.2.8.insert.insert = or i64 %r.sroa.2.8.insert.mask,
%r.sroa.2.8.insert.ext
  %inc1 = add nsw i32 %i.0, 1
  br label %for.cond
}

-----
Hiroshi Inoue <inouehrs at jp.ibm.com>
IBM Research - Tokyo


"Friedman, Eli" <efriedma at codeaurora.org> wrote on 2017/05/10
02:53:09:
> From: "Friedman, Eli" <efriedma at codeaurora.org>
> To: Hiroshi 7 Inoue/Japan/IBM at IBMJP, llvm-dev at lists.llvm.org
> Date: 2017/05/10 02:54
> Subject: Re: [llvm-dev] RFC: SROA for method argument
>
> When there are accesses of mixed type to an alloca, SROA just treats
> the whole alloca as a big integer, and generates PHI nodes
> appropriately.  In many cases, instcombine would then slice up the
> generated PHI nodes to use more appropriate types, but that doesn't
> work out here.  (See InstCombiner::SliceUpIllegalIntegerPHI.)
> Probably the right solution is to make instcombine more aggressive
> here; it's hard to come up with a generally useful transform in SROA
> without reasoning about control flow.
> -Eli
> --
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a
> Linux Foundation Collaborative Project
>
> On 5/9/2017 6:05 AM, Hiroshi 7 Inoue via llvm-dev wrote:
> Hi,
>
> I am working to improve SROA to generate better code when a method
> has a struct in its arguments. I would appreciate it if I could have
> any suggestions or comments on how I can best proceed with this
optimization.>
> * Problem *
> I observed that LLVM often generates redundant instructions around
> glibc’s istreambuf_iterator. The problem comes from the scalar
> replacement (SROA) for methods with an aggregate as an argument.
> Here is a simplified example in C.
>
> struct record {
> long long a;
> int b;
> int c;
> };
>
> int func(struct record r) {
> for (int i = 0; i < r.c; i++)
> r.b++;
> return r.b;
> }
>
> When updating r.b (or r.c as well), SROA generates redundant
> instructions on some platforms (such as x86_64 and ppc64); here, r.b
> and r.c are packed into one 64-bit GPR when the struct is passed as
> a method argument. The problem is caused when the same memory
> location is accessed by load/store instructions of different types.
> For this example, CLANG generates following IRs to initialize the
> struct for ppc64 and x86_64. For both platforms, the 64-bit value is
> stored into memory allocated by alloca first. Later, the same memory
> location is accessed as 32-bit integer values (r.b and r.c).
>
> for ppc64
> %struct.record = type { i64, i32, i32 }
>
> define signext i32 @ppc64le_func([2 x i64] %r.coerce) #0 {
> entry:
> %r = alloca %struct.record, align 8
> %0 = bitcast %struct.record* %r to [2 x i64]*
> store [2 x i64] %r.coerce, [2 x i64]* %0, align 8
> ....
>
> for x86_64
> define i32 @x86_64_func(i64 %r.coerce0, i64 %r.coerce1) #0 {
> entry:
> %r = alloca %struct.record, align 8
> %0 = bitcast %struct.record* %r to { i64, i64 }*
> %1 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %0, i32 0, i32 0
> store i64 %r.coerce0, i64* %1, align 8
> %2 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %0, i32 0, i32 1
> store i64 %r.coerce1, i64* %2, align 8
> ....
>
> For such code sequence, the current SROA generates instructions to
> update only upper (or lower) half of the 64-bit value when storing
> r.b (or r.c). SROA can split an i64 value into two i32 values under
> some conditions (e.g. when the struct contains only int b and int c
> in this example), but it is not capable of splitting complex cases.-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170511/403ac429/attachment.html>

Reid Kleckner via llvm-dev

2017-May-12 22:31 UTC

head link

[llvm-dev] RFC: SROA for method argument

I'll propose a different heuristic. SROA should ignore stores of arguments
into allocas in the entry block when deciding what slices to form. Such
stores happen exactly once, and are usually coercions that we have to do
for ABI reasons. SROA should generate code like this before promoting
allocas to SSA form:

define i32 @func(i64 %r.coerce.0, i64 %r.coerce.1) {
  %r.slice.0 = alloca i64
  %r.slice.1 = alloca i32
  %r.slice.2 = alloca i32
  store i64 %r.coerce.0, i64* %r.slice.0
  %r.1.shr = lshr i64 %r.coerce.1, 32
  %r.1 = trunc i64 %r.1.shr
  %r.2 = trunc i64 %r.coerce.1
  store i32 %r.1, i32* %r.slice.1
  store i32 %r.2, i32* %r.slice.2
  ...
}

This is basically "reasoning about the CFG" without actually looking
at
loop info. Stores of arguments in the entry block can't be in a loop. Even
if they end up in one after inlining, instcombine should be able to
simplify the {i32,i32}->i64->{i32,i32} code.

On Tue, May 9, 2017 at 10:53 AM, Friedman, Eli via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> On 5/9/2017 6:05 AM, Hiroshi 7 Inoue via llvm-dev wrote:
>
> Hi,
>
> I am working to improve SROA to generate better code when a method has a
> struct in its arguments. I would appreciate it if I could have any
> suggestions or comments on how I can best proceed with this optimization.
>
> * Problem *
> I observed that LLVM often generates redundant instructions around glibc’s
> istreambuf_iterator. The problem comes from the scalar replacement (SROA)
> for methods with an aggregate as an argument. Here is a simplified example
> in C.
>
> struct record {
> long long a;
> int b;
> int c;
> };
>
> int func(struct record r) {
> for (int i = 0; i < r.c; i++)
> r.b++;
> return r.b;
> }
>
> When updating r.b (or r.c as well), SROA generates redundant instructions
> on some platforms (such as x86_64 and ppc64); here, r.b and r.c are packed
> into one 64-bit GPR when the struct is passed as a method argument. The
> problem is caused when the same memory location is accessed by load/store
> instructions of different types.
> For this example, CLANG generates following IRs to initialize the struct
> for ppc64 and x86_64. For both platforms, the 64-bit value is stored into
> memory allocated by alloca first. Later, the same memory location is
> accessed as 32-bit integer values (r.b and r.c).
>
> for ppc64
> %struct.record = type { i64, i32, i32 }
>
> define signext i32 @ppc64le_func([2 x i64] %r.coerce) #0 {
> entry:
> %r = alloca %struct.record, align 8
> %0 = bitcast %struct.record* %r to [2 x i64]*
> store [2 x i64] %r.coerce, [2 x i64]* %0, align 8
> ....
>
> for x86_64
> define i32 @x86_64_func(i64 %r.coerce0, i64 %r.coerce1) #0 {
> entry:
> %r = alloca %struct.record, align 8
> %0 = bitcast %struct.record* %r to { i64, i64 }*
> %1 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %0, i32 0, i32 0
> store i64 %r.coerce0, i64* %1, align 8
> %2 = getelementptr inbounds { i64, i64 }, { i64, i64 }* %0, i32 0, i32 1
> store i64 %r.coerce1, i64* %2, align 8
> ....
>
> For such code sequence, the current SROA generates instructions to update
> only upper (or lower) half of the 64-bit value when storing r.b (or r.c).
> SROA can split an i64 value into two i32 values under some conditions (e.g.
> when the struct contains only int b and int c in this example), but it is
> not capable of splitting complex cases.
>
> When there are accesses of mixed type to an alloca, SROA just treats the
> whole alloca as a big integer, and generates PHI nodes appropriately.  In
> many cases, instcombine would then slice up the generated PHI nodes to use
> more appropriate types, but that doesn't work out here.  (See
InstCombiner::SliceUpIllegalIntegerPHI.)
> Probably the right solution is to make instcombine more aggressive here;
> it's hard to come up with a generally useful transform in SROA without
> reasoning about control flow.
>
> -Eli
>
> --
> Employee of Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux
Foundation Collaborative Project
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20170512/ceffe06f/attachment.html>

llvm dev - May 2017 - RFC: SROA for method argument

[llvm-dev] RFC: SROA for method argument

[llvm-dev] RFC: SROA for method argument

[llvm-dev] RFC: SROA for method argument

[llvm-dev] RFC: SROA for method argument