thr3ads.net - llvm dev - [llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores [Sep 2018]

If this information is useful, please help other people find it:
Share via:

Nirav Davé via llvm-dev

2018-Sep-11 15:16 UTC

[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores

Andres:

FWIW, codegen will do the merge if you turn on global alias analysis for it
"-combiner-global-alias-analysis". That said, we should be able to do
this
merging earlier.

-Nirav


On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hi,
>
> On 2018-09-10 13:42:21 -0700, Andres Freund wrote:
> > I have, in postres, a piece of IR that, after inlining and constant
> > propagation boils (when cooked on really high heat) down to (also
> > attached for your convenience):
> >
> > source_filename = "pg"
> > target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> > target triple = "x86_64-pc-linux-gnu"
> >
> > define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8 noalias) {
> > entry:
> >   %a01 = getelementptr i8, i8* %0, i16 0
> >   store i8 0, i8* %a01
> >
> >   ; in the real case this also loads data
> >   %b01 = getelementptr i32, i32* %1, i16 0
> >   store i32 0, i32* %b01
> >
> >   %a02 = getelementptr i8, i8* %0, i16 1
> >   store i8 0, i8* %a02
> >
> >   ; in the real case this also loads data
> >   %b02 = getelementptr i32, i32* %1, i16 1
> >   store i32 0, i32* %b02
> >
> >   ; in the real case this also loads data
> >   %a03 = getelementptr i8, i8* %0, i16 2
> >   store i8 0, i8* %a03
> >
> >   ; in the real case this also loads data
> >   %b03 = getelementptr i32, i32* %1, i16 2
> >   store i32 0, i32* %b03
> >
> >   %a04 = getelementptr i8, i8* %0, i16 3
> >   store i8 0, i8* %a04
> >
> >   ; in the real case this also loads data
> >   %b04 = getelementptr i32, i32* %1, i16 3
> >   store i32 0, i32* %b04
> >
> >   ret void
> > }
>
> > So, here we finally come to my question: Is it really expected that,
> > unless largely independent optimizations (SLP in this case) happen to
> > move instructions *within the same basic block* out of the way, these
> > stores don't get coalesced?  And then only if the either the
> > optimization pipeline is run again, or if instruction selection can do
> > so?
> >
> >
> > On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725 which
> > might address this indirectly.  But I'm somewhat doubtful that
that's
> > the most straightforward way to optimize this kind of code?
>
> That doesn't help, but it turns out that //reviews.llvm.org/D30703 can
> kinda somwhat help by adding a redundant
>   %i32ptr = bitcast i8* %0 to i32*
>   store i32 0, i32* %i32ptr
>
> at the start. Then dse-partial-store-merging does its magic and
> optimizes the sub-stores away.  But it's fairly ugly to manually have
to
> add superflous stores in the right granularity (a larger llvm.memset
> doesn't work).
>
> gcc, since 7, detects such cases in its "new" -fstore-merging
pass.
>
> - Andres
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/ed591c3e/attachment.html>

Andres Freund via llvm-dev

2018-Sep-11 18:21 UTC

head link

[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores

Hi,

On 2018-09-11 11:16:25 -0400, Nirav Davé wrote:> Andres:
> 
> FWIW, codegen will do the merge if you turn on global alias analysis for it
> "-combiner-global-alias-analysis". That said, we should be able
to do this
> merging earlier.
Interesting. That does *something* for my real case, but certainly not
as much as I'd expected, or what I can get dse-partial-store-merging to
do if I emit some "superflous" earlier store (which encompass all the
previous stores) that allow it to its job.

In the case at hand, with a manual 64bit store (this is on a 64bit
target), llvm then combines 8 byte-wide stores into one.


Without -combiner-global-alias-analysis it generates:

        movb    $0, 1(%rdx)
        movl    4(%rsi,%rdi), %ebx
        movq    %rbx, 8(%rcx)
        movb    $0, 2(%rdx)
        movl    8(%rsi,%rdi), %ebx
        movq    %rbx, 16(%rcx)
        movb    $0, 3(%rdx)
        movl    12(%rsi,%rdi), %ebx
        movq    %rbx, 24(%rcx)
        movb    $0, 4(%rdx)
        movq    16(%rsi,%rdi), %rbx
        movq    %rbx, 32(%rcx)
        movb    $0, 5(%rdx)
        movq    24(%rsi,%rdi), %rbx
        movq    %rbx, 40(%rcx)
        movb    $0, 6(%rdx)
        movq    32(%rsi,%rdi), %rbx
        movq    %rbx, 48(%rcx)
        movb    $0, 7(%rdx)
        movq    40(%rsi,%rdi), %rsi

were (%rdi) is the array of 1 byte values, where I hope to get stores
combined, which is guaranteed to be 8byte aligned.

With out -combiner-global-alias-analysis it generates:

	movw	$0, (%rsi)
	movl	(%rcx,%rdi), %ebx
	movq	%rbx, (%rdx)
	movl	4(%rcx,%rdi), %ebx
	movl	8(%rcx,%rdi), %r8d
	movq	%rbx, 8(%rdx)
	movl	$0, 2(%rsi)
	movq	%r8, 16(%rdx)
	movl	12(%rcx,%rdi), %ebx
	movq	%rbx, 24(%rdx)
	movq	16(%rcx,%rdi), %rbx
	movq	%rbx, 32(%rdx)
	movq	24(%rcx,%rdi), %rbx
	movq	%rbx, 40(%rdx)
	movb	$0, 6(%rsi)
	movq	32(%rcx,%rdi), %rbx
	movq	%rbx, 48(%rdx)
	movb	$0, 7(%rsi)

where (%rsi) is the array of 1-byte values.  So it's a 2, 4, 1, 1
byte store. Huh?

Whereas, if I emit a superflous 8-byte store beforehand it becomes:
        movq    $0, (%rsi)
        movl    (%rcx,%rdi), %ebx
        movq    %rbx, (%rdx)
        movl    4(%rcx,%rdi), %ebx
        movq    %rbx, 8(%rdx)
        movl    8(%rcx,%rdi), %ebx
        movq    %rbx, 16(%rdx)
        movl    12(%rcx,%rdi), %ebx
        movq    %rbx, 24(%rdx)
        movq    16(%rcx,%rdi), %rbx
        movq    %rbx, 32(%rdx)
        movq    24(%rcx,%rdi), %rbx
        movq    %rbx, 40(%rdx)
        movq    32(%rcx,%rdi), %rbx
        movq    %rbx, 48(%rdx)
        movq    40(%rcx,%rdi), %rcx

so just a single 8-byte store.

I've attached the two testfiles (which unfortunately are somewhat
messy):
24703.1.bc - file without "superflous" store
25256.0.bc - file with "superflous" store

the workflow I have, emulating the current pipeline, is:

opt -O3 -disable-slp-vectorization -S < /srv/dev/pgdev-dev/25256.0.bc |llc
-O3 [-combiner-global-alias-analysis]

Note that the problem can also occur when -disable-slp-vectorization, it
just requires a larger example.

Greetings,

Andres Freund
> -Nirav
> 
> 
> On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
> 
> > Hi,
> >
> > On 2018-09-10 13:42:21 -0700, Andres Freund wrote:
> > > I have, in postres, a piece of IR that, after inlining and
constant
> > > propagation boils (when cooked on really high heat) down to (also
> > > attached for your convenience):
> > >
> > > source_filename = "pg"
> > > target datalayout =
"e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> > > target triple = "x86_64-pc-linux-gnu"
> > >
> > > define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8
noalias) {
> > > entry:
> > >   %a01 = getelementptr i8, i8* %0, i16 0
> > >   store i8 0, i8* %a01
> > >
> > >   ; in the real case this also loads data
> > >   %b01 = getelementptr i32, i32* %1, i16 0
> > >   store i32 0, i32* %b01
> > >
> > >   %a02 = getelementptr i8, i8* %0, i16 1
> > >   store i8 0, i8* %a02
> > >
> > >   ; in the real case this also loads data
> > >   %b02 = getelementptr i32, i32* %1, i16 1
> > >   store i32 0, i32* %b02
> > >
> > >   ; in the real case this also loads data
> > >   %a03 = getelementptr i8, i8* %0, i16 2
> > >   store i8 0, i8* %a03
> > >
> > >   ; in the real case this also loads data
> > >   %b03 = getelementptr i32, i32* %1, i16 2
> > >   store i32 0, i32* %b03
> > >
> > >   %a04 = getelementptr i8, i8* %0, i16 3
> > >   store i8 0, i8* %a04
> > >
> > >   ; in the real case this also loads data
> > >   %b04 = getelementptr i32, i32* %1, i16 3
> > >   store i32 0, i32* %b04
> > >
> > >   ret void
> > > }
> >
> > > So, here we finally come to my question: Is it really expected
that,
> > > unless largely independent optimizations (SLP in this case)
happen to
> > > move instructions *within the same basic block* out of the way,
these
> > > stores don't get coalesced?  And then only if the either the
> > > optimization pipeline is run again, or if instruction selection
can do
> > > so?
> > >
> > >
> > > On IRC Roman Lebedev pointed out https://reviews.llvm.org/D48725
which
> > > might address this indirectly.  But I'm somewhat doubtful
that that's
> > > the most straightforward way to optimize this kind of code?
> >
> > That doesn't help, but it turns out that //reviews.llvm.org/D30703
can
> > kinda somwhat help by adding a redundant
> >   %i32ptr = bitcast i8* %0 to i32*
> >   store i32 0, i32* %i32ptr
> >
> > at the start. Then dse-partial-store-merging does its magic and
> > optimizes the sub-stores away.  But it's fairly ugly to manually
have to
> > add superflous stores in the right granularity (a larger llvm.memset
> > doesn't work).
> >
> > gcc, since 7, detects such cases in its "new"
-fstore-merging pass.
> >
> > - Andres
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >-------------- next part --------------
A non-text attachment was scrubbed...
Name: 24703.1.bc
Type: application/octet-stream
Size: 12852 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/54fbc469/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 25256.0.bc
Type: application/octet-stream
Size: 12324 bytes
Desc: not available
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/54fbc469/attachment-0003.obj>

Nirav Davé via llvm-dev

2018-Sep-11 19:06 UTC

head link

[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores

Hmm. This looks like the backend conservatively giving up early on merging.
It looks like you're running clang 5.02. There have been some improvements
to the backend's memory aliasing and store merging that have landed since.
Can you check if this is fixed in a newer version?

-Nirav


On Tue, Sep 11, 2018 at 2:21 PM, Andres Freund <andres at anarazel.de>
wrote:
> Hi,
>
> On 2018-09-11 11:16:25 -0400, Nirav Davé wrote:
> > Andres:
> >
> > FWIW, codegen will do the merge if you turn on global alias analysis
for
> it
> > "-combiner-global-alias-analysis". That said, we should be
able to do
> this
> > merging earlier.
>
> Interesting. That does *something* for my real case, but certainly not
> as much as I'd expected, or what I can get dse-partial-store-merging to
> do if I emit some "superflous" earlier store (which encompass all
the
> previous stores) that allow it to its job.
>
> In the case at hand, with a manual 64bit store (this is on a 64bit
> target), llvm then combines 8 byte-wide stores into one.
>
>
> Without -combiner-global-alias-analysis it generates:
>
>         movb    $0, 1(%rdx)
>         movl    4(%rsi,%rdi), %ebx
>         movq    %rbx, 8(%rcx)
>         movb    $0, 2(%rdx)
>         movl    8(%rsi,%rdi), %ebx
>         movq    %rbx, 16(%rcx)
>         movb    $0, 3(%rdx)
>         movl    12(%rsi,%rdi), %ebx
>         movq    %rbx, 24(%rcx)
>         movb    $0, 4(%rdx)
>         movq    16(%rsi,%rdi), %rbx
>         movq    %rbx, 32(%rcx)
>         movb    $0, 5(%rdx)
>         movq    24(%rsi,%rdi), %rbx
>         movq    %rbx, 40(%rcx)
>         movb    $0, 6(%rdx)
>         movq    32(%rsi,%rdi), %rbx
>         movq    %rbx, 48(%rcx)
>         movb    $0, 7(%rdx)
>         movq    40(%rsi,%rdi), %rsi
>
> were (%rdi) is the array of 1 byte values, where I hope to get stores
> combined, which is guaranteed to be 8byte aligned.
>
> With out -combiner-global-alias-analysis it generates:
>
>         movw    $0, (%rsi)
>         movl    (%rcx,%rdi), %ebx
>         movq    %rbx, (%rdx)
>         movl    4(%rcx,%rdi), %ebx
>         movl    8(%rcx,%rdi), %r8d
>         movq    %rbx, 8(%rdx)
>         movl    $0, 2(%rsi)
>         movq    %r8, 16(%rdx)
>         movl    12(%rcx,%rdi), %ebx
>         movq    %rbx, 24(%rdx)
>         movq    16(%rcx,%rdi), %rbx
>         movq    %rbx, 32(%rdx)
>         movq    24(%rcx,%rdi), %rbx
>         movq    %rbx, 40(%rdx)
>         movb    $0, 6(%rsi)
>         movq    32(%rcx,%rdi), %rbx
>         movq    %rbx, 48(%rdx)
>         movb    $0, 7(%rsi)
>
> where (%rsi) is the array of 1-byte values.  So it's a 2, 4, 1, 1
> byte store. Huh?
>
> Whereas, if I emit a superflous 8-byte store beforehand it becomes:
>         movq    $0, (%rsi)
>         movl    (%rcx,%rdi), %ebx
>         movq    %rbx, (%rdx)
>         movl    4(%rcx,%rdi), %ebx
>         movq    %rbx, 8(%rdx)
>         movl    8(%rcx,%rdi), %ebx
>         movq    %rbx, 16(%rdx)
>         movl    12(%rcx,%rdi), %ebx
>         movq    %rbx, 24(%rdx)
>         movq    16(%rcx,%rdi), %rbx
>         movq    %rbx, 32(%rdx)
>         movq    24(%rcx,%rdi), %rbx
>         movq    %rbx, 40(%rdx)
>         movq    32(%rcx,%rdi), %rbx
>         movq    %rbx, 48(%rdx)
>         movq    40(%rcx,%rdi), %rcx
>
> so just a single 8-byte store.
>
> I've attached the two testfiles (which unfortunately are somewhat
> messy):
> 24703.1.bc - file without "superflous" store
> 25256.0.bc - file with "superflous" store
>
> the workflow I have, emulating the current pipeline, is:
>
> opt -O3 -disable-slp-vectorization -S < /srv/dev/pgdev-dev/25256.0.bc
|llc
> -O3 [-combiner-global-alias-analysis]
>
> Note that the problem can also occur when -disable-slp-vectorization, it
> just requires a larger example.
>
> Greetings,
>
> Andres Freund
>
> > -Nirav
> >
> >
> > On Mon, Sep 10, 2018 at 8:33 PM, Andres Freund via llvm-dev <
> > llvm-dev at lists.llvm.org> wrote:
> >
> > > Hi,
> > >
> > > On 2018-09-10 13:42:21 -0700, Andres Freund wrote:
> > > > I have, in postres, a piece of IR that, after inlining and
constant
> > > > propagation boils (when cooked on really high heat) down to
(also
> > > > attached for your convenience):
> > > >
> > > > source_filename = "pg"
> > > > target datalayout =
"e-m:e-i64:64-f80:128-n8:16:32:64-S128"
> > > > target triple = "x86_64-pc-linux-gnu"
> > > >
> > > > define void @evalexpr_0_0(i8* align 8 noalias, i32* align 8
noalias)
> {
> > > > entry:
> > > >   %a01 = getelementptr i8, i8* %0, i16 0
> > > >   store i8 0, i8* %a01
> > > >
> > > >   ; in the real case this also loads data
> > > >   %b01 = getelementptr i32, i32* %1, i16 0
> > > >   store i32 0, i32* %b01
> > > >
> > > >   %a02 = getelementptr i8, i8* %0, i16 1
> > > >   store i8 0, i8* %a02
> > > >
> > > >   ; in the real case this also loads data
> > > >   %b02 = getelementptr i32, i32* %1, i16 1
> > > >   store i32 0, i32* %b02
> > > >
> > > >   ; in the real case this also loads data
> > > >   %a03 = getelementptr i8, i8* %0, i16 2
> > > >   store i8 0, i8* %a03
> > > >
> > > >   ; in the real case this also loads data
> > > >   %b03 = getelementptr i32, i32* %1, i16 2
> > > >   store i32 0, i32* %b03
> > > >
> > > >   %a04 = getelementptr i8, i8* %0, i16 3
> > > >   store i8 0, i8* %a04
> > > >
> > > >   ; in the real case this also loads data
> > > >   %b04 = getelementptr i32, i32* %1, i16 3
> > > >   store i32 0, i32* %b04
> > > >
> > > >   ret void
> > > > }
> > >
> > > > So, here we finally come to my question: Is it really
expected that,
> > > > unless largely independent optimizations (SLP in this case)
happen to
> > > > move instructions *within the same basic block* out of the
way, these
> > > > stores don't get coalesced?  And then only if the either
the
> > > > optimization pipeline is run again, or if instruction
selection can
> do
> > > > so?
> > > >
> > > >
> > > > On IRC Roman Lebedev pointed out
https://reviews.llvm.org/D48725
> which
> > > > might address this indirectly.  But I'm somewhat
doubtful that that's
> > > > the most straightforward way to optimize this kind of code?
> > >
> > > That doesn't help, but it turns out that
//reviews.llvm.org/D30703 can
> > > kinda somwhat help by adding a redundant
> > >   %i32ptr = bitcast i8* %0 to i32*
> > >   store i32 0, i32* %i32ptr
> > >
> > > at the start. Then dse-partial-store-merging does its magic and
> > > optimizes the sub-stores away.  But it's fairly ugly to
manually have
> to
> > > add superflous stores in the right granularity (a larger
llvm.memset
> > > doesn't work).
> > >
> > > gcc, since 7, detects such cases in its "new"
-fstore-merging pass.
> > >
> > > - Andres
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > llvm-dev at lists.llvm.org
> > > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > >
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20180911/209174d1/attachment.html>

Apparently Analagous Threads

Search for more apparently analagous threads

llvm dev - Sep 2018 - Byte-wide stores aren't coalesced if interspersed with other stores

[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores

[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores

[llvm-dev] Byte-wide stores aren't coalesced if interspersed with other stores

Apparently Analagous Threads