thr3ads.net - llvm dev - [llvm-dev] enabling interleaved access loop vectorization [Aug 2016]

If this information is useful, please help other people find it:
Share via:

Nema, Ashutosh via llvm-dev

2016-Aug-05 11:20 UTC

[llvm-dev] enabling interleaved access loop vectorization

Hi Michael,

Sometime back I did some experiments with interleave vectorizer and did not
found any degrade,
probably my tests/benchmarks are not extensive enough to cover much.

Elina is the right person to comment on it as she already experienced cases
where it hinders performance.

For interleave vectorizer on X86 we do not have any specific costing, it goes to
BasicTTI where the costing is not appropriate(WRT X86).
It consider cost of extracts & inserts for extracting elements from a wide
vector, which is really expensive.
i.e. in your test case the cost of load associated with “in[i * 2]” is 10 (for
VF4).
Interleave vectorize will generate following instructions for it:
  %wide.vec = load <8 x i32>, <8 x i32>* %14, align 4, !tbaa !1,
!alias.scope !5
  %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef,
<4 x i32> <i32 0, i32 2, i32 4, i32 6>

For wide load it get cost as 2(as it has to generate 2 loads) but for extracting
elements (shuffle operation) it get cost as 8 (4 for extract + 4 for insert).
The cost should be 3 here, 2 for loads & 1 for shuffle.

To enable Interleave vectorizer on X86 we should implement a proper cost
estimation.

Test you mentioned is indeed a candidate for Stride memory vectorization.

Regards,
Ashutosh

From: Michael Kuperstein [mailto:mkuper at google.com]
Sent: Friday, August 5, 2016 4:53 AM
To: Demikhovsky, Elena <elena.demikhovsky at intel.com>
Cc: Renato Golin <renato.golin at linaro.org>; Sanjay Patel <spatel at
rotateright.com>; Nema, Ashutosh <Ashutosh.Nema at amd.com>; Matthew
Simpson <mssimpso at codeaurora.org>; llvm-dev <llvm-dev at
lists.llvm.org>
Subject: Re: [llvm-dev] enabling interleaved access loop vectorization

Hi Elena,

Circling back to this, do you know of any concrete cases where enabling
interleaved access on x86 is unprofitable?
Right now, there are some cases where we lose significantly, because (a) we
consider gathers (on architectures that don't have them) extremely
expensive, so we won't vectorize them at all without interleaved access, and
(b) we have interleaved access turned off.

Consider something like this:

void foo(int *in, int *out) {
  int i = 0;
  for (i = 0; i < 256; ++i) {
    out[i] = in[i] + in[i + 1] + in[i + 2] + in[i * 2];
  }
}

We don't vectorize this loop at all, because we calculate the cost of the
in[i * 2] gather to be 14 cycles per lane (!).
This is an overestimate we need to fix, since the vectorized code is actually
fairly decent - e.g. forcing vectorization, with SSE4.2, we get:

.LBB0_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
movdqu (%rdi,%rax,4), %xmm3
movd %xmm0, %rcx
movdqu 4(%rdi,%rcx,4), %xmm4
paddd %xmm3, %xmm4
movdqu 8(%rdi,%rcx,4), %xmm3
paddd %xmm4, %xmm3
movdqa %xmm1, %xmm4
paddq %xmm4, %xmm4
movdqa %xmm0, %xmm5
paddq %xmm5, %xmm5
movd %xmm5, %rcx
pextrq $1, %xmm5, %rdx
movd %xmm4, %r8
pextrq $1, %xmm4, %r9
movd (%rdi,%rcx,4), %xmm4    # xmm4 = mem[0],zero,zero,zero
pinsrd $1, (%rdi,%rdx,4), %xmm4
pinsrd $2, (%rdi,%r8,4), %xmm4
pinsrd $3, (%rdi,%r9,4), %xmm4
paddd %xmm3, %xmm4
movdqu %xmm4, (%rsi,%rax,4)
addq $4, %rax
paddq %xmm2, %xmm0
paddq %xmm2, %xmm1
cmpq $256, %rax              # imm = 0x100
jne .LBB0_3

But the real point is that with interleaved access enabled, we vectorize, and
get:

.LBB0_3:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
movdqu (%rdi,%rcx), %xmm0
movdqu 4(%rdi,%rcx), %xmm1
movdqu 8(%rdi,%rcx), %xmm2
paddd %xmm0, %xmm1
paddd %xmm2, %xmm1
movdqu (%rdi,%rcx,2), %xmm0
movdqu 16(%rdi,%rcx,2), %xmm2
pshufd $132, %xmm2, %xmm2      # xmm2 = xmm2[0,1,0,2]
pshufd $232, %xmm0, %xmm0      # xmm0 = xmm0[0,2,2,3]
pblendw $240, %xmm2, %xmm0      # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7]
paddd %xmm1, %xmm0
movdqu %xmm0, (%rsi,%rcx)
cmpq $992, %rcx              # imm = 0x3E0
jne .LBB0_7

The performance I see out of the 3 versions (with a 500K-iteration outer loop):

Scalar: 0m10.320s
Vector (Non-interleaved): 0m8.054s
Vector (Interleaved): 0m3.541s

This is far from being the perfect use case for interleaved access:
1) There's no real interleaving, just one strided gather, so this would be
better served by Ashutosh's full "strided access" proposal.
2) It looks like the actual move + shuffle sequence is not better, and even
probably worse, than just inserting directly from memory - but it's still
worthwhile because of how much we save on the index computations.
Regardless of all that, the fact of the matter is that we get much better code
by treating it as interleaved, and I think this may be a good enough motivation
to enable it, unless we significantly regress in other cases.

I was going to look at benchmarks to see if we get any regressions, but if you
already have examples you're aware of, that would be great.

Thanks,
  Michael

On Thu, May 26, 2016 at 12:35 PM, Demikhovsky, Elena via llvm-dev <llvm-dev
at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
Interleaved access is not enabled on X86 yet.
We looked at this feature and got into conclusion that interleaving (as loads +
shuffles) is not always profitable on X86. We should provide the right cost
which depends on number of shuffles. Number of shuffles depends on permutations
(shuffle mask). And even if we estimate the number of shuffles, the shuffles are
not generated in-place. Vectorizer produces a long queue of "extracts"
and "inserts" that hopefully will be coupled into shuffles on a later
instcombine pass.

-  Elena

   >-----Original Message-----
   >From: Renato Golin [mailto:renato.golin at
linaro.org<mailto:renato.golin at linaro.org>]
   >Sent: Thursday, May 26, 2016 21:25
   >To: Sanjay Patel <spatel at rotateright.com<mailto:spatel at
rotateright.com>>; Demikhovsky, Elena
   ><elena.demikhovsky at intel.com<mailto:elena.demikhovsky at
intel.com>>
   >Cc: llvm-dev <llvm-dev at lists.llvm.org<mailto:llvm-dev at
lists.llvm.org>>
   >Subject: Re: [llvm-dev] enabling interleaved access loop vectorization
   >
   >On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-
   >dev at lists.llvm.org<mailto:dev at lists.llvm.org>> wrote:
   >> Is there a compile-time and/or potential runtime cost that makes
   >> enableInterleavedAccessVectorization() default to 'false'?
   >>
   >> I notice that this is set to true for ARM, AArch64, and PPC.
   >>
   >> In particular, I'm wondering if there's a reason it's
not enabled for
   >> x86 in relation to PR27881:
   >> https://llvm.org/bugs/show_bug.cgi?id=27881
   >
   >Hi Sanjay,
   >
   >The feature was originally developed for ARM's VLDn/VSTn instructions
   >and then extended to AArch64 and PPC, but not x86/64 yet.
   >
   >I believe Elena was working on that, but needed to get the scatter/gather
   >intrinsics working first. I just copied her in case I'm wrong. :)
   >
   >cheers,
   >--renato
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160805/6970a589/attachment.html>

Matthew Simpson via llvm-dev

2016-Aug-05 14:02 UTC

head link

[llvm-dev] enabling interleaved access loop vectorization

Isn't our current interleaved access vectorization just a special case of
the more general strided access proposal? If so, from a development perspective,
it might make sense to begin incorporating some of that work into the existing
framework (with appropriate target hooks and costs). This could probably be done
piecemeal rather than all at once.

Also, keep in mind that ARM/Aarch64 run an additional IR pass
(InterleavedAccessPass) that matches the load/store plus shuffle sequences that
the vectorizer generates to target-specific instrinsics.

-- Matt

From: Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com] 
Sent: Friday, August 05, 2016 7:21 AM
To: Michael Kuperstein <mkuper at google.com>; Demikhovsky, Elena
<elena.demikhovsky at intel.com>
Cc: Renato Golin <renato.golin at linaro.org>; Sanjay Patel <spatel at
rotateright.com>; Matthew Simpson <mssimpso at codeaurora.org>;
llvm-dev <llvm-dev at lists.llvm.org>
Subject: RE: [llvm-dev] enabling interleaved access loop vectorization

Hi Michael,

Sometime back I did some experiments with interleave vectorizer and did not
found any degrade,

probably my tests/benchmarks are not extensive enough to cover much.

Elina is the right person to comment on it as she already experienced cases
where it hinders performance.

For interleave vectorizer on X86 we do not have any specific costing, it goes to
BasicTTI where the costing is not appropriate(WRT X86).

It consider cost of extracts & inserts for extracting elements from a wide
vector, which is really expensive.

i.e. in your test case the cost of load associated with “in[i * 2]” is 10 (for
VF4).

Interleave vectorize will generate following instructions for it:

  %wide.vec = load <8 x i32>, <8 x i32>* %14, align 4, !tbaa !1,
!alias.scope !5

  %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef,
<4 x i32> <i32 0, i32 2, i32 4, i32 6>

For wide load it get cost as 2(as it has to generate 2 loads) but for extracting
elements (shuffle operation) it get cost as 8 (4 for extract + 4 for insert).

The cost should be 3 here, 2 for loads & 1 for shuffle.

To enable Interleave vectorizer on X86 we should implement a proper cost
estimation.

Test you mentioned is indeed a candidate for Stride memory vectorization.

Regards,

Ashutosh

From: Michael Kuperstein [mailto:mkuper at google.com] 
Sent: Friday, August 5, 2016 4:53 AM
To: Demikhovsky, Elena <elena.demikhovsky at intel.com
<mailto:elena.demikhovsky at intel.com> >
Cc: Renato Golin <renato.golin at linaro.org <mailto:renato.golin at
linaro.org> >; Sanjay Patel <spatel at rotateright.com
<mailto:spatel at rotateright.com> >; Nema, Ashutosh <Ashutosh.Nema
at amd.com <mailto:Ashutosh.Nema at amd.com> >; Matthew Simpson
<mssimpso at codeaurora.org <mailto:mssimpso at codeaurora.org> >;
llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org> >
Subject: Re: [llvm-dev] enabling interleaved access loop vectorization

Hi Elena,

Circling back to this, do you know of any concrete cases where enabling
interleaved access on x86 is unprofitable?

Right now, there are some cases where we lose significantly, because (a) we
consider gathers (on architectures that don't have them) extremely
expensive, so we won't vectorize them at all without interleaved access, and
(b) we have interleaved access turned off.

Consider something like this:

void foo(int *in, int *out) {

  int i = 0;

  for (i = 0; i < 256; ++i) {

    out[i] = in[i] + in[i + 1] + in[i + 2] + in[i * 2];

  }

}

We don't vectorize this loop at all, because we calculate the cost of the
in[i * 2] gather to be 14 cycles per lane (!).

This is an overestimate we need to fix, since the vectorized code is actually
fairly decent - e.g. forcing vectorization, with SSE4.2, we get:

.LBB0_3:                                # %vector.body

                                        # =>This Inner Loop Header: Depth=1

movdqu (%rdi,%rax,4), %xmm3

movd %xmm0, %rcx

movdqu 4(%rdi,%rcx,4), %xmm4

paddd %xmm3, %xmm4

movdqu 8(%rdi,%rcx,4), %xmm3

paddd %xmm4, %xmm3

movdqa %xmm1, %xmm4

paddq %xmm4, %xmm4

movdqa %xmm0, %xmm5

paddq %xmm5, %xmm5

movd %xmm5, %rcx

pextrq $1, %xmm5, %rdx

movd %xmm4, %r8

pextrq $1, %xmm4, %r9

movd (%rdi,%rcx,4), %xmm4    # xmm4 = mem[0],zero,zero,zero

pinsrd $1, (%rdi,%rdx,4), %xmm4

pinsrd $2, (%rdi,%r8,4), %xmm4

pinsrd $3, (%rdi,%r9,4), %xmm4

paddd %xmm3, %xmm4

movdqu %xmm4, (%rsi,%rax,4)

addq $4, %rax

paddq %xmm2, %xmm0

paddq %xmm2, %xmm1

cmpq $256, %rax              # imm = 0x100

jne .LBB0_3

But the real point is that with interleaved access enabled, we vectorize, and
get:

.LBB0_3:                                # %vector.body

                                        # =>This Inner Loop Header: Depth=1

movdqu (%rdi,%rcx), %xmm0

movdqu 4(%rdi,%rcx), %xmm1

movdqu 8(%rdi,%rcx), %xmm2

paddd %xmm0, %xmm1

paddd %xmm2, %xmm1

movdqu (%rdi,%rcx,2), %xmm0

movdqu 16(%rdi,%rcx,2), %xmm2

pshufd $132, %xmm2, %xmm2      # xmm2 = xmm2[0,1,0,2]

pshufd $232, %xmm0, %xmm0      # xmm0 = xmm0[0,2,2,3]

pblendw $240, %xmm2, %xmm0      # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7]

paddd %xmm1, %xmm0

movdqu %xmm0, (%rsi,%rcx)

cmpq $992, %rcx              # imm = 0x3E0

jne .LBB0_7

The performance I see out of the 3 versions (with a 500K-iteration outer loop):

Scalar: 0m10.320s

Vector (Non-interleaved): 0m8.054s

Vector (Interleaved): 0m3.541s

This is far from being the perfect use case for interleaved access:

1) There's no real interleaving, just one strided gather, so this would be
better served by Ashutosh's full "strided access" proposal.

2) It looks like the actual move + shuffle sequence is not better, and even
probably worse, than just inserting directly from memory - but it's still
worthwhile because of how much we save on the index computations.

Regardless of all that, the fact of the matter is that we get much better code
by treating it as interleaved, and I think this may be a good enough motivation
to enable it, unless we significantly regress in other cases.

I was going to look at benchmarks to see if we get any regressions, but if you
already have examples you're aware of, that would be great.

Thanks,

  Michael

On Thu, May 26, 2016 at 12:35 PM, Demikhovsky, Elena via llvm-dev <llvm-dev
at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> > wrote:

Interleaved access is not enabled on X86 yet.
We looked at this feature and got into conclusion that interleaving (as loads +
shuffles) is not always profitable on X86. We should provide the right cost
which depends on number of shuffles. Number of shuffles depends on permutations
(shuffle mask). And even if we estimate the number of shuffles, the shuffles are
not generated in-place. Vectorizer produces a long queue of "extracts"
and "inserts" that hopefully will be coupled into shuffles on a later
instcombine pass.

-  Elena

   >-----Original Message-----
   >From: Renato Golin [mailto:renato.golin at linaro.org
<mailto:renato.golin at linaro.org> ]
   >Sent: Thursday, May 26, 2016 21:25
   >To: Sanjay Patel <spatel at rotateright.com <mailto:spatel at
rotateright.com> >; Demikhovsky, Elena
   ><elena.demikhovsky at intel.com <mailto:elena.demikhovsky at
intel.com> >
   >Cc: llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at
lists.llvm.org> >
   >Subject: Re: [llvm-dev] enabling interleaved access loop vectorization
   >
   >On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-
   >dev at lists.llvm.org <mailto:dev at lists.llvm.org> > wrote:
   >> Is there a compile-time and/or potential runtime cost that makes
   >> enableInterleavedAccessVectorization() default to 'false'?
   >>
   >> I notice that this is set to true for ARM, AArch64, and PPC.
   >>
   >> In particular, I'm wondering if there's a reason it's
not enabled for
   >> x86 in relation to PR27881:
   >> https://llvm.org/bugs/show_bug.cgi?id=27881
   >
   >Hi Sanjay,
   >
   >The feature was originally developed for ARM's VLDn/VSTn instructions
   >and then extended to AArch64 and PPC, but not x86/64 yet.
   >
   >I believe Elena was working on that, but needed to get the scatter/gather
   >intrinsics working first. I just copied her in case I'm wrong. :)
   >
   >cheers,
   >--renato
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org> 
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160805/9f59609f/attachment.html>

Michael Kuperstein via llvm-dev

2016-Aug-05 16:57 UTC

head link

[llvm-dev] enabling interleaved access loop vectorization

I agree the BasicTTI cost for interleaving is fairly conservative, but I
don't think that's "inappropriate" for x86.

The cost we have for gathers right now is very conservative (as I wrote in
the original email, 14 per lane). So, enabling interleaving, even with the
BasicTTI cost, will only reduce the total estimated cost for the vectorized
versions - which should be a good thing (since the cost is *still*
conservative).

On Fri, Aug 5, 2016 at 4:20 AM, Nema, Ashutosh <Ashutosh.Nema at amd.com>
wrote:
> Hi Michael,
>
>
>
> Sometime back I did some experiments with interleave vectorizer and did
> not found any degrade,
>
> probably my tests/benchmarks are not extensive enough to cover much.
>
>
>
> Elina is the right person to comment on it as she already experienced
> cases where it hinders performance.
>
>
>
> For interleave vectorizer on X86 we do not have any specific costing, it
> goes to BasicTTI where the costing is not appropriate(WRT X86).
>
> It consider cost of extracts & inserts for extracting elements from a
wide
> vector, which is really expensive.
>
> i.e. in your test case the cost of load associated with “in[i * 2]” is 10
> (for VF4).
>
> Interleave vectorize will generate following instructions for it:
>
>   %wide.vec = load <8 x i32>, <8 x i32>* %14, align 4, !tbaa
!1,
> !alias.scope !5
>
>   %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32>
undef, <4 x
> i32> <i32 0, i32 2, i32 4, i32 6>
>
>
>
> For wide load it get cost as 2(as it has to generate 2 loads) but for
> extracting elements (shuffle operation) it get cost as 8 (4 for extract + 4
> for insert).
>
> The cost should be 3 here, 2 for loads & 1 for shuffle.
>
>
>
> To enable Interleave vectorizer on X86 we should implement a proper cost
> estimation.
>
>
>
> Test you mentioned is indeed a candidate for Stride memory vectorization.
>
>
>
> Regards,
>
> Ashutosh
>
>
>
> *From:* Michael Kuperstein [mailto:mkuper at google.com]
> *Sent:* Friday, August 5, 2016 4:53 AM
> *To:* Demikhovsky, Elena <elena.demikhovsky at intel.com>
> *Cc:* Renato Golin <renato.golin at linaro.org>; Sanjay Patel <
> spatel at rotateright.com>; Nema, Ashutosh <Ashutosh.Nema at
amd.com>; Matthew
> Simpson <mssimpso at codeaurora.org>; llvm-dev <llvm-dev at
lists.llvm.org>
>
> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> Hi Elena,
>
>
>
> Circling back to this, do you know of any concrete cases where enabling
> interleaved access on x86 is unprofitable?
>
> Right now, there are some cases where we lose significantly, because (a)
> we consider gathers (on architectures that don't have them) extremely
> expensive, so we won't vectorize them at all without interleaved
access,
> and (b) we have interleaved access turned off.
>
>
>
> Consider something like this:
>
>
>
> void foo(int *in, int *out) {
>
>   int i = 0;
>
>   for (i = 0; i < 256; ++i) {
>
>     out[i] = in[i] + in[i + 1] + in[i + 2] + in[i * 2];
>
>   }
>
> }
>
>
>
> We don't vectorize this loop at all, because we calculate the cost of
the
> in[i * 2] gather to be 14 cycles per lane (!).
>
> This is an overestimate we need to fix, since the vectorized code is
> actually fairly decent - e.g. forcing vectorization, with SSE4.2, we get:
>
>
>
> .LBB0_3:                                # %vector.body
>
>                                         # =>This Inner Loop Header:
Depth=1
>
> movdqu (%rdi,%rax,4), %xmm3
>
> movd %xmm0, %rcx
>
> movdqu 4(%rdi,%rcx,4), %xmm4
>
> paddd %xmm3, %xmm4
>
> movdqu 8(%rdi,%rcx,4), %xmm3
>
> paddd %xmm4, %xmm3
>
> movdqa %xmm1, %xmm4
>
> paddq %xmm4, %xmm4
>
> movdqa %xmm0, %xmm5
>
> paddq %xmm5, %xmm5
>
> movd %xmm5, %rcx
>
> pextrq $1, %xmm5, %rdx
>
> movd %xmm4, %r8
>
> pextrq $1, %xmm4, %r9
>
> movd (%rdi,%rcx,4), %xmm4    # xmm4 = mem[0],zero,zero,zero
>
> pinsrd $1, (%rdi,%rdx,4), %xmm4
>
> pinsrd $2, (%rdi,%r8,4), %xmm4
>
> pinsrd $3, (%rdi,%r9,4), %xmm4
>
> paddd %xmm3, %xmm4
>
> movdqu %xmm4, (%rsi,%rax,4)
>
> addq $4, %rax
>
> paddq %xmm2, %xmm0
>
> paddq %xmm2, %xmm1
>
> cmpq $256, %rax              # imm = 0x100
>
> jne .LBB0_3
>
>
>
> But the real point is that with interleaved access enabled, we vectorize,
> and get:
>
>
>
> .LBB0_3:                                # %vector.body
>
>                                         # =>This Inner Loop Header:
Depth=1
>
> movdqu (%rdi,%rcx), %xmm0
>
> movdqu 4(%rdi,%rcx), %xmm1
>
> movdqu 8(%rdi,%rcx), %xmm2
>
> paddd %xmm0, %xmm1
>
> paddd %xmm2, %xmm1
>
> movdqu (%rdi,%rcx,2), %xmm0
>
> movdqu 16(%rdi,%rcx,2), %xmm2
>
> pshufd $132, %xmm2, %xmm2      # xmm2 = xmm2[0,1,0,2]
>
> pshufd $232, %xmm0, %xmm0      # xmm0 = xmm0[0,2,2,3]
>
> pblendw $240, %xmm2, %xmm0      # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7]
>
> paddd %xmm1, %xmm0
>
> movdqu %xmm0, (%rsi,%rcx)
>
> cmpq $992, %rcx              # imm = 0x3E0
>
> jne .LBB0_7
>
>
>
> The performance I see out of the 3 versions (with a 500K-iteration outer
> loop):
>
>
>
> Scalar: 0m10.320s
>
> Vector (Non-interleaved): 0m8.054s
>
> Vector (Interleaved): 0m3.541s
>
>
>
> This is far from being the perfect use case for interleaved access:
>
> 1) There's no real interleaving, just one strided gather, so this would
be
> better served by Ashutosh's full "strided access" proposal.
>
> 2) It looks like the actual move + shuffle sequence is not better, and
> even probably worse, than just inserting directly from memory - but
it's
> still worthwhile because of how much we save on the index computations.
>
> Regardless of all that, the fact of the matter is that we get much better
> code by treating it as interleaved, and I think this may be a good enough
> motivation to enable it, unless we significantly regress in other cases.
>
>
>
> I was going to look at benchmarks to see if we get any regressions, but if
> you already have examples you're aware of, that would be great.
>
>
>
> Thanks,
>
>   Michael
>
>
>
> On Thu, May 26, 2016 at 12:35 PM, Demikhovsky, Elena via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Interleaved access is not enabled on X86 yet.
> We looked at this feature and got into conclusion that interleaving (as
> loads + shuffles) is not always profitable on X86. We should provide the
> right cost which depends on number of shuffles. Number of shuffles depends
> on permutations (shuffle mask). And even if we estimate the number of
> shuffles, the shuffles are not generated in-place. Vectorizer produces a
> long queue of "extracts" and "inserts" that hopefully
will be coupled into
> shuffles on a later instcombine pass.
>
> -  Elena
>
>
>    >-----Original Message-----
>    >From: Renato Golin [mailto:renato.golin at linaro.org]
>    >Sent: Thursday, May 26, 2016 21:25
>    >To: Sanjay Patel <spatel at rotateright.com>; Demikhovsky,
Elena
>    ><elena.demikhovsky at intel.com>
>    >Cc: llvm-dev <llvm-dev at lists.llvm.org>
>    >Subject: Re: [llvm-dev] enabling interleaved access loop
vectorization
>    >
>    >On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-
>    >dev at lists.llvm.org> wrote:
>    >> Is there a compile-time and/or potential runtime cost that
makes
>    >> enableInterleavedAccessVectorization() default to
'false'?
>    >>
>    >> I notice that this is set to true for ARM, AArch64, and PPC.
>    >>
>    >> In particular, I'm wondering if there's a reason
it's not enabled for
>    >> x86 in relation to PR27881:
>    >> https://llvm.org/bugs/show_bug.cgi?id=27881
>    >
>    >Hi Sanjay,
>    >
>    >The feature was originally developed for ARM's VLDn/VSTn
instructions
>    >and then extended to AArch64 and PPC, but not x86/64 yet.
>    >
>    >I believe Elena was working on that, but needed to get the
> scatter/gather
>    >intrinsics working first. I just copied her in case I'm wrong.
:)
>    >
>    >cheers,
>    >--renato
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160805/20502b81/attachment.html>

Michael Kuperstein via llvm-dev

2016-Aug-05 17:05 UTC

head link

[llvm-dev] enabling interleaved access loop vectorization

Regarding InterleavedAccessPass - sure, but proper strided/interleaved
access optimization ought to have a positive impact even without target
support.
Case in point - Hal enabled it on PPC last September. An important
difference vs. x86 seems to be that arbitrary shuffles are cheap on PPC,
but, as I said below, I hope we can enable it on x86 with a conservative
cost function, and still get improvement.

On Fri, Aug 5, 2016 at 7:02 AM, Matthew Simpson <mssimpso at
codeaurora.org>
wrote:
> Isn't our current interleaved access vectorization just a special case
of
> the more general strided access proposal? If so, from a development
> perspective, it might make sense to begin incorporating some of that work
> into the existing framework (with appropriate target hooks and costs). This
> could probably be done piecemeal rather than all at once.
>
>
>
> Also, keep in mind that ARM/Aarch64 run an additional IR pass
> (InterleavedAccessPass) that matches the load/store plus shuffle sequences
> that the vectorizer generates to target-specific instrinsics.
>
>
>
> -- Matt
>
>
>
>
>
> *From:* Nema, Ashutosh [mailto:Ashutosh.Nema at amd.com]
> *Sent:* Friday, August 05, 2016 7:21 AM
> *To:* Michael Kuperstein <mkuper at google.com>; Demikhovsky, Elena
<
> elena.demikhovsky at intel.com>
> *Cc:* Renato Golin <renato.golin at linaro.org>; Sanjay Patel <
> spatel at rotateright.com>; Matthew Simpson <mssimpso at
codeaurora.org>;
> llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* RE: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> Hi Michael,
>
>
>
> Sometime back I did some experiments with interleave vectorizer and did
> not found any degrade,
>
> probably my tests/benchmarks are not extensive enough to cover much.
>
>
>
> Elina is the right person to comment on it as she already experienced
> cases where it hinders performance.
>
>
>
> For interleave vectorizer on X86 we do not have any specific costing, it
> goes to BasicTTI where the costing is not appropriate(WRT X86).
>
> It consider cost of extracts & inserts for extracting elements from a
wide
> vector, which is really expensive.
>
> i.e. in your test case the cost of load associated with “in[i * 2]” is 10
> (for VF4).
>
> Interleave vectorize will generate following instructions for it:
>
>   %wide.vec = load <8 x i32>, <8 x i32>* %14, align 4, !tbaa
!1,
> !alias.scope !5
>
>   %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32>
undef, <4 x
> i32> <i32 0, i32 2, i32 4, i32 6>
>
>
>
> For wide load it get cost as 2(as it has to generate 2 loads) but for
> extracting elements (shuffle operation) it get cost as 8 (4 for extract + 4
> for insert).
>
> The cost should be 3 here, 2 for loads & 1 for shuffle.
>
>
>
> To enable Interleave vectorizer on X86 we should implement a proper cost
> estimation.
>
>
>
> Test you mentioned is indeed a candidate for Stride memory vectorization.
>
>
>
> Regards,
>
> Ashutosh
>
>
>
> *From:* Michael Kuperstein [mailto:mkuper at google.com <mkuper at
google.com>]
> *Sent:* Friday, August 5, 2016 4:53 AM
> *To:* Demikhovsky, Elena <elena.demikhovsky at intel.com>
> *Cc:* Renato Golin <renato.golin at linaro.org>; Sanjay Patel <
> spatel at rotateright.com>; Nema, Ashutosh <Ashutosh.Nema at
amd.com>; Matthew
> Simpson <mssimpso at codeaurora.org>; llvm-dev <llvm-dev at
lists.llvm.org>
> *Subject:* Re: [llvm-dev] enabling interleaved access loop vectorization
>
>
>
> Hi Elena,
>
>
>
> Circling back to this, do you know of any concrete cases where enabling
> interleaved access on x86 is unprofitable?
>
> Right now, there are some cases where we lose significantly, because (a)
> we consider gathers (on architectures that don't have them) extremely
> expensive, so we won't vectorize them at all without interleaved
access,
> and (b) we have interleaved access turned off.
>
>
>
> Consider something like this:
>
>
>
> void foo(int *in, int *out) {
>
>   int i = 0;
>
>   for (i = 0; i < 256; ++i) {
>
>     out[i] = in[i] + in[i + 1] + in[i + 2] + in[i * 2];
>
>   }
>
> }
>
>
>
> We don't vectorize this loop at all, because we calculate the cost of
the
> in[i * 2] gather to be 14 cycles per lane (!).
>
> This is an overestimate we need to fix, since the vectorized code is
> actually fairly decent - e.g. forcing vectorization, with SSE4.2, we get:
>
>
>
> .LBB0_3:                                # %vector.body
>
>                                         # =>This Inner Loop Header:
Depth=1
>
> movdqu (%rdi,%rax,4), %xmm3
>
> movd %xmm0, %rcx
>
> movdqu 4(%rdi,%rcx,4), %xmm4
>
> paddd %xmm3, %xmm4
>
> movdqu 8(%rdi,%rcx,4), %xmm3
>
> paddd %xmm4, %xmm3
>
> movdqa %xmm1, %xmm4
>
> paddq %xmm4, %xmm4
>
> movdqa %xmm0, %xmm5
>
> paddq %xmm5, %xmm5
>
> movd %xmm5, %rcx
>
> pextrq $1, %xmm5, %rdx
>
> movd %xmm4, %r8
>
> pextrq $1, %xmm4, %r9
>
> movd (%rdi,%rcx,4), %xmm4    # xmm4 = mem[0],zero,zero,zero
>
> pinsrd $1, (%rdi,%rdx,4), %xmm4
>
> pinsrd $2, (%rdi,%r8,4), %xmm4
>
> pinsrd $3, (%rdi,%r9,4), %xmm4
>
> paddd %xmm3, %xmm4
>
> movdqu %xmm4, (%rsi,%rax,4)
>
> addq $4, %rax
>
> paddq %xmm2, %xmm0
>
> paddq %xmm2, %xmm1
>
> cmpq $256, %rax              # imm = 0x100
>
> jne .LBB0_3
>
>
>
> But the real point is that with interleaved access enabled, we vectorize,
> and get:
>
>
>
> .LBB0_3:                                # %vector.body
>
>                                         # =>This Inner Loop Header:
Depth=1
>
> movdqu (%rdi,%rcx), %xmm0
>
> movdqu 4(%rdi,%rcx), %xmm1
>
> movdqu 8(%rdi,%rcx), %xmm2
>
> paddd %xmm0, %xmm1
>
> paddd %xmm2, %xmm1
>
> movdqu (%rdi,%rcx,2), %xmm0
>
> movdqu 16(%rdi,%rcx,2), %xmm2
>
> pshufd $132, %xmm2, %xmm2      # xmm2 = xmm2[0,1,0,2]
>
> pshufd $232, %xmm0, %xmm0      # xmm0 = xmm0[0,2,2,3]
>
> pblendw $240, %xmm2, %xmm0      # xmm0 = xmm0[0,1,2,3],xmm2[4,5,6,7]
>
> paddd %xmm1, %xmm0
>
> movdqu %xmm0, (%rsi,%rcx)
>
> cmpq $992, %rcx              # imm = 0x3E0
>
> jne .LBB0_7
>
>
>
> The performance I see out of the 3 versions (with a 500K-iteration outer
> loop):
>
>
>
> Scalar: 0m10.320s
>
> Vector (Non-interleaved): 0m8.054s
>
> Vector (Interleaved): 0m3.541s
>
>
>
> This is far from being the perfect use case for interleaved access:
>
> 1) There's no real interleaving, just one strided gather, so this would
be
> better served by Ashutosh's full "strided access" proposal.
>
> 2) It looks like the actual move + shuffle sequence is not better, and
> even probably worse, than just inserting directly from memory - but
it's
> still worthwhile because of how much we save on the index computations.
>
> Regardless of all that, the fact of the matter is that we get much better
> code by treating it as interleaved, and I think this may be a good enough
> motivation to enable it, unless we significantly regress in other cases.
>
>
>
> I was going to look at benchmarks to see if we get any regressions, but if
> you already have examples you're aware of, that would be great.
>
>
>
> Thanks,
>
>   Michael
>
>
>
> On Thu, May 26, 2016 at 12:35 PM, Demikhovsky, Elena via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Interleaved access is not enabled on X86 yet.
> We looked at this feature and got into conclusion that interleaving (as
> loads + shuffles) is not always profitable on X86. We should provide the
> right cost which depends on number of shuffles. Number of shuffles depends
> on permutations (shuffle mask). And even if we estimate the number of
> shuffles, the shuffles are not generated in-place. Vectorizer produces a
> long queue of "extracts" and "inserts" that hopefully
will be coupled into
> shuffles on a later instcombine pass.
>
> -  Elena
>
>
>    >-----Original Message-----
>    >From: Renato Golin [mailto:renato.golin at linaro.org]
>    >Sent: Thursday, May 26, 2016 21:25
>    >To: Sanjay Patel <spatel at rotateright.com>; Demikhovsky,
Elena
>    ><elena.demikhovsky at intel.com>
>    >Cc: llvm-dev <llvm-dev at lists.llvm.org>
>    >Subject: Re: [llvm-dev] enabling interleaved access loop
vectorization
>    >
>    >On 26 May 2016 at 19:12, Sanjay Patel via llvm-dev <llvm-
>    >dev at lists.llvm.org> wrote:
>    >> Is there a compile-time and/or potential runtime cost that
makes
>    >> enableInterleavedAccessVectorization() default to
'false'?
>    >>
>    >> I notice that this is set to true for ARM, AArch64, and PPC.
>    >>
>    >> In particular, I'm wondering if there's a reason
it's not enabled for
>    >> x86 in relation to PR27881:
>    >> https://llvm.org/bugs/show_bug.cgi?id=27881
>    >
>    >Hi Sanjay,
>    >
>    >The feature was originally developed for ARM's VLDn/VSTn
instructions
>    >and then extended to AArch64 and PPC, but not x86/64 yet.
>    >
>    >I believe Elena was working on that, but needed to get the
> scatter/gather
>    >intrinsics working first. I just copied her in case I'm wrong.
:)
>    >
>    >cheers,
>    >--renato
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20160805/b465acde/attachment-0001.html>

Maybe Matching Threads

Search for more reasonably related threads

llvm dev - Aug 2016 - enabling interleaved access loop vectorization

[llvm-dev] enabling interleaved access loop vectorization

[llvm-dev] enabling interleaved access loop vectorization

[llvm-dev] enabling interleaved access loop vectorization

[llvm-dev] enabling interleaved access loop vectorization

Maybe Matching Threads