thr3ads.net - llvm dev - [llvm-dev] Understanding and controlling some of the AVX shuffle emission paths [Nov 2021]

If this information is useful, please help other people find it:
Share via:

Wang, Pengfei via llvm-dev

2021-Nov-11 08:34 UTC

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

>As I am very new to this part of LLVM, I am not sure what is feasible or
not. Would it be envisionnable to either:
>1. have a way to inject some numeric cost to influence the value of some
resulting combinations?
>2. revive some form of intrinsic and guarantee that the instruction would be
generated?
I think a feasible way is to add a new tuningXXX feature for given targets and
do something different with the flag in the combine.
1) seems overengineering and 2) seems overkilled for potential opportunities by
the combine.

Thanks
Phoebe

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Nicolas
Vasilache via llvm-dev
Sent: Wednesday, November 10, 2021 5:46 PM
To: Diego Caballero <diegocaballero at google.com>
Cc: llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] Understanding and controlling some of the AVX shuffle
emission paths



On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero <diegocaballero at
google.com<mailto:diegocaballero at google.com>> wrote:
+Nicolas Vasilache<mailto:ntv at google.com> :)

Thanks Diego, email is hard, I could not find ways to inject myself into my own
discussion...


On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev <llvm-dev at
lists.llvm.org<mailto:llvm-dev at lists.llvm.org>> wrote:
On 09/11/2021 20:44, Simon Pilgrim wrote:
> On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
>> Hi everyone,
>>
>> I am experimenting with LLVM lowering, intrinsics and shufflevector
>> in general.
>>
>> Here is an IR that I produce with the objective of emitting some
>> vblendps instructions:
>>
https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.
>>
> From what I can see, the original IR code was (effectively):
>
> 8 x UNPCKLPS/UNPCKHPS
> 4 x SHUFPS
> 8 x BLENDPS
> 4 x INSERTF128
> 4 x PERM2F128
>
>> I compile this further with
>>
>> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3
>> -mcpu=haswell - -o -
>>
>> to obtain:
>>
>>
https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a
>>
>
> and after the x86 shuffle combines:
>
> 8 x UNPCKLPS/UNPCKHPS
> 8 x UNPCKLPD/UNPCKHPD
> 4 x INSERTF128
> 4 x PERM2F128
>
> Starting from each BLENDPS, they've combined with the SHUFPS to create
> the UNPCK*PD nodes. We nearly always benefit from folding shuffle
> chains to reduce total instruction counts, even if some inner nodes
> have multiple uses (like the SHUFPS), and I'd hate to lose that.
>
>> At this point, I would expect to see some vblendps instructions
>> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55
>> and %57/%58 to reduce pressure on port 5 (vblendps can also go on
>> ports 0 and 1). However the expected instruction does not get
>> generated and llvm-mca continues to show me high port 5 contention.
>>
>> Could people suggest some steps / commands to help better understand
>> why my expectation is not met and whether I can do something to make
>> the compiler generate what I want? Thanks in advance!
> So on Haswell, we've gained 4 extra Port5-only shuffles but removed
> the 8 Port015 blends.
>
> We have very little arch-specific shuffle combines, just the
> fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask
> loads, the shuffle combines just aims for the reduction in simple
> target shuffle nodes. And tbh I'm reluctant to add to this as shuffle
> combining is complex already.
>
> We should be preferring to lower/combine to BLENDPS in more
> circumstances (its commutable and never slower than any other target
> shuffle, although demanded elts can do less with 'undef' elements),
> but that won't help us here.
>
> So far I've failed to find a BLEND-based 8x8 transpose pattern that
> the shuffle combiner doesn't manage to combine back to the
> 8xUNPCK/SHUFPS ops :(
If you are referring to this specific code, yes same for me.
If you are thinking about the general 8x8 transpose problem, I have a version
with vector<4xf32> loads that ends up using blends; as expected, the port
5 pressure reduction helps and both llvm-mca and runtime agree that this is
20-30% faster.


The only thing I can think of is you might want to see if you can
reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
the SHUFPS/BLENDPS:

8 x UNPCKLPS/UNPCKHPS
4 x INSERTF128
4 x PERM2F128
4 x SHUFPS
8 x BLENDPS

Splitting the per-lane shuffles with the subvector-shuffles could help
stop the shuffle combiner.

Right, I tried different variations here but invariably getting the same result.
The vector<4xf32> based version is something that I also want to target
for a bunch of orthogonal reasons.
I'll note that my use case is MLIR codegen with explicit vectors and
intrinsics -> LLVM so I have quite some flexibility.
But it feels unnatural in the compiler flow to have to branch off at a
significant higher-level of abstraction to sidestep concerns related to X86
microarchitecture details.

As I am very new to this part of LLVM, I am not sure what is feasible or not.
Would it be envisionnable to either:
1. have a way to inject some numeric cost to influence the value of some
resulting combinations?
2. revive some form of intrinsic and guarantee that the instruction would be
generated?

I realize point 2. is contrary to the evolution of LLVM as these intrinsics were
removed ca. 2015 in favor of the combiner-based approach.
Still it seems that `we have very little arch-specific shuffle combines` could
be the signal that such intrinsics are needed?

>> I have verified independently that in isolation, a single such
>> shuffle creates a vblendps. I see them being recombined in the
>> produced assembly and I am looking for experimenting with avoiding
>> that vshufps + vblendps + vblendps get recombined into vunpckxxx +
>> vunpckxxx instructions.
>>
>> --_______________________________________________
LLVM Developers mailing list
llvm-dev at lists.llvm.org<mailto:llvm-dev at lists.llvm.org>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


--
N
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211111/e5003589/attachment.html>

Simon Pilgrim via llvm-dev

2021-Nov-14 15:52 UTC

head link

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

Nicolas - have you investigated just using inline asm instead?

On 11/11/2021 08:34, Wang, Pengfei via llvm-dev wrote:>
> >As I am very new to this part of LLVM, I am not sure what is feasible 
> or not. Would it be envisionnable to either:
>
> >1. have a way to inject some numeric cost to influence the value of 
> some resulting combinations?
>
> >2. revive some form of intrinsic and guarantee that the instruction 
> would be generated?
>
> I think a feasible way is to add a new tuningXXX feature for given 
> targets and do something different with the flag in the combine.
>
> 1) seems overengineering and 2) seems overkilled for potential 
> opportunities by the combine.
>
> Thanks
>
> Phoebe
>
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> *On Behalf Of 
> *Nicolas Vasilache via llvm-dev
> *Sent:* Wednesday, November 10, 2021 5:46 PM
> *To:* Diego Caballero <diegocaballero at google.com>
> *Cc:* llvm-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] Understanding and controlling some of the 
> AVX shuffle emission paths
>
> On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero 
> <diegocaballero at google.com> wrote:
>
>     +Nicolas Vasilache <mailto:ntv at google.com> :)
>
> Thanks Diego, email is hard, I could not find ways to inject myself 
> into my own discussion...
>
>     On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev
>     <llvm-dev at lists.llvm.org> wrote:
>
>         On 09/11/2021 20:44, Simon Pilgrim wrote:
>
>         > On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
>         >> Hi everyone,
>         >>
>         >> I am experimenting with LLVM lowering, intrinsics and
>         shufflevector
>         >> in general.
>         >>
>         >> Here is an IR that I produce with the objective of
emitting
>         some
>         >> vblendps instructions:
>         >>
>        
https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.
>
>         >>
>         > From what I can see, the original IR code was (effectively):
>         >
>         > 8 x UNPCKLPS/UNPCKHPS
>         > 4 x SHUFPS
>         > 8 x BLENDPS
>         > 4 x INSERTF128
>         > 4 x PERM2F128
>         >
>         >> I compile this further with
>         >>
>         >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3
>         >> -mcpu=haswell - -o -
>         >>
>         >> to obtain:
>         >>
>         >>
>        
https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a
>
>         >>
>         >
>         > and after the x86 shuffle combines:
>         >
>         > 8 x UNPCKLPS/UNPCKHPS
>         > 8 x UNPCKLPD/UNPCKHPD
>         > 4 x INSERTF128
>         > 4 x PERM2F128
>         >
>         > Starting from each BLENDPS, they've combined with the
SHUFPS
>         to create
>         > the UNPCK*PD nodes. We nearly always benefit from folding
>         shuffle
>         > chains to reduce total instruction counts, even if some
>         inner nodes
>         > have multiple uses (like the SHUFPS), and I'd hate to lose
that.
>         >
>         >> At this point, I would expect to see some vblendps
>         instructions
>         >> generated for the pieces of IR that produce %48/%49
%51/%52
>         %54/%55
>         >> and %57/%58 to reduce pressure on port 5 (vblendps can
also
>         go on
>         >> ports 0 and 1). However the expected instruction does not
get
>         >> generated and llvm-mca continues to show me high port 5
>         contention.
>         >>
>         >> Could people suggest some steps / commands to help better
>         understand
>         >> why my expectation is not met and whether I can do
>         something to make
>         >> the compiler generate what I want? Thanks in advance!
>         > So on Haswell, we've gained 4 extra Port5-only shuffles
but
>         removed
>         > the 8 Port015 blends.
>         >
>         > We have very little arch-specific shuffle combines, just the
>         > fast-variable-shuffle tuning flags to avoid unnecessary
>         shuffle mask
>         > loads, the shuffle combines just aims for the reduction in
>         simple
>         > target shuffle nodes. And tbh I'm reluctant to add to this
>         as shuffle
>         > combining is complex already.
>         >
>         > We should be preferring to lower/combine to BLENDPS in more
>         > circumstances (its commutable and never slower than any
>         other target
>         > shuffle, although demanded elts can do less with
'undef'
>         elements),
>         > but that won't help us here.
>         >
>         > So far I've failed to find a BLEND-based 8x8 transpose
>         pattern that
>         > the shuffle combiner doesn't manage to combine back to the
>         > 8xUNPCK/SHUFPS ops :(
>
> If you are referring to this specific code, yes same for me.
>
> If you are thinking about the general 8x8 transpose problem, I have a 
> version with vector<4xf32> loads that ends up using blends; as 
> expected, the port 5 pressure reduction helps and both llvm-mca and 
> runtime agree that this is 20-30% faster.
>
>
>         The only thing I can think of is you might want to see if you can
>         reorder the INSERTF128/PERM2F128 shuffles in between the
>         UNPACK*PS and
>         the SHUFPS/BLENDPS:
>
>         8 x UNPCKLPS/UNPCKHPS
>         4 x INSERTF128
>         4 x PERM2F128
>         4 x SHUFPS
>         8 x BLENDPS
>
>         Splitting the per-lane shuffles with the subvector-shuffles
>         could help
>         stop the shuffle combiner.
>
> Right, I tried different variations here but invariably getting the 
> same result.
>
> The vector<4xf32> based version is something that I also want to 
> target for a bunch of orthogonal reasons.
>
> I'll note that my use case is MLIR codegen with explicit vectors and 
> intrinsics -> LLVM so I have quite some flexibility.
>
> But it feels unnatural in the compiler flow to have to branch off at a 
> significant higher-level of abstraction to sidestep concerns related 
> to X86 microarchitecture details.
>
> As I am very new to this part of LLVM, I am not sure what is feasible 
> or not. Would it be envisionnable to either:
>
> 1. have a way to inject some numeric cost to influence the value of 
> some resulting combinations?
>
> 2. revive some form of intrinsic and guarantee that the instruction 
> would be generated?
>
> I realize point 2. is contrary to the evolution of LLVM as these 
> intrinsics were removed ca. 2015 in favor of the combiner-based approach.
>
> Still it seems that `we have very little arch-specific shuffle 
> combines` could be the signal that such intrinsics are needed?
>
>
>         >> I have verified independently that in isolation, a single
such
>         >> shuffle creates a vblendps. I see them being recombined in
the
>         >> produced assembly and I am looking for experimenting with
>         avoiding
>         >> that vshufps + vblendps + vblendps get recombined into
>         vunpckxxx +
>         >> vunpckxxx instructions.
>         >>
>         >> --
>         _______________________________________________
>         LLVM Developers mailing list
>         llvm-dev at lists.llvm.org
>         https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> -- 
>
> N
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211114/9647197a/attachment.html>

llvm dev - Nov 2021 - Understanding and controlling some of the AVX shuffle emission paths

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths