thr3ads.net - llvm dev - [llvm-dev] Understanding and controlling some of the AVX shuffle emission paths [Nov 2021]

If this information is useful, please help other people find it:
Share via:

Nicolas Vasilache via llvm-dev

2021-Nov-14 22:16 UTC

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

Not yet, the InlineAsmOp in MLIR is still generally unused.
It has been used a bit in the IREE project though (
https://github.com/google/iree/blob/49a81c60329437e64791ee1abd09d47fe1cde205/iree/compiler/Codegen/LLVMCPU/VectorContractToAArch64InlineAsmOp.cpp#L103
).

I should be be indeed able to intersperse my lowering (
https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp#L124)
with some InlineAsmOp uses.
I'll report back when I have something.

On Sun, Nov 14, 2021 at 4:53 PM Simon Pilgrim <llvm-dev at redking.me.uk>
wrote:
> Nicolas - have you investigated just using inline asm instead?
> On 11/11/2021 08:34, Wang, Pengfei via llvm-dev wrote:
>
> >As I am very new to this part of LLVM, I am not sure what is feasible
or
> not. Would it be envisionnable to either:
>
> >1. have a way to inject some numeric cost to influence the value of
some
> resulting combinations?
>
> >2. revive some form of intrinsic and guarantee that the instruction
would
> be generated?
>
>
>
> I think a feasible way is to add a new tuningXXX feature for given targets
> and do something different with the flag in the combine.
>
> 1) seems overengineering and 2) seems overkilled for potential
> opportunities by the combine.
>
>
>
> Thanks
>
> Phoebe
>
>
>
> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org>
> <llvm-dev-bounces at lists.llvm.org> *On Behalf Of *Nicolas Vasilache
via
> llvm-dev
> *Sent:* Wednesday, November 10, 2021 5:46 PM
> *To:* Diego Caballero <diegocaballero at google.com>
> <diegocaballero at google.com>
> *Cc:* llvm-dev at lists.llvm.org
> *Subject:* Re: [llvm-dev] Understanding and controlling some of the AVX
> shuffle emission paths
>
>
>
>
>
>
>
> On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero <
> diegocaballero at google.com> wrote:
>
> +Nicolas Vasilache <ntv at google.com> :)
>
>
>
> Thanks Diego, email is hard, I could not find ways to inject myself into
> my own discussion...
>
>
>
>
>
> On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> On 09/11/2021 20:44, Simon Pilgrim wrote:
>
> > On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
> >> Hi everyone,
> >>
> >> I am experimenting with LLVM lowering, intrinsics and
shufflevector
> >> in general.
> >>
> >> Here is an IR that I produce with the objective of emitting some
> >> vblendps instructions:
> >>
> https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.
>
> >>
> > From what I can see, the original IR code was (effectively):
> >
> > 8 x UNPCKLPS/UNPCKHPS
> > 4 x SHUFPS
> > 8 x BLENDPS
> > 4 x INSERTF128
> > 4 x PERM2F128
> >
> >> I compile this further with
> >>
> >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3
> >> -mcpu=haswell - -o -
> >>
> >> to obtain:
> >>
> >>
> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a
> >>
> >
> > and after the x86 shuffle combines:
> >
> > 8 x UNPCKLPS/UNPCKHPS
> > 8 x UNPCKLPD/UNPCKHPD
> > 4 x INSERTF128
> > 4 x PERM2F128
> >
> > Starting from each BLENDPS, they've combined with the SHUFPS to
create
> > the UNPCK*PD nodes. We nearly always benefit from folding shuffle
> > chains to reduce total instruction counts, even if some inner nodes
> > have multiple uses (like the SHUFPS), and I'd hate to lose that.
> >
> >> At this point, I would expect to see some vblendps instructions
> >> generated for the pieces of IR that produce %48/%49 %51/%52
%54/%55
> >> and %57/%58 to reduce pressure on port 5 (vblendps can also go on
> >> ports 0 and 1). However the expected instruction does not get
> >> generated and llvm-mca continues to show me high port 5
contention.
> >>
> >> Could people suggest some steps / commands to help better
understand
> >> why my expectation is not met and whether I can do something to
make
> >> the compiler generate what I want? Thanks in advance!
> > So on Haswell, we've gained 4 extra Port5-only shuffles but
removed
> > the 8 Port015 blends.
> >
> > We have very little arch-specific shuffle combines, just the
> > fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask
> > loads, the shuffle combines just aims for the reduction in simple
> > target shuffle nodes. And tbh I'm reluctant to add to this as
shuffle
> > combining is complex already.
> >
> > We should be preferring to lower/combine to BLENDPS in more
> > circumstances (its commutable and never slower than any other target
> > shuffle, although demanded elts can do less with 'undef'
elements),
> > but that won't help us here.
> >
> > So far I've failed to find a BLEND-based 8x8 transpose pattern
that
> > the shuffle combiner doesn't manage to combine back to the
> > 8xUNPCK/SHUFPS ops :(
>
>
>
> If you are referring to this specific code, yes same for me.
>
> If you are thinking about the general 8x8 transpose problem, I have a
> version with vector<4xf32> loads that ends up using blends; as
expected,
> the port 5 pressure reduction helps and both llvm-mca and runtime agree
> that this is 20-30% faster.
>
>
>
>
> The only thing I can think of is you might want to see if you can
> reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
> the SHUFPS/BLENDPS:
>
> 8 x UNPCKLPS/UNPCKHPS
> 4 x INSERTF128
> 4 x PERM2F128
> 4 x SHUFPS
> 8 x BLENDPS
>
> Splitting the per-lane shuffles with the subvector-shuffles could help
> stop the shuffle combiner.
>
>
>
> Right, I tried different variations here but invariably getting the same
> result.
>
> The vector<4xf32> based version is something that I also want to
target
> for a bunch of orthogonal reasons.
>
> I'll note that my use case is MLIR codegen with explicit vectors and
> intrinsics -> LLVM so I have quite some flexibility.
>
> But it feels unnatural in the compiler flow to have to branch off at a
> significant higher-level of abstraction to sidestep concerns related to X86
> microarchitecture details.
>
>
>
> As I am very new to this part of LLVM, I am not sure what is feasible or
> not. Would it be envisionnable to either:
>
> 1. have a way to inject some numeric cost to influence the value of some
> resulting combinations?
>
> 2. revive some form of intrinsic and guarantee that the instruction would
> be generated?
>
>
>
> I realize point 2. is contrary to the evolution of LLVM as these
> intrinsics were removed ca. 2015 in favor of the combiner-based approach.
>
> Still it seems that `we have very little arch-specific shuffle combines`
> could be the signal that such intrinsics are needed?
>
>
>
>
> >> I have verified independently that in isolation, a single such
> >> shuffle creates a vblendps. I see them being recombined in the
> >> produced assembly and I am looking for experimenting with avoiding
> >> that vshufps + vblendps + vblendps get recombined into vunpckxxx +
> >> vunpckxxx instructions.
> >>
> >> --
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> --
>
> N
>
> _______________________________________________
> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-- 
N
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211114/37a452c8/attachment.html>

Nicolas Vasilache via llvm-dev

2021-Nov-21 14:58 UTC

head link

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

FYI, a commit is up for the inline asm change:
https://reviews.llvm.org/D114335.

On Sun, Nov 14, 2021 at 11:16 PM Nicolas Vasilache <ntv at google.com>
wrote:
> Not yet, the InlineAsmOp in MLIR is still generally unused.
> It has been used a bit in the IREE project though (
>
https://github.com/google/iree/blob/49a81c60329437e64791ee1abd09d47fe1cde205/iree/compiler/Codegen/LLVMCPU/VectorContractToAArch64InlineAsmOp.cpp#L103
> ).
>
> I should be be indeed able to intersperse my lowering (
>
https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp#L124)
> with some InlineAsmOp uses.
> I'll report back when I have something.
>
> On Sun, Nov 14, 2021 at 4:53 PM Simon Pilgrim <llvm-dev at
redking.me.uk>
> wrote:
>
>> Nicolas - have you investigated just using inline asm instead?
>> On 11/11/2021 08:34, Wang, Pengfei via llvm-dev wrote:
>>
>> >As I am very new to this part of LLVM, I am not sure what is
feasible or
>> not. Would it be envisionnable to either:
>>
>> >1. have a way to inject some numeric cost to influence the value of
some
>> resulting combinations?
>>
>> >2. revive some form of intrinsic and guarantee that the instruction
>> would be generated?
>>
>>
>>
>> I think a feasible way is to add a new tuningXXX feature for given
>> targets and do something different with the flag in the combine.
>>
>> 1) seems overengineering and 2) seems overkilled for potential
>> opportunities by the combine.
>>
>>
>>
>> Thanks
>>
>> Phoebe
>>
>>
>>
>> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org>
>> <llvm-dev-bounces at lists.llvm.org> *On Behalf Of *Nicolas
Vasilache via
>> llvm-dev
>> *Sent:* Wednesday, November 10, 2021 5:46 PM
>> *To:* Diego Caballero <diegocaballero at google.com>
>> <diegocaballero at google.com>
>> *Cc:* llvm-dev at lists.llvm.org
>> *Subject:* Re: [llvm-dev] Understanding and controlling some of the AVX
>> shuffle emission paths
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero <
>> diegocaballero at google.com> wrote:
>>
>> +Nicolas Vasilache <ntv at google.com> :)
>>
>>
>>
>> Thanks Diego, email is hard, I could not find ways to inject myself
into
>> my own discussion...
>>
>>
>>
>>
>>
>> On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>> On 09/11/2021 20:44, Simon Pilgrim wrote:
>>
>> > On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote:
>> >> Hi everyone,
>> >>
>> >> I am experimenting with LLVM lowering, intrinsics and
shufflevector
>> >> in general.
>> >>
>> >> Here is an IR that I produce with the objective of emitting
some
>> >> vblendps instructions:
>> >>
>>
https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.
>>
>> >>
>> > From what I can see, the original IR code was (effectively):
>> >
>> > 8 x UNPCKLPS/UNPCKHPS
>> > 4 x SHUFPS
>> > 8 x BLENDPS
>> > 4 x INSERTF128
>> > 4 x PERM2F128
>> >
>> >> I compile this further with
>> >>
>> >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3
>> >> -mcpu=haswell - -o -
>> >>
>> >> to obtain:
>> >>
>> >>
>>
https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a
>> >>
>> >
>> > and after the x86 shuffle combines:
>> >
>> > 8 x UNPCKLPS/UNPCKHPS
>> > 8 x UNPCKLPD/UNPCKHPD
>> > 4 x INSERTF128
>> > 4 x PERM2F128
>> >
>> > Starting from each BLENDPS, they've combined with the SHUFPS
to create
>> > the UNPCK*PD nodes. We nearly always benefit from folding shuffle
>> > chains to reduce total instruction counts, even if some inner
nodes
>> > have multiple uses (like the SHUFPS), and I'd hate to lose
that.
>> >
>> >> At this point, I would expect to see some vblendps
instructions
>> >> generated for the pieces of IR that produce %48/%49 %51/%52
%54/%55
>> >> and %57/%58 to reduce pressure on port 5 (vblendps can also go
on
>> >> ports 0 and 1). However the expected instruction does not get
>> >> generated and llvm-mca continues to show me high port 5
contention.
>> >>
>> >> Could people suggest some steps / commands to help better
understand
>> >> why my expectation is not met and whether I can do something
to make
>> >> the compiler generate what I want? Thanks in advance!
>> > So on Haswell, we've gained 4 extra Port5-only shuffles but
removed
>> > the 8 Port015 blends.
>> >
>> > We have very little arch-specific shuffle combines, just the
>> > fast-variable-shuffle tuning flags to avoid unnecessary shuffle
mask
>> > loads, the shuffle combines just aims for the reduction in simple
>> > target shuffle nodes. And tbh I'm reluctant to add to this as
shuffle
>> > combining is complex already.
>> >
>> > We should be preferring to lower/combine to BLENDPS in more
>> > circumstances (its commutable and never slower than any other
target
>> > shuffle, although demanded elts can do less with 'undef'
elements),
>> > but that won't help us here.
>> >
>> > So far I've failed to find a BLEND-based 8x8 transpose pattern
that
>> > the shuffle combiner doesn't manage to combine back to the
>> > 8xUNPCK/SHUFPS ops :(
>>
>>
>>
>> If you are referring to this specific code, yes same for me.
>>
>> If you are thinking about the general 8x8 transpose problem, I have a
>> version with vector<4xf32> loads that ends up using blends; as
expected,
>> the port 5 pressure reduction helps and both llvm-mca and runtime agree
>> that this is 20-30% faster.
>>
>>
>>
>>
>> The only thing I can think of is you might want to see if you can
>> reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and
>> the SHUFPS/BLENDPS:
>>
>> 8 x UNPCKLPS/UNPCKHPS
>> 4 x INSERTF128
>> 4 x PERM2F128
>> 4 x SHUFPS
>> 8 x BLENDPS
>>
>> Splitting the per-lane shuffles with the subvector-shuffles could help
>> stop the shuffle combiner.
>>
>>
>>
>> Right, I tried different variations here but invariably getting the
same
>> result.
>>
>> The vector<4xf32> based version is something that I also want to
target
>> for a bunch of orthogonal reasons.
>>
>> I'll note that my use case is MLIR codegen with explicit vectors
and
>> intrinsics -> LLVM so I have quite some flexibility.
>>
>> But it feels unnatural in the compiler flow to have to branch off at a
>> significant higher-level of abstraction to sidestep concerns related to
X86
>> microarchitecture details.
>>
>>
>>
>> As I am very new to this part of LLVM, I am not sure what is feasible
or
>> not. Would it be envisionnable to either:
>>
>> 1. have a way to inject some numeric cost to influence the value of
some
>> resulting combinations?
>>
>> 2. revive some form of intrinsic and guarantee that the instruction
would
>> be generated?
>>
>>
>>
>> I realize point 2. is contrary to the evolution of LLVM as these
>> intrinsics were removed ca. 2015 in favor of the combiner-based
approach.
>>
>> Still it seems that `we have very little arch-specific shuffle
combines`
>> could be the signal that such intrinsics are needed?
>>
>>
>>
>>
>> >> I have verified independently that in isolation, a single such
>> >> shuffle creates a vblendps. I see them being recombined in the
>> >> produced assembly and I am looking for experimenting with
avoiding
>> >> that vshufps + vblendps + vblendps get recombined into
vunpckxxx +
>> >> vunpckxxx instructions.
>> >>
>> >> --
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>> --
>>
>> N
>>
>> _______________________________________________
>> LLVM Developers mailing listllvm-dev at
lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>
> --
> N
>

-- 
N
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211121/3019ef14/attachment.html>

llvm dev - Nov 2021 - Understanding and controlling some of the AVX shuffle emission paths

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths