thr3ads.net - llvm dev - [llvm-dev] Understanding and controlling some of the AVX shuffle emission paths [Nov 2021]

If this information is useful, please help other people find it:
Share via:

Nicolas Vasilache via llvm-dev

2021-Nov-09 08:57 UTC

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

Hi everyone,

I am experimenting with LLVM lowering, intrinsics and shufflevector in
general.

Here is an IR that I produce with the objective of emitting some vblendps
instructions:
https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a.

I compile this further with

clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3 -mcpu=haswell -
-o -

to obtain:

https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a

At this point, I would expect to see some vblendps instructions generated
for the pieces of IR that produce %48/%49 %51/%52 %54/%55 and %57/%58 to
reduce pressure on port 5 (vblendps can also go on ports 0 and 1). However
the expected instruction does not get generated and llvm-mca continues to
show me high port 5 contention.

Could people suggest some steps / commands to help better understand why my
expectation is not met and whether I can do something to make the compiler
generate what I want? Thanks in advance!

I have verified independently that in isolation, a single such shuffle
creates a vblendps. I see them being recombined in the produced assembly
and I am looking for experimenting with avoiding that vshufps + vblendps +
vblendps get recombined into vunpckxxx + vunpckxxx instructions.

-- 
N
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20211109/da51e793/attachment.html>

Simon Pilgrim via llvm-dev

2021-Nov-09 20:44 UTC

head link

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev
wrote:> Hi everyone,
>
> I am experimenting with LLVM lowering, intrinsics and shufflevector in 
> general.
>
> Here is an IR that I produce with the objective of emitting some 
> vblendps instructions: 
> https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a. 
> From what I can see, the original IR code was (effectively):

8 x UNPCKLPS/UNPCKHPS
4 x SHUFPS
8 x BLENDPS
4 x INSERTF128
4 x PERM2F128
> I compile this further with
>
> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3 
> -mcpu=haswell - -o -
>
> to obtain:
>
> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a
and after the x86 shuffle combines:

8 x UNPCKLPS/UNPCKHPS
8 x UNPCKLPD/UNPCKHPD
4 x INSERTF128
4 x PERM2F128

Starting from each BLENDPS, they've combined with the SHUFPS to create 
the UNPCK*PD nodes. We nearly always benefit from folding shuffle chains 
to reduce total instruction counts, even if some inner nodes have 
multiple uses (like the SHUFPS), and I'd hate to lose that.
> At this point, I would expect to see some vblendps instructions 
> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55 
> and %57/%58 to reduce pressure on port 5 (vblendps can also go on 
> ports 0 and 1). However the expected instruction does not get 
> generated and llvm-mca continues to show me high port 5 contention.
>
> Could people suggest some steps / commands to help better understand 
> why my expectation is not met and whether I can do something to make 
> the compiler generate what I want? Thanks in advance!So on Haswell, we've gained 4 extra Port5-only shuffles but removed the 
8 Port015 blends.

We have very little arch-specific shuffle combines, just the 
fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask 
loads, the shuffle combines just aims for the reduction in simple target 
shuffle nodes. And tbh I'm reluctant to add to this as shuffle combining 
is complex already.

We should be preferring to lower/combine to BLENDPS in more 
circumstances (its commutable and never slower than any other target 
shuffle, although demanded elts can do less with 'undef' elements), but 
that won't help us here.

So far I've failed to find a BLEND-based 8x8 transpose pattern that the 
shuffle combiner doesn't manage to combine back to the 8xUNPCK/SHUFPS ops :(
> I have verified independently that in isolation, a single such shuffle 
> creates a vblendps. I see them being recombined in the produced 
> assembly and I am looking for experimenting with avoiding that vshufps 
> + vblendps + vblendps get recombined into vunpckxxx + vunpckxxx 
> instructions.
>
> --

llvm dev - Nov 2021 - Understanding and controlling some of the AVX shuffle emission paths

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths

[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths