Simon Pilgrim via llvm-dev
2021-Nov-09 21:32 UTC
[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths
On 09/11/2021 20:44, Simon Pilgrim wrote:> On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote: >> Hi everyone, >> >> I am experimenting with LLVM lowering, intrinsics and shufflevector >> in general. >> >> Here is an IR that I produce with the objective of emitting some >> vblendps instructions: >> https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a. >> > From what I can see, the original IR code was (effectively): > > 8 x UNPCKLPS/UNPCKHPS > 4 x SHUFPS > 8 x BLENDPS > 4 x INSERTF128 > 4 x PERM2F128 > >> I compile this further with >> >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3 >> -mcpu=haswell - -o - >> >> to obtain: >> >> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a >> > > and after the x86 shuffle combines: > > 8 x UNPCKLPS/UNPCKHPS > 8 x UNPCKLPD/UNPCKHPD > 4 x INSERTF128 > 4 x PERM2F128 > > Starting from each BLENDPS, they've combined with the SHUFPS to create > the UNPCK*PD nodes. We nearly always benefit from folding shuffle > chains to reduce total instruction counts, even if some inner nodes > have multiple uses (like the SHUFPS), and I'd hate to lose that. > >> At this point, I would expect to see some vblendps instructions >> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55 >> and %57/%58 to reduce pressure on port 5 (vblendps can also go on >> ports 0 and 1). However the expected instruction does not get >> generated and llvm-mca continues to show me high port 5 contention. >> >> Could people suggest some steps / commands to help better understand >> why my expectation is not met and whether I can do something to make >> the compiler generate what I want? Thanks in advance! > So on Haswell, we've gained 4 extra Port5-only shuffles but removed > the 8 Port015 blends. > > We have very little arch-specific shuffle combines, just the > fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask > loads, the shuffle combines just aims for the reduction in simple > target shuffle nodes. And tbh I'm reluctant to add to this as shuffle > combining is complex already. > > We should be preferring to lower/combine to BLENDPS in more > circumstances (its commutable and never slower than any other target > shuffle, although demanded elts can do less with 'undef' elements), > but that won't help us here. > > So far I've failed to find a BLEND-based 8x8 transpose pattern that > the shuffle combiner doesn't manage to combine back to the > 8xUNPCK/SHUFPS ops :(The only thing I can think of is you might want to see if you can reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and the SHUFPS/BLENDPS: 8 x UNPCKLPS/UNPCKHPS 4 x INSERTF128 4 x PERM2F128 4 x SHUFPS 8 x BLENDPS Splitting the per-lane shuffles with the subvector-shuffles could help stop the shuffle combiner.>> I have verified independently that in isolation, a single such >> shuffle creates a vblendps. I see them being recombined in the >> produced assembly and I am looking for experimenting with avoiding >> that vshufps + vblendps + vblendps get recombined into vunpckxxx + >> vunpckxxx instructions. >> >> --
Diego Caballero via llvm-dev
2021-Nov-10 09:30 UTC
[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths
+Nicolas Vasilache <ntv at google.com> :) On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On 09/11/2021 20:44, Simon Pilgrim wrote: > > > On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote: > >> Hi everyone, > >> > >> I am experimenting with LLVM lowering, intrinsics and shufflevector > >> in general. > >> > >> Here is an IR that I produce with the objective of emitting some > >> vblendps instructions: > >> > https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a. > > >> > > From what I can see, the original IR code was (effectively): > > > > 8 x UNPCKLPS/UNPCKHPS > > 4 x SHUFPS > > 8 x BLENDPS > > 4 x INSERTF128 > > 4 x PERM2F128 > > > >> I compile this further with > >> > >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3 > >> -mcpu=haswell - -o - > >> > >> to obtain: > >> > >> > https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a > >> > > > > and after the x86 shuffle combines: > > > > 8 x UNPCKLPS/UNPCKHPS > > 8 x UNPCKLPD/UNPCKHPD > > 4 x INSERTF128 > > 4 x PERM2F128 > > > > Starting from each BLENDPS, they've combined with the SHUFPS to create > > the UNPCK*PD nodes. We nearly always benefit from folding shuffle > > chains to reduce total instruction counts, even if some inner nodes > > have multiple uses (like the SHUFPS), and I'd hate to lose that. > > > >> At this point, I would expect to see some vblendps instructions > >> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55 > >> and %57/%58 to reduce pressure on port 5 (vblendps can also go on > >> ports 0 and 1). However the expected instruction does not get > >> generated and llvm-mca continues to show me high port 5 contention. > >> > >> Could people suggest some steps / commands to help better understand > >> why my expectation is not met and whether I can do something to make > >> the compiler generate what I want? Thanks in advance! > > So on Haswell, we've gained 4 extra Port5-only shuffles but removed > > the 8 Port015 blends. > > > > We have very little arch-specific shuffle combines, just the > > fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask > > loads, the shuffle combines just aims for the reduction in simple > > target shuffle nodes. And tbh I'm reluctant to add to this as shuffle > > combining is complex already. > > > > We should be preferring to lower/combine to BLENDPS in more > > circumstances (its commutable and never slower than any other target > > shuffle, although demanded elts can do less with 'undef' elements), > > but that won't help us here. > > > > So far I've failed to find a BLEND-based 8x8 transpose pattern that > > the shuffle combiner doesn't manage to combine back to the > > 8xUNPCK/SHUFPS ops :( > > The only thing I can think of is you might want to see if you can > reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and > the SHUFPS/BLENDPS: > > 8 x UNPCKLPS/UNPCKHPS > 4 x INSERTF128 > 4 x PERM2F128 > 4 x SHUFPS > 8 x BLENDPS > > Splitting the per-lane shuffles with the subvector-shuffles could help > stop the shuffle combiner. > > >> I have verified independently that in isolation, a single such > >> shuffle creates a vblendps. I see them being recombined in the > >> produced assembly and I am looking for experimenting with avoiding > >> that vshufps + vblendps + vblendps get recombined into vunpckxxx + > >> vunpckxxx instructions. > >> > >> -- > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211110/69d9faea/attachment-0001.html>