Nicolas Vasilache via llvm-dev
2021-Nov-14 22:16 UTC
[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths
Not yet, the InlineAsmOp in MLIR is still generally unused. It has been used a bit in the IREE project though ( https://github.com/google/iree/blob/49a81c60329437e64791ee1abd09d47fe1cde205/iree/compiler/Codegen/LLVMCPU/VectorContractToAArch64InlineAsmOp.cpp#L103 ). I should be be indeed able to intersperse my lowering ( https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp#L124) with some InlineAsmOp uses. I'll report back when I have something. On Sun, Nov 14, 2021 at 4:53 PM Simon Pilgrim <llvm-dev at redking.me.uk> wrote:> Nicolas - have you investigated just using inline asm instead? > On 11/11/2021 08:34, Wang, Pengfei via llvm-dev wrote: > > >As I am very new to this part of LLVM, I am not sure what is feasible or > not. Would it be envisionnable to either: > > >1. have a way to inject some numeric cost to influence the value of some > resulting combinations? > > >2. revive some form of intrinsic and guarantee that the instruction would > be generated? > > > > I think a feasible way is to add a new tuningXXX feature for given targets > and do something different with the flag in the combine. > > 1) seems overengineering and 2) seems overkilled for potential > opportunities by the combine. > > > > Thanks > > Phoebe > > > > *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> > <llvm-dev-bounces at lists.llvm.org> *On Behalf Of *Nicolas Vasilache via > llvm-dev > *Sent:* Wednesday, November 10, 2021 5:46 PM > *To:* Diego Caballero <diegocaballero at google.com> > <diegocaballero at google.com> > *Cc:* llvm-dev at lists.llvm.org > *Subject:* Re: [llvm-dev] Understanding and controlling some of the AVX > shuffle emission paths > > > > > > > > On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero < > diegocaballero at google.com> wrote: > > +Nicolas Vasilache <ntv at google.com> :) > > > > Thanks Diego, email is hard, I could not find ways to inject myself into > my own discussion... > > > > > > On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev < > llvm-dev at lists.llvm.org> wrote: > > On 09/11/2021 20:44, Simon Pilgrim wrote: > > > On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote: > >> Hi everyone, > >> > >> I am experimenting with LLVM lowering, intrinsics and shufflevector > >> in general. > >> > >> Here is an IR that I produce with the objective of emitting some > >> vblendps instructions: > >> > https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a. > > >> > > From what I can see, the original IR code was (effectively): > > > > 8 x UNPCKLPS/UNPCKHPS > > 4 x SHUFPS > > 8 x BLENDPS > > 4 x INSERTF128 > > 4 x PERM2F128 > > > >> I compile this further with > >> > >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3 > >> -mcpu=haswell - -o - > >> > >> to obtain: > >> > >> > https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a > >> > > > > and after the x86 shuffle combines: > > > > 8 x UNPCKLPS/UNPCKHPS > > 8 x UNPCKLPD/UNPCKHPD > > 4 x INSERTF128 > > 4 x PERM2F128 > > > > Starting from each BLENDPS, they've combined with the SHUFPS to create > > the UNPCK*PD nodes. We nearly always benefit from folding shuffle > > chains to reduce total instruction counts, even if some inner nodes > > have multiple uses (like the SHUFPS), and I'd hate to lose that. > > > >> At this point, I would expect to see some vblendps instructions > >> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55 > >> and %57/%58 to reduce pressure on port 5 (vblendps can also go on > >> ports 0 and 1). However the expected instruction does not get > >> generated and llvm-mca continues to show me high port 5 contention. > >> > >> Could people suggest some steps / commands to help better understand > >> why my expectation is not met and whether I can do something to make > >> the compiler generate what I want? Thanks in advance! > > So on Haswell, we've gained 4 extra Port5-only shuffles but removed > > the 8 Port015 blends. > > > > We have very little arch-specific shuffle combines, just the > > fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask > > loads, the shuffle combines just aims for the reduction in simple > > target shuffle nodes. And tbh I'm reluctant to add to this as shuffle > > combining is complex already. > > > > We should be preferring to lower/combine to BLENDPS in more > > circumstances (its commutable and never slower than any other target > > shuffle, although demanded elts can do less with 'undef' elements), > > but that won't help us here. > > > > So far I've failed to find a BLEND-based 8x8 transpose pattern that > > the shuffle combiner doesn't manage to combine back to the > > 8xUNPCK/SHUFPS ops :( > > > > If you are referring to this specific code, yes same for me. > > If you are thinking about the general 8x8 transpose problem, I have a > version with vector<4xf32> loads that ends up using blends; as expected, > the port 5 pressure reduction helps and both llvm-mca and runtime agree > that this is 20-30% faster. > > > > > The only thing I can think of is you might want to see if you can > reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and > the SHUFPS/BLENDPS: > > 8 x UNPCKLPS/UNPCKHPS > 4 x INSERTF128 > 4 x PERM2F128 > 4 x SHUFPS > 8 x BLENDPS > > Splitting the per-lane shuffles with the subvector-shuffles could help > stop the shuffle combiner. > > > > Right, I tried different variations here but invariably getting the same > result. > > The vector<4xf32> based version is something that I also want to target > for a bunch of orthogonal reasons. > > I'll note that my use case is MLIR codegen with explicit vectors and > intrinsics -> LLVM so I have quite some flexibility. > > But it feels unnatural in the compiler flow to have to branch off at a > significant higher-level of abstraction to sidestep concerns related to X86 > microarchitecture details. > > > > As I am very new to this part of LLVM, I am not sure what is feasible or > not. Would it be envisionnable to either: > > 1. have a way to inject some numeric cost to influence the value of some > resulting combinations? > > 2. revive some form of intrinsic and guarantee that the instruction would > be generated? > > > > I realize point 2. is contrary to the evolution of LLVM as these > intrinsics were removed ca. 2015 in favor of the combiner-based approach. > > Still it seems that `we have very little arch-specific shuffle combines` > could be the signal that such intrinsics are needed? > > > > > >> I have verified independently that in isolation, a single such > >> shuffle creates a vblendps. I see them being recombined in the > >> produced assembly and I am looking for experimenting with avoiding > >> that vshufps + vblendps + vblendps get recombined into vunpckxxx + > >> vunpckxxx instructions. > >> > >> -- > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > > > > > -- > > N > > _______________________________________________ > LLVM Developers mailing listllvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev > >-- N -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211114/37a452c8/attachment.html>
Nicolas Vasilache via llvm-dev
2021-Nov-21 14:58 UTC
[llvm-dev] Understanding and controlling some of the AVX shuffle emission paths
FYI, a commit is up for the inline asm change: https://reviews.llvm.org/D114335. On Sun, Nov 14, 2021 at 11:16 PM Nicolas Vasilache <ntv at google.com> wrote:> Not yet, the InlineAsmOp in MLIR is still generally unused. > It has been used a bit in the IREE project though ( > https://github.com/google/iree/blob/49a81c60329437e64791ee1abd09d47fe1cde205/iree/compiler/Codegen/LLVMCPU/VectorContractToAArch64InlineAsmOp.cpp#L103 > ). > > I should be be indeed able to intersperse my lowering ( > https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/X86Vector/Transforms/AVXTranspose.cpp#L124) > with some InlineAsmOp uses. > I'll report back when I have something. > > On Sun, Nov 14, 2021 at 4:53 PM Simon Pilgrim <llvm-dev at redking.me.uk> > wrote: > >> Nicolas - have you investigated just using inline asm instead? >> On 11/11/2021 08:34, Wang, Pengfei via llvm-dev wrote: >> >> >As I am very new to this part of LLVM, I am not sure what is feasible or >> not. Would it be envisionnable to either: >> >> >1. have a way to inject some numeric cost to influence the value of some >> resulting combinations? >> >> >2. revive some form of intrinsic and guarantee that the instruction >> would be generated? >> >> >> >> I think a feasible way is to add a new tuningXXX feature for given >> targets and do something different with the flag in the combine. >> >> 1) seems overengineering and 2) seems overkilled for potential >> opportunities by the combine. >> >> >> >> Thanks >> >> Phoebe >> >> >> >> *From:* llvm-dev <llvm-dev-bounces at lists.llvm.org> >> <llvm-dev-bounces at lists.llvm.org> *On Behalf Of *Nicolas Vasilache via >> llvm-dev >> *Sent:* Wednesday, November 10, 2021 5:46 PM >> *To:* Diego Caballero <diegocaballero at google.com> >> <diegocaballero at google.com> >> *Cc:* llvm-dev at lists.llvm.org >> *Subject:* Re: [llvm-dev] Understanding and controlling some of the AVX >> shuffle emission paths >> >> >> >> >> >> >> >> On Wed, Nov 10, 2021 at 10:30 AM Diego Caballero < >> diegocaballero at google.com> wrote: >> >> +Nicolas Vasilache <ntv at google.com> :) >> >> >> >> Thanks Diego, email is hard, I could not find ways to inject myself into >> my own discussion... >> >> >> >> >> >> On Tue, Nov 9, 2021 at 10:32 PM Simon Pilgrim via llvm-dev < >> llvm-dev at lists.llvm.org> wrote: >> >> On 09/11/2021 20:44, Simon Pilgrim wrote: >> >> > On 09/11/2021 08:57, Nicolas Vasilache via llvm-dev wrote: >> >> Hi everyone, >> >> >> >> I am experimenting with LLVM lowering, intrinsics and shufflevector >> >> in general. >> >> >> >> Here is an IR that I produce with the objective of emitting some >> >> vblendps instructions: >> >> >> https://gist.github.com/nicolasvasilache/0fe30c83cbfe5b4776ec9f0ee465611a. >> >> >> >> > From what I can see, the original IR code was (effectively): >> > >> > 8 x UNPCKLPS/UNPCKHPS >> > 4 x SHUFPS >> > 8 x BLENDPS >> > 4 x INSERTF128 >> > 4 x PERM2F128 >> > >> >> I compile this further with >> >> >> >> clang -x ir -emit-llvm -S -mcpu=haswell -O3 -o - | llc -O3 >> >> -mcpu=haswell - -o - >> >> >> >> to obtain: >> >> >> >> >> https://gist.github.com/nicolasvasilache/2c773b86fcda01cc28711828a0a9ce0a >> >> >> > >> > and after the x86 shuffle combines: >> > >> > 8 x UNPCKLPS/UNPCKHPS >> > 8 x UNPCKLPD/UNPCKHPD >> > 4 x INSERTF128 >> > 4 x PERM2F128 >> > >> > Starting from each BLENDPS, they've combined with the SHUFPS to create >> > the UNPCK*PD nodes. We nearly always benefit from folding shuffle >> > chains to reduce total instruction counts, even if some inner nodes >> > have multiple uses (like the SHUFPS), and I'd hate to lose that. >> > >> >> At this point, I would expect to see some vblendps instructions >> >> generated for the pieces of IR that produce %48/%49 %51/%52 %54/%55 >> >> and %57/%58 to reduce pressure on port 5 (vblendps can also go on >> >> ports 0 and 1). However the expected instruction does not get >> >> generated and llvm-mca continues to show me high port 5 contention. >> >> >> >> Could people suggest some steps / commands to help better understand >> >> why my expectation is not met and whether I can do something to make >> >> the compiler generate what I want? Thanks in advance! >> > So on Haswell, we've gained 4 extra Port5-only shuffles but removed >> > the 8 Port015 blends. >> > >> > We have very little arch-specific shuffle combines, just the >> > fast-variable-shuffle tuning flags to avoid unnecessary shuffle mask >> > loads, the shuffle combines just aims for the reduction in simple >> > target shuffle nodes. And tbh I'm reluctant to add to this as shuffle >> > combining is complex already. >> > >> > We should be preferring to lower/combine to BLENDPS in more >> > circumstances (its commutable and never slower than any other target >> > shuffle, although demanded elts can do less with 'undef' elements), >> > but that won't help us here. >> > >> > So far I've failed to find a BLEND-based 8x8 transpose pattern that >> > the shuffle combiner doesn't manage to combine back to the >> > 8xUNPCK/SHUFPS ops :( >> >> >> >> If you are referring to this specific code, yes same for me. >> >> If you are thinking about the general 8x8 transpose problem, I have a >> version with vector<4xf32> loads that ends up using blends; as expected, >> the port 5 pressure reduction helps and both llvm-mca and runtime agree >> that this is 20-30% faster. >> >> >> >> >> The only thing I can think of is you might want to see if you can >> reorder the INSERTF128/PERM2F128 shuffles in between the UNPACK*PS and >> the SHUFPS/BLENDPS: >> >> 8 x UNPCKLPS/UNPCKHPS >> 4 x INSERTF128 >> 4 x PERM2F128 >> 4 x SHUFPS >> 8 x BLENDPS >> >> Splitting the per-lane shuffles with the subvector-shuffles could help >> stop the shuffle combiner. >> >> >> >> Right, I tried different variations here but invariably getting the same >> result. >> >> The vector<4xf32> based version is something that I also want to target >> for a bunch of orthogonal reasons. >> >> I'll note that my use case is MLIR codegen with explicit vectors and >> intrinsics -> LLVM so I have quite some flexibility. >> >> But it feels unnatural in the compiler flow to have to branch off at a >> significant higher-level of abstraction to sidestep concerns related to X86 >> microarchitecture details. >> >> >> >> As I am very new to this part of LLVM, I am not sure what is feasible or >> not. Would it be envisionnable to either: >> >> 1. have a way to inject some numeric cost to influence the value of some >> resulting combinations? >> >> 2. revive some form of intrinsic and guarantee that the instruction would >> be generated? >> >> >> >> I realize point 2. is contrary to the evolution of LLVM as these >> intrinsics were removed ca. 2015 in favor of the combiner-based approach. >> >> Still it seems that `we have very little arch-specific shuffle combines` >> could be the signal that such intrinsics are needed? >> >> >> >> >> >> I have verified independently that in isolation, a single such >> >> shuffle creates a vblendps. I see them being recombined in the >> >> produced assembly and I am looking for experimenting with avoiding >> >> that vshufps + vblendps + vblendps get recombined into vunpckxxx + >> >> vunpckxxx instructions. >> >> >> >> -- >> _______________________________________________ >> LLVM Developers mailing list >> llvm-dev at lists.llvm.org >> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> >> >> >> -- >> >> N >> >> _______________________________________________ >> LLVM Developers mailing listllvm-dev at lists.llvm.orghttps://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >> >> > > -- > N >-- N -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20211121/3019ef14/attachment.html>