Adrien Guinet via llvm-dev
2020-May-18 18:29 UTC
[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions
On 5/18/20 8:24 PM, Craig Topper wrote:> I can tell you that your avx512 issue is that v64i8 gfni instructions also > require avx512bw to be enabled to make v64i8 a supported type. The C > intrinsics handling in the front end know this rule. But since you > generated your own intrinsics you bypassed that.Indeed that's the issue... I was stick with what Intel announces here (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2907), but I guess I should have checked the C intrinsics. I will fix my code to verify the presence of avx512bw if I ever need v64i8. Thanks for the hint!
Craig Topper via llvm-dev
2020-May-19 00:23 UTC
[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions
I'm guessing AggressiveInstCombine only runs once in the pipeline and its probably before the vectorizers. Its existing transforms probably output things the vectorizer can understand and vectorize. In your case you're fully dependent on vectorized code. We don't like to form target specific intrinsics in the middle end pipeline. We'd prefer to do something in the X86 specific IR pipeline or Machine IR pipeline run by llc. Or have a generic concept in IR that we can express like llvm.ctlz, llvm.cttz, llvm.popcnt or llvm.bitreverse. We have methods in TargetTransformInfo to query for targets supporting them or in the worst case we're able to generate reasonable code if the target doesn't support it natively. I'll try to point some more people here at Intel towards this thread. ~Craig On Mon, May 18, 2020 at 11:30 AM Adrien Guinet via llvm-dev < llvm-dev at lists.llvm.org> wrote:> On 5/18/20 8:24 PM, Craig Topper wrote: > > I can tell you that your avx512 issue is that v64i8 gfni instructions > also > > require avx512bw to be enabled to make v64i8 a supported type. The C > > intrinsics handling in the front end know this rule. But since you > > generated your own intrinsics you bypassed that. > > Indeed that's the issue... I was stick with what Intel announces here > ( > https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2907), > but > I guess I should have checked the C intrinsics. > > I will fix my code to verify the presence of avx512bw if I ever need v64i8. > > Thanks for the hint! > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/5f348d93/attachment.html>
Adrien Guinet via llvm-dev
2020-May-19 07:58 UTC
[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions
On 5/19/20 2:23 AM, Craig Topper wrote:> I'm guessing AggressiveInstCombine only runs once in the pipeline and its > probably before the vectorizers. Its existing transforms probably output > things the vectorizer can understand and vectorize. In your case you're > fully dependent on vectorized code.Yes. I think it would generate more efficient code if I would do the combination within the loop vectorization algorithm (that is, if I understood correctly, in VPlan). It would for instance give more opportunity to "hide" the latency of GF2P8AFFINEQB, by e.g fine tuning the loop unrolling factor. But I have to take more time to figure out how to make this happen.> We don't like to form target specific intrinsics in the middle end > pipeline. We'd prefer to do something in the X86 specific IR pipeline or > Machine IR pipeline run by llc. Or have a generic concept in IR that we can > express like llvm.ctlz, llvm.cttz, llvm.popcnt or llvm.bitreverse. We have > methods in TargetTransformInfo to query for targets supporting them or in > the worst case we're able to generate reasonable code if the target doesn't > support it natively.I thought about putting that in the X86 pipeline, but it might remove some opportunities: * supporting equivalent instructions from another vendor without rewriting everything * fine tuning the loop vectorization process like said above The approach with defining a generic intrinsic in the IR can be an interesting one, the remaining question would be which API should we put there, so that it's generic enough (to be a generic intrinsic) but doesn't prevent some optimization opportunities. That being said, something like: llvm.gf2p8.XX(<i8, XX> value, <i8, XX> matrix, i8 cst) seems reasonnable to me. I guess that's the eternal debate on where optimization X should happen and how... :)> I'll try to point some more people here at Intel towards this thread.Thanks a lot! Adrien.