Adrien Guinet via llvm-dev
2020-May-18 13:58 UTC
[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions
Hello everyone, On the last couple of days, I have been experimenting with teaching LLVM how to combine a set of affine instructions into an instruction that uses the GFNI [1] AVX512 extension, especially GF2P8AFFINEQB [2]. While the general idea seems to work, I have some questions about my current implementation (see below). FTR, I have named this transformation AffineCombineExpr (ACE). Let's first introduce the general idea, which is to transform code like this: static uint8_t rol2(uint8_t v) { return (v<<2)^(v>>6); } uint8_t func(uint8_t v) { v = rol2(v); v ^= 0xAA; return v; } into this: define zeroext i8 @func(i8 zeroext %v) local_unnamed_addr #0 { %0 = insertelement <16 x i8> undef, i8 %v, i64 0 %1 = call <16 x i8> @llvm.x86.vgf2p8affineqb.128(<16 x i8> %0, <16 x i8> bitcast (<2 x i64> <i64 4647715923615551520, i64 undef> to <16 x i8>), i8 -86) %2 = extractelement <16 x i8> %1, i64 0 ret i8 %2 } (if that's profitable, which might not be the case here, see below) Another more interesting example where we could see potential benefits is this one: https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-21dd247f3b8aa49860ae8122fe3ea698R22 This gets even more interesting with vectorized code, with an example here: * original C code: https://pastebin.com/4JjF7DPu * LLVM IR after clang -O2 -mgfni -mavx2: https://pastebin.com/Ti0Vm2gj [3] * LLVM IR after ACE (using opt -aggressive-instcombine -S): https://pastebin.com/2zFU7J6g (interesting things happened at line 67) If, like me, you don't have a GFNI-enabled CPU, you can use Intel SDE [4] to run the compiled code. The code of the pass is available here: https://github.com/aguinet/llvm-project/blob/feature/gfni_combine/llvm/lib/Transforms/AggressiveInstCombine/AffineExprCombine.cpp And there are test cases here: https://github.com/aguinet/llvm-project/tree/feature/gfni_combine/llvm/test/Transforms/AggressiveInstCombine (aec_*.ll) Questions ======== The high-level view of the algorithm is the following: a) gather, from a basic block, suites of instructions that process an 8-bit integer using affine instructions, and generate another 8-bit integer. b) compute the linear matrix and affine constant related to that set of instructions c) emit the GFNI instructions Even thought the current code showcases the idea, there are quite a few things I'm unhappy with and would like some advice: 1) about a): what we want is an analysis that can gather, from a given basic block, the largest DAG*s* of instructions based on a given predicate (in our case: is this an affine transformation?), that process an 8-bit value and output another 8-bit value. After looking at {Aggressive,}InstCombine, I hadn't find exactly what I wanted, so I've rewritten something from scratch to validate the overall idea. But is there some facility within LLVM I could reuse for this purpose? This feels like for instance the same kind of analyses that might be already done in ScheduleDAG (?). 2) profitability: according to [6], the latency of the GFNI instructions is 3 cycles. and in the general case insertelement and extractelement also maps to 3-cycle latency instructions [8] [9]. This makes the whole replacement a latency of 9 cycles (in the scalar case). Is there any example on how can I compare this 9 cycles latency against the set of instructions I am replacing against? What I do for now is really simple and I think far from reality (see https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R115) 3) loop vectorization: related to 2), it seems that we could generate more efficient code if this instruction combining process would happen directly within the loop vectorization algorithm. Indeed, we could benefit from the cost model analysis that already exists there, and also tweak the loop unrolling factor to better hide this latency of 3 cycles.>From the documentation [7], it looks like this need to happen in the VPlan representation,but I've had a hard time figuring out where I should plug myself in. Is there any example that showcases instruction combining within this representation? 4) I inserted this transformation into AggressiveInstCombine, because it is indeed on paper a combination of instructions, and the analysis to make it work can be quite costly. That being said, it seems to be ran too early in the pipeline, and just running clang -03 -mgfni does not combine anything at all (I still got to investigate this). It might be linked to 1) and the fact that my analysis is very naïve, and assume some "clean" and optimized LLVM IR. Question is: where does that kind of transformation should happen in the current optimization pipeline? Current limitations/issues ========================= 1) compile-time performances: I haven't run benchmarks yet to see the compile-cost of this. Some efforts have been made on this preliminary version to drop as early as possible basic blocks that does not seem interesting [5], but that deserves more work 2) run-time performances: if someone has a GFNI-enabled CPU and can ran some benchmarks (for instance on the aec_vec.ll one), that could be very interesting :) 3) if we run the "vectorization" test above with avx512 (and thus 32-bytes vectors), we generate the LLVM IR here: https://pastebin.com/Rwn43N4x, but llc crashes with this message: https://pastebin.com/bbSZPFe5 . Am I doing something wrong or is there an actual issue in the X86 backend? 4) for now we limit ourselves to 8x8 functions, but there are chances we could extend this to bigger inputs/outputs (eg. 32x32 for CRC32-like functions would be nice) Thanks for any help! Regards, [1] https://en.wikipedia.org/wiki/AVX-512#GFNI [2] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2901 [3] if you wonder why not -mavx512f , see section about current issues below [4] https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html [5] https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R308 [6] https://rizmediateknologi.blogspot.com/2019/08/the-ice-lake-benchmark-preview-inside_1.html [7] https://llvm.org/docs/Proposals/VectorizationPlan.html [8] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=insert_epi8&expand=3145 [9] https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=extract_epi8&expand=3145,2432
Craig Topper via llvm-dev
2020-May-18 18:24 UTC
[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions
I can tell you that your avx512 issue is that v64i8 gfni instructions also require avx512bw to be enabled to make v64i8 a supported type. The C intrinsics handling in the front end know this rule. But since you generated your own intrinsics you bypassed that. On Mon, May 18, 2020 at 6:58 AM Adrien Guinet via llvm-dev < llvm-dev at lists.llvm.org> wrote:> Hello everyone, > > On the last couple of days, I have been experimenting with teaching LLVM > how to combine a > set of affine instructions into an instruction that uses the GFNI [1] > AVX512 extension, > especially GF2P8AFFINEQB [2]. While the general idea seems to work, I have > some questions > about my current implementation (see below). FTR, I have named this > transformation > AffineCombineExpr (ACE). > > Let's first introduce the general idea, which is to transform code like > this: > > static uint8_t rol2(uint8_t v) { > return (v<<2)^(v>>6); > } > uint8_t func(uint8_t v) { > v = rol2(v); > v ^= 0xAA; > return v; > } > > into this: > > define zeroext i8 @func(i8 zeroext %v) local_unnamed_addr #0 { > %0 = insertelement <16 x i8> undef, i8 %v, i64 0 > %1 = call <16 x i8> @llvm.x86.vgf2p8affineqb.128(<16 x i8> %0, <16 x i8> > bitcast (<2 x > i64> <i64 4647715923615551520, i64 undef> to <16 x i8>), i8 -86) > %2 = extractelement <16 x i8> %1, i64 0 > ret i8 %2 > } > > (if that's profitable, which might not be the case here, see below) > > Another more interesting example where we could see potential benefits is > this one: > > https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-21dd247f3b8aa49860ae8122fe3ea698R22 > > This gets even more interesting with vectorized code, with an example here: > > * original C code: https://pastebin.com/4JjF7DPu > * LLVM IR after clang -O2 -mgfni -mavx2: https://pastebin.com/Ti0Vm2gj [3] > * LLVM IR after ACE (using opt -aggressive-instcombine -S): > https://pastebin.com/2zFU7J6g > (interesting things happened at line 67) > > If, like me, you don't have a GFNI-enabled CPU, you can use Intel SDE [4] > to run the > compiled code. > > The code of the pass is available here: > > > https://github.com/aguinet/llvm-project/blob/feature/gfni_combine/llvm/lib/Transforms/AggressiveInstCombine/AffineExprCombine.cpp > > And there are test cases here: > > https://github.com/aguinet/llvm-project/tree/feature/gfni_combine/llvm/test/Transforms/AggressiveInstCombine > (aec_*.ll) > > Questions > ========> > The high-level view of the algorithm is the following: > > a) gather, from a basic block, suites of instructions that process an > 8-bit integer using > affine instructions, and generate another 8-bit integer. > b) compute the linear matrix and affine constant related to that set of > instructions > c) emit the GFNI instructions > > Even thought the current code showcases the idea, there are quite a few > things I'm unhappy > with and would like some advice: > > 1) about a): what we want is an analysis that can gather, from a given > basic block, the > largest DAG*s* of instructions based on a given predicate (in our case: is > this an affine > transformation?), that process an 8-bit value and output another 8-bit > value. > > After looking at {Aggressive,}InstCombine, I hadn't find exactly what I > wanted, so I've > rewritten something from scratch to validate the overall idea. But is > there some facility > within LLVM I could reuse for this purpose? This feels like for instance > the same kind of > analyses that might be already done in ScheduleDAG (?). > > 2) profitability: according to [6], the latency of the GFNI instructions > is 3 cycles. and > in the general case insertelement and extractelement also maps to 3-cycle > latency > instructions [8] [9]. This makes the whole replacement a latency of 9 > cycles (in the > scalar case). Is there any example on how can I compare this 9 cycles > latency against the > set of instructions I am replacing against? What I do for now is really > simple and I think > far from reality (see > > https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R115 > ) > > 3) loop vectorization: related to 2), it seems that we could generate more > efficient code > if this instruction combining process would happen directly within the > loop vectorization > algorithm. Indeed, we could benefit from the cost model analysis that > already exists > there, and also tweak the loop unrolling factor to better hide this > latency of 3 cycles. > > From the documentation [7], it looks like this need to happen in the VPlan > representation, > but I've had a hard time figuring out where I should plug myself in. Is > there any example > that showcases instruction combining within this representation? > > 4) I inserted this transformation into AggressiveInstCombine, because it > is indeed on > paper a combination of instructions, and the analysis to make it work can > be quite costly. > That being said, it seems to be ran too early in the pipeline, and just > running clang -03 > -mgfni does not combine anything at all (I still got to investigate this). > It might be > linked to 1) and the fact that my analysis is very naïve, and assume some > "clean" and > optimized LLVM IR. > Question is: where does that kind of transformation should happen in the > current > optimization pipeline? > > Current limitations/issues > =========================> > 1) compile-time performances: I haven't run benchmarks yet to see the > compile-cost of > this. Some efforts have been made on this preliminary version to drop as > early as possible > basic blocks that does not seem interesting [5], but that deserves more > work > > 2) run-time performances: if someone has a GFNI-enabled CPU and can ran > some benchmarks > (for instance on the aec_vec.ll one), that could be very interesting :) > > 3) if we run the "vectorization" test above with avx512 (and thus 32-bytes > vectors), we > generate the LLVM IR here: https://pastebin.com/Rwn43N4x, but llc crashes > with this > message: https://pastebin.com/bbSZPFe5 . Am I doing something wrong or is > there an actual > issue in the X86 backend? > > 4) for now we limit ourselves to 8x8 functions, but there are chances we > could extend this > to bigger inputs/outputs (eg. 32x32 for CRC32-like functions would be nice) > > Thanks for any help! > > Regards, > > [1] https://en.wikipedia.org/wiki/AVX-512#GFNI > [2] > https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2901 > [3] if you wonder why not -mavx512f , see section about current issues > below > [4] > > https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html > [5] > > https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R308 > [6] > > https://rizmediateknologi.blogspot.com/2019/08/the-ice-lake-benchmark-preview-inside_1.html > [7] https://llvm.org/docs/Proposals/VectorizationPlan.html > [8] > https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=insert_epi8&expand=3145 > [9] > > https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=extract_epi8&expand=3145,2432 > _______________________________________________ > LLVM Developers mailing list > llvm-dev at lists.llvm.org > https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev >-- ~Craig -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/91f55bba/attachment.html>
Adrien Guinet via llvm-dev
2020-May-18 18:29 UTC
[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions
On 5/18/20 8:24 PM, Craig Topper wrote:> I can tell you that your avx512 issue is that v64i8 gfni instructions also > require avx512bw to be enabled to make v64i8 a supported type. The C > intrinsics handling in the front end know this rule. But since you > generated your own intrinsics you bypassed that.Indeed that's the issue... I was stick with what Intel announces here (https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2907), but I guess I should have checked the C intrinsics. I will fix my code to verify the presence of avx512bw if I ever need v64i8. Thanks for the hint!
Possibly Parallel Threads
- Use Galois field New Instructions (GFNI) to combine affine instructions
- How to implement load/store for vector predicate register
- How to implement load/store for vector predicate register
- OpenSSH and Galois/Counter mode i.e. GCM
- invalid code generated on Windows x86_64 using skylake-specific features