thr3ads.net - llvm dev - [llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions [May 2020]

If this information is useful, please help other people find it:
Share via:

Adrien Guinet via llvm-dev

2020-May-18 13:58 UTC

[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions

Hello everyone,

On the last couple of days, I have been experimenting with teaching LLVM how to
combine a
set of affine instructions into an instruction that uses the GFNI [1] AVX512
extension,
especially GF2P8AFFINEQB [2]. While the general idea seems to work, I have some
questions
about my current implementation (see below). FTR, I have named this
transformation
AffineCombineExpr (ACE).

Let's first introduce the general idea, which is to transform code like
this:

static uint8_t rol2(uint8_t v) {
  return (v<<2)^(v>>6);
}
uint8_t func(uint8_t v) {
  v = rol2(v);
  v ^= 0xAA;
  return v;
}

into this:

define zeroext i8 @func(i8 zeroext %v) local_unnamed_addr #0 {
  %0 = insertelement <16 x i8> undef, i8 %v, i64 0
  %1 = call <16 x i8> @llvm.x86.vgf2p8affineqb.128(<16 x i8> %0,
<16 x i8> bitcast (<2 x
i64> <i64 4647715923615551520, i64 undef> to <16 x i8>), i8 -86)
  %2 = extractelement <16 x i8> %1, i64 0
  ret i8 %2
}

(if that's profitable, which might not be the case here, see below)

Another more interesting example where we could see potential benefits is this
one:
https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-21dd247f3b8aa49860ae8122fe3ea698R22

This gets even more interesting with vectorized code, with an example here:

* original C code: https://pastebin.com/4JjF7DPu
* LLVM IR after clang -O2 -mgfni -mavx2: https://pastebin.com/Ti0Vm2gj [3]
* LLVM IR after ACE (using opt -aggressive-instcombine -S):
https://pastebin.com/2zFU7J6g
(interesting things happened at line 67)

If, like me, you don't have a GFNI-enabled CPU, you can use Intel SDE [4] to
run the
compiled code.

The code of the pass is available here:

https://github.com/aguinet/llvm-project/blob/feature/gfni_combine/llvm/lib/Transforms/AggressiveInstCombine/AffineExprCombine.cpp

And there are test cases here:
https://github.com/aguinet/llvm-project/tree/feature/gfni_combine/llvm/test/Transforms/AggressiveInstCombine
(aec_*.ll)

Questions
========
The high-level view of the algorithm is the following:

a) gather, from a basic block, suites of instructions that process an 8-bit
integer using
affine instructions, and generate another 8-bit integer.
b) compute the linear matrix and affine constant related to that set of
instructions
c) emit the GFNI instructions

Even thought the current code showcases the idea, there are quite a few things
I'm unhappy
with and would like some advice:

1) about a): what we want is an analysis that can gather, from a given basic
block, the
largest DAG*s* of instructions based on a given predicate (in our case: is this
an affine
transformation?), that process an 8-bit value and output another 8-bit value.

After looking at {Aggressive,}InstCombine, I hadn't find exactly what I
wanted, so I've
rewritten something from scratch to validate the overall idea. But is there some
facility
within LLVM I could reuse for this purpose? This feels like for instance the
same kind of
analyses that might be already done in ScheduleDAG (?).

2) profitability: according to [6], the latency of the GFNI instructions is 3
cycles. and
in the general case insertelement and extractelement also maps to 3-cycle
latency
instructions [8] [9]. This makes the whole replacement a latency of 9 cycles (in
the
scalar case). Is there any example on how can I compare this 9 cycles latency
against the
set of instructions I am replacing against? What I do for now is really simple
and I think
far from reality (see
https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R115)

3) loop vectorization: related to 2), it seems that we could generate more
efficient code
if this instruction combining process would happen directly within the loop
vectorization
algorithm. Indeed, we could benefit from the cost model analysis that already
exists
there, and also tweak the loop unrolling factor to better hide this latency of 3
cycles.
>From the documentation [7], it looks like this need to happen in the VPlan
representation,but I've had a hard time figuring out where I should plug myself in. Is
there any example
that showcases instruction combining within this representation?

4) I inserted this transformation into AggressiveInstCombine, because it is
indeed on
paper a combination of instructions, and the analysis to make it work can be
quite costly.
That being said, it seems to be ran too early in the pipeline, and just running
clang -03
-mgfni does not combine anything at all (I still got to investigate this). It
might be
linked to 1) and the fact that my analysis is very naïve, and assume some
"clean" and
optimized LLVM IR.
Question is: where does that kind of transformation should happen in the current
optimization pipeline?

Current limitations/issues
=========================
1) compile-time performances: I haven't run benchmarks yet to see the
compile-cost of
this. Some efforts have been made on this preliminary version to drop as early
as possible
basic blocks that does not seem interesting [5], but that deserves more work

2) run-time performances: if someone has a GFNI-enabled CPU and can ran some
benchmarks
(for instance on the aec_vec.ll one), that could be very interesting :)

3) if we run the "vectorization" test above with avx512 (and thus
32-bytes vectors), we
generate the LLVM IR here: https://pastebin.com/Rwn43N4x, but llc crashes with
this
message: https://pastebin.com/bbSZPFe5 . Am I doing something wrong or is there
an actual
issue in the X86 backend?

4) for now we limit ourselves to 8x8 functions, but there are chances we could
extend this
to bigger inputs/outputs (eg. 32x32 for CRC32-like functions would be nice)

Thanks for any help!

Regards,

[1] https://en.wikipedia.org/wiki/AVX-512#GFNI
[2]
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2901
[3] if you wonder why not -mavx512f , see section about current issues below
[4]
https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html
[5]
https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R308
[6]
https://rizmediateknologi.blogspot.com/2019/08/the-ice-lake-benchmark-preview-inside_1.html
[7] https://llvm.org/docs/Proposals/VectorizationPlan.html
[8]
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=insert_epi8&expand=3145
[9]
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=extract_epi8&expand=3145,2432

Craig Topper via llvm-dev

2020-May-18 18:24 UTC

head link

[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions

I can tell you that your avx512 issue is that v64i8 gfni instructions also
require avx512bw to be enabled to make v64i8 a supported type. The C
intrinsics handling in the front end know this rule. But since you
generated your own intrinsics you bypassed that.

On Mon, May 18, 2020 at 6:58 AM Adrien Guinet via llvm-dev <
llvm-dev at lists.llvm.org> wrote:
> Hello everyone,
>
> On the last couple of days, I have been experimenting with teaching LLVM
> how to combine a
> set of affine instructions into an instruction that uses the GFNI [1]
> AVX512 extension,
> especially GF2P8AFFINEQB [2]. While the general idea seems to work, I have
> some questions
> about my current implementation (see below). FTR, I have named this
> transformation
> AffineCombineExpr (ACE).
>
> Let's first introduce the general idea, which is to transform code like
> this:
>
> static uint8_t rol2(uint8_t v) {
>   return (v<<2)^(v>>6);
> }
> uint8_t func(uint8_t v) {
>   v = rol2(v);
>   v ^= 0xAA;
>   return v;
> }
>
> into this:
>
> define zeroext i8 @func(i8 zeroext %v) local_unnamed_addr #0 {
>   %0 = insertelement <16 x i8> undef, i8 %v, i64 0
>   %1 = call <16 x i8> @llvm.x86.vgf2p8affineqb.128(<16 x i8>
%0, <16 x i8>
> bitcast (<2 x
> i64> <i64 4647715923615551520, i64 undef> to <16 x i8>), i8
-86)
>   %2 = extractelement <16 x i8> %1, i64 0
>   ret i8 %2
> }
>
> (if that's profitable, which might not be the case here, see below)
>
> Another more interesting example where we could see potential benefits is
> this one:
>
>
https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-21dd247f3b8aa49860ae8122fe3ea698R22
>
> This gets even more interesting with vectorized code, with an example here:
>
> * original C code: https://pastebin.com/4JjF7DPu
> * LLVM IR after clang -O2 -mgfni -mavx2: https://pastebin.com/Ti0Vm2gj [3]
> * LLVM IR after ACE (using opt -aggressive-instcombine -S):
> https://pastebin.com/2zFU7J6g
> (interesting things happened at line 67)
>
> If, like me, you don't have a GFNI-enabled CPU, you can use Intel SDE
[4]
> to run the
> compiled code.
>
> The code of the pass is available here:
>
>
>
https://github.com/aguinet/llvm-project/blob/feature/gfni_combine/llvm/lib/Transforms/AggressiveInstCombine/AffineExprCombine.cpp
>
> And there are test cases here:
>
>
https://github.com/aguinet/llvm-project/tree/feature/gfni_combine/llvm/test/Transforms/AggressiveInstCombine
> (aec_*.ll)
>
> Questions
> ========>
> The high-level view of the algorithm is the following:
>
> a) gather, from a basic block, suites of instructions that process an
> 8-bit integer using
> affine instructions, and generate another 8-bit integer.
> b) compute the linear matrix and affine constant related to that set of
> instructions
> c) emit the GFNI instructions
>
> Even thought the current code showcases the idea, there are quite a few
> things I'm unhappy
> with and would like some advice:
>
> 1) about a): what we want is an analysis that can gather, from a given
> basic block, the
> largest DAG*s* of instructions based on a given predicate (in our case: is
> this an affine
> transformation?), that process an 8-bit value and output another 8-bit
> value.
>
> After looking at {Aggressive,}InstCombine, I hadn't find exactly what I
> wanted, so I've
> rewritten something from scratch to validate the overall idea. But is
> there some facility
> within LLVM I could reuse for this purpose? This feels like for instance
> the same kind of
> analyses that might be already done in ScheduleDAG (?).
>
> 2) profitability: according to [6], the latency of the GFNI instructions
> is 3 cycles. and
> in the general case insertelement and extractelement also maps to 3-cycle
> latency
> instructions [8] [9]. This makes the whole replacement a latency of 9
> cycles (in the
> scalar case). Is there any example on how can I compare this 9 cycles
> latency against the
> set of instructions I am replacing against? What I do for now is really
> simple and I think
> far from reality (see
>
>
https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R115
> )
>
> 3) loop vectorization: related to 2), it seems that we could generate more
> efficient code
> if this instruction combining process would happen directly within the
> loop vectorization
> algorithm. Indeed, we could benefit from the cost model analysis that
> already exists
> there, and also tweak the loop unrolling factor to better hide this
> latency of 3 cycles.
>
> From the documentation [7], it looks like this need to happen in the VPlan
> representation,
> but I've had a hard time figuring out where I should plug myself in. Is
> there any example
> that showcases instruction combining within this representation?
>
> 4) I inserted this transformation into AggressiveInstCombine, because it
> is indeed on
> paper a combination of instructions, and the analysis to make it work can
> be quite costly.
> That being said, it seems to be ran too early in the pipeline, and just
> running clang -03
> -mgfni does not combine anything at all (I still got to investigate this).
> It might be
> linked to 1) and the fact that my analysis is very naïve, and assume some
> "clean" and
> optimized LLVM IR.
> Question is: where does that kind of transformation should happen in the
> current
> optimization pipeline?
>
> Current limitations/issues
> =========================>
> 1) compile-time performances: I haven't run benchmarks yet to see the
> compile-cost of
> this. Some efforts have been made on this preliminary version to drop as
> early as possible
> basic blocks that does not seem interesting [5], but that deserves more
> work
>
> 2) run-time performances: if someone has a GFNI-enabled CPU and can ran
> some benchmarks
> (for instance on the aec_vec.ll one), that could be very interesting :)
>
> 3) if we run the "vectorization" test above with avx512 (and thus
32-bytes
> vectors), we
> generate the LLVM IR here: https://pastebin.com/Rwn43N4x, but llc crashes
> with this
> message: https://pastebin.com/bbSZPFe5 . Am I doing something wrong or is
> there an actual
> issue in the X86 backend?
>
> 4) for now we limit ourselves to 8x8 functions, but there are chances we
> could extend this
> to bigger inputs/outputs (eg. 32x32 for CRC32-like functions would be nice)
>
> Thanks for any help!
>
> Regards,
>
> [1] https://en.wikipedia.org/wiki/AVX-512#GFNI
> [2]
>
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2901
> [3] if you wonder why not -mavx512f , see section about current issues
> below
> [4]
>
>
https://software.intel.com/content/www/us/en/develop/articles/intel-software-development-emulator.html
> [5]
>
>
https://github.com/aguinet/llvm-project/commit/9ed424cbac0fe3566f801167e2190fad5ad07507#diff-3a29c490bdd8d147d4044818c2da0509R308
> [6]
>
>
https://rizmediateknologi.blogspot.com/2019/08/the-ice-lake-benchmark-preview-inside_1.html
> [7] https://llvm.org/docs/Proposals/VectorizationPlan.html
> [8]
>
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=insert_epi8&expand=3145
> [9]
>
>
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=extract_epi8&expand=3145,2432
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>-- 
~Craig
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200518/91f55bba/attachment.html>

Adrien Guinet via llvm-dev

2020-May-18 18:29 UTC

head link

[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions

On 5/18/20 8:24 PM, Craig Topper wrote:> I can tell you that your avx512 issue is that v64i8 gfni instructions also
> require avx512bw to be enabled to make v64i8 a supported type. The C
> intrinsics handling in the front end know this rule. But since you
> generated your own intrinsics you bypassed that.
Indeed that's the issue... I was stick with what Intel announces here
(https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=gf2p&expand=2907),
but
I guess I should have checked the C intrinsics.

I will fix my code to verify the presence of avx512bw if I ever need v64i8.

Thanks for the hint!

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - May 2020 - Use Galois field New Instructions (GFNI) to combine affine instructions

[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions

[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions

[llvm-dev] Use Galois field New Instructions (GFNI) to combine affine instructions

Apparently Analagous Threads