thr3ads.net - llvm dev - [llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target [Jul 2020]

If this information is useful, please help other people find it:
Share via:

Neil Henning via llvm-dev

2020-Jul-16 18:54 UTC

[llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target

So for us we use SLEEF to actually implement the libcalls (LLVM intrinsics)
that LLVM by default would generate - and since SLEEF has highly optimal
8-wide pow, optimized for AVX and AVX2, we really want to use that.

So we would not see 4/8 libcalls and instead see 1 call to something that
lights up the ymm registers. I guess the problem then is that the default
expectation is that pow would be implemented using N scalar libcalls?

Cheers,
-Neil.

On Thu, Jul 16, 2020 at 6:08 PM Sanjay Patel <spatel at rotateright.com>
wrote:
> The debug spew for loop vectorization shows:
> LV: Found an estimated cost of 49 for VF 4 For instruction:   %14 = tail
> call float @llvm.pow.f32(float %10, float %13)
> LV: Vector loop of width 4 costs: 13.
>
> LV: Found an estimated cost of 107 for VF 8 For instruction:   %14 = tail
> call float @llvm.pow.f32(float %10, float %13)
> LV: Vector loop of width 8 costs: 14.
> LV: Selecting VF: 4.
>
> So rounding of the integer division could be to blame?
>
> But before we focus on that, there's a lot of hand-waving involved in
> creating these costs beginning with the base cost implementation:
>     unsigned SingleCallCost = 10; // Library call cost. Make it expensive.
>
> But before we focus on that... :)
>
> Are we modeling the right thing? Ie, are you not expecting to see 4 or 8
> libcalls when the vector pow call gets expanded on this example? If we are
> doing those libcalls, then it's not clear to me how anything else in
the
> loop matters for performance.
>
> On Thu, Jul 16, 2020 at 10:20 AM Neil Henning via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> Tried a bunch of them there (x86-64, haswell, znver2) and they all
>> defaulted to 4-wide - haswell additionally caused some extra loop
unrolling
>> but still with 8-wide pows.
>>
>> Cheers,
>> -Neil.
>>
>> On Thu, Jul 16, 2020 at 2:39 PM Roman Lebedev <lebedev.ri at
gmail.com>
>> wrote:
>>
>>> Did you specify the target CPU the code should be optimized for?
>>> For clang that is -march=native/znver2/... / -mtune=<same>
>>> For opt/llc that is --mcpu=<same>
>>> I would expect that by default, some generic baseline is picked.
>>>
>>> On Thu, Jul 16, 2020 at 4:25 PM Neil Henning via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Hey list,
>>>>
>>>> I've recently done the first test run of bumping our Burst
compiler
>>>> from LLVM 10 -> 11 now that the branch has been cut, and
have noticed an
>>>> apparent loop vectorization codegen regression for X86 with AVX
or AVX2
>>>> enabled. The following IR example is vectorized to 4 wide with
LLVM 11 and
>>>> trunk whereas in LLVM 10 it (correctly as per what we want)
vectorized it 8
>>>> wide matching the ymm registers.
>>>>
>>>> ; ModuleID = '../test.ll'
>>>> source_filename = "main"
>>>> target datalayout >>>>
"e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
>>>> target triple = "x86_64-pc-windows-msvc-coff"
>>>>
>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0" = type {
float*,
>>>> i32, [4 x i8] }
>>>>
>>>> ; Function Attrs: nofree
>>>> define dllexport void @func(float* noalias nocapture %output,
>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* nocapture
nonnull
>>>> readonly dereferenceable(16) %a,
>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* nocapture
nonnull
>>>> readonly dereferenceable(16) %b) local_unnamed_addr #0 !ubaa.
!1 {
>>>> entry:
>>>>   %0 = getelementptr
>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0",
>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* %a, i64 0, i32
1
>>>>   %1 = load i32, i32* %0, align 1
>>>>   %.not = icmp eq i32 %1, 0
>>>>   br i1 %.not, label %BL.0042, label %BL.0005.lr.ph
>>>>
>>>> BL.0005.lr.ph:                                    ; preds =
%entry
>>>>   %2 = bitcast
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"*
>>>> %a to i8**
>>>>   %3 = load i8*, i8** %2, align 1
>>>>   %4 = bitcast
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"*
>>>> %b to i8**
>>>>   %5 = load i8*, i8** %4, align 1
>>>>   %wide.trip.count = zext i32 %1 to i64
>>>>   br label %BL.0005
>>>>
>>>> BL.0005:                                          ; preds =
%BL.0005, %
>>>> BL.0005.lr.ph
>>>>   %indvars.iv = phi i64 [ 0, %BL.0005.lr.ph ], [
%indvars.iv.next,
>>>> %BL.0005 ]
>>>>   %6 = shl nuw nsw i64 %indvars.iv, 2
>>>>   %7 = getelementptr float, float* %output, i64 %indvars.iv
>>>>   %8 = getelementptr i8, i8* %3, i64 %6
>>>>   %9 = bitcast i8* %8 to float*
>>>>   %10 = load float, float* %9, align 4
>>>>   %11 = getelementptr i8, i8* %5, i64 %6
>>>>   %12 = bitcast i8* %11 to float*
>>>>   %13 = load float, float* %12, align 4
>>>>   %14 = tail call float @llvm.pow.f32(float %10, float %13)
>>>>   store float %14, float* %7, align 4
>>>>   %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>>>>   %exitcond.not = icmp eq i64 %indvars.iv.next,
%wide.trip.count
>>>>   br i1 %exitcond.not, label %BL.0042, label %BL.0005
>>>>
>>>> BL.0042:                                          ; preds =
%BL.0005,
>>>> %entry
>>>>   ret void
>>>> }
>>>>
>>>> ; Function Attrs: norecurse readnone
>>>> define dllexport void @burst.initialize(i8* (i8*)* nocapture
readnone
>>>> %callback) local_unnamed_addr #1 !ubaa. !0 {
>>>> entry:
>>>>   ret void
>>>> }
>>>>
>>>> ; Function Attrs: nounwind readnone speculatable willreturn
>>>> declare float @llvm.pow.f32(float, float) #2
>>>>
>>>> attributes #0 = { nofree }
>>>> attributes #1 = { norecurse readnone }
>>>> attributes #2 = { nounwind readnone speculatable willreturn }
>>>>
>>>> !ubaa.Burst.Compiler.IL.Tests.VectorsMaths\2FFloatPointer.0 =
!{!0, !0,
>>>> !0, !0}
>>>>
>>>> !0 = !{i1 false}
>>>> !1 = !{i1 true, i1 false, i1 false}
>>>>
>>>> If I run this with ../llvm-project/llvm/build/bin/opt.exe -o -
-S -O3
>>>> ../avx_sad_4.ll -mattr=avx -debug, I can see that the loop
vectorizer
>>>> correctly considers using 8-wide ymm registers for this, but
has decided
>>>> that the 4-wide variant is cheaper based on some cost modelling
I don't
>>>> understand.
>>>>
>>>> So is this expected behaviour? I know there was some cost model
changes
>>>> in the 10->11 timeframe.
>>>>
>>>> Thanks for any help,
>>>>
>>>> Cheers,
>>>> -Neil.
>>>>
>>> Roman
>>>
>>>
>>>> --
>>>> Neil Henning
>>>> Senior Software Engineer Compiler
>>>> unity.com
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>
>> --
>> Neil Henning
>> Senior Software Engineer Compiler
>> unity.com
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>
-- 
Neil Henning
Senior Software Engineer Compiler
unity.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200716/9516c977/attachment.html>

Sanjay Patel via llvm-dev

2020-Jul-16 19:11 UTC

head link

[llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target

Right - the vectorizer doesn't know that we have SLEEF, so the cost model
is assuming the pow gets expanded.

I'm not familiar with status on SLEEF, but we have support for other
veclibs in TargetLibraryInfo. So we can see that we are willing to generate
an <8 x float> call if that is known supported:
$ ./opt -loop-vectorize vec4.ll -S -vector-library=SVML -mattr=avx | grep
pow
  %50 = call <8 x float> @__svml_powf8(<8 x float> %wide.load, <8
x float>
%wide.load4)



On Thu, Jul 16, 2020 at 2:55 PM Neil Henning <neil.henning at unity3d.com>
wrote:
> So for us we use SLEEF to actually implement the libcalls (LLVM
> intrinsics) that LLVM by default would generate - and since SLEEF has
> highly optimal 8-wide pow, optimized for AVX and AVX2, we really want to
> use that.
>
> So we would not see 4/8 libcalls and instead see 1 call to something that
> lights up the ymm registers. I guess the problem then is that the default
> expectation is that pow would be implemented using N scalar libcalls?
>
> Cheers,
> -Neil.
>
> On Thu, Jul 16, 2020 at 6:08 PM Sanjay Patel <spatel at
rotateright.com>
> wrote:
>
>> The debug spew for loop vectorization shows:
>> LV: Found an estimated cost of 49 for VF 4 For instruction:   %14 =
tail
>> call float @llvm.pow.f32(float %10, float %13)
>> LV: Vector loop of width 4 costs: 13.
>>
>> LV: Found an estimated cost of 107 for VF 8 For instruction:   %14 =
tail
>> call float @llvm.pow.f32(float %10, float %13)
>> LV: Vector loop of width 8 costs: 14.
>> LV: Selecting VF: 4.
>>
>> So rounding of the integer division could be to blame?
>>
>> But before we focus on that, there's a lot of hand-waving involved
in
>> creating these costs beginning with the base cost implementation:
>>     unsigned SingleCallCost = 10; // Library call cost. Make it
expensive.
>>
>> But before we focus on that... :)
>>
>> Are we modeling the right thing? Ie, are you not expecting to see 4 or
8
>> libcalls when the vector pow call gets expanded on this example? If we
are
>> doing those libcalls, then it's not clear to me how anything else
in the
>> loop matters for performance.
>>
>> On Thu, Jul 16, 2020 at 10:20 AM Neil Henning via llvm-dev <
>> llvm-dev at lists.llvm.org> wrote:
>>
>>> Tried a bunch of them there (x86-64, haswell, znver2) and they all
>>> defaulted to 4-wide - haswell additionally caused some extra loop
unrolling
>>> but still with 8-wide pows.
>>>
>>> Cheers,
>>> -Neil.
>>>
>>> On Thu, Jul 16, 2020 at 2:39 PM Roman Lebedev <lebedev.ri at
gmail.com>
>>> wrote:
>>>
>>>> Did you specify the target CPU the code should be optimized
for?
>>>> For clang that is -march=native/znver2/... /
-mtune=<same>
>>>> For opt/llc that is --mcpu=<same>
>>>> I would expect that by default, some generic baseline is
picked.
>>>>
>>>> On Thu, Jul 16, 2020 at 4:25 PM Neil Henning via llvm-dev <
>>>> llvm-dev at lists.llvm.org> wrote:
>>>>
>>>>> Hey list,
>>>>>
>>>>> I've recently done the first test run of bumping our
Burst compiler
>>>>> from LLVM 10 -> 11 now that the branch has been cut, and
have noticed an
>>>>> apparent loop vectorization codegen regression for X86 with
AVX or AVX2
>>>>> enabled. The following IR example is vectorized to 4 wide
with LLVM 11 and
>>>>> trunk whereas in LLVM 10 it (correctly as per what we want)
vectorized it 8
>>>>> wide matching the ymm registers.
>>>>>
>>>>> ; ModuleID = '../test.ll'
>>>>> source_filename = "main"
>>>>> target datalayout >>>>>
"e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
>>>>> target triple = "x86_64-pc-windows-msvc-coff"
>>>>>
>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0" = type {
>>>>> float*, i32, [4 x i8] }
>>>>>
>>>>> ; Function Attrs: nofree
>>>>> define dllexport void @func(float* noalias nocapture
%output,
>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* nocapture
nonnull
>>>>> readonly dereferenceable(16) %a,
>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* nocapture
nonnull
>>>>> readonly dereferenceable(16) %b) local_unnamed_addr #0
!ubaa. !1 {
>>>>> entry:
>>>>>   %0 = getelementptr
>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0",
>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* %a, i64 0, i32
1
>>>>>   %1 = load i32, i32* %0, align 1
>>>>>   %.not = icmp eq i32 %1, 0
>>>>>   br i1 %.not, label %BL.0042, label %BL.0005.lr.ph
>>>>>
>>>>> BL.0005.lr.ph:                                    ; preds =
%entry
>>>>>   %2 = bitcast
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"*
>>>>> %a to i8**
>>>>>   %3 = load i8*, i8** %2, align 1
>>>>>   %4 = bitcast
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"*
>>>>> %b to i8**
>>>>>   %5 = load i8*, i8** %4, align 1
>>>>>   %wide.trip.count = zext i32 %1 to i64
>>>>>   br label %BL.0005
>>>>>
>>>>> BL.0005:                                          ; preds =
%BL.0005, %
>>>>> BL.0005.lr.ph
>>>>>   %indvars.iv = phi i64 [ 0, %BL.0005.lr.ph ], [
%indvars.iv.next,
>>>>> %BL.0005 ]
>>>>>   %6 = shl nuw nsw i64 %indvars.iv, 2
>>>>>   %7 = getelementptr float, float* %output, i64 %indvars.iv
>>>>>   %8 = getelementptr i8, i8* %3, i64 %6
>>>>>   %9 = bitcast i8* %8 to float*
>>>>>   %10 = load float, float* %9, align 4
>>>>>   %11 = getelementptr i8, i8* %5, i64 %6
>>>>>   %12 = bitcast i8* %11 to float*
>>>>>   %13 = load float, float* %12, align 4
>>>>>   %14 = tail call float @llvm.pow.f32(float %10, float %13)
>>>>>   store float %14, float* %7, align 4
>>>>>   %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>>>>>   %exitcond.not = icmp eq i64 %indvars.iv.next,
%wide.trip.count
>>>>>   br i1 %exitcond.not, label %BL.0042, label %BL.0005
>>>>>
>>>>> BL.0042:                                          ; preds =
%BL.0005,
>>>>> %entry
>>>>>   ret void
>>>>> }
>>>>>
>>>>> ; Function Attrs: norecurse readnone
>>>>> define dllexport void @burst.initialize(i8* (i8*)*
nocapture readnone
>>>>> %callback) local_unnamed_addr #1 !ubaa. !0 {
>>>>> entry:
>>>>>   ret void
>>>>> }
>>>>>
>>>>> ; Function Attrs: nounwind readnone speculatable willreturn
>>>>> declare float @llvm.pow.f32(float, float) #2
>>>>>
>>>>> attributes #0 = { nofree }
>>>>> attributes #1 = { norecurse readnone }
>>>>> attributes #2 = { nounwind readnone speculatable willreturn
}
>>>>>
>>>>> !ubaa.Burst.Compiler.IL.Tests.VectorsMaths\2FFloatPointer.0
= !{!0,
>>>>> !0, !0, !0}
>>>>>
>>>>> !0 = !{i1 false}
>>>>> !1 = !{i1 true, i1 false, i1 false}
>>>>>
>>>>> If I run this with ../llvm-project/llvm/build/bin/opt.exe
-o - -S -O3
>>>>> ../avx_sad_4.ll -mattr=avx -debug, I can see that the loop
vectorizer
>>>>> correctly considers using 8-wide ymm registers for this,
but has decided
>>>>> that the 4-wide variant is cheaper based on some cost
modelling I don't
>>>>> understand.
>>>>>
>>>>> So is this expected behaviour? I know there was some cost
model
>>>>> changes in the 10->11 timeframe.
>>>>>
>>>>> Thanks for any help,
>>>>>
>>>>> Cheers,
>>>>> -Neil.
>>>>>
>>>> Roman
>>>>
>>>>
>>>>> --
>>>>> Neil Henning
>>>>> Senior Software Engineer Compiler
>>>>> unity.com
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>
>>>
>>> --
>>> Neil Henning
>>> Senior Software Engineer Compiler
>>> unity.com
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>
>>
>
> --
> Neil Henning
> Senior Software Engineer Compiler
> unity.com
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200716/e082b587/attachment.html>

Neil Henning via llvm-dev

2020-Jul-17 08:01 UTC

head link

[llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target

So we already have our own custom TargetLibraryInfo for the SLEEF functions
that LLVM does not have matching intrinsics for, but we were using the LLVM
intrinsics where possible so that during the rest of the optimization
passes it'd understand that SLEEF's pow acts for all intents and
purposes
like the intrinsic pow, and it can optimize based on that.

If the default X86 behaviour is to assume scalar replacement with libcalls
then I guess all I can do is not use intrinsic pow? Sucks that we'd lose
out on all the benefits of using the intrinsic :(

Cheers,
-Neil.

On Thu, Jul 16, 2020 at 8:11 PM Sanjay Patel <spatel at rotateright.com>
wrote:
> Right - the vectorizer doesn't know that we have SLEEF, so the cost
model
> is assuming the pow gets expanded.
>
> I'm not familiar with status on SLEEF, but we have support for other
> veclibs in TargetLibraryInfo. So we can see that we are willing to generate
> an <8 x float> call if that is known supported:
> $ ./opt -loop-vectorize vec4.ll -S -vector-library=SVML -mattr=avx | grep
> pow
>   %50 = call <8 x float> @__svml_powf8(<8 x float> %wide.load,
<8 x float>
> %wide.load4)
>
>
>
> On Thu, Jul 16, 2020 at 2:55 PM Neil Henning <neil.henning at
unity3d.com>
> wrote:
>
>> So for us we use SLEEF to actually implement the libcalls (LLVM
>> intrinsics) that LLVM by default would generate - and since SLEEF has
>> highly optimal 8-wide pow, optimized for AVX and AVX2, we really want
to
>> use that.
>>
>> So we would not see 4/8 libcalls and instead see 1 call to something
that
>> lights up the ymm registers. I guess the problem then is that the
default
>> expectation is that pow would be implemented using N scalar libcalls?
>>
>> Cheers,
>> -Neil.
>>
>> On Thu, Jul 16, 2020 at 6:08 PM Sanjay Patel <spatel at
rotateright.com>
>> wrote:
>>
>>> The debug spew for loop vectorization shows:
>>> LV: Found an estimated cost of 49 for VF 4 For instruction:   %14 =
tail
>>> call float @llvm.pow.f32(float %10, float %13)
>>> LV: Vector loop of width 4 costs: 13.
>>>
>>> LV: Found an estimated cost of 107 for VF 8 For instruction:   %14
>>> tail call float @llvm.pow.f32(float %10, float %13)
>>> LV: Vector loop of width 8 costs: 14.
>>> LV: Selecting VF: 4.
>>>
>>> So rounding of the integer division could be to blame?
>>>
>>> But before we focus on that, there's a lot of hand-waving
involved in
>>> creating these costs beginning with the base cost implementation:
>>>     unsigned SingleCallCost = 10; // Library call cost. Make it
>>> expensive.
>>>
>>> But before we focus on that... :)
>>>
>>> Are we modeling the right thing? Ie, are you not expecting to see 4
or 8
>>> libcalls when the vector pow call gets expanded on this example? If
we are
>>> doing those libcalls, then it's not clear to me how anything
else in the
>>> loop matters for performance.
>>>
>>> On Thu, Jul 16, 2020 at 10:20 AM Neil Henning via llvm-dev <
>>> llvm-dev at lists.llvm.org> wrote:
>>>
>>>> Tried a bunch of them there (x86-64, haswell, znver2) and they
all
>>>> defaulted to 4-wide - haswell additionally caused some extra
loop unrolling
>>>> but still with 8-wide pows.
>>>>
>>>> Cheers,
>>>> -Neil.
>>>>
>>>> On Thu, Jul 16, 2020 at 2:39 PM Roman Lebedev <lebedev.ri at
gmail.com>
>>>> wrote:
>>>>
>>>>> Did you specify the target CPU the code should be optimized
for?
>>>>> For clang that is -march=native/znver2/... /
-mtune=<same>
>>>>> For opt/llc that is --mcpu=<same>
>>>>> I would expect that by default, some generic baseline is
picked.
>>>>>
>>>>> On Thu, Jul 16, 2020 at 4:25 PM Neil Henning via llvm-dev
<
>>>>> llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>>> Hey list,
>>>>>>
>>>>>> I've recently done the first test run of bumping
our Burst compiler
>>>>>> from LLVM 10 -> 11 now that the branch has been cut,
and have noticed an
>>>>>> apparent loop vectorization codegen regression for X86
with AVX or AVX2
>>>>>> enabled. The following IR example is vectorized to 4
wide with LLVM 11 and
>>>>>> trunk whereas in LLVM 10 it (correctly as per what we
want) vectorized it 8
>>>>>> wide matching the ymm registers.
>>>>>>
>>>>>> ; ModuleID = '../test.ll'
>>>>>> source_filename = "main"
>>>>>> target datalayout >>>>>>
"e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
>>>>>> target triple = "x86_64-pc-windows-msvc-coff"
>>>>>>
>>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0" = type {
>>>>>> float*, i32, [4 x i8] }
>>>>>>
>>>>>> ; Function Attrs: nofree
>>>>>> define dllexport void @func(float* noalias nocapture
%output,
>>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* nocapture
nonnull
>>>>>> readonly dereferenceable(16) %a,
>>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* nocapture
nonnull
>>>>>> readonly dereferenceable(16) %b) local_unnamed_addr #0
!ubaa. !1 {
>>>>>> entry:
>>>>>>   %0 = getelementptr
>>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0",
>>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* %a, i64 0, i32
1
>>>>>>   %1 = load i32, i32* %0, align 1
>>>>>>   %.not = icmp eq i32 %1, 0
>>>>>>   br i1 %.not, label %BL.0042, label %BL.0005.lr.ph
>>>>>>
>>>>>> BL.0005.lr.ph:                                    ;
preds = %entry
>>>>>>   %2 = bitcast
>>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* %a to i8**
>>>>>>   %3 = load i8*, i8** %2, align 1
>>>>>>   %4 = bitcast
>>>>>>
%"Burst.Compiler.IL.Tests.VectorsMaths/FloatPointer.0"* %b to i8**
>>>>>>   %5 = load i8*, i8** %4, align 1
>>>>>>   %wide.trip.count = zext i32 %1 to i64
>>>>>>   br label %BL.0005
>>>>>>
>>>>>> BL.0005:                                          ;
preds = %BL.0005,
>>>>>> %BL.0005.lr.ph
>>>>>>   %indvars.iv = phi i64 [ 0, %BL.0005.lr.ph ], [
%indvars.iv.next,
>>>>>> %BL.0005 ]
>>>>>>   %6 = shl nuw nsw i64 %indvars.iv, 2
>>>>>>   %7 = getelementptr float, float* %output, i64
%indvars.iv
>>>>>>   %8 = getelementptr i8, i8* %3, i64 %6
>>>>>>   %9 = bitcast i8* %8 to float*
>>>>>>   %10 = load float, float* %9, align 4
>>>>>>   %11 = getelementptr i8, i8* %5, i64 %6
>>>>>>   %12 = bitcast i8* %11 to float*
>>>>>>   %13 = load float, float* %12, align 4
>>>>>>   %14 = tail call float @llvm.pow.f32(float %10, float
%13)
>>>>>>   store float %14, float* %7, align 4
>>>>>>   %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
>>>>>>   %exitcond.not = icmp eq i64 %indvars.iv.next,
%wide.trip.count
>>>>>>   br i1 %exitcond.not, label %BL.0042, label %BL.0005
>>>>>>
>>>>>> BL.0042:                                          ;
preds = %BL.0005,
>>>>>> %entry
>>>>>>   ret void
>>>>>> }
>>>>>>
>>>>>> ; Function Attrs: norecurse readnone
>>>>>> define dllexport void @burst.initialize(i8* (i8*)*
nocapture readnone
>>>>>> %callback) local_unnamed_addr #1 !ubaa. !0 {
>>>>>> entry:
>>>>>>   ret void
>>>>>> }
>>>>>>
>>>>>> ; Function Attrs: nounwind readnone speculatable
willreturn
>>>>>> declare float @llvm.pow.f32(float, float) #2
>>>>>>
>>>>>> attributes #0 = { nofree }
>>>>>> attributes #1 = { norecurse readnone }
>>>>>> attributes #2 = { nounwind readnone speculatable
willreturn }
>>>>>>
>>>>>>
!ubaa.Burst.Compiler.IL.Tests.VectorsMaths\2FFloatPointer.0 = !{!0,
>>>>>> !0, !0, !0}
>>>>>>
>>>>>> !0 = !{i1 false}
>>>>>> !1 = !{i1 true, i1 false, i1 false}
>>>>>>
>>>>>> If I run this with
../llvm-project/llvm/build/bin/opt.exe -o - -S -O3
>>>>>> ../avx_sad_4.ll -mattr=avx -debug, I can see that the
loop vectorizer
>>>>>> correctly considers using 8-wide ymm registers for
this, but has decided
>>>>>> that the 4-wide variant is cheaper based on some cost
modelling I don't
>>>>>> understand.
>>>>>>
>>>>>> So is this expected behaviour? I know there was some
cost model
>>>>>> changes in the 10->11 timeframe.
>>>>>>
>>>>>> Thanks for any help,
>>>>>>
>>>>>> Cheers,
>>>>>> -Neil.
>>>>>>
>>>>> Roman
>>>>>
>>>>>
>>>>>> --
>>>>>> Neil Henning
>>>>>> Senior Software Engineer Compiler
>>>>>> unity.com
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>>
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Neil Henning
>>>> Senior Software Engineer Compiler
>>>> unity.com
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>
>>
>> --
>> Neil Henning
>> Senior Software Engineer Compiler
>> unity.com
>>
>
-- 
Neil Henning
Senior Software Engineer Compiler
unity.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200717/b082c809/attachment-0001.html>

Florian Hahn via llvm-dev

2020-Jul-17 11:09 UTC

head link

[llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target

> On 16 Jul 2020, at 19:54, Neil Henning via llvm-dev <llvm-dev at
lists.llvm.org> wrote:
> 
> So for us we use SLEEF to actually implement the libcalls (LLVM intrinsics)
that LLVM by default would generate - and since SLEEF has highly optimal 8-wide
pow, optimized for AVX and AVX2, we really want to use that.
Right, the way vector versions of library functions are accessed by the
vectoriser has changed since the last release. I think the initial patch was
https://reviews.llvm.org/D70107 <https://reviews.llvm.org/D70107>.

Vector functions now must be annotated with a vector-function-abi-variant
function attribute. There’s the -inject-tli-mappings pass, that is supposed to
add the attributes for vector functions from TLI. It seems like this is
currently not happening for your custom TLI mappings for some reason.

For example, the Accelerate library has a vector version of log10. Running `opt
-vector-library=Accelerate -inject-tli-mappings` on the IR below will add the
following attribute to the llvm.log10 call-site, indicating that there’s a <4
x float> version of log10 called vlog10f.

{
"vector-function-abi-variant"="_ZGV_LLVM_N4v_llvm.log10.f32(vlog10f)"
}

To double-check, if running -inject-tli-mappings on your example does not add
the vector-function-abi-variant attribute for `pow`, the vectorisers won’t know
about them. If the vector-function-abi-variant attribute is actually created,
but the vector version is not used nonetheless, it would be great if you could
share the IR with the attributes, as they depend on the downstream TLI.

I am also CC’ing Francesco, who might be able to help you pinning down where
exactly things go wrong with the mapping.

Cheers,
Florian

——

define float @call_llvm.log10.f32(float %in) {
  %call = tail call float @llvm.log10.f32(float %in)
  ret float %call
}

declare float @llvm.log10.f32(float)
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200717/adc19c3d/attachment.html>

Neil Henning via llvm-dev

2020-Jul-17 11:51 UTC

head link

[llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target

Oh interesting - I hadn't even considered registering vector descriptors
for the LLVM intrinsics, but right enough when I just registered that pow
has a vector variant (itself of a bigger size) I got the correct 8-wide
variants like I was expecting - nice!

Thanks for the help!

Cheers,
-Neil.

On Fri, Jul 17, 2020 at 12:09 PM Florian Hahn <florian_hahn at apple.com>
wrote:
>
>
> On 16 Jul 2020, at 19:54, Neil Henning via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> So for us we use SLEEF to actually implement the libcalls (LLVM
> intrinsics) that LLVM by default would generate - and since SLEEF has
> highly optimal 8-wide pow, optimized for AVX and AVX2, we really want to
> use that.
>
>
> Right, the way vector versions of library functions are accessed by the
> vectoriser has changed since the last release. I think the initial patch
> was https://reviews.llvm.org/D70107.
>
> Vector functions now must be annotated with a vector-function-abi-variant
> function attribute. There’s the -inject-tli-mappings pass, that is supposed
> to add the attributes for vector functions from TLI. It seems like this is
> currently not happening for your custom TLI mappings for some reason.
>
> For example, the Accelerate library has a vector version of log10. Running
> `opt -vector-library=Accelerate -inject-tli-mappings` on the IR below will
> add the following attribute to the llvm.log10 call-site, indicating that
> there’s a <4 x float> version of log10 called vlog10f.
>
> {
"vector-function-abi-variant"="_ZGV_LLVM_N4v_llvm.log10.f32(vlog10f)"
}
>
>
> To double-check, if running -inject-tli-mappings on your example does not
> add the vector-function-abi-variant attribute for `pow`, the vectorisers
> won’t know about them. If the vector-function-abi-variant attribute is
> actually created, but the vector version is not used nonetheless, it would
> be great if you could share the IR with the attributes, as they depend on
> the downstream TLI.
>
> I am also CC’ing Francesco, who might be able to help you pinning down
> where exactly things go wrong with the mapping.
>
> Cheers,
> Florian
>
> ——
>
> define float @call_llvm.log10.f32(float %in) {
>   %call = tail call float @llvm.log10.f32(float %in)
>   ret float %call
> }
>
> declare float @llvm.log10.f32(float)
>

-- 
Neil Henning
Senior Software Engineer Compiler
unity.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.llvm.org/pipermail/llvm-dev/attachments/20200717/e903ada7/attachment.html>

llvm dev - Jul 2020 - LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target

[llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target

[llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target

[llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target

[llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target

[llvm-dev] LLVM 11 and trunk selecting 4 wide instead of 8 wide loop vectorization for AVX-enabled target