thr3ads.net - llvm dev - [LLVMdev] SLP vectorizer on AVX feature [Jul 2015]

If this information is useful, please help other people find it:
Share via:

Renato Golin

2015-Jul-01 20:18 UTC

[LLVMdev] SLP vectorizer on AVX feature

Hi Frank,

What does --debug-only=vectorize says?

You may try to get the datalayout and the triple on the IR header,
just to make sure you got everything right. LLVM will honour those,
and front-ends should create them correctly.

--renato

On 1 July 2015 at 19:06, Frank Winter <fwinter at jlab.org>
wrote:> I realized that the function parameters had no alignment attributes on
them.
> However, even adding an alignment suitable for aligned loads on YMM, i.e.
32
> bytes, didn't convince the vectorizer to use [8 x float].
>
> define void @main(i64 %lo, i64 %hi, float* noalias align 32 %arg0, float*
> noalias align 32 %arg1, float* noalias align 32 %arg2) {
> ...
>
> results still in code using only [4 x float].
>
> Thanks,
> Frank
>
>
>
> On 07/01/2015 10:51 AM, Frank Winter wrote:
>>
>> I seem to have problem to get the SLP vectorizer to make use of the
full 8
>> floats available in a SIMD vector on a Sandy Bridge CPU with AVX. The
>> function is attached, the CPU flags are:
>>
>> flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov
>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
pdpe1gb
>> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc
>> aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
cx16
>> xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida
arat epb
>> xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
>>
>> I use LLVM 3.6 checked out yesterday
>>
>> ~/toolchain/install/llvm-3.6/bin/opt -datalayout -basicaa
-slp-vectorizer
>> -instcombine < func_4x4x4_scalar_p_scalar.ll -S
>>
>> the output goes like:
>>
>> ; ModuleID = '<stdin>'
>>
>> define void @main(i64 %lo, i64 %hi, float* noalias %arg0, float*
noalias
>> %arg1, float* noalias %arg2) {
>> entrypoint:
>>   %0 = bitcast float* %arg1 to <4 x float>*
>>   %1 = load <4 x float>* %0, align 4
>>   %2 = bitcast float* %arg2 to <4 x float>*
>>   %3 = load <4 x float>* %2, align 4
>>   %4 = fadd <4 x float> %3, %1
>>   %5 = bitcast float* %arg0 to <4 x float>*
>>   store <4 x float> %4, <4 x float>* %5, align 4
>> ....
>>
>> So, it could make use of <8 x float> available in that machine.
But it
>> doesn't. Then I thought, that maybe the YMM registers get used when
lowering
>> the IR to machine code. However, the generated assembly doesn't
seem to
>> support this assumption :-(
>>
>>
>> main:
>>     .cfi_startproc
>>     xorl    %eax, %eax
>>     xorl    %esi, %esi
>>     .align    16, 0x90
>> .LBB0_1:
>>     vmovups    (%r8,%rax), %xmm0
>>     vaddps    (%rcx,%rax), %xmm0, %xmm0
>>     vmovups    %xmm0, (%rdx,%rax)
>>     addq    $4, %rsi
>>     addq    $16, %rax
>>     cmpq    $61, %rsi
>>     jb    .LBB0_1
>>     retq
>>
>> I played with -mcpu and -march switches without success. In any case,
the
>> target architecture should be detected with the -datalayout pass,
right?
>>
>> Any idea what I am missing?
>>
>> Frank
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Frank Winter

2015-Jul-01 20:22 UTC

head link

[LLVMdev] SLP vectorizer on AVX feature

Hi Renato,

there were two follow-up emails. The issue is solved. The SLP vectorizer 
has a magic number built into the code which determines the max. vector 
length to search for. That was set to 128 bits. Increasing it to 256 
bits solved the issue.

For inconsistency reasons it must be '--debug-only=SLP' and the output 
can be found in one of the follow-up emails.

Thanks,
Frank


On 07/01/2015 04:18 PM, Renato Golin wrote:> Hi Frank,
>
> What does --debug-only=vectorize says?
>
> You may try to get the datalayout and the triple on the IR header,
> just to make sure you got everything right. LLVM will honour those,
> and front-ends should create them correctly.
>
> --renato
>
> On 1 July 2015 at 19:06, Frank Winter <fwinter at jlab.org> wrote:
>> I realized that the function parameters had no alignment attributes on
them.
>> However, even adding an alignment suitable for aligned loads on YMM,
i.e. 32
>> bytes, didn't convince the vectorizer to use [8 x float].
>>
>> define void @main(i64 %lo, i64 %hi, float* noalias align 32 %arg0,
float*
>> noalias align 32 %arg1, float* noalias align 32 %arg2) {
>> ...
>>
>> results still in code using only [4 x float].
>>
>> Thanks,
>> Frank
>>
>>
>>
>> On 07/01/2015 10:51 AM, Frank Winter wrote:
>>> I seem to have problem to get the SLP vectorizer to make use of the
full 8
>>> floats available in a SIMD vector on a Sandy Bridge CPU with AVX.
The
>>> function is attached, the CPU flags are:
>>>
>>> flags        : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca
cmov
>>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb
>>> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc
>>> aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16
>>> xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida
arat epb
>>> xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
>>>
>>> I use LLVM 3.6 checked out yesterday
>>>
>>> ~/toolchain/install/llvm-3.6/bin/opt -datalayout -basicaa
-slp-vectorizer
>>> -instcombine < func_4x4x4_scalar_p_scalar.ll -S
>>>
>>> the output goes like:
>>>
>>> ; ModuleID = '<stdin>'
>>>
>>> define void @main(i64 %lo, i64 %hi, float* noalias %arg0, float*
noalias
>>> %arg1, float* noalias %arg2) {
>>> entrypoint:
>>>    %0 = bitcast float* %arg1 to <4 x float>*
>>>    %1 = load <4 x float>* %0, align 4
>>>    %2 = bitcast float* %arg2 to <4 x float>*
>>>    %3 = load <4 x float>* %2, align 4
>>>    %4 = fadd <4 x float> %3, %1
>>>    %5 = bitcast float* %arg0 to <4 x float>*
>>>    store <4 x float> %4, <4 x float>* %5, align 4
>>> ....
>>>
>>> So, it could make use of <8 x float> available in that
machine. But it
>>> doesn't. Then I thought, that maybe the YMM registers get used
when lowering
>>> the IR to machine code. However, the generated assembly doesn't
seem to
>>> support this assumption :-(
>>>
>>>
>>> main:
>>>      .cfi_startproc
>>>      xorl    %eax, %eax
>>>      xorl    %esi, %esi
>>>      .align    16, 0x90
>>> .LBB0_1:
>>>      vmovups    (%r8,%rax), %xmm0
>>>      vaddps    (%rcx,%rax), %xmm0, %xmm0
>>>      vmovups    %xmm0, (%rdx,%rax)
>>>      addq    $4, %rsi
>>>      addq    $16, %rax
>>>      cmpq    $61, %rsi
>>>      jb    .LBB0_1
>>>      retq
>>>
>>> I played with -mcpu and -march switches without success. In any
case, the
>>> target architecture should be detected with the -datalayout pass,
right?
>>>
>>> Any idea what I am missing?
>>>
>>> Frank
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Renato Golin

2015-Jul-01 20:29 UTC

head link

[LLVMdev] SLP vectorizer on AVX feature

On 1 July 2015 at 21:22, Frank Winter <fwinter at jlab.org>
wrote:> there were two follow-up emails.
I only got one... weird...

> The issue is solved. The SLP vectorizer has
> a magic number built into the code which determines the max. vector length
> to search for. That was set to 128 bits. Increasing it to 256 bits solved
> the issue.
That looks like a simple fix. Is it upstream yet? :)

> For inconsistency reasons it must be '--debug-only=SLP' and the
output can
> be found in one of the follow-up emails.
Of course. Maybe we should mean "vectorize" as all of them? Anyway,
that's unrelated.

cheers,
--renato

cbergstrom at pathscale.com

2015-Jul-01 20:30 UTC

head link

[LLVMdev] SLP vectorizer on AVX feature

Is there a patch that will get upstreamed?

  Original Message  
From: Frank Winter
Sent: Thursday, July 2, 2015 03:29
To: Renato Golin
Cc: LLVM Dev
Subject: Re: [LLVMdev] SLP vectorizer on AVX feature

Hi Renato,

there were two follow-up emails. The issue is solved. The SLP vectorizer 
has a magic number built into the code which determines the max. vector 
length to search for. That was set to 128 bits. Increasing it to 256 
bits solved the issue.

For inconsistency reasons it must be '--debug-only=SLP' and the output 
can be found in one of the follow-up emails.

Thanks,
Frank


On 07/01/2015 04:18 PM, Renato Golin wrote:> Hi Frank,
>
> What does --debug-only=vectorize says?
>
> You may try to get the datalayout and the triple on the IR header,
> just to make sure you got everything right. LLVM will honour those,
> and front-ends should create them correctly.
>
> --renato
>
> On 1 July 2015 at 19:06, Frank Winter <fwinter at jlab.org> wrote:
>> I realized that the function parameters had no alignment attributes on
them.
>> However, even adding an alignment suitable for aligned loads on YMM,
i.e. 32
>> bytes, didn't convince the vectorizer to use [8 x float].
>>
>> define void @main(i64 %lo, i64 %hi, float* noalias align 32 %arg0,
float*
>> noalias align 32 %arg1, float* noalias align 32 %arg2) {
>> ...
>>
>> results still in code using only [4 x float].
>>
>> Thanks,
>> Frank
>>
>>
>>
>> On 07/01/2015 10:51 AM, Frank Winter wrote:
>>> I seem to have problem to get the SLP vectorizer to make use of the
full 8
>>> floats available in a SIMD vector on a Sandy Bridge CPU with AVX.
The
>>> function is attached, the CPU flags are:
>>>
>>> flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov
>>> pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall
nx pdpe1gb
>>> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology
nonstop_tsc
>>> aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2
ssse3 cx16
>>> xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave avx lahf_lm ida
arat epb
>>> xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid
>>>
>>> I use LLVM 3.6 checked out yesterday
>>>
>>> ~/toolchain/install/llvm-3.6/bin/opt -datalayout -basicaa
-slp-vectorizer
>>> -instcombine < func_4x4x4_scalar_p_scalar.ll -S
>>>
>>> the output goes like:
>>>
>>> ; ModuleID = '<stdin>'
>>>
>>> define void @main(i64 %lo, i64 %hi, float* noalias %arg0, float*
noalias
>>> %arg1, float* noalias %arg2) {
>>> entrypoint:
>>> %0 = bitcast float* %arg1 to <4 x float>*
>>> %1 = load <4 x float>* %0, align 4
>>> %2 = bitcast float* %arg2 to <4 x float>*
>>> %3 = load <4 x float>* %2, align 4
>>> %4 = fadd <4 x float> %3, %1
>>> %5 = bitcast float* %arg0 to <4 x float>*
>>> store <4 x float> %4, <4 x float>* %5, align 4
>>> ....
>>>
>>> So, it could make use of <8 x float> available in that
machine. But it
>>> doesn't. Then I thought, that maybe the YMM registers get used
when lowering
>>> the IR to machine code. However, the generated assembly doesn't
seem to
>>> support this assumption :-(
>>>
>>>
>>> main:
>>> .cfi_startproc
>>> xorl %eax, %eax
>>> xorl %esi, %esi
>>> .align 16, 0x90
>>> .LBB0_1:
>>> vmovups (%r8,%rax), %xmm0
>>> vaddps (%rcx,%rax), %xmm0, %xmm0
>>> vmovups %xmm0, (%rdx,%rax)
>>> addq $4, %rsi
>>> addq $16, %rax
>>> cmpq $61, %rsi
>>> jb .LBB0_1
>>> retq
>>>
>>> I played with -mcpu and -march switches without success. In any
case, the
>>> target architecture should be detected with the -datalayout pass,
right?
>>>
>>> Any idea what I am missing?
>>>
>>> Frank
>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

_______________________________________________
LLVM Developers mailing list
LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

Apparently Analagous Threads

Search for more possibly parallel threads

llvm dev - Jul 2015 - [LLVMdev] SLP vectorizer on AVX feature

[LLVMdev] SLP vectorizer on AVX feature

[LLVMdev] SLP vectorizer on AVX feature

[LLVMdev] SLP vectorizer on AVX feature

[LLVMdev] SLP vectorizer on AVX feature

Apparently Analagous Threads