thr3ads.net - similar to: "[LLVMdev] Modifications to SLP"

Displaying 20 results from an estimated 9000 matches similar to: "[LLVMdev] Modifications to SLP"

Data structure improvement for the SLP vectorizer

2017 Mar 15

Data structure improvement for the SLP vectorizer

Maybe it would illustrative to give an IR example of the case I'm interested in. Consider define void @"julia_transform_bvn_derivs_hessian!"(double* %data, double* %data2, double *%data3, double *%out) { %element11 = getelementptr inbounds double, double* %data, i32 1 %load10 = load double, double* %data %load11 = load double, double* %element11 %element21 =

[LLVMdev] loop vectorizer erroneously finds 256 bit vectors

2013 Nov 10

[LLVMdev] loop vectorizer erroneously finds 256 bit vectors

The loop vectorizer is doing an amazing job so far. Most of the time. I just came across one function which led to unexpected behavior: On this function the loop vectorizer finds a 256 bit vector as the wides vector type for the x86-64 architecture. (!) This is strange, as it was always finding the correct size of 128 bit as the widest type. I isolated the IR of the function to check if this is

[LLVMdev] loop vectorizer erroneously finds 256 bit vectors

2013 Nov 10

[LLVMdev] loop vectorizer erroneously finds 256 bit vectors

I looked more into this. For the previously sent IR the vector width of 256 bit is found mistakenly (and reproducibly) on this hardware: model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz For the same IR the loop vectorizer finds the correct vector width (128 bit) on: model name : Intel(R) Xeon(R) CPU E5630 @ 2.53GHz model name : Intel(R) Core(TM) i7 CPU M 640 @

[LLVMdev] loop vectorizer erroneously finds 256 bit vectors

2013 Nov 10

[LLVMdev] loop vectorizer erroneously finds 256 bit vectors

Hi Frank, I'm not an Intel expert, but it seems that your Xeon E5 supports AVX, which does have 256-bit vectors. The other two only supports SSE instructions, which are only 128-bit long. cheers, --renato On 10 November 2013 06:05, Frank Winter <fwinter at jlab.org> wrote: > I looked more into this. For the previously sent IR the vector width of > 256 bit is found mistakenly

Data structure improvement for the SLP vectorizer

2017 Mar 15

Data structure improvement for the SLP vectorizer

There was some discussion of this on the llvm-commits list, but I wanted to raise the topic for discussion here. The background of the -commits discussion was that r296863 added the ability to sort memory access when the SLP vectorizer reached a load (the SLP vectorizer starts at a store or some other sink, and tries to go up the tree vectorizing as it goes along - if the input is in a different

[LLVMdev] bb-vectorizer transforms only part of the block

2015 Jun 22

[LLVMdev] bb-vectorizer transforms only part of the block

The loads, stores and float arithmetic in attached function should be completely vectorizable. The bb-vectorizer does a good job at first, but from instruction %96 on it messes up by adding unnecessary vectorshuffles. (The function was designed so that no shuffle would be needed in order to vectorize it). I tested this with llvm 3.6 with the following command:

[LLVMdev] loop vectorizer misses opportunity, exploit

2013 Oct 31

[LLVMdev] loop vectorizer misses opportunity, exploit

----- Original Message ----- > > Hi Nadav, > > that's the whole point of it. I can't in general make the index > calculation simpler. The example given is the simplest non-trivial > index function that is needed. It might well be that it's that > simple that the index calculation in this case can be thrown aways > altogether and - as you say - be replaced by

[LLVMdev] loop vectorizer: Unexpected extract/insertelement

2013 Nov 06

[LLVMdev] loop vectorizer: Unexpected extract/insertelement

The following IR implements the following nested loop: for (int i = start ; i < end ; ++i ) for (int p = 0 ; p < 4 ; ++p ) a[i*4+p] = b[i*4+p] + c[i*4+p]; define void @main(i64 %arg0, i64 %arg1, i1 %arg2, i64 %arg3, float* noalias %arg4, float* noalias %arg5, float* noalias %arg6) { entrypoint: br i1 %arg2, label %L0, label %L1 L0:

[LLVMdev] How to broaden the SLP vectorizer's search

2014 Aug 07

[LLVMdev] How to broaden the SLP vectorizer's search

On 7 August 2014 17:33, Chad Rosier <mcrosier at codeaurora.org> wrote: > You might consider filing a bug (llvm.org/bugs) requesting a flag, but I > don't know if the code owners want to expose such a flag. I'm not sure that's a good idea as a raw access to that limit, as there are no guarantees that it'll stay the same. But maybe a flag turning some

[LLVMdev] loop vectorizer: Unexpected extract/insertelement

2013 Nov 06

[LLVMdev] loop vectorizer: Unexpected extract/insertelement

The loop vectorizer relies on cleanup passes to be run after it: from Transforms/IPO/PassManagerBuilder.cpp: // Add the various vectorization passes and relevant cleanup passes for // them since we are no longer in the middle of the main scalar pipeline. MPM.add(createLoopVectorizePass(DisableUnrollLoops)); MPM.add(createInstructionCombiningPass());

[LLVMdev] loop vectorizer: Unexpected extract/insertelement

2013 Nov 06

[LLVMdev] loop vectorizer: Unexpected extract/insertelement

The instcombine pass cleans up a lot. Any idea why there are still shufflevector, insertelement, *and* bitcast (!!) etc. instructions left? The original loop is so clean, a textbook example I'd say. There is no need to shuffle anything.At least I don't see it. Frank vector.ph: ; preds = %L5 %broadcast.splatinsert1 = insertelement <4 x

[LLVMdev] loop vectorizer: this loop is not worth vectorizing

2013 Nov 01

[LLVMdev] loop vectorizer: this loop is not worth vectorizing

In the case when coming from C it was probably the loop unroller and SLP vectorizer which vectorized the code. Potentially I could do the same in the IR. However, the loop body that is generated in the IR can get very large. Thus, the loop unroller will refuse to unroll the loop in a large number of (important) cases. Isn't there a way to convince the loop vectorizer that it should

[LLVMdev] loop vectorizer: this loop is not worth vectorizing

2013 Nov 01

[LLVMdev] loop vectorizer: this loop is not worth vectorizing

I am trying a setup where the one loop is rewritten as two loops. This avoids the 'rem' and 'div' instructions in the index calculation (which give the loop vectorizer a hard time). However, with this setup the loop vectorizer complains about a too small loop. LV: Checking a loop in "main" LV: Found a loop: L3 LV: Found a loop with a very small trip count. This loop

[LLVMdev] SIMD instructions and memory alignment on X86

2013 Jul 18

[LLVMdev] SIMD instructions and memory alignment on X86

Are you able to send any IR for others to reproduce this issue? On Wed, Jul 17, 2013 at 11:23 PM, Peter Newman <peter at uformia.com> wrote: > Unfortunately, this doesn't appear to be the bug I'm hitting. I applied > the fix to my source and it didn't make a difference. > > Also further testing found me getting the same behavior with other SIMD > instructions.

[LLVMdev] SIMD instructions and memory alignment on X86

2013 Jul 18

[LLVMdev] SIMD instructions and memory alignment on X86

Unfortunately, this doesn't appear to be the bug I'm hitting. I applied the fix to my source and it didn't make a difference. Also further testing found me getting the same behavior with other SIMD instructions. The common factor is in each case, ECX is set to 0x7fffffff, and it's an operation using xmm ptr ecx+offset . Additionally, turning the optimization level passed to

[LLVMdev] SLP vectorizer on AVX feature

2015 Jul 01

[LLVMdev] SLP vectorizer on AVX feature

Frank, It sounds like the SLP vectorizer thinks that it is more profitable to use 128bit wide operations (because 256bit operations are double pumped on Sandybridge). Did you see a different result on Haswell? Thanks, Nadav > On Jul 1, 2015, at 11:06 AM, Frank Winter <fwinter at jlab.org> wrote: > > I realized that the function parameters had no alignment attributes on them.

[LLVMdev] SLP vectorizer on AVX feature

2015 Jul 01

[LLVMdev] SLP vectorizer on AVX feature

Hi Frank, What does --debug-only=vectorize says? You may try to get the datalayout and the triple on the IR header, just to make sure you got everything right. LLVM will honour those, and front-ends should create them correctly. --renato On 1 July 2015 at 19:06, Frank Winter <fwinter at jlab.org> wrote: > I realized that the function parameters had no alignment attributes on them.

[LLVMdev] Bug 16257 - fmul of undef ConstantExpr not folded to undef

2014 Sep 22

[LLVMdev] Bug 16257 - fmul of undef ConstantExpr not folded to undef

Hi Duncan, On 17.09.2014 21:10, Duncan Sands wrote: > Hi Oleg, > > On 17/09/14 18:45, Oleg Ranevskyy wrote: >> Hi, >> >> Thank you for all your helpful comments. >> >> To sum up, below is the list of correct folding examples for fadd: >> (1) fadd %x, -0.0 -> %x >> (2) fadd undef, undef -> undef

[LLVMdev] loop vectorizer erroneously finds 256 bit vectors

2013 Nov 10

[LLVMdev] loop vectorizer erroneously finds 256 bit vectors

Hi Renato, you are right! There is 'avx' support: fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 x2apic popcnt aes xsave

Fusing contract fadd/fsub with normal fmul

2017 Jun 10

Fusing contract fadd/fsub with normal fmul

Hi, On LLVM 5.0 (current trunk), fadd/fsub and fmul that are both marked with `contract` or `fast` can be merged to a fma instruction by the backend. I'm wondering about the exact semantic of this new flag as well as `fast` and in particular, would it be valid to do this when only the `fadd`/`fsub` (and not the `fmul`) is marked with `contract` or at least `fast`. The reasoning is that doing

similar to: [LLVMdev] Modifications to SLP