similar to: [LLVMdev] loop vectorizer: this loop is not worth vectorizing

Displaying 20 results from an estimated 700 matches similar to: "[LLVMdev] loop vectorizer: this loop is not worth vectorizing"

2013 Nov 01
0
[LLVMdev] loop vectorizer: this loop is not worth vectorizing
In the case when coming from C it was probably the loop unroller and SLP vectorizer which vectorized the code. Potentially I could do the same in the IR. However, the loop body that is generated in the IR can get very large. Thus, the loop unroller will refuse to unroll the loop in a large number of (important) cases. Isn't there a way to convince the loop vectorizer that it should
2013 Oct 30
3
[LLVMdev] loop vectorizer
----- Original Message ----- > > > I ran the BB vectorizer as I guess this is the SLP vectorizer. No, while the BB vectorizer is doing a form of SLP vectorization, there is a separate SLP vectorization pass which uses a different algorithm. You can pass -vectorize-slp to opt. -Hal > > BBV: using target information > BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_...
2013 Oct 30
2
[LLVMdev] loop vectorizer
The debug messages are misleading. They should read “trying to vectorize a list of …”; The problem is that the SCEV analysis is unable to detect that C[ir0] and C[ir1] are consecutive. Is this loop from an important benchmark ? Thanks, Nadav On Oct 30, 2013, at 11:13 AM, Frank Winter <fwinter at jlab.org> wrote: > The SLP vectorizer apparently did something in the prologue of the
2013 Oct 30
0
[LLVMdev] loop vectorizer
The SLP vectorizer apparently did something in the prologue of the function (where storing of arguments on the stack happens) which then got eliminated later on (since I don't see any vector instructions in the final IR). Below the debug output of the SLP pass: Args: opt -O1 -vectorize-slp -debug loop.ll -S SLP: Analyzing blocks in _Z3barmmPfS_S_. SLP: Found 2 stores to vectorize. SLP:
2013 Oct 30
3
[LLVMdev] loop vectorizer
On 30 October 2013 09:25, Nadav Rotem <nrotem at apple.com> wrote: > The access pattern to arrays a and b is non-linear. Unrolled loops are > usually handled by the SLP-vectorizer. Are ir0 and ir1 consecutive for all > values for i ? > Based on his list of values, it seems that the induction stride is linear within each block of 4 iterations, but it's not a clear
2013 Oct 30
0
[LLVMdev] loop vectorizer
I ran the BB vectorizer as I guess this is the SLP vectorizer. BBV: using target information BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_... BBV: found 2 instructions with candidate pairs BBV: found 0 pair connections. BBV: done! However, this was run on the unrolled loop (I guess). Here is the IR printed by 'opt': entry: %cmp9 = icmp ult i64 %start, %end br i1 %cmp9, label
2013 Oct 31
2
[LLVMdev] loop vectorizer
On Oct 30, 2013, at 6:10 PM, Frank Winter <fwinter at jlab.org> wrote: > the only option I see is to unroll the loop by hand. Since the array access is consecutive over 4 loop iterations I gave it a try and unrolled the loop by a factor of 4. Which gives the following array accesses: > > loop iter 0: > index_0 = 0 index_1 = 4 > index_0 = 1 index_1 = 5 > index_0 = 2
2013 Oct 31
0
[LLVMdev] loop vectorizer
>> What needs to be done (on a high level) in order to have the auto vectorizer succeed on the test function as given erlier? > Maybe you could rewrite the loop in a way that will expose contiguous memory accesses. Is this something you could do ? > Hi Nadav, the only option I see is to unroll the loop by hand. Since the array access is consecutive over 4 loop iterations I gave it a
2013 Oct 30
3
[LLVMdev] loop vectorizer
Hi Frank, > We are looking at a variety of target architectures. Ultimately we aim to run on BG/Q and Intel Xeon Phi (native). However, running on those architectures with the LLVM technology is planned in some future. As a first step we would target vanilla x86 with SSE/AVX 128/256 as a proof-of-concept. Great! It should be easy to support these targets. When you said wide-vectors I assumed
2013 Nov 03
2
[LLVMdev] loop vectorizer issue
Hello, I was trying to trace the Loop vectorizer of the LLVM, I wrote a simple loop with a clear dependency. But found that the debug shows that 'we can vectorize this loop' Here you are my loop with dependency: for(int k=20;k<50;k++) dataY[k] = dataY[k-1]; And the debug prints: LV: Checking a loop in "main" LV: Found a loop: for.body4 LV: Found an
2013 Nov 03
0
[LLVMdev] loop vectorizer issue
Notice that the code you provided, for globals and stack allocations, at least, is semantically equivalent to: int a = d[19]; for(int k = 20; k < 50; k++) dataY[k] = a; Like so, the load you see missing was redundant, probably hoisted by GVN/PRE and replaced with "%.pre". H. On Sun, Nov 3, 2013 at 11:26 AM, Sara Elshobaky <sara.elshobaky at gmail.com>wrote: >
2013 Nov 03
3
[LLVMdev] loop vectorizer issue
Actually what I meant in my original loop, that there is a dependency between every two consecutive iterations. So, how the loop vectorizer says 'we can vectorize this loop'? for(int k=20;k<50;k++) dataY[k] = dataY[k-1]; From: Henrique Santos [mailto:henrique.nazare.santos at gmail.com] Sent: Sunday, November 03, 2013 4:28 PM To: Sara Elshobaky Cc: <llvmdev at
2013 Nov 03
0
[LLVMdev] loop vectorizer issue
Hi Sarah, the loop vectorizer runs not on the C code but on LLVM IR this c code was lowered to. Before the loop vectorizer runs many other optimization change the shape of this IR. You can see in the LLVM IR you referenced below, a preceding LLVM IR transformation has change your loop from: > for(int k=20;k<50;k++) > dataY[k] = dataY[k-1]; to > int a = d[19]; >
2015 Mar 19
2
[LLVMdev] Cast to SCEVAddRecExpr
Hi Nick, Thanks for looking into it. I have tried that as well but it didn't worked. "AddExpr->getOperand(0))" node is: " (4 * (sext i32 {2,+,2}<%for.body4> to i64))<nsw>" When I cast this to "SCEVAddRecExpr" it returns NULL. Regards, Ashutosh -----Original Message----- From: Nick Lewycky [mailto:nicholas at mxc.ca] Sent: Thursday, March 19,
2013 Oct 30
0
[LLVMdev] loop vectorizer
Well, they are not directly consecutive. They are consecutive with a constant offset or stride: ir1 = ir0 + 4 If I rewrite the function in this form void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b) { const std::uint64_t inner = 4; for (std::uint64_t i = start ; i < end ; ++i ) { const std::uint64_t ir0 = (
2013 Oct 30
2
[LLVMdev] loop vectorizer
The loop vectorizer seems to be not able to vectorize the following code: void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b) { const std::uint64_t inner = 4; for (std::uint64_t i = start ; i < end ; ++i ) { const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4; const std::uint64_t ir1 = ( (i/inner)
2015 Mar 19
3
[LLVMdev] Cast to SCEVAddRecExpr
Yes, I can get "SCEVAddRecExpr" from operands of "(sext i32 {2,+,2}<%for.body4> to i64)". So whenever SCEV cast to "SCEVAddRecExpr" fails, we have drill down for such patterns ? Is that the right way ? Regards, Ashutosh -----Original Message----- From: Nick Lewycky [mailto:nicholas at mxc.ca] Sent: Thursday, March 19, 2015 1:02 PM To: Nema, Ashutosh Cc:
2013 Aug 15
0
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
Codeprepare and independent blocks are introducing these loads and stores. These are prepasses that polly runs prior to building the dependence graph to transform scalar dependences into data dependences. Ether was working on eliminating the rewrite of scalar dependences. On Thu, Aug 15, 2013 at 5:32 AM, Star Tan <tanmx_star at yeah.net> wrote: > Hi all, > > I have investigated the
2013 Aug 16
2
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
Hi Sebpop, Thanks for your explanation. I noticed that Polly would finally run the SROA pass to transform these load/store instructions into scalar operations. Is it possible to run such a pass before polly-dependence analysis? Star Tan At 2013-08-15 21:12:53,"Sebastian Pop" <sebpop at gmail.com> wrote: >Codeprepare and independent blocks are introducing these loads and
2013 Aug 16
0
[LLVMdev] [Polly] Analysis of extra compile-time overhead for simple nested loops
I do not think that running SROA before polly is a good idea: it would defeat the purpose of the code preparation passes that polly intentionally schedules for the data dependence analysis. If you remove the data references before polly runs, you would miss them in the dependence graph: that could lead to incorrect transforms. On Thu, Aug 15, 2013 at 7:28 PM, Star Tan <tanmx_star at