Displaying 20 results from an estimated 700 matches similar to: "[LLVMdev] loop vectorizer"
2013 Oct 30
0
[LLVMdev] loop vectorizer
Hi Frank,
The access pattern to arrays a and b is non-linear. Unrolled loops are usually handled by the SLP-vectorizer. Are ir0 and ir1 consecutive for all values for i ?
Thanks,
Nadav
On Oct 30, 2013, at 9:05 AM, Frank Winter <fwinter at jlab.org> wrote:
> The loop vectorizer seems to be not able to vectorize the following code:
>
> void bar(std::uint64_t start,
2013 Oct 30
3
[LLVMdev] loop vectorizer
----- Original Message -----
>
>
> I ran the BB vectorizer as I guess this is the SLP vectorizer.
No, while the BB vectorizer is doing a form of SLP vectorization, there is a separate SLP vectorization pass which uses a different algorithm. You can pass -vectorize-slp to opt.
-Hal
>
> BBV: using target information
> BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_...
2013 Oct 30
0
[LLVMdev] loop vectorizer
The SLP vectorizer apparently did something in the prologue of the
function (where storing of arguments on the stack happens) which then
got eliminated later on (since I don't see any vector instructions in
the final IR). Below the debug output of the SLP pass:
Args: opt -O1 -vectorize-slp -debug loop.ll -S
SLP: Analyzing blocks in _Z3barmmPfS_S_.
SLP: Found 2 stores to vectorize.
SLP:
2013 Oct 30
2
[LLVMdev] loop vectorizer
The debug messages are misleading. They should read “trying to vectorize a list of …”; The problem is that the SCEV analysis is unable to detect that C[ir0] and C[ir1] are consecutive. Is this loop from an important benchmark ?
Thanks,
Nadav
On Oct 30, 2013, at 11:13 AM, Frank Winter <fwinter at jlab.org> wrote:
> The SLP vectorizer apparently did something in the prologue of the
2013 Oct 31
0
[LLVMdev] loop vectorizer
>> What needs to be done (on a high level) in order to have the auto vectorizer succeed on the test function as given erlier?
> Maybe you could rewrite the loop in a way that will expose contiguous memory accesses. Is this something you could do ?
>
Hi Nadav,
the only option I see is to unroll the loop by hand. Since the array
access is consecutive over 4 loop iterations I gave it a
2013 Oct 31
2
[LLVMdev] loop vectorizer
On Oct 30, 2013, at 6:10 PM, Frank Winter <fwinter at jlab.org> wrote:
> the only option I see is to unroll the loop by hand. Since the array access is consecutive over 4 loop iterations I gave it a try and unrolled the loop by a factor of 4. Which gives the following array accesses:
>
> loop iter 0:
> index_0 = 0 index_1 = 4
> index_0 = 1 index_1 = 5
> index_0 = 2
2013 Oct 30
0
[LLVMdev] loop vectorizer
I ran the BB vectorizer as I guess this is the SLP vectorizer.
BBV: using target information
BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_...
BBV: found 2 instructions with candidate pairs
BBV: found 0 pair connections.
BBV: done!
However, this was run on the unrolled loop (I guess).
Here is the IR printed by 'opt':
entry:
%cmp9 = icmp ult i64 %start, %end
br i1 %cmp9, label
2013 Oct 30
3
[LLVMdev] loop vectorizer
On 30 October 2013 09:25, Nadav Rotem <nrotem at apple.com> wrote:
> The access pattern to arrays a and b is non-linear. Unrolled loops are
> usually handled by the SLP-vectorizer. Are ir0 and ir1 consecutive for all
> values for i ?
>
Based on his list of values, it seems that the induction stride is linear
within each block of 4 iterations, but it's not a clear
2013 Oct 30
0
[LLVMdev] loop vectorizer
Well, they are not directly consecutive. They are consecutive with a
constant offset or stride:
ir1 = ir0 + 4
If I rewrite the function in this form
void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
c, float * __restrict__ a, float * __restrict__ b)
{
const std::uint64_t inner = 4;
for (std::uint64_t i = start ; i < end ; ++i ) {
const std::uint64_t ir0 = (
2013 Oct 30
3
[LLVMdev] loop vectorizer
Hi Frank,
> We are looking at a variety of target architectures. Ultimately we aim to run on BG/Q and Intel Xeon Phi (native). However, running on those architectures with the LLVM technology is planned in some future. As a first step we would target vanilla x86 with SSE/AVX 128/256 as a proof-of-concept.
Great! It should be easy to support these targets. When you said wide-vectors I assumed
2013 Oct 31
0
[LLVMdev] loop vectorizer
I tried the following on the hand-unrolled loop:
const std::uint64_t ir0 = i*8+0; // working
const std::uint64_t ir0 = i%4+0; // working
const std::uint64_t ir0 = (i+0)%4; // not working
'+0' means +1,+2,+3 in the unrolled iterations.
'Working' means the SLP vectorizer succeeded.
Thus, when working 'towards' the correct index function, auto
2013 Nov 01
2
[LLVMdev] loop vectorizer: this loop is not worth vectorizing
I am trying a setup where the one loop is rewritten as two loops. This
avoids the 'rem' and 'div' instructions in the index calculation (which
give the loop vectorizer a hard time).
However, with this setup the loop vectorizer complains about a too small
loop.
LV: Checking a loop in "main"
LV: Found a loop: L3
LV: Found a loop with a very small trip count. This loop
2013 Nov 01
0
[LLVMdev] loop vectorizer: this loop is not worth vectorizing
In the case when coming from C it was probably the loop unroller and SLP
vectorizer which vectorized the code. Potentially I could do the same in
the IR. However, the loop body that is generated in the IR can get very
large. Thus, the loop unroller will refuse to unroll the loop in a large
number of (important) cases.
Isn't there a way to convince the loop vectorizer that it should
2017 Feb 13
2
RFC: Representing unions in TBAA
Hello all,
I'm new to the llvm community. I'm learning how things work. I noticed
that there has been some interest in improving how unions are handled. Bug
21725 is one example. I figured it might be a interesting place to start.
I discussed this with a couple people, and below is a suggestion on how
to represent unions. I would like some comments on how this fits in with
how
2012 Jul 15
2
[LLVMdev] Issue with Machine Verifier and earlyclobber
On Jul 15, 2012, at 9:20 AM, Borja Ferrer <borja.ferav at gmail.com> wrote:
> Jakob, one more hint, I've placed some asserts around the code you added and noticed that the InlineSpiller::insertReload() function is not being called.
>
> 2012/7/14 Borja Ferrer <borja.ferav at gmail.com>
> Hello Jakob,
>
> I'm still getting the error, I can give you any other
2016 Apr 08
2
LIBCLC with LLVM 3.9 Trunk
It's not clear what is actually wrong from your original message, I think
you need to give some more information as to what you are doing: Example
source, what target GPU, compiler error messages or other evidence of "it's
wrong" (llvm IR, disassembly, etc) ...
--
Mats
On 8 April 2016 at 09:55, Liu Xin via llvm-dev <llvm-dev at lists.llvm.org>
wrote:
> I built it
2013 Oct 31
5
[LLVMdev] loop vectorizer
On 30 October 2013 18:40, Frank Winter <fwinter at jlab.org> wrote:
> const std::uint64_t ir0 = (i+0)%4; // not working
>
I thought this would be the case when I saw the original expression. Maybe
we need to teach module arithmetic to SCEV?
--renato
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
2013 Oct 31
3
[LLVMdev] loop vectorizer misses opportunity, exploit
Hi Frank,
This loop should be vectorized by the SLP-vectorizer. It has several scalars (C[0], C[1] … ) that can be merged into a vector. The SLP vectorizer can’t figure out that the stores are consecutive because SCEV can’t analyze the OR in the index calculation:
%2 = and i64 %i.04, 3
%3 = lshr i64 %i.04, 2
%4 = shl i64 %3, 3
%5 = or i64 %4, %2
%11 = getelementptr inbounds float*
2013 Oct 31
3
[LLVMdev] loop vectorizer misses opportunity, exploit
----- Original Message -----
>
> Hi Nadav,
>
> that's the whole point of it. I can't in general make the index
> calculation simpler. The example given is the simplest non-trivial
> index function that is needed. It might well be that it's that
> simple that the index calculation in this case can be thrown aways
> altogether and - as you say - be replaced by
2013 Oct 31
0
[LLVMdev] loop vectorizer misses opportunity, exploit
A quite small but yet complete example function which all vectorization
passes fail to optimize:
#include <cstdint>
#include <iostream>
void bar(std::uint64_t start, std::uint64_t end, float * __restrict__
c, float * __restrict__ a, float * __restrict__ b)
{
for ( std::uint64_t i = start ; i < end ; i += 4 ) {
{
const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);